| |
Data
Integration
The project team shall initiate an integration of multi-sourced
data into a single data warehouse. This complex and
iterative task will commence during the second six months
of the project, and will run concurrently with and be
tied closely to the data acquisition tasks.
The project team shall begin the QA/QC at project start.
There are certain well data known to be necessary and
known storage formats will be required for project results.
During the process of data collection, it is likely
that certain data types will be proven unnecessary,
too sparse, or too difficult to standardize. Some data
types may be more or less easily acquired than initially
estimated. Therefore, periodic data review of the data
acquisition task will be necessary. This effort will
not only point out gaps in data, but also potential
problems in data consistency that may be avoided by
a simple adjustment in collection methodologies and
that will greatly simplify data cleaning (QA/QC).
Data
Cleaning
The project team shall first collect data into temporary
databases for data cleaning. A standardized indexing
method will be developed so that all the data about
a given well can be related across the data sets. A
variety of data cleaning routines will be employed to
clean data: finding and filling in missing values, using
outlier analysis to identify and correct or delete outlying
values, approximate string matching routines to find
and correct discrepancies and duplications in data.
Finally, the team shall complete a data review and analysis
which may reveal a need for data reduction - the process
of obtaining a reduced representative sample of the
whole data set via such methods as clustering, data
aggregation, dimensionality reduction, data compression,
and generalization.
Creation
of Primary Database
After the project team has processed the data; acquired,
indexed, subjected to any necessary preprocessing or
reduction techniques, and cleaned to reasonable standards,
a final database shall be created. The database platform
shall be in both SQL Server (for use via Internet) and
Microsoft Access (for stand-alone use). The geo-referenced
database shall be accessible online through services
such as ArcSDE and ArcIMS. Downloadable shape files
for some data features will be included for stand-alone
users who wish to use software of their own preference.
Certain important data tables shall also be made available
as comma-separated-value (.csv) text files. The project
team shall ensure that although these will not have
the relational integrity of the database format, they
shall be highly useful and the format shall in all likelihood
be useable by a variety of software programs in the
foreseeable future. Query tools shall also allow export
of portions of the database as spreadsheet files.
|