San Juan Basin Data Recovery Project
                                                                       a PRRC undertaking
|
|
|
|
 
  Powered by Google
 
 

Data Integration

The project team shall initiate an integration of multi-sourced data into a single data warehouse. This complex and iterative task will commence during the second six months of the project, and will run concurrently with and be tied closely to the data acquisition tasks.
The project team shall begin the QA/QC at project start. There are certain well data known to be necessary and known storage formats will be required for project results. During the process of data collection, it is likely that certain data types will be proven unnecessary, too sparse, or too difficult to standardize. Some data types may be more or less easily acquired than initially estimated. Therefore, periodic data review of the data acquisition task will be necessary. This effort will not only point out gaps in data, but also potential problems in data consistency that may be avoided by a simple adjustment in collection methodologies and that will greatly simplify data cleaning (QA/QC).

Data Cleaning

The project team shall first collect data into temporary databases for data cleaning. A standardized indexing method will be developed so that all the data about a given well can be related across the data sets. A variety of data cleaning routines will be employed to clean data: finding and filling in missing values, using outlier analysis to identify and correct or delete outlying values, approximate string matching routines to find and correct discrepancies and duplications in data.

Finally, the team shall complete a data review and analysis which may reveal a need for data reduction - the process of obtaining a reduced representative sample of the whole data set via such methods as clustering, data aggregation, dimensionality reduction, data compression, and generalization.

Creation of Primary Database

After the project team has processed the data; acquired, indexed, subjected to any necessary preprocessing or reduction techniques, and cleaned to reasonable standards, a final database shall be created. The database platform shall be in both SQL Server (for use via Internet) and Microsoft Access (for stand-alone use). The geo-referenced database shall be accessible online through services such as ArcSDE and ArcIMS. Downloadable shape files for some data features will be included for stand-alone users who wish to use software of their own preference. Certain important data tables shall also be made available as comma-separated-value (.csv) text files. The project team shall ensure that although these will not have the relational integrity of the database format, they shall be highly useful and the format shall in all likelihood be useable by a variety of software programs in the foreseeable future. Query tools shall also allow export of portions of the database as spreadsheet files.