|
diff --git a/README b/README new file mode 100644 index 0000000..9caedb8 --- a/dev/null +++ b/ README |
|
@@ -0,0 +1,35 @@ |
| | 1 | Experiment 007 |
| | 2 | Don Pellegrino [don@drexel.edu] |
| | 3 | |
| | 4 | Collection and inventory of influenza data. |
| | 5 | |
| | 6 | INTRODUCTION |
| | 7 | |
| | 8 | The "Influenza Virus Resource" at NCBI |
| | 9 | [http://www.ncbi.nlm.nih.gov/genomes/FLU/] exposes the sequence records and |
| | 10 | their meta-data in a number of different ways. An exploration of the |
| | 11 | phylogenetic properties of the records first requires that the available data |
| | 12 | be collected and inventoried. |
| | 13 | |
| | 14 | Two primary alternatives have been identified for managing the data. A |
| | 15 | relational database can be used. IBM DB2 has been used for this. The use of |
| | 16 | a relational database is limited by the difficulty in sharing the data. Each |
| | 17 | vendor uses incompatible import and export routines. Additionally installing |
| | 18 | an instance of a database management system (DBMS) often requires a large |
| | 19 | amount of effort and many not be practical on hosted environments which do not |
| | 20 | support the running of user daemons. Finally proper parallelization of a DBMS |
| | 21 | will require additional system specific configuration for each machine used. |
| | 22 | |
| | 23 | An alternative to the DBMS is to use a container file format such as HDF5. |
| | 24 | This has the advantage that all of the data can be collected into a single |
| | 25 | file which can then be shared with others. It has the disadvantage that is |
| | 26 | lacks the robust search and SQL operations provided by a DBMS. In addition to |
| | 27 | two alternatives use fundamentally different storage strategies with the DBMS |
| | 28 | using a relational model and the contain file format using a hierarchical |
| | 29 | model. |
| | 30 | |
| | 31 | The "doc/Data Deployments.dia" diagram shows the source systems that |
| | 32 | expose the various records as well as the transform routines that are |
| | 33 | used for aggregation of the data on the local system. |
| | 34 | |
| | 35 | LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia |
|