-rw-r--r-- | README | 35 | ||||
-rw-r--r-- | data/ProteinNames.txt | 275 | ||||
-rw-r--r-- | doc/Data Deployments.dia | bin | 0 -> 3566 bytes |
3 files changed, 310 insertions, 0 deletions
@@ -0,0 +1,35 @@ +Experiment 007 +Don Pellegrino [don@drexel.edu] + +Collection and inventory of influenza data. + +INTRODUCTION + +The "Influenza Virus Resource" at NCBI +[http://www.ncbi.nlm.nih.gov/genomes/FLU/] exposes the sequence records and +their meta-data in a number of different ways. An exploration of the +phylogenetic properties of the records first requires that the available data +be collected and inventoried. + +Two primary alternatives have been identified for managing the data. A +relational database can be used. IBM DB2 has been used for this. The use of +a relational database is limited by the difficulty in sharing the data. Each +vendor uses incompatible import and export routines. Additionally installing +an instance of a database management system (DBMS) often requires a large +amount of effort and many not be practical on hosted environments which do not +support the running of user daemons. Finally proper parallelization of a DBMS +will require additional system specific configuration for each machine used. + +An alternative to the DBMS is to use a container file format such as HDF5. +This has the advantage that all of the data can be collected into a single +file which can then be shared with others. It has the disadvantage that is +lacks the robust search and SQL operations provided by a DBMS. In addition to +two alternatives use fundamentally different storage strategies with the DBMS +using a relational model and the contain file format using a hierarchical +model. + +The "doc/Data Deployments.dia" diagram shows the source systems that +expose the various records as well as the transform routines that are +used for aggregation of the data on the local system. + + LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia |