Experiment 007
Don Pellegrino [don@drexel.edu]

Collection and inventory of influenza data.

INTRODUCTION

The "Influenza Virus Resource" at NCBI
[http://www.ncbi.nlm.nih.gov/genomes/FLU/] exposes the sequence records and
their meta-data in a number of different ways.  An exploration of the
phylogenetic properties of the records first requires that the available data
be collected and inventoried.

Two primary alternatives have been identified for managing the data.  A
relational database can be used.  IBM DB2 has been used for this.  The use of
a relational database is limited by the difficulty in sharing the data.  Each
vendor uses incompatible import and export routines.  Additionally installing
an instance of a database management system (DBMS) often requires a large
amount of effort and many not be practical on hosted environments which do not
support the running of user daemons.  Finally proper parallelization of a DBMS
will require additional system specific configuration for each machine used.

An alternative to the DBMS is to use a container file format such as HDF5.
This has the advantage that all of the data can be collected into a single
file which can then be shared with others.  It has the disadvantage that is
lacks the robust search and SQL operations provided by a DBMS.  In addition to
two alternatives use fundamentally different storage strategies with the DBMS
using a relational model and the contain file format using a hierarchical
model.

The "doc/Data Deployments.dia" diagram shows the source systems that
expose the various records as well as the transform routines that are
used for aggregation of the data on the local system.

 LocalWords:  NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia