Experiment 007
Don Pellegrino [don@drexel.edu]

Collection and inventory of influenza data.

INTRODUCTION

The "Influenza Virus Resource" at NCBI
[http://www.ncbi.nlm.nih.gov/genomes/FLU/] exposes the sequence records and
their meta-data in a number of different ways.  An exploration of the
phylogenetic properties of the records first requires that the available data
be collected and inventoried.

Two primary alternatives have been identified for managing the data.
A relational database can be used.  IBM DB2 has been used for this in
exp004.  The use of a relational database is limited by the difficulty
in sharing the data.  Each vendor uses incompatible import and export
routines.  Additionally installing an instance of a database
management system (DBMS) often requires a large amount of effort and
many not be practical on hosted environments which do not support the
running of user daemons.  Proper parallelization of a DBMS will
require additional system specific configuration for each machine
used.  Generally a single DB2 instance with Internet connectivity has
been used in conjunction with DB2 client installations on the
analytical environments.

An alternative to the DBMS is to use a container file format such as
HDF5.  This has the advantage that all of the data can be collected
into a single file which can then be shared with others.  It has the
disadvantage that it lacks the robust search and SQL operations
provided by a DBMS.  These two alternatives use fundamentally
different storage strategies with the DBMS using a relational model
and the container file format using a hierarchical model.

The "doc/Data Deployments.dia" diagram shows the source systems that
expose the various influenza records as well as the transform routines
that are used for aggregation of the data on the local system.
Initially it may appear that loading the text files directly into the
HDF5 container is redundant, particularly as a pure pre-processing
step.  This will be a redundant effort for cases where tools are used
which require yet another load step.  For custom C programs however
reading the data from disk and converting it from ASCII text to a
native datatype is a necessary preprocessing step.  Sharing the C
struct definitions between HDF5 and the native code is the key
differentiator between loading from text and loading from the binary
HDF5 container.  Since these read and conversion operations must be
done in the C code anyway the additional effort to save their results
in the HDF5 container are justified by any time that can be saved by
reusing the HDF5 data rather than rerunning the read and conversion
operations from plain text.

BUILDING

An autogen.sh script is provided to initialize the project directory
with the necessary GNU Autotools configuration.

When building on a Debian system the mpi.h file is in a subdirectory
of /usr/include and therefore not found within the default include
path.  To account for this run the following before running
./configure.

  $ export CPPFLAGS=-I/usr/include/mpi

TEST CASES

The "load_influenza_aa_dat" function loads a single tab delimited text
file into a table structure in the HDF5 file.  The HDFView GUI can be
used to open the loaded table and then export it back out as a text
file.  The text file can then be compared with the original input to
verify that the load was completed without error.

  $ diff --report-identical-files \
    /home/don/exp004/genomes/INFLUENZA/influenza_aa.dat \
    Protein\ Sequences.txt 

  Files /home/don/exp004/genomes/INFLUENZA/influenza_aa.dat and
  Protein Sequences.txt are identical

 LocalWords:  NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi
 LocalWords:  autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt exp pre
 LocalWords:  datatype struct