Experiment 007 Don Pellegrino [don@drexel.edu] Collection and inventory of influenza data. INTRODUCTION The "Influenza Virus Resource" at NCBI [http://www.ncbi.nlm.nih.gov/genomes/FLU/] exposes the sequence records and their meta-data in a number of different ways. An exploration of the phylogenetic properties of the records first requires that the available data be collected and inventoried. Two primary alternatives have been identified for managing the data. A relational database can be used. IBM DB2 has been used for this. The use of a relational database is limited by the difficulty in sharing the data. Each vendor uses incompatible import and export routines. Additionally installing an instance of a database management system (DBMS) often requires a large amount of effort and many not be practical on hosted environments which do not support the running of user daemons. Finally proper parallelization of a DBMS will require additional system specific configuration for each machine used. An alternative to the DBMS is to use a container file format such as HDF5. This has the advantage that all of the data can be collected into a single file which can then be shared with others. It has the disadvantage that is lacks the robust search and SQL operations provided by a DBMS. In addition to two alternatives use fundamentally different storage strategies with the DBMS using a relational model and the contain file format using a hierarchical model. The "doc/Data Deployments.dia" diagram shows the source systems that expose the various records as well as the transform routines that are used for aggregation of the data on the local system. BUILDING An autogen.sh script is provided to initialize the project directory with the necessary GNU Autotools configuration. When building on a Debian system the mpi.h file is in a subdirectory of /usr/include and therefore not found within the default include path. To account for this run the following before running ./configure. $ export CPPFLAGS=-I/usr/include/mpi TEST CASES The "load_influenza_aa_dat" function loads a single tab delimited text file into a table structure in the HDF5 file. The HDFView GUI can be used to open the loaded table and then export it back out as a text file. The text file can then be compared with the original input to verify that the load was completed without error. $ diff --report-identical-files \ /home/don/exp004/genomes/INFLUENZA/influenza_aa.dat \ Protein\ Sequences.txt Files /home/don/exp004/genomes/INFLUENZA/influenza_aa.dat and Protein Sequences.txt are identical LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt