summaryrefslogtreecommitdiffstats
Side-by-side diff
-rw-r--r--README56
1 files changed, 37 insertions, 19 deletions
diff --git a/README b/README
index 197d289..f56193e 100644
--- a/README
+++ b/README
@@ -11,26 +11,43 @@ their meta-data in a number of different ways. An exploration of the
phylogenetic properties of the records first requires that the available data
be collected and inventoried.
-Two primary alternatives have been identified for managing the data. A
-relational database can be used. IBM DB2 has been used for this. The use of
-a relational database is limited by the difficulty in sharing the data. Each
-vendor uses incompatible import and export routines. Additionally installing
-an instance of a database management system (DBMS) often requires a large
-amount of effort and many not be practical on hosted environments which do not
-support the running of user daemons. Finally proper parallelization of a DBMS
-will require additional system specific configuration for each machine used.
-
-An alternative to the DBMS is to use a container file format such as HDF5.
-This has the advantage that all of the data can be collected into a single
-file which can then be shared with others. It has the disadvantage that is
-lacks the robust search and SQL operations provided by a DBMS. In addition to
-two alternatives use fundamentally different storage strategies with the DBMS
-using a relational model and the contain file format using a hierarchical
-model.
+Two primary alternatives have been identified for managing the data.
+A relational database can be used. IBM DB2 has been used for this in
+exp004. The use of a relational database is limited by the difficulty
+in sharing the data. Each vendor uses incompatible import and export
+routines. Additionally installing an instance of a database
+management system (DBMS) often requires a large amount of effort and
+many not be practical on hosted environments which do not support the
+running of user daemons. Proper parallelization of a DBMS will
+require additional system specific configuration for each machine
+used. Generally a single DB2 instance with Internet connectivity has
+been used in conjunction with DB2 client installations on the
+analytical environments.
+
+An alternative to the DBMS is to use a container file format such as
+HDF5. This has the advantage that all of the data can be collected
+into a single file which can then be shared with others. It has the
+disadvantage that it lacks the robust search and SQL operations
+provided by a DBMS. These two alternatives use fundamentally
+different storage strategies with the DBMS using a relational model
+and the container file format using a hierarchical model.
The "doc/Data Deployments.dia" diagram shows the source systems that
-expose the various records as well as the transform routines that are
-used for aggregation of the data on the local system.
+expose the various influenza records as well as the transform routines
+that are used for aggregation of the data on the local system.
+Initially it may appear that loading the text files directly into the
+HDF5 container is redundant, particularly as a pure pre-processing
+step. This will be a redundant effort for cases where tools are used
+which require yet another load step. For custom C programs however
+reading the data from disk and converting it from ASCII text to a
+native datatype is a necessary preprocessing step. Sharing the C
+struct definitions between HDF5 and the native code is the key
+differentiator between loading from text and loading from the binary
+HDF5 container. Since these read and conversion operations must be
+done in the C code anyway the additional effort to save their results
+in the HDF5 container are justified by any time that can be saved by
+reusing the HDF5 data rather than rerunning the read and conversion
+operations from plain text.
BUILDING
@@ -60,4 +77,5 @@ verify that the load was completed without error.
Protein Sequences.txt are identical
LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi
- LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt
+ LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt exp pre
+ LocalWords: datatype struct

Valid XHTML 1.0 Strict

Copyright © 2009 Don Pellegrino All Rights Reserved.