summaryrefslogtreecommitdiffstats
Unidiff
-rw-r--r--README56
1 files changed, 37 insertions, 19 deletions
diff --git a/README b/README
index 197d289..f56193e 100644
--- a/README
+++ b/README
@@ -11,26 +11,43 @@ their meta-data in a number of different ways. An exploration of the
11phylogenetic properties of the records first requires that the available data11phylogenetic properties of the records first requires that the available data
12be collected and inventoried.12be collected and inventoried.
1313
14Two primary alternatives have been identified for managing the data. A14Two primary alternatives have been identified for managing the data.
15relational database can be used. IBM DB2 has been used for this. The use of15A relational database can be used. IBM DB2 has been used for this in
16a relational database is limited by the difficulty in sharing the data. Each16exp004. The use of a relational database is limited by the difficulty
17vendor uses incompatible import and export routines. Additionally installing17in sharing the data. Each vendor uses incompatible import and export
18an instance of a database management system (DBMS) often requires a large18routines. Additionally installing an instance of a database
19amount of effort and many not be practical on hosted environments which do not19management system (DBMS) often requires a large amount of effort and
20support the running of user daemons. Finally proper parallelization of a DBMS20many not be practical on hosted environments which do not support the
21will require additional system specific configuration for each machine used.21running of user daemons. Proper parallelization of a DBMS will
2222require additional system specific configuration for each machine
23An alternative to the DBMS is to use a container file format such as HDF5.23used. Generally a single DB2 instance with Internet connectivity has
24This has the advantage that all of the data can be collected into a single24been used in conjunction with DB2 client installations on the
25file which can then be shared with others. It has the disadvantage that is25analytical environments.
26lacks the robust search and SQL operations provided by a DBMS. In addition to26
27two alternatives use fundamentally different storage strategies with the DBMS27An alternative to the DBMS is to use a container file format such as
28using a relational model and the contain file format using a hierarchical28HDF5. This has the advantage that all of the data can be collected
29model.29into a single file which can then be shared with others. It has the
30disadvantage that it lacks the robust search and SQL operations
31provided by a DBMS. These two alternatives use fundamentally
32different storage strategies with the DBMS using a relational model
33and the container file format using a hierarchical model.
3034
31The "doc/Data Deployments.dia" diagram shows the source systems that35The "doc/Data Deployments.dia" diagram shows the source systems that
32expose the various records as well as the transform routines that are36expose the various influenza records as well as the transform routines
33used for aggregation of the data on the local system.37that are used for aggregation of the data on the local system.
38Initially it may appear that loading the text files directly into the
39HDF5 container is redundant, particularly as a pure pre-processing
40step. This will be a redundant effort for cases where tools are used
41which require yet another load step. For custom C programs however
42reading the data from disk and converting it from ASCII text to a
43native datatype is a necessary preprocessing step. Sharing the C
44struct definitions between HDF5 and the native code is the key
45differentiator between loading from text and loading from the binary
46HDF5 container. Since these read and conversion operations must be
47done in the C code anyway the additional effort to save their results
48in the HDF5 container are justified by any time that can be saved by
49reusing the HDF5 data rather than rerunning the read and conversion
50operations from plain text.
3451
35BUILDING52BUILDING
3653
@@ -60,4 +77,5 @@ verify that the load was completed without error.
60 Protein Sequences.txt are identical77 Protein Sequences.txt are identical
6178
62 LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi79 LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi
63 LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt80 LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt exp pre
81 LocalWords: datatype struct

Valid XHTML 1.0 Strict

Copyright © 2009 Don Pellegrino All Rights Reserved.