|
diff --git a/README b/README index 197d289..f56193e 100644 --- a/ README+++ b/ README |
|
@@ -11,26 +11,43 @@ their meta-data in a number of different ways. An exploration of the |
11 | phylogenetic properties of the records first requires that the available data | 11 | phylogenetic properties of the records first requires that the available data |
12 | be collected and inventoried. | 12 | be collected and inventoried. |
13 | | 13 | |
14 | Two primary alternatives have been identified for managing the data. A | 14 | Two primary alternatives have been identified for managing the data. |
15 | relational database can be used. IBM DB2 has been used for this. The use of | 15 | A relational database can be used. IBM DB2 has been used for this in |
16 | a relational database is limited by the difficulty in sharing the data. Each | 16 | exp004. The use of a relational database is limited by the difficulty |
17 | vendor uses incompatible import and export routines. Additionally installing | 17 | in sharing the data. Each vendor uses incompatible import and export |
18 | an instance of a database management system (DBMS) often requires a large | 18 | routines. Additionally installing an instance of a database |
19 | amount of effort and many not be practical on hosted environments which do not | 19 | management system (DBMS) often requires a large amount of effort and |
20 | support the running of user daemons. Finally proper parallelization of a DBMS | 20 | many not be practical on hosted environments which do not support the |
21 | will require additional system specific configuration for each machine used. | 21 | running of user daemons. Proper parallelization of a DBMS will |
22 | | 22 | require additional system specific configuration for each machine |
23 | An alternative to the DBMS is to use a container file format such as HDF5. | 23 | used. Generally a single DB2 instance with Internet connectivity has |
24 | This has the advantage that all of the data can be collected into a single | 24 | been used in conjunction with DB2 client installations on the |
25 | file which can then be shared with others. It has the disadvantage that is | 25 | analytical environments. |
26 | lacks the robust search and SQL operations provided by a DBMS. In addition to | 26 | |
27 | two alternatives use fundamentally different storage strategies with the DBMS | 27 | An alternative to the DBMS is to use a container file format such as |
28 | using a relational model and the contain file format using a hierarchical | 28 | HDF5. This has the advantage that all of the data can be collected |
29 | model. | 29 | into a single file which can then be shared with others. It has the |
| | 30 | disadvantage that it lacks the robust search and SQL operations |
| | 31 | provided by a DBMS. These two alternatives use fundamentally |
| | 32 | different storage strategies with the DBMS using a relational model |
| | 33 | and the container file format using a hierarchical model. |
30 | | 34 | |
31 | The "doc/Data Deployments.dia" diagram shows the source systems that | 35 | The "doc/Data Deployments.dia" diagram shows the source systems that |
32 | expose the various records as well as the transform routines that are | 36 | expose the various influenza records as well as the transform routines |
33 | used for aggregation of the data on the local system. | 37 | that are used for aggregation of the data on the local system. |
| | 38 | Initially it may appear that loading the text files directly into the |
| | 39 | HDF5 container is redundant, particularly as a pure pre-processing |
| | 40 | step. This will be a redundant effort for cases where tools are used |
| | 41 | which require yet another load step. For custom C programs however |
| | 42 | reading the data from disk and converting it from ASCII text to a |
| | 43 | native datatype is a necessary preprocessing step. Sharing the C |
| | 44 | struct definitions between HDF5 and the native code is the key |
| | 45 | differentiator between loading from text and loading from the binary |
| | 46 | HDF5 container. Since these read and conversion operations must be |
| | 47 | done in the C code anyway the additional effort to save their results |
| | 48 | in the HDF5 container are justified by any time that can be saved by |
| | 49 | reusing the HDF5 data rather than rerunning the read and conversion |
| | 50 | operations from plain text. |
34 | | 51 | |
35 | BUILDING | 52 | BUILDING |
36 | | 53 | |
@@ -60,4 +77,5 @@ verify that the load was completed without error. |
60 | Protein Sequences.txt are identical | 77 | Protein Sequences.txt are identical |
61 | | 78 | |
62 | LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi | 79 | LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi |
63 | LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt | 80 | LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt exp pre |
| | 81 | LocalWords: datatype struct |
|