Added thoughts on the value provided by the HDF5 container.

Consideration of the anlaytical tools to be used and their load features should be considered to determine if the HDF5 container adds any value. In general custom C programs should get value out of saving the results of reading and ASCII to binary conversion steps back out to HDF5.
author: Don Pellegrino <don@drexel.edu> 2010-01-16 15:34:33 (GMT)
committer: Don Pellegrino <don@drexel.edu> 2010-01-16 15:34:33 (GMT)
commit: 7d4cded2072023a0bf1505ab01df21b66ffb4987 (patch) (unidiff)
tree: a96c138735b955ce8b188408d774d85134c20de0
parent: ad58bc2790477c9804641489ecf26437e784feff (diff)
download: exp007-7d4cded2072023a0bf1505ab01df21b66ffb4987.zip
exp007-7d4cded2072023a0bf1505ab01df21b66ffb4987.tar.gz
exp007-7d4cded2072023a0bf1505ab01df21b66ffb4987.tar.bz2
1 files changed, 37 insertions, 19 deletions
diff --git a/README b/README
index 197d289..f56193e 100644
--- a/README
+++ b/README
@@ -11,26 +11,43 @@ their meta-data in a number of different ways.  An exploration of the
 phylogenetic properties of the records first requires that the available data
 be collected and inventoried.
-Two primary alternatives have been identified for managing the data.  A
+Two primary alternatives have been identified for managing the data.
-relational database can be used.  IBM DB2 has been used for this.  The use of
+A relational database can be used.  IBM DB2 has been used for this in
-a relational database is limited by the difficulty in sharing the data.  Each
+exp004.  The use of a relational database is limited by the difficulty
-vendor uses incompatible import and export routines.  Additionally installing
+in sharing the data.  Each vendor uses incompatible import and export
-an instance of a database management system (DBMS) often requires a large
+routines.  Additionally installing an instance of a database
-amount of effort and many not be practical on hosted environments which do not
+management system (DBMS) often requires a large amount of effort and
-support the running of user daemons.  Finally proper parallelization of a DBMS
+many not be practical on hosted environments which do not support the
-will require additional system specific configuration for each machine used.
+running of user daemons.  Proper parallelization of a DBMS will
+require additional system specific configuration for each machine
-An alternative to the DBMS is to use a container file format such as HDF5.
+used.  Generally a single DB2 instance with Internet connectivity has
-This has the advantage that all of the data can be collected into a single
+been used in conjunction with DB2 client installations on the
-file which can then be shared with others.  It has the disadvantage that is
+analytical environments.
-lacks the robust search and SQL operations provided by a DBMS.  In addition to
-two alternatives use fundamentally different storage strategies with the DBMS
+An alternative to the DBMS is to use a container file format such as
-using a relational model and the contain file format using a hierarchical
+HDF5.  This has the advantage that all of the data can be collected
-model.
+into a single file which can then be shared with others.  It has the
+disadvantage that it lacks the robust search and SQL operations
+provided by a DBMS.  These two alternatives use fundamentally
+different storage strategies with the DBMS using a relational model
+and the container file format using a hierarchical model.
 The "doc/Data Deployments.dia" diagram shows the source systems that
-expose the various records as well as the transform routines that are
+expose the various influenza records as well as the transform routines
-used for aggregation of the data on the local system.
+that are used for aggregation of the data on the local system.
+Initially it may appear that loading the text files directly into the
+HDF5 container is redundant, particularly as a pure pre-processing
+step.  This will be a redundant effort for cases where tools are used
+which require yet another load step.  For custom C programs however
+reading the data from disk and converting it from ASCII text to a
+native datatype is a necessary preprocessing step.  Sharing the C
+struct definitions between HDF5 and the native code is the key
+differentiator between loading from text and loading from the binary
+HDF5 container.  Since these read and conversion operations must be
+done in the C code anyway the additional effort to save their results
+in the HDF5 container are justified by any time that can be saved by
+reusing the HDF5 data rather than rerunning the read and conversion
+operations from plain text.
 BUILDING
@@ -60,4 +77,5 @@ verify that the load was completed without error.
  Protein Sequences.txt are identical
 LocalWords:  NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi
- LocalWords:  autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt
+ LocalWords:  autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt exp pre
+ LocalWords:  datatype struct
author	Don Pellegrino <don@drexel.edu>	2010-01-16 15:34:33 (GMT)
committer	Don Pellegrino <don@drexel.edu>	2010-01-16 15:34:33 (GMT)
commit	7d4cded2072023a0bf1505ab01df21b66ffb4987 (patch) (unidiff)
tree	a96c138735b955ce8b188408d774d85134c20de0
parent	ad58bc2790477c9804641489ecf26437e784feff (diff)
download	exp007-7d4cded2072023a0bf1505ab01df21b66ffb4987.zip exp007-7d4cded2072023a0bf1505ab01df21b66ffb4987.tar.gz exp007-7d4cded2072023a0bf1505ab01df21b66ffb4987.tar.bz2

diff --git a/README b/README index 197d289..f56193e 100644 --- a/README +++ b/README
@@ -11,26 +11,43 @@ their meta-data in a number of different ways. An exploration of the
11	phylogenetic properties of the records first requires that the available data	11	phylogenetic properties of the records first requires that the available data
12	be collected and inventoried.	12	be collected and inventoried.
13		13
14	Two primary alternatives have been identified for managing the data. A	14	Two primary alternatives have been identified for managing the data.
15	relational database can be used. IBM DB2 has been used for this. The use of	15	A relational database can be used. IBM DB2 has been used for this in
16	a relational database is limited by the difficulty in sharing the data. Each	16	exp004. The use of a relational database is limited by the difficulty
17	vendor uses incompatible import and export routines. Additionally installing	17	in sharing the data. Each vendor uses incompatible import and export
18	an instance of a database management system (DBMS) often requires a large	18	routines. Additionally installing an instance of a database
19	amount of effort and many not be practical on hosted environments which do not	19	management system (DBMS) often requires a large amount of effort and
20	support the running of user daemons. Finally proper parallelization of a DBMS	20	many not be practical on hosted environments which do not support the
21	will require additional system specific configuration for each machine used.	21	running of user daemons. Proper parallelization of a DBMS will
22		22	require additional system specific configuration for each machine
23	An alternative to the DBMS is to use a container file format such as HDF5.	23	used. Generally a single DB2 instance with Internet connectivity has
24	This has the advantage that all of the data can be collected into a single	24	been used in conjunction with DB2 client installations on the
25	file which can then be shared with others. It has the disadvantage that is	25	analytical environments.
26	lacks the robust search and SQL operations provided by a DBMS. In addition to	26
27	two alternatives use fundamentally different storage strategies with the DBMS	27	An alternative to the DBMS is to use a container file format such as
28	using a relational model and the contain file format using a hierarchical	28	HDF5. This has the advantage that all of the data can be collected
29	model.	29	into a single file which can then be shared with others. It has the
		30	disadvantage that it lacks the robust search and SQL operations
		31	provided by a DBMS. These two alternatives use fundamentally
		32	different storage strategies with the DBMS using a relational model
		33	and the container file format using a hierarchical model.
30		34
31	The "doc/Data Deployments.dia" diagram shows the source systems that	35	The "doc/Data Deployments.dia" diagram shows the source systems that
32	expose the various records as well as the transform routines that are	36	expose the various influenza records as well as the transform routines
33	used for aggregation of the data on the local system.	37	that are used for aggregation of the data on the local system.
		38	Initially it may appear that loading the text files directly into the
		39	HDF5 container is redundant, particularly as a pure pre-processing
		40	step. This will be a redundant effort for cases where tools are used
		41	which require yet another load step. For custom C programs however
		42	reading the data from disk and converting it from ASCII text to a
		43	native datatype is a necessary preprocessing step. Sharing the C
		44	struct definitions between HDF5 and the native code is the key
		45	differentiator between loading from text and loading from the binary
		46	HDF5 container. Since these read and conversion operations must be
		47	done in the C code anyway the additional effort to save their results
		48	in the HDF5 container are justified by any time that can be saved by
		49	reusing the HDF5 data rather than rerunning the read and conversion
		50	operations from plain text.
34		51
35	BUILDING	52	BUILDING
36		53
@@ -60,4 +77,5 @@ verify that the load was completed without error.
60	Protein Sequences.txt are identical	77	Protein Sequences.txt are identical
61		78
62	LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi	79	LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi
63	LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt	80	LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt exp pre
		81	LocalWords: datatype struct