summaryrefslogtreecommitdiffstats
Unidiff
-rw-r--r--README35
-rw-r--r--data/ProteinNames.txt275
-rw-r--r--doc/Data Deployments.diabin0 -> 3566 bytes
3 files changed, 310 insertions, 0 deletions
diff --git a/README b/README
new file mode 100644
index 0000000..9caedb8
--- a/dev/null
+++ b/README
@@ -0,0 +1,35 @@
1Experiment 007
2Don Pellegrino [don@drexel.edu]
3
4Collection and inventory of influenza data.
5
6INTRODUCTION
7
8The "Influenza Virus Resource" at NCBI
9[http://www.ncbi.nlm.nih.gov/genomes/FLU/] exposes the sequence records and
10their meta-data in a number of different ways. An exploration of the
11phylogenetic properties of the records first requires that the available data
12be collected and inventoried.
13
14Two primary alternatives have been identified for managing the data. A
15relational database can be used. IBM DB2 has been used for this. The use of
16a relational database is limited by the difficulty in sharing the data. Each
17vendor uses incompatible import and export routines. Additionally installing
18an instance of a database management system (DBMS) often requires a large
19amount of effort and many not be practical on hosted environments which do not
20support the running of user daemons. Finally proper parallelization of a DBMS
21will require additional system specific configuration for each machine used.
22
23An alternative to the DBMS is to use a container file format such as HDF5.
24This has the advantage that all of the data can be collected into a single
25file which can then be shared with others. It has the disadvantage that is
26lacks the robust search and SQL operations provided by a DBMS. In addition to
27two alternatives use fundamentally different storage strategies with the DBMS
28using a relational model and the contain file format using a hierarchical
29model.
30
31The "doc/Data Deployments.dia" diagram shows the source systems that
32expose the various records as well as the transform routines that are
33used for aggregation of the data on the local system.
34
35 LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia

Valid XHTML 1.0 Strict

Copyright © 2009 Don Pellegrino All Rights Reserved.