Temporal Analysis - Proteins by Year of Strain

Interesting paths and connections through the data.

Temporal Analysis - Proteins by Year of Strain

Postby donpellegrino » Thu Dec 17, 2009 6:33 pm

Influenza sequence records from NCBI often include a strain feature. For example in http://www.ncbi.nlm.nih.gov/nuccore/CY053413 the record reports "/strain='A/Russia/19/2009(H1N1)'." NCBI eFetch was run for the August 7, 2009 sequence data and collected the metadata for all 115,384 protein sequences as of that date. The strain feature can be extracted from the XML notation with the following DB2 query:

Code: Select all
INSERT INTO strains (
SELECT   X.*
FROM   efetch_protein EP,
   XMLTABLE (
   'for $x in
   $d/GBSet/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier
   where $x/GBQualifier_name = "strain"
   return $x'
   PASSING EP.metadata as "d"
   COLUMNS
   strain              VARCHAR(254)   PATH './GBQualifier_value',
   GBSeq_accession_version   VARCHAR(50)   PATH '../../../../GBSeq_accession-version'
   ) AS X
);


Using a parser to identify the year component of the strain string finds 111,576 (96.7%) year fields populated. Alternatively the year column from /genomes/INFLUENZA/influenza_aa.dat could be used.

Using this year value parsed from the strain feature of the protein records it is then possible to annotate the map. A base layer is created where each point is colored by the year value. A color legend is added to show the color by year. In addition an animation frame is created for each year from 1902 to 2009. 1902 is the first year for which data is available. For some years no records are available and for these the animation frame does not show any highlights. With an overlay of white points for proteins that are associated with a strain from that year we can see how sequences appear and disappear from different points along the base layer over time. I used ParaView to create the animation with three frames per year and one frame per second. Each year is shown for three seconds of animation footage. I then used mencoder to write an avi file of the animation. This has been posted to the site at:

http://cluster.ischool.drexel.edu/~st96 ... 091217.avi

The most striking feature of the animation is how tightly 2009 clusters versus the prior years. I provided a split view with a zoomed in region on the left to show how dramatic this feature is. Interestingly the most recent pandemic year of 1968 shows a similar global phenomenon. We might be observing a few macroscopic artifacts:

1. Perhaps during the pandemic years scientists simply didn't sequence any other strains than the pandemic strains.

2. And/or during the pandemic years the pandemic strain is so dominant that it leads to the extinction of all the other strains.

From here I might use quantitative methods to profile the potential mutations we will see for new influenza strains in upcoming years by regressing a model from the historical temporal diversity data. Also this view provides a good framework to overlay the temporal themes in influenza literature identified by CiteSpace.

What patterns to you see in the animation?
donpellegrino
 
Posts: 14
Joined: Wed Aug 19, 2009 1:52 pm

Re: Temporal Analysis - Proteins by Year of Strain

Postby donpellegrino » Mon Dec 21, 2009 2:32 pm

I have also posted a copy of the animation to YouTube [http://www.youtube.com/watch?v=A5xkn6sfKPM]. Initial feedback has been "Wow! - That's boring" and "nothing is happening." The uneven distribution of data along the years presents a difficulty here. This first draft of the animation uses a constant frame rate (1 fps) with 3 frames dedicated to every year between the first datapoint in 1902 and the last datapoint in 2009. Since there is little data between 1920 and 1968 there is little movement in the animation for the first 3:20 of the movie. This makes the beginning quite boring. I could just drop the first 3:20 but I had opted to leave them in for completeness. A separate version of the animation should include narration and dynamic time scale. Such a version would be a more interesting complement to the analytically accurate but boring first version.

proteins_summary.png
Count of data points in the dataset by year of strain.
proteins_summary.png (4.98 KiB) Viewed 1102 times


The first version of the animation is better suited to manual analysis by interactively moving the time slider back and forth with a movie player that supports such an action. A second version will have to be made for watching beginning to end with a narrative for the story. The challenge now is to figure out what story is being told by the data.
donpellegrino
 
Posts: 14
Joined: Wed Aug 19, 2009 1:52 pm


Return to Analyses

Who is online

Users browsing this forum: No registered users and 1 guest

cron