Influenza Sequence Mapping Project - Data  

The iSchool at Drexel


This page describes the data aspects of the Influenza Sequence Mapping Project.

Influenza Protein Sequence Data

1. NCBI to local FASTA via rsync

The Influenza Virus Resource from the National Center for Biotechnology Information provides a public repository of influenza sequence data. This data serves as the input for generation of the map. Scripts start by collecting the contents of the directory.

2. FASTA to BLAST via formatdb

With a copy of the data retrieved from NCBI the formatdb command is used to create a BLAST database from the influenza.faa file. The influenza.faa file contains the protein sequence data in FASTA format.

3. BLAST to Similarity Scores via BLASTP

BLASTP is run to compare all proteins in the database created in step 2 with each other. This results in a set of similarity scores for each protein sequence. Each set lists the other protein sequences to which the given sequence is most similar.

4. Similarity Scores to Undirected Graph via custom procedure

A custom stored procedure is run using IBM DB2 to create an undirected graph from the similarity scores. Nodes in the graph are instantiated for each protein with scores assigned in step 3. Undirected edges are instantiated between each protein in the similarity sets.

5. Undirected Graph to 2D coordinates via LGL

The Large Graph Layout (LGL) algorithm is used to assigned each node coordinates in two-dimensional space. The interactive visualization tool can be used to render the resulting layout.

Influenza Protein Sequence Meta-data

The NCBI EFetch tool is used to collect the meta-data for each influenza protein sequence record from the NCBI protein database. This data is retrieved from NCBI in XML format. It is then transformed into a relational model and stored in a local DB2 database for fast referencing by the visualization tool. The following fields are populated with varying levels of completeness and accuracy:

Valid XHTML 1.0 Strict Dublin Core Used Here

Metadata associated with this resource:
Copyright © 2008 Don Pellegrino All Rights Reserved.