Map distance as indicator of sequence similarity.

Interesting paths and connections through the data.

Map distance as indicator of sequence similarity.

Postby donpellegrino » Wed Oct 28, 2009 2:55 pm

The positioning of the protein sequences on the two-dimensional projection is handled by the Large Graph Layout algorithm. In general sequences that appear close to each other on the map should be similar and sequences that appear further from each other on the map should be dissimilar. The can be easily verified using the selection interface and performing a BLAST comparison. First two points on the map are selected that are very close to one another.

Code: Select all
GI, GBSEQ_LOCUS, GBSEQ_ACCESSION_VERSION, GBSEQ_UPDATE_DATE, GBSEQ_CREATE_DATE
ORGANISM
STRAIN
SEROTYPE, N_CITES

227977170, ACP44188, ACP44188.1, 01-JUN-2009, 28-APR-2009
Influenza A virus (A/California/07/2009(H1N1))
A/California/07/2009
H1N1, 3

78032581 , ABB17157, ABB17157.1, 14-MAY-2008, 26-JAN-2006
Influenza A virus (A/mallard/Alberta/77/1977(H2N3))
A/mallard/Alberta/77/1977
H2N3, 2


Consistent with expectations a sequence alignment between gi 227977170 and gi 78032581 shows that they are very similar:

Code: Select all
Query ID
    gi|227977170|gb|ACP44188.1|
Description
    polymerase PA [Influenza A virus (A/California/07/2009(H1N1))]
Molecule type
    amino acid
Query Length
    716

Subject ID
    gi|78032581|gb|ABB17157.1|
Description
    polymerase PA [Influenza A virus (A/mallard/Alberta/77/1977(H2N3))] See details
Molecule type
    amino acid
Subject Length
    716
Program
    BLASTP 2.2.22+

gb|ABB17157.1|  polymerase PA [Influenza A virus (A/mallard/Alberta/77/1977(H2N3))]
Length=716

Score = 1454 bits (3764),  Expect = 0.0, Method: Compositional matrix adjust.
Identities = 693/716 (96%), Positives = 707/716 (98%), Gaps = 0/716 (0%)

Query  1    MEDFVRQCFNPMIVELAXKAMKEYGEDPKIETNKFAAICTHLEVCFMYSDFHFIDERGES  60
            MEDFVRQCFNPMIVELA KAMKEYGEDPKIETNKFAAICTHLEVCFMYSDFHFIDERGES
Sbjct  1    MEDFVRQCFNPMIVELAEKAMKEYGEDPKIETNKFAAICTHLEVCFMYSDFHFIDERGES  60

Query  61   IIVESGDPNALLKHRFEIIEGRDRIMAWTVVNSICNTTGVEKPKFLPDLYDYKENRFIEI  120
            IIVESGDPNALLKHRFEIIEGRDR MAWTVVNSICNTTGVEKPKFLPDLYDYKENRFIEI
Sbjct  61   IIVESGDPNALLKHRFEIIEGRDRTMAWTVVNSICNTTGVEKPKFLPDLYDYKENRFIEI  120

Query  121  GVTRREVHIYYLEKANKIKSEKTHIHIFSFTGEEMATKADYTLDEESRARIKTRLFTIRQ  180
            GVTRREVHIYYLEKANKIKSEKTHIHIFSFTGEEMATKADYTLDEESRARIKTRLFTIRQ
Sbjct  121  GVTRREVHIYYLEKANKIKSEKTHIHIFSFTGEEMATKADYTLDEESRARIKTRLFTIRQ  180

Query  181  EMASRSLWDSFRQSERGEETIEEKFEITGTMRKLADQSLPPNFPSLENFRAYVDGFEPNG  240
            EMASR LWDSFRQSERGEETIEE+FEITGTMR+LADQSLPPNF SLENFRAYVDGFEPNG
Sbjct  181  EMASRGLWDSFRQSERGEETIEERFEITGTMRRLADQSLPPNFSSLENFRAYVDGFEPNG  240

Query  241  CIEGKLSQMSKEVNAKIEPFLRTTPRPLRLPDGPLCHQRSKFLLMDALKLSIEDPSHEGE  300
            CIEGKLSQMSKEVNA+IEPFL+TTPRPLRLPDGP C QRSKFLLMDALKLSIEDPSHEGE
Sbjct  241  CIEGKLSQMSKEVNARIEPFLKTTPRPLRLPDGPPCSQRSKFLLMDALKLSIEDPSHEGE  300

Query  301  GIPLYDAIKCMKTFFGWKEPNIVKPHEKGINPNYLMAWKQVLAELQDIENEEKIPRTKNM  360
            GIPLYDAIKCMKTFFGWKEP I+KPHEKGINPNYL+AWKQVLAELQDIENEEKIP+TKNM
Sbjct  301  GIPLYDAIKCMKTFFGWKEPKIIKPHEKGINPNYLLAWKQVLAELQDIENEEKIPKTKNM  360

Query  361  KRTSQLKWALGENMAPEKVDFDDCKDVGDLKQYDSDEPEPRSLASWVQNEFNKACELTDS  420
            K+TSQLKWALGENMAPEKVDF+DCKDV DLKQYDSDEPE RSLASW+Q+EFNKACELTDS
Sbjct  361  KKTSQLKWALGENMAPEKVDFEDCKDVSDLKQYDSDEPETRSLASWIQSEFNKACELTDS  420

Query  421  SWIELDEIGEDVAPIEHIASMRRNYFTAEVSHCRATEYIMKGVYINTALLNASCAAMDDF  480
            SW+ELDEIGED+APIEHIASMRRNYFTAEVSHCRATEYIMKGVYINTALLNASCAAMDDF
Sbjct  421  SWMELDEIGEDIAPIEHIASMRRNYFTAEVSHCRATEYIMKGVYINTALLNASCAAMDDF  480

Query  481  QLIPMISKCRTKEGRRKTNLYGFIIKGRSHLRNDTDVVNFVSMEFSLTDPRLEPHKWEKY  540
            QLIPMISKCRTKEGRRKTNLYGFIIKGRSHLRNDTDVVNFVSMEFSLTDPRLEPHKWEKY
Sbjct  481  QLIPMISKCRTKEGRRKTNLYGFIIKGRSHLRNDTDVVNFVSMEFSLTDPRLEPHKWEKY  540

Query  541  CVLEIGDMLLRTAIGQVSRPMFLYVRTNGTSKIKMKWGMEMRRCLLQSLQQIESMIEAES  600
            CVLEIGDMLLRTAIGQVSRPMFLYVRTNGTSKIKMKWGMEMRRCLLQSLQQIESMIEAES
Sbjct  541  CVLEIGDMLLRTAIGQVSRPMFLYVRTNGTSKIKMKWGMEMRRCLLQSLQQIESMIEAES  600

Query  601  SVKEKDMTKEFFENKSETWPIGESPRGVEEGSIGKVCRTLLAKSVFNSLYASPQLEGFSA  660
            SVKEKDMTKEFFENKSETWPIGESP+GVEEGSIGKVCRTLLAKSVFNSLYASPQLEGFSA
Sbjct  601  SVKEKDMTKEFFENKSETWPIGESPKGVEEGSIGKVCRTLLAKSVFNSLYASPQLEGFSA  660

Query  661  ESRKLLLIVQALRDNLEPGTFDLGGLYEAIEECLINDPWVLLNASWFNSFLTHALK  716
            ESRKLLLIVQALRDNLEPGTFDLGGLYEAIEECLINDPWVLLNASWFNSFLTHALK
Sbjct  661  ESRKLLLIVQALRDNLEPGTFDLGGLYEAIEECLINDPWVLLNASWFNSFLTHALK  716


Next two sequences from two separate ends of the map are selected:

Code: Select all
167859496, ACA04707, ACA04707.1, 08-JAN-2009, 19-FEB-2008
Influenza A virus (A/duck/Eastern China/89/2005(H5N1))
A/duck/Eastern China/89/2005
H5N1, 2

31339558, AAP49111, AAP49111.1, 01-JAN-1900, 03-JUN-2003
Influenza A virus (A/wild duck/Shantou/4808/01(H9N2))
(A/Wild Duck/Shantou/4808/01(H9N2))
-, 2


Again, consistent with expectations these are found to be more dissimilar:

Code: Select all
Query ID
    gi|167859496|gb|ACA04707.1|
Description
    neuraminidase [Influenza A virus (A/duck/Eastern China/89/2005(H5N1))]
Molecule type
    amino acid
Query Length
    449

Subject ID
    gi|31339558|gb|AAP49111.1|
Description
    polymerase [Influenza A virus (A/Wild Duck/Shantou/4808/01(H9N2))] See details
Molecule type
    amino acid
Subject Length
    563
Program
    BLASTP 2.2.22+

gb|AAP49111.1|  polymerase [Influenza A virus (A/Wild Duck/Shantou/4808/01(H9N2))]
Length=563

Score = 15.8 bits (29),  Expect = 3.4, Method: Compositional matrix adjust.
Identities = 4/7 (57%), Positives = 6/7 (85%), Gaps = 0/7 (0%)

Query  142  PVGETPS  148
            P+GE+P
Sbjct  467  PIGESPK  473


Score = 15.4 bits (28),  Expect = 4.8, Method: Compositional matrix adjust.
Identities = 8/25 (32%), Positives = 11/25 (44%), Gaps = 2/25 (8%)

Query  270  CVCRDNWHGSNRPWVSFNQNLEYQI  294
            C+  D W   N  W  FN  L + +
Sbjct  540  CLINDPWVLLNASW--FNSFLTHAL  562


Score = 15.4 bits (28),  Expect = 5.5, Method: Compositional matrix adjust.
Identities = 10/48 (20%), Positives = 16/48 (33%), Gaps = 10/48 (20%)

Query  168  TSWLTIGISGPDNGAVAVLKYNGIITDTIKSWRNNILRTQESECAGVN  215
            +SW+ +   G D   +          + I S R N    + S C   
Sbjct  267  SSWIELDEIGEDVAPI----------EHIASMRRNYFTAEVSHCRATE  304


As a simple sanity check this shows that the distance between two points on the map is a general indicator of the similarity between the two sequences represented by those points.
donpellegrino
 
Posts: 14
Joined: Wed Aug 19, 2009 1:52 pm

Re: Map distance as indicator of sequence similarity.

Postby donpellegrino » Wed Oct 28, 2009 2:58 pm

gi_227977170_78032581.png
gi 227977170 and gi 78032581 selected on the map.
gi_227977170_78032581.png (49.96 KiB) Viewed 2075 times
donpellegrino
 
Posts: 14
Joined: Wed Aug 19, 2009 1:52 pm


Return to Analyses

Who is online

Users browsing this forum: No registered users and 1 guest

cron