Advertisement
Advertisement

More Related Content

Advertisement

Formats de données en biologie

  1. Formats de données en biologie Pierre Poulain pierre.poulain@univ-paris-diderot.fr 09/2011
  2. Menu 1 Rappels 2 Problématique 3 Séquences 4 Structures 5 Quelques précautions 6 Conclusion 7 Références & crédits graphiques PP Université Paris Diderot - Paris 7 2
  3. Menu 1 Rappels 2 Problématique 3 Séquences 4 Structures 5 Quelques précautions 6 Conclusion 7 Références & crédits graphiques PP Université Paris Diderot - Paris 7 3
  4. Dogme de la biologie ADN ARN protéine transcription traduction PP Université Paris Diderot - Paris 7 4
  5. Menu 1 Rappels 2 Problématique 3 Séquences 4 Structures 5 Quelques précautions 6 Conclusion 7 Références & crédits graphiques PP Université Paris Diderot - Paris 7 5
  6. Expérimentalement ADN ARN protéine A,T,C,G A,U,C,G V,G,W,C... AAGATGACCGTGTGTCAT TTGATCCTGAACTGTTTG AAAAAATGTTCCGTGACG GACTCTTTGATGATGAGA CCTCGGAAGTAACGGAGC AGCGCAATGTTCCGTGAC CAGCTGACAATGTATCAG ATTCCAGACTGGATCAGA TCTGAATGCCATTAGCTT PP Université Paris Diderot - Paris 7 6
  7. TTGTCACCTGTACACTGGCATTACTACACAGAAACCCAGATGTCCGTTACC Séquences > structures AAGATGACCGTGTGTCATTCATTCCTAAGATTCAAAATGATTTCGATGGCA TTGATCCTGAACTGTTTGAATTGAGAAAAGCTGTTATGGACACCAATGAAA AAAAAATGTTCCGTGACGACACTTTCGGCAAGAACCTGAATGCAAACACAA GACTCTTTGATGATGAGACTAGTTCATCCTCTTTTAAGCAAAATTCCTCTC CCTCGGAAGTAACGGAGCAACCTGTGCAACCAACCTCCGCTGTCATGGGTA GCTTCTTGTCTCCACAGTACCAACGTGCGTCATCTGCTTCTCGTACTAATC ATAATACAAGCACCTCCAGTTTAATGAAGCCTGAATCAAGTCTCTACCTGG ATAAATCATATTCGCATTTTAACAACAACGGCAGCAACGAAAACGCCCGCA CATATTTGTAATCCAATATATACTCACATGTAACAACTTATTATATAAATA AAGGATATCCTACATTATATTTCATAGAAAACCGCTCAAAAAGGTGTATTA CATCCCAACACCACACATATTTCAGCGATAAAAACCTTAAATGTGAAATTC CTGCTTCCTTAAATGTACGCAATTGCCGCTTTTTTCTGACATCTTTTTTGA AAGGAAACAGATCCTCCAGAAGGGATTTACTGTTGGCTATTTTGTGTTAGA ATAATAGATTAGGTTGCGTAAGTCATGGTCGAAAATAGTACGCAGAAGGCC GGAAATGATGATAATAGCTCTACCAAGCCATATTCGGAGGCGTTTTTCTTA AACCCAACGCCTGGATTAGAAGCTGAGCACTCAAGCACATCGCCTGCCCCC AACTTGAAAATCGGTATGCTATTATCAATGCTTTACAATTCTGTCGGTTAC GAGGATCATTGCCCTCAAGGTGGCGAATATTCGGATTTATTGAGAAATTTG PP Université Paris Diderot - Paris 7 7 TGTGAAGCTATTTTGCCATCTTACGAAATTATTGAACGCTACAAGAACCAC
  8. Séquences > structures PP Université Paris Diderot - Paris 7 8
  9. Séquences > structures PP Université Paris Diderot - Paris 7 9
  10. Beaucoup de données
  11. que vous manipulez
  12. Menu 1 Rappels 2 Problématique 3 Séquences 4 Structures 5 Quelques précautions 6 Conclusion 7 Références & crédits graphiques PP Université Paris Diderot - Paris 7 12
  13. Séquences nucléiques, protéiques PP Université Paris Diderot - Paris 7 13
  14. Format Fasta Le plus simple >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY PP Université Paris Diderot - Paris 7 14
  15. Fasta >en-tête séquence sur 80 caractères maximum par ligne séquence sur 80 caractères maximum par ligne séquence sur 80 caractères maximum par ligne séquence sur 80 caractères maximum par ligne séquence sur 80 carac PP Université Paris Diderot - Paris 7 15
  16. Remarques > colle en-tête longueur de chaque ligne fixée extensions .fasta, .seq, .fas, .fna, .faa Python : chaînes de caractères + listes + (biopython) PP Université Paris Diderot - Paris 7 16
  17. Multifasta >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY >gi|134252438|gb|ABO64984.1| cytochrome b [Elephantulus rupestris] TAFSSVTHICRDVNYGWLIRYLHANGASLFFICLFIHVGRGIYYGSYLYFETWNIGVILLFITMATAFMG YVLPWGQMSFWGATVITNLLSAIPYIGTTLVEWIWGGFSVDKATLTRFFAFHFILPFIIAALAMVHLLFL HETGSNNPLGLVSDSDKIPFHPYYTIKDLLGVFAILILHLSLVLFSPDLLGDPDNYTPANPLNTPPHIKP EWYFLFAYAILRSIPNKLGGVLALVLSILILIIFPLLHTSKQRSLMFRPISQCLFWVLVADLLTLTWIGG QPVEHPYIIIGQLASILYFTIILVLMPIAGVIENHIIKL >gi|157367467|gb|ABV45600.1| cytochrome b [Mammuthus primigenius] MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHIC RDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSF WGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFALHFILPFTMIALAGVHLTFLHETGSNNPLG LTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAI LRSVPNKLGGILALLLSILILGMMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEHPYIII GQMASILYFSIILAFLPIAGMIENYLIK PP Université Paris Diderot - Paris 7 17
  18. Bases de données de séquences primaires GenBank – EMBL – DDBJ PP Université Paris Diderot - Paris 7 18
  19. GenBank http://www.ncbi.nlm.nih.gov/
  20. trypsine ?
  21. trypsine !
  22. Exemple LOCUS NM_001001317 940 bp mRNA linear PRI 27-DEC-2010 DEFINITION Homo sapiens trypsin X3 (TRYX3), mRNA. ACCESSION NM_001001317 VERSION NM_001001317.2 GI:170650697 [...] FEATURES Location/Qualifiers source 1..940 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="7" /map="7q34" gene 1..940 /gene="TRYX3" /gene_synonym="FLJ16649; MGC35022; PRSS1; TRY1; UNQ2540" /note="trypsin X3" /db_xref="GeneID:136541" /db_xref="HPRD:15572" [...] ORIGIN 1 aaggctggca aaaaggagac cagacaggag gcgtctgtag agatatcatg aacttcaact 61 tagctttgtt ttccagagac tggagctaaa ctgggctttc aacatcatca tgaagtttat [...] 781 tgccaaaatt ttttactata taccctggat tgaaaatgta atccaaaata actgagctgt 841 ggcagttgtg gaccatatga cacagcttgt ccccatcgtt cacctttaga attaaatata 901 aattaactcc tcaaaaaaaa aaaaaaaaaa aaaaaaaaaa // PP Université Paris Diderot - Paris 7 22
  23. Exemple LOCUS NM_001001317 940 bp mRNA linear PRI 27-DEC-2010 DEFINITION ACCESSION VERSION Homo sapiens trypsin X3 (TRYX3), mRNA. NM_001001317 NM_001001317.2 GI:170650697 en-tête [...] FEATURES Location/Qualifiers source 1..940 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="7" gene /map="7q34" 1..940 features /gene="TRYX3" /gene_synonym="FLJ16649; MGC35022; PRSS1; TRY1; UNQ2540" /note="trypsin X3" /db_xref="GeneID:136541" /db_xref="HPRD:15572" [...] ORIGIN 1 aaggctggca aaaaggagac cagacaggag gcgtctgtag agatatcatg aacttcaact [...] 61 tagctttgtt ttccagagac tggagctaaa ctgggctttc aacatcatca tgaagtttat séquence 781 tgccaaaatt ttttactata taccctggat tgaaaatgta atccaaaata actgagctgt 841 ggcagttgtg gaccatatga cacagcttgt ccccatcgtt cacctttaga attaaatata 901 aattaactcc tcaaaaaaaa aaaaaaaaaa aaaaaaaaaa // PP Université Paris Diderot - Paris 7 23
  24. En-tête LOCUS NM_001001317 940 bp mRNA linear PRI 27-DEC-2010 | | | | | nom taille type de division date de molécule modification ACCESSION NM_001001317 | numéro d'accession (unique et stable) SOURCE Homo sapiens (human) | nom de l'organisme ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. | taxonomie REFERENCE 1 (bases 1 to 940) AUTHORS Bubb,K.L., Bovee,D., Buckley,D., Haugen,E., Kibukawa,M., Paddock,M., Palmieri,A., Subramanian,S., Zhou,Y., Kaul,R., Green,P. and Olson,M.V. TITLE Scan of human genome reveals no new Loci under ancient balancing selection JOURNAL Genetics 173 (4), 2165-2177 (2006) PUBMED 16751668 | référence bibliographique PP Université Paris Diderot - Paris 7 24
  25. Features début et fin du gène | nom du gène gene 1..940 | /gene="TRYX3" /gene_synonym="FLJ16649; MGC35022; PRSS1; TRY1; UNQ2540" /note="trypsin X3" /db_xref="GeneID:136541" /db_xref="HPRD:15572" | identifiants d'autres bases de données séquence codante début et fin | | CDS 110..835 /gene="TRYX3" /gene_synonym="FLJ16649; MGC35022; PRSS1; TRY1; UNQ2540" /EC_number="3.4.21.4" /note="trypsin-X3" nom de la protéine produite /codon_start=1 | /product="trypsin-X3 precursor" /protein_id="NP_001001317.1" /db_xref="GI:48255915" /db_xref="CCDS:CCDS5871.1" /db_xref="GeneID:136541" /db_xref="HPRD:15572" /translation="MKFILLWALLNLTVALAFNPDYTVSSTPPYLVYLKSDYLPCAGV LIHPLWVITAAHCNLPKLRVILGVTIPADSNEKHLQVIGYEKMIHHPHFSVTSIDHDI MLIKLKTEAELNDYVKLANLPYQTISENTMCSVSTWSYNVCDIYKEPDSLQTVNISVI SKPQCRDAYKTYNITENMLCVGIVPGRRQPCKEVSAAPAICNGMLQGILSFADGCVLR ADVGIYAKIFYYIPWIENVIQNN" | séquence de la protéine PP Université Paris Diderot - Paris 7 25
  26. Séquence ORIGIN 1 aaggctggca aaaaggagac cagacaggag gcgtctgtag agatatcatg aacttcaact 61 tagctttgtt ttccagagac tggagctaaa ctgggctttc aacatcatca tgaagtttat 121 cctcctctgg gccctcttga atctgactgt tgctttggcc tttaatccag attacacagt 181 cagctccact cccccttact tggtctattt gaaatctgac tacttgccct gcgctggagt 241 cctgatccac ccgctttggg tgatcacagc tgcacactgc aatttaccaa agcttcgggt 301 gatattgggg gttacaatcc cagcagactc taatgaaaag catctgcaag tgattggcta 361 tgagaagatg attcatcatc cacacttctc agtcacttct attgatcatg acatcatgct 421 aatcaagctg aaaacagagg ctgaactcaa tgactatgtg aaattagcca acctgcccta 481 ccaaactatc tctgaaaata ccatgtgctc tgtctctacc tggagctaca atgtgtgtga 541 tatctacaaa gagcccgatt cactgcaaac tgtgaacatc tctgtaatct ccaagcctca 601 gtgtcgcgat gcctataaaa cctacaacat cacggaaaat atgctgtgtg tgggcattgt 661 gccaggaagg aggcagccct gcaaggaagt ttctgctgcc ccggcaatct gcaatgggat 721 gcttcaagga atcctgtctt ttgcggatgg atgtgttttg agagccgatg ttggcatcta 781 tgccaaaatt ttttactata taccctggat tgaaaatgta atccaaaata actgagctgt 841 ggcagttgtg gaccatatga cacagcttgt ccccatcgtt cacctttaga attaaatata 901 aattaactcc tcaaaaaaaa aaaaaaaaaa aaaaaaaaaa // | séquence du gène PP Université Paris Diderot - Paris 7 26
  27. Remarques extension .gbk visualisation : artemis http://www.sanger.ac.uk/resources/software/artemis/ format EMBL (.embl) ∼ .gbk Python : chaînes de caractères/listes + expressions régulières PP Université Paris Diderot - Paris 7 27
  28. EMBL ID 7 standard; DNA; HTG; 5916 BP. AC chromosome:GRCh37:7:141951963:141957878:-1 [...] OS Homo sapiens (human) OC Eukaryota; Metazoa; Eumetazoa; Bilateria; Coelomata; Deuterostomia; OC Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; OC Sarcopterygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; OC Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; OC Hominoidea; Hominidae; Homininae; Homo. [...] FT gene 1..5916 FT /gene=ENSG00000171147 FT /locus_tag="U66059.56" FT /note="Trypsin-X3 Precursor (EC 3.4.21.4) [...] FT CDS join(352..391,2386..2524,2748..3004,5448..5587,5689..5838) FT /gene="ENSESTG00000027201" FT /protein_id="ENSESTP00000068598" FT /note="transcript_id=ENSESTT00000068598" FT /translation="MKFILLWALLNLTVALAFNPDYTVSSTPPYLVYLKSDYLPCAGVL FT IHPLWVITAAHCNLPKLRVILGVTIPADSNEKHLQVIGYEKMIHHPHFSVTSIDHDIML [...] SQ Sequence 5916 BP; 1714 A; 1266 C; 1022 G; 1914 T; 0 other; AAGGCTGGCA AAAAGGAGAC CAGACAGGAG GCGTCTGTAG AGATATCATG AACTTCAACT 60 TAGCTTTGGT ACTTTCTTCC CTGAAGACAG AGGGCAGAAC TCTGAGTTCC AGAACCATTT 120 TCAACTGTAT TGGGGACCAA TCACTTGACT CTATTCTTGT CTCTCTGACA GATGACGCTA 180 CACTCTCCTC TGAATAATGG ACACCATTTC TAAAACTGAA TCCTGCTACT AAAATAATTC 240 [...] GTAATCCAAA ATAACTGAGC TGTGGCAGTT GTGGACCATA TGACACAGCT TGTCCCCATC 5880 GTTCACCTTT AGAATTAAAT ATAAATTAAC TCCTCA 5916 // PP Université Paris Diderot - Paris 7 28
  29. Bases de données de séquences secondaires UniProt – Pfam – ProSite – ... PP Université Paris Diderot - Paris 7 29
  30. UniProt http://www.uniprot.org/
  31. trypsine ?
  32. trypsine !
  33. Exemple ID TRY3_HUMAN Reviewed; 304 AA. AC P35030; A9Z1Y4; P15951; Q15665; Q5VXV0; Q9UQV3; DT 01-FEB-1994, integrated into UniProtKB/Swiss-Prot. DT 14-OCT-2008, sequence version 2. DT 11-JAN-2011, entry version 111. DE RecName: Full=Trypsin-3; DE EC=3.4.21.4; DE AltName: Full=Brain trypsinogen; DE AltName: Full=Mesotrypsinogen; [...] CC -!- FUNCTION: Digestive protease specialized for the degradation of CC trypsin inhibitors. CC -!- CATALYTIC ACTIVITY: Preferential cleavage: Arg-|-Xaa, Lys-|-Xaa. CC -!- COFACTOR: Binds 1 calcium ion per subunit. [...] DR PIR; S33496; S33496. DR RefSeq; NP_002762.2; NM_002771.3. DR UniGene; Hs.654513; -. DR PDB; 1H4W; X-ray; 1.70 A; A=81-304. [...] FT DISULFID 196 263 FT DISULFID 228 242 FT DISULFID 253 277 [...] SQ SEQUENCE 304 AA; 32529 MW; 4C4303C310B7BFFC CRC64; MCGPDDRCPA RWPGPGRAVK CGKGLAAARP GRVERGGAQR GGAGLELHPL LGGRTWRAAR DADGCEALGT VAVPFDDDDK IVGGYTCEEN SLPYQVSLNS GSHFCGGSLI SEQWVVSAAH CYKTRIQVRL GEHNIKVLEG NEQFINAAKI IRHPKYNRDT LDNDIMLIKL SSPAVINARV STISLPTTPP AAGTECLISG WGNTLSFGAD YPDELKCLDA PVLTQAECKA SYPGKITNSM FCVGFLEGGK DSCQRDSGGP VVCNGQLQGV VSWGHGCAWK NRPGVYTKVY NYVDWIKDTI AANS // PP Université Paris Diderot - Paris 7 33
  34. Détails ID TRY3_HUMAN Reviewed; 304 AA. | | | nom origine : Swiss-Prot taille DT 01-FEB-1994, integrated into UniProtKB/Swiss-Prot. DT 14-OCT-2008, sequence version 2. DT 11-JAN-2011, entry version 111. | dates d'entrée dans UniProt, de modification de la séquence, de modification de la fiche DE RecName: Full=Trypsin-3; | nom de la protéine DE AltName: Full=Brain trypsinogen; DE AltName: Full=Mesotrypsinogen; DE AltName: Full=Serine protease 3; DE AltName: Full=Serine protease 4; DE AltName: Full=Trypsin III; | noms alternatifs OS Homo sapiens (Human). | organisme OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; OC Catarrhini; Hominidae; Homo. | taxonomie PP Université Paris Diderot - Paris 7 34
  35. Détails (2) RN [1] RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORMS A AND B), AND VARIANT ALA-188. RC TISSUE=Brain; RX MEDLINE=94123994; PubMed=8294000; DOI=10.1016/0378-1119(93)90460-K; RA Wiegand U., Corbach S., Minn A., Kang J., Mueller-Hill B.; RT "Cloning of the cDNA encoding human brain trypsinogen and RT characterization of its product."; RL Gene 136:167-175(1993). | référence bibliographique CC -!- FUNCTION: Digestive protease specialized for the degradation of CC trypsin inhibitors. CC -!- CATALYTIC ACTIVITY: Preferential cleavage: Arg-|-Xaa, Lys-|-Xaa. CC -!- COFACTOR: Binds 1 calcium ion per subunit. CC -!- SUBCELLULAR LOCATION: Secreted. | annotations (fonction, localisation) DR PIR; S12764; S12764. DR PIR; S33496; S33496. DR RefSeq; NP_002762.2; NM_002771.3. DR UniGene; Hs.654513; -. | identifiants d'autres bases de données PE 1: Evidence at protein level; | degré de confiance de l'existence (expression) de la protéine PP Université Paris Diderot - Paris 7 35
  36. Détails (3) FT MOD_RES 211 211 Sulfotyrosine (By similarity). FT DISULFID 87 217 FT DISULFID 105 121 [...] FT STRAND 111 117 FT HELIX 119 121 | annotations de la séquence SQ SEQUENCE 304 AA; 32529 MW; 4C4303C310B7BFFC CRC64; MCGPDDRCPA RWPGPGRAVK CGKGLAAARP GRVERGGAQR GGAGLELHPL LGGRTWRAAR DADGCEALGT VAVPFDDDDK IVGGYTCEEN SLPYQVSLNS GSHFCGGSLI SEQWVVSAAH CYKTRIQVRL GEHNIKVLEG NEQFINAAKI IRHPKYNRDT LDNDIMLIKL SSPAVINARV STISLPTTPP AAGTECLISG WGNTLSFGAD YPDELKCLDA PVLTQAECKA SYPGKITNSM FCVGFLEGGK DSCQRDSGGP VVCNGQLQGV VSWGHGCAWK NRPGVYTKVY NYVDWIKDTI AANS | séquence de la protéine // | fin de la fiche PP Université Paris Diderot - Paris 7 36
  37. Remarques extension .txt également .xml Python : chaînes de caractères/listes + expressions régulières (+ module xml) PP Université Paris Diderot - Paris 7 37
  38. xml <?xml version='1.0' encoding='UTF-8'?> <uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" <entry dataset="Swiss-Prot" created="1994-02-01" modified="2011-01-11" version="111"> <accession>P35030</accession> <accession>A9Z1Y4</accession> <accession>P15951</accession> <accession>Q15665</accession> [...] <dbReference type="NCBI Taxonomy" id="9606" key="2"/> <lineage> <taxon>Eukaryota</taxon> <taxon>Metazoa</taxon> <taxon>Chordata</taxon> [...] <feature type="disulfide bond"> <location> <begin position="228"/> <end position="242"/> [...] <feature type="strand"> <location> <begin position="133"/> <end position="137"/> [...] <sequence length="304" mass="32529" checksum="4C4303C310B7BFFC" modified="2008-10-14" version="2" MCGPDDRCPARWPGPGRAVKCGKGLAAARPGRVERGGAQRGGAGLELHPLLGGRTWRAAR DADGCEALGTVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAH CYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARV STISLPTTPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSM FCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTI AANS </sequence> PP Université Paris Diderot - Paris 7 38
  39. Menu 1 Rappels 2 Problématique 3 Séquences 4 Structures 5 Quelques précautions 6 Conclusion 7 Références & crédits graphiques PP Université Paris Diderot - Paris 7 39
  40. Protein Data Bank (PDB) structures : ADN, ARN, protéines, virus... Rayons-X, RMN, cryo-microscopie électronique PP Université Paris Diderot - Paris 7 40
  41. PDB http://www.rcsb.org/pdb/home/home.do
  42. trypsine ?
  43. trypsine !
  44. Exemple HEADER HYDROLASE (SERINE PROTEINASE) 26-OCT-81 2PTN TITLE ON THE DISORDERED ACTIVATION DOMAIN IN TRYPSINOGEN. TITLE 2 CHEMICAL LABELLING AND LOW-TEMPERATURE CRYSTALLOGRAPHY COMPND MOL_ID: 1; COMPND 2 MOLECULE: TRYPSIN; COMPND 3 CHAIN: A; COMPND 4 EC: 3.4.21.4; COMPND 5 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: BOS TAURUS; SOURCE 3 ORGANISM_COMMON: CATTLE; SOURCE 4 ORGANISM_TAXID: 9913 KEYWDS HYDROLASE (SERINE PROTEINASE) EXPDTA X-RAY DIFFRACTION [...] REMARK 2 RESOLUTION. 1.55 ANGSTROMS. [...] [...] ATOM 273 N ALA A 55 6.294 11.611 25.982 1.00 9.30 N ATOM 274 CA ALA A 55 6.778 12.670 25.099 1.00 9.30 C ATOM 275 C ALA A 55 7.329 13.864 25.883 1.00 9.30 C ATOM 276 O ALA A 55 6.747 14.218 26.934 1.00 9.30 O ATOM 277 CB ALA A 55 5.636 13.154 24.190 1.00 9.30 C ATOM 278 N ALA A 56 8.461 14.383 25.454 1.00 7.97 N ATOM 279 CA ALA A 56 9.069 15.522 26.129 1.00 7.97 C ATOM 280 C ALA A 56 8.143 16.740 26.167 1.00 7.97 C ATOM 281 O ALA A 56 8.162 17.496 27.169 1.00 7.97 O ATOM 282 CB ALA A 56 10.414 15.918 25.506 1.00 7.97 C [...] PP Université Paris Diderot - Paris 7 44
  45. PDB en-tête ——————– © coordonnées © PP Université Paris Diderot - Paris 7 45
  46. Coordonnées PyMOL Rasmol VMD ... Python PP Université Paris Diderot - Paris 7 46
  47. Coordonnées ATOM 601 N LEU A 99 10.007 19.687 17.536 1.00 12.25 N ATOM 602 CA LEU A 99 9.599 18.429 18.188 1.00 12.25 C ATOM 603 C LEU A 99 10.565 17.281 17.914 1.00 12.25 C ATOM 604 O LEU A 99 10.256 16.101 18.215 1.00 12.25 O ATOM 605 CB LEU A 99 8.149 18.040 17.853 1.00 12.25 C ATOM 606 CG LEU A 99 7.125 19.029 18.438 1.00 18.18 C ATOM 607 CD1 LEU A 99 5.695 18.554 18.168 1.00 18.18 C ATOM 608 CD2 LEU A 99 7.323 19.236 19.952 1.00 18.18 C PP Université Paris Diderot - Paris 7 47
  48. PP Université Paris Diderot - Paris 7 48
  49. Remarques plusieurs chaînes plusieurs structures (RMN) des trous (RX) Python : chaînes de caractères (tranches) + listes PP Université Paris Diderot - Paris 7 49
  50. Plusieurs chaînes ATOM 955 CD2 TYR A 117 28.547 16.730 59.818 1.00 34.54 C ATOM 956 CE1 TYR A 117 26.512 14.828 59.696 1.00 34.81 C ATOM 957 CE2 TYR A 117 28.117 16.089 60.985 1.00 35.96 C ATOM 958 CZ TYR A 117 27.100 15.139 60.917 1.00 35.42 C ATOM 959 OH TYR A 117 26.673 14.515 62.069 1.00 37.14 O ATOM 960 OXT TYR A 117 25.735 19.061 58.351 1.00 32.81 O TER 961 TYR A 117 ATOM 962 N ARG B 3 42.047 55.053 18.876 1.00 34.90 N ATOM 963 CA ARG B 3 42.680 56.307 19.383 1.00 35.03 C ATOM 964 C ARG B 3 43.365 56.041 20.722 1.00 33.56 C ATOM 965 O ARG B 3 42.720 55.647 21.691 1.00 33.47 O ATOM 966 CB ARG B 3 41.614 57.395 19.562 1.00 37.48 C ATOM 967 CG ARG B 3 40.638 57.499 18.394 1.00 41.05 C PP Université Paris Diderot - Paris 7 50
  51. Plusieurs structures MODEL 1 ATOM 1 N GLY A 1 11.935 -10.938 0.352 1.00 0.00 N ATOM 2 CA GLY A 1 13.344 -10.643 0.600 1.00 0.00 C ATOM 3 C GLY A 1 13.861 -9.576 -0.330 1.00 0.00 C ATOM 4 O GLY A 1 14.929 -9.728 -0.931 1.00 0.00 O [...] ATOM 934 HB2 GLU A 60 9.981 7.744 1.905 1.00 0.00 H ATOM 935 HB3 GLU A 60 10.321 6.103 2.451 1.00 0.00 H ATOM 936 HG2 GLU A 60 12.152 6.972 3.824 1.00 0.00 H ATOM 937 HG3 GLU A 60 11.700 8.597 3.310 1.00 0.00 H TER 938 GLU A 60 ENDMDL MODEL 2 ATOM 1 N GLY A 1 19.334 -6.988 0.864 1.00 0.00 N ATOM 2 CA GLY A 1 18.296 -6.813 1.874 1.00 0.00 C ATOM 3 C GLY A 1 18.000 -5.370 2.142 1.00 0.00 C ATOM 4 O GLY A 1 18.677 -4.724 2.959 1.00 0.00 O [...] ATOM 934 HB2 GLU A 60 11.353 9.615 -0.439 1.00 0.00 H ATOM 935 HB3 GLU A 60 13.095 9.643 -0.204 1.00 0.00 H ATOM 936 HG2 GLU A 60 13.380 10.930 -2.203 1.00 0.00 H ATOM 937 HG3 GLU A 60 11.654 10.817 -2.534 1.00 0.00 H TER 938 GLU A 60 ENDMDL PP Université Paris Diderot - Paris 7 51
  52. Des trous [...] ATOM 7568 CB LYS B 72 -59.462-109.221 -72.440 1.00 31.64 C ATOM 7569 CG LYS B 72 -58.524-109.915 -73.424 1.00 31.85 C ATOM 7570 CD LYS B 72 -58.889-109.602 -74.868 1.00 32.02 C ATOM 7571 CE LYS B 72 -58.174-110.533 -75.837 1.00 31.61 C ATOM 7572 NZ LYS B 72 -58.629-110.335 -77.242 1.00 31.27 N ATOM 7573 N GLY B 73 -61.309-106.416 -72.158 1.00 31.85 N ATOM 7574 CA GLY B 73 -62.485-105.832 -71.510 1.00 30.84 C ATOM 7575 C GLY B 73 -63.598-106.848 -71.303 1.00 29.65 C ATOM 7576 O GLY B 73 -64.660-106.750 -71.920 1.00 28.85 O ATOM 7577 N SER B 74 -63.354-107.820 -70.425 1.00 28.53 N ATOM 7578 CA SER B 74 -64.301-108.911 -70.179 1.00 27.75 C ATOM 7579 C SER B 74 -64.180-109.438 -68.754 1.00 26.72 C ATOM 7580 O SER B 74 -65.113-110.041 -68.227 1.00 24.48 O ATOM 7581 CB SER B 74 -64.070-110.058 -71.166 1.00 26.32 C ATOM 7582 OG SER B 74 -64.505-109.716 -72.470 1.00 25.54 O ATOM 7583 N GLN B 79 -62.682-105.888 -62.336 1.00 42.85 N ATOM 7584 CA GLN B 79 -63.246-104.902 -63.248 1.00 42.57 C ATOM 7585 C GLN B 79 -62.146-104.278 -64.103 1.00 42.60 C ATOM 7586 O GLN B 79 -60.992-104.191 -63.681 1.00 42.45 O ATOM 7587 CB GLN B 79 -63.996-103.819 -62.464 1.00 42.46 C ATOM 7588 CG GLN B 79 -64.950-102.964 -63.300 1.00 42.30 C ATOM 7589 CD GLN B 79 -66.093-103.764 -63.905 1.00 42.15 C ATOM 7590 OE1 GLN B 79 -66.388-104.879 -63.472 1.00 42.18 O ATOM 7591 NE2 GLN B 79 -66.743-103.194 -64.911 1.00 41.70 N ATOM 7592 N VAL B 80 -62.514-103.846 -65.305 1.00 42.30 N ATOM 7593 CA VAL B 80 -61.549-103.342 -66.275 1.00 42.03 C ATOM 7594 C VAL B 80 -60.882-102.055 -65.796 1.00 42.42 C ATOM 7595 O VAL B 80 -61.544-101.165 -65.260 1.00 43.09 O [...] PP Université Paris Diderot - Paris 7 52
  53. Menu 1 Rappels 2 Problématique 3 Séquences 4 Structures 5 Quelques précautions 6 Conclusion 7 Références & crédits graphiques PP Université Paris Diderot - Paris 7 53
  54. Quelques précautions restez prudents / données PP Université Paris Diderot - Paris 7 54
  55. GenBank Z71230 LOCUS Z71230 124 bp DNA linear PLN 14-NOV-2006 DEFINITION Nicotiana tabacum chloroplast JLA region, sequence 2. ACCESSION Z71230 VERSION Z71230.1 GI:1279604 KEYWORDS rpl2 gene; transfer RNA-His; trnH gene. SOURCE chloroplast Nicotiana tabacum (common tobacco) ORGANISM Nicotiana tabacum Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; asterids; lamiids; Solanales; Solanaceae; Nicotianoideae; Nicotianeae; Nicotiana. REFERENCE 1 (bases 1 to 124) AUTHORS Goulding,S.E., Olmstead,R.G., Morden,C.W. and Wolfe,K.H. TITLE Ebb and flow of the chloroplast inverted repeat JOURNAL Mol. Gen. Genet. 252 (1-2), 195-206 (1996) PUBMED 8804393 [...] FEATURES Location/Qualifiers source 1..124 /organism="Nicotiana tabacum" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /isolate="Cuban cahibo cigar, gift from President Fidel Castro" /db_xref="taxon:4097" gene <1..11 /gene="rpl2" PP Université Paris Diderot - Paris 7 55
  56. GenBank NC_001610 LOCUS NC_001610 17084 bp DNA circular MAM 14-APR-2009 DEFINITION Didelphis virginiana mitochondrion, complete genome. ACCESSION NC_001610 VERSION NC_001610.1 GI:5835037 DBLINK Project: 11806 KEYWORDS . SOURCE mitochondrion Didelphis virginiana (North American opossum) ORGANISM Didelphis virginiana Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Metatheria; Didelphimorphia; Didelphidae; Didelphis. REFERENCE 1 (bases 1 to 17084) AUTHORS Janke,A., Feldmaier-Fuchs,G., Thomas,W.K., von Haeseler,A. and Paabo,S. TITLE The marsupial mitochondrial genome and the evolution of placental mammals JOURNAL Genetics 137 (1), 243-256 (1994) PUBMED 8056314 [...] FEATURES Location/Qualifiers source 1..17084 /organism="Didelphis virginiana" /organelle="mitochondrion" /mol_type="genomic DNA" /isolate="fresh road killed individual" /db_xref="taxon:9267" /tissue_type="liver" /dev_stage="adult" PP Université Paris Diderot - Paris 7 56
  57. GenBank 252544 LOCUS 252544 649 bp RNA linear VRL 19-SEP-2002 DEFINITION gene 7 3' end, 5' end, segment 7 [human rotavirus, strain Wa, Genomic RNA, 425 nt 2 segments]. ACCESSION VERSION GI:252544 KEYWORDS . SOURCE Human rotavirus A ORGANISM Human rotavirus A Viruses; dsRNA viruses; Reoviridae; Sedoreovirinae; Rotavirus; Rotavirus A. [...] FEATURES Location/Qualifiers source 1..649 /organism="Human rotavirus A" /mol_type="genomic RNA" /strain="Wa" /db_xref="taxon:10941" ORIGIN 1 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 61 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 121 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 181 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 241 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 301 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 361 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 421 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 481 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 541 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 601 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnn // PP Université Paris Diderot - Paris 7 57
  58. PDB 7GBP, chaîne D, res 67 PP Université Paris Diderot - Paris 7 Oups ! 58
  59. Menu 1 Rappels 2 Problématique 3 Séquences 4 Structures 5 Quelques précautions 6 Conclusion 7 Références & crédits graphiques PP Université Paris Diderot - Paris 7 59
  60. TTGTCACCTGTACACTGGCATTACTACACAGAAACCCAGATGTCCGTTACC AAGATGACCGTGTGTCATTCATTCCTAAGATTCAAAATGATTTCGATGGCA TTGATCCTGAACTGTTTGAATTGAGAAAAGCTGTTATGGACACCAATGAAA données : séquences, structures... AAAAAATGTTCCGTGACGACACTTTCGGCAAGAACCTGAATGCAAACACAA GACTCTTTGATGATGAGACTAGTTCATCCTCTTTTAAGCAAAATTCCTCTC CCTCGGAAGTAACGGAGCAACCTGTGCAACCAACCTCCGCTGTCATGGGTA GCTTCTTGTCTCCACAGTACCAACGTGCGTCATCTGCTTCTCGTACTAATC ATAATACAAGCACCTCCAGTTTAATGAAGCCTGAATCAAGTCTCTACCTGG ATAAATCATATTCGCATTTTAACAACAACGGCAGCAACGAAAACGCCCGCA CATATTTGTAATCCAATATATACTCACATGTAACAACTTATTATATAAATA AAGGATATCCTACATTATATTTCATAGAAAACCGCTCAAAAAGGTGTATTA CATCCCAACACCACACATATTTCAGCGATAAAAACCTTAAATGTGAAATTC CTGCTTCCTTAAATGTACGCAATTGCCGCTTTTTTCTGACATCTTTTTTGA AAGGAAACAGATCCTCCAGAAGGGATTTACTGTTGGCTATTTTGTGTTAGA ATAATAGATTAGGTTGCGTAAGTCATGGTCGAAAATAGTACGCAGAAGGCC GGAAATGATGATAATAGCTCTACCAAGCCATATTCGGAGGCGTTTTTCTTA AACCCAACGCCTGGATTAGAAGCTGAGCACTCAAGCACATCGCCTGCCCCC AACTTGAAAATCGGTATGCTATTATCAATGCTTTACAATTCTGTCGGTTAC GAGGATCATTGCCCTCAAGGTGGCGAATATTCGGATTTATTGAGAAATTTG TGTGAAGCTATTTTGCCATCTTACGAAATTATTGAACGCTACAAGAACCAC
  61. formats – informations
  62. il existe des normes ... pas toujours respectées
  63. réfléchissez aux objets que vous manipulez
  64. PP Université Paris Diderot - Paris 7 64
  65. Menu 1 Rappels 2 Problématique 3 Séquences 4 Structures 5 Quelques précautions 6 Conclusion 7 Références & crédits graphiques PP Université Paris Diderot - Paris 7 65
  66. Références Cours de J.-C. Gelly Bases de données en biologie Bioinformatics for dummies de J.-M. Claverie et C. Notredame BioStar Incorrect / unusual entries in main databases (GenBank, UniProt, PDB) ? http://biostar.stackexchange.com/questions/10869/ incorrect-unusual-entries-in-main-databases-genbank-uniprot-pdb PP Université Paris Diderot - Paris 7 66
  67. Références (2) format FASTA – http://en.wikipedia.org/wiki/FASTA_format GenBank – http://www.ncbi.nlm.nih.gov/ format : http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html UniProt – http://www.uniprot.org/ format : http://www.uniprot.org/manual/ PDB – http://www.rcsb.org/pdb/home/home.do format : http://www.wwpdb.org/documentation/format23/v2.3.html PP Université Paris Diderot - Paris 7 67
  68. Crédits graphiques Squidonius (Wikimedia) Ralphbijker (Flickr) USDA/ARS Viktorvoigt (Wikimedia) Icons-Land (Findicons) herzogbr (Flickr) Icons-Land (Findicons) PAPYRARRI (Flickr) PP Université Paris Diderot - Paris 7 68
Advertisement