0
Introduction to Bioinformatics Victor Jin Department of Biomedical Informatics Ohio State University
What is Bioinformatics?   bio·in·for·mat·ics   :  the collection, classification, storage, and analysis of biochemical and...
<ul><li>Myth1 : Bioinformatics is about genomics </li></ul><ul><li>Nucleotide – DNA, RNA, … </li></ul><ul><li>Genome – Seq...
Myth2 : Data vs. Information Data  Nucleotide – DNA, RNA, … Genome – Sequences, chromosomes, expressed data, … Protein – S...
<ul><li>Myth3 : Computer is intelligent </li></ul><ul><li>Pros </li></ul><ul><li>Repeated work </li></ul><ul><li>Accurate ...
Biology – Bioinformatics  Bioinformatics
High-throughput techniques <ul><li>DNA Sequencing </li></ul><ul><li>1970’s – Nobel prize </li></ul><ul><li>1980’s – Ph.D. ...
Human Genome Project The Beginning (1988) Cold Spring Harbor Laboratory Long Island, New York
Initial Analysis of the Human Genome
What information do we want to extract? Total genetic difference (# of bases) is 4% 35 million single base substitutions p...
Phenotype <ul><li>mRNA level </li></ul><ul><li>Protein expression </li></ul><ul><li>Protein structure </li></ul><ul><li>Ce...
High-throughput techniques High throughput protein crystalization Mass spectrometry Microarray High throughput cell imagin...
How to extract the information? <ul><li>Computational tools </li></ul><ul><li>Building the databases </li></ul><ul><li>Per...
What we are going to do: <ul><li>Search the databases </li></ul><ul><li>Perform analysis </li></ul><ul><li>Present output ...
What the scope of Bioinformatics teach? <ul><li>Genomics </li></ul><ul><li>Proteomics </li></ul><ul><li>Microarray analysi...
Review of Biology Central dogma
Review of Biology Operon
Review of Biology mRNA, cDNA,  exon, intron
Review of Biology Protein folding and structure
<ul><li>Databases </li></ul><ul><ul><li>GenBank www.ncbi.nlm.nih.gov/GenBank/ </li></ul></ul><ul><ul><li>EMBL  www.ebi.ac....
<ul><li>Resources   </li></ul><ul><ul><li>Local:  </li></ul></ul><ul><ul><ul><li>OSU library </li></ul></ul></ul><ul><ul><...
<ul><li>PubMed – Entrez </li></ul><ul><ul><li>PubMed :  http://www.pubmed.gov ,  </li></ul></ul><ul><ul><li>http://www.ncb...
Entrez Databases
<ul><li>Literatures </li></ul><ul><ul><li>Examples: </li></ul></ul><ul><ul><ul><li>E2F3 </li></ul></ul></ul><ul><ul><ul><l...
Literatures
Literatures
<ul><li>Literatures </li></ul><ul><ul><li>Examples: </li></ul></ul><ul><ul><ul><li>E2F3 </li></ul></ul></ul><ul><ul><ul><l...
Literatures
Nucleotide <ul><li>Gene </li></ul><ul><li>Genome </li></ul><ul><li>Sequence </li></ul><ul><li>mRNA </li></ul><ul><li>cDNA ...
Accession number, GI number, Version <ul><li>accession number  (GenBank) - The accession number is the unique identifier a...
Example : E2F3
Example : E2F3
Data Format FASTA (.fasta file) >gi|33469954|ref|NM_000240.2| Homo sapiens monoamine oxidase A (MAOA), nuclear gene encodi...
Data Format <ul><li>Other formats </li></ul><ul><ul><li>NBRF/PIR (.pir file) </li></ul></ul><ul><ul><li>Begin with “>P1;” ...
Protein Databases UniProt  is the  uni versal  prot ein database, a central repository of  protein  data created by combin...
PubMed – Protein Databases The Protein database contains sequence data from the translated coding regions from DNA sequenc...
<ul><li>Example – UniProt - Expasy </li></ul><ul><ul><li>http://www.uniprot.org/   http://www.expasy.org/ </li></ul></ul>
Example – UniProt - Expasy
Example – UniProt - Expasy
Example – UniProt - Expasy
Example – UniProt - Expasy
Annotation - Visualization UCSC Genome Browser ( http:// genome.ucsc.edu/ )
Upcoming SlideShare
Loading in...5
×

Introduction to Bioinformatics Victor Jin

1,407

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,407
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
97
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to Bioinformatics Victor Jin"

  1. 1. Introduction to Bioinformatics Victor Jin Department of Biomedical Informatics Ohio State University
  2. 2. What is Bioinformatics? bio·in·for·mat·ics : the collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied in molecular genetics and genomics. Source : Merriam-Webster's Medical Dictionary, © 2002 Merriam-Webster, Inc.
  3. 3. <ul><li>Myth1 : Bioinformatics is about genomics </li></ul><ul><li>Nucleotide – DNA, RNA, … </li></ul><ul><li>Genome – Sequences, chromosomes, expressed data, … </li></ul><ul><li>Protein – Sequences, 3-D structure, interaction, … </li></ul><ul><li>System – Gene network, protein network, TFs, … </li></ul><ul><li>Other – Masspec, microarray, images, lab records, journals, literatures, … </li></ul><ul><li>The goal is to understand how the system works. </li></ul>
  4. 4. Myth2 : Data vs. Information Data Nucleotide – DNA, RNA, … Genome – Sequences, chromosomes, expressed data, … Protein – Sequences, 3-D structure, interaction, … System – Gene network, protein network, TFs, … Other – Masspec, microarray, images, lab records, journals, literatures, … Information Genotype Phenotype Genotype-Phenotype relationship SNPs Pathways Drug targets Getting data is “easy”, extracting information is hard!
  5. 5. <ul><li>Myth3 : Computer is intelligent </li></ul><ul><li>Pros </li></ul><ul><li>Repeated work </li></ul><ul><li>Accurate storage </li></ul><ul><li>Precise computation </li></ul><ul><li>Fast communication </li></ul><ul><li>… </li></ul><ul><li>Cons </li></ul><ul><li>Cannot generalize </li></ul><ul><li>No real intelligence </li></ul><ul><li>… </li></ul><ul><li>The results must be reviewed and validated by biologists. In addition, biologists must have some understanding of how computer processes data (algorithms) – that’s why we need to learn bioinformatics. </li></ul>
  6. 6. Biology – Bioinformatics Bioinformatics
  7. 7. High-throughput techniques <ul><li>DNA Sequencing </li></ul><ul><li>1970’s – Nobel prize </li></ul><ul><li>1980’s – Ph.D. thesis </li></ul><ul><li>Early 1990’s – Major research projects </li></ul><ul><li>Late 1990’s to now - $20 </li></ul>
  8. 8. Human Genome Project The Beginning (1988) Cold Spring Harbor Laboratory Long Island, New York
  9. 9. Initial Analysis of the Human Genome
  10. 10. What information do we want to extract? Total genetic difference (# of bases) is 4% 35 million single base substitutions plus 5 million insertions or deletions (indels) The average protein differs by only two amino acids, and 29% of proteins are identical. Genotype – Phenotype relationship!!! Science, 9/2/2005
  11. 11. Phenotype <ul><li>mRNA level </li></ul><ul><li>Protein expression </li></ul><ul><li>Protein structure </li></ul><ul><li>Cell morphology </li></ul><ul><li>Tissue morphology </li></ul><ul><li>System physiological functions </li></ul><ul><li>Behavior </li></ul><ul><li>… </li></ul>
  12. 12. High-throughput techniques High throughput protein crystalization Mass spectrometry Microarray High throughput cell imaging High throughput in vivo screening …
  13. 13. How to extract the information? <ul><li>Computational tools </li></ul><ul><li>Building the databases </li></ul><ul><li>Perform analysis/extract features </li></ul><ul><li>Data funsion/Integration </li></ul><ul><li>Data mining/Classification/statistical learning </li></ul><ul><li>Visualization/representation </li></ul>Biological information!!!
  14. 14. What we are going to do: <ul><li>Search the databases </li></ul><ul><li>Perform analysis </li></ul><ul><li>Present output </li></ul>Be a salient user!
  15. 15. What the scope of Bioinformatics teach? <ul><li>Genomics </li></ul><ul><li>Proteomics </li></ul><ul><li>Microarray analysis </li></ul><ul><li>Other aspects </li></ul><ul><ul><li>Ontology </li></ul></ul><ul><li>Machine learning / statistical analysis </li></ul><ul><li>Visualization </li></ul><ul><li>Data sources (databases) </li></ul><ul><li>Available tools </li></ul><ul><li>Major issues in using the databases and tools </li></ul><ul><li>Other resources </li></ul>
  16. 16. Review of Biology Central dogma
  17. 17. Review of Biology Operon
  18. 18. Review of Biology mRNA, cDNA, exon, intron
  19. 19. Review of Biology Protein folding and structure
  20. 20. <ul><li>Databases </li></ul><ul><ul><li>GenBank www.ncbi.nlm.nih.gov/GenBank/ </li></ul></ul><ul><ul><li>EMBL www.ebi.ac.uk/embl/ </li></ul></ul><ul><ul><li>DDBJ www.ddbj.nig.ac.jp </li></ul></ul><ul><ul><li>Synchronized daily. </li></ul></ul><ul><ul><li>Accession numbers are managed in a consistent way. </li></ul></ul><ul><ul><li>AceDB </li></ul></ul><ul><ul><li>DDJP DNA </li></ul></ul><ul><ul><li>JJPID </li></ul></ul><ul><ul><li>MIPS </li></ul></ul><ul><ul><li>PHRED </li></ul></ul><ul><ul><li>PIR </li></ul></ul><ul><ul><li>PROSITE </li></ul></ul><ul><ul><li>RDP </li></ul></ul><ul><ul><li>TIGR </li></ul></ul><ul><ul><li>UNIGENE </li></ul></ul><ul><ul><li>… </li></ul></ul>
  21. 21. <ul><li>Resources </li></ul><ul><ul><li>Local: </li></ul></ul><ul><ul><ul><li>OSU library </li></ul></ul></ul><ul><ul><li>Web: </li></ul></ul><ul><ul><ul><li>PubMed </li></ul></ul></ul><ul><ul><ul><li>JSTOR ( http://www.jstor.com ) </li></ul></ul></ul><ul><ul><ul><li>http:// www.expasy.org </li></ul></ul></ul><ul><ul><ul><li>http://www.genecards.org </li></ul></ul></ul><ul><ul><ul><li>http://www.pathguide.org/ </li></ul></ul></ul>
  22. 22. <ul><li>PubMed – Entrez </li></ul><ul><ul><li>PubMed : http://www.pubmed.gov , </li></ul></ul><ul><ul><li>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi </li></ul></ul><ul><ul><li>PubMed training : http://www.nlm.nih.gov/bsd/disted/pubmed.html </li></ul></ul><ul><ul><li>Entrez : http://www.ncbi.nlm.nih.gov/Database/index.html </li></ul></ul><ul><ul><li>Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. Click on the graphic below for a more detailed view of Entrez integration. </li></ul></ul>
  23. 23. Entrez Databases
  24. 24. <ul><li>Literatures </li></ul><ul><ul><li>Examples: </li></ul></ul><ul><ul><ul><li>E2F3 </li></ul></ul></ul><ul><ul><ul><li>Retinoblastoma </li></ul></ul></ul><ul><ul><li>Constraints: automatics vs. manual </li></ul></ul><ul><ul><li>Save: Tutorial at http://www.nlm.nih.gov/bsd/viewlet/myncbi/saving_searches.swf </li></ul></ul>
  25. 25. Literatures
  26. 26. Literatures
  27. 27. <ul><li>Literatures </li></ul><ul><ul><li>Examples: </li></ul></ul><ul><ul><ul><li>E2F3 </li></ul></ul></ul><ul><ul><ul><li>Retinoblastoma </li></ul></ul></ul><ul><ul><li>Constraints: automatics vs. manual </li></ul></ul>
  28. 28. Literatures
  29. 29. Nucleotide <ul><li>Gene </li></ul><ul><li>Genome </li></ul><ul><li>Sequence </li></ul><ul><li>mRNA </li></ul><ul><li>cDNA </li></ul><ul><li>SNP </li></ul><ul><li>ESTs (expressed sequence tags) / UniGene </li></ul><ul><li>Name </li></ul><ul><li>Accession number </li></ul><ul><li>GI number </li></ul><ul><li>Version number </li></ul><ul><li>Alias </li></ul>
  30. 30. Accession number, GI number, Version <ul><li>accession number (GenBank) - The accession number is the unique identifier assigned to the entire sequence record when the record is submitted to GenBank. The GenBank accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456). </li></ul><ul><li>The accession number for a particular record will not change even if the author submits a request to change some of the information in the record. Take note that an accession number is a unique identifier for a complete sequence record, while a Sequence Identifier, such as a Version, GI, or ProteinID, is an identification number assigned just to the sequence data. The NCBI Entrez System is searchable by accession number using the Accession [ACCN] search field. </li></ul><ul><li>GI (GenBank) - A GI or &quot;GenInfo Identifier&quot; is a sequence identifier that can be assigned to a nucleotide sequence or protein translation. Each GI is a numeric value of one or more digits. The protein translation and the nucleotide sequence contained in the same record will each be assigned different GI numbers. </li></ul><ul><li>Every time the sequence data for a particular record is changed, its version number increases and it receives a new GI . However, while each new version number is based upon the previous version number, a new GI for an altered sequence may be completely different from the previous GI. For example, in the GenBank record M12345, the original GI might be 7654321, but after a change in the sequence is submitted, the new GI for the changed sequence could be 10529376. Individuals can search for nucleotide sequences and protein translations by GI using the UID search field in the NCBI sequence databases. </li></ul><ul><li>GI number is NOT GeneID. </li></ul>
  31. 31. Example : E2F3
  32. 32. Example : E2F3
  33. 33. Data Format FASTA (.fasta file) >gi|33469954|ref|NM_000240.2| Homo sapiens monoamine oxidase A (MAOA), nuclear gene encoding mitochondrial protein, mRNA GGGCGCTCCCGGAGTATCAGCAAAAGGGTTCGCCCCGCCCACAGTGCCCGGCTCCCCCCGGGTATCAAAA GAAGGATCGGCTCCGCCCCCGGGCTCCCCGGGGGAGTTGATAGAAGGGTCCTTCCCACCCTTTGCCGTCC CCACTCCTGTGCCTACGACCCAGGAGCGTGTCAGCCAAAGCATGGAGAATCAAGAGAAGGCGAGTATCGC GGGCCACATGTTCGACGTAGTCGTGATCGGAGGTGGCATTTCAGGACTATCTGCTGCCAAACTCTTGACT GAATATGGCGTTAGTGTTTTGGTTTTAGAAGCTCGGGACAGGGTTGGAGGAAGAACATATACTATAAGGA ATGAGCATGTTGATTACGTAGATGTTGGTGGAGCTTATGTGGGACCAACCCAAAACAGAATCTTACGCTT GTCTAAGGAGCTGGGCATAGAGACTTACAAAGTGAATGTCAGTGAGCGTCTCGTTCAATATGTCAAGGGG AAAACATATCCATTTCGGGGCGCCTTTCCACCAGTATGGAATCCCATTGCATATTTGGATTACAATAATC TGTGGAGGACAATAGATAACATGGGGAAGGAGATTCCAACTGATGCACCCTGGGAGGCTCAACATGCTGA CAAATGGGACAAAATGACCATGAAAGAGCTCATTGACAAAATCTGCTGGACAAAGACTGCTAGGCGGTTT GCTTATCTTTTTGTGAATATCAATGTGACCTCTGAGCCTCACGAAGTGTCTGCCCTGTGGTTCTTGTGGT ATGTGAAGCAGTGCGGGGGCACCACTCGGATATTCTCTGTCACCAATGGTGGCCAGGAACGGAAGTTTGT AGGTGGATCTGGTCAAGTGAGCGAACGGATAATGGACCTCCTCGGAGACCAAGTGAAGCTGAACCATCCT GTCACTCACGTTGACCAGTCAAGTGACAACATCATCATAGAGACGCTGAACCATGAACATTATGAGTGCA AATACGTAATTAATGCGATCCCTCCGACCTTGACTGCCAAGATTCACTTCAGACCAGAGCTTCCAGCAGA GAGAAACCAGTTAATTCAGCGGCTTCCAATGGGAGCTGTCATTAAGTGCATGATGTATTACAAGGAGGCC TTCTGGAAGAAGAAGGATTACTGTGGCTGCATGATCATTGAAGATGAAGATGCTCCAATTTCAATAACCT TGGATGACACCAAGCCAGATGGGTCACTGCCTGCCATCATGGGCTTCATTCTTGCCCGGAAAGCTGATCG ACTTGCTAAGCTACATAAGGAAATAAGGAAGAAGAAAATCTGTGAGCTCTATGCCAAAGTGCTGGGATCC CAAGAAGCTTTACATCCAGTGCATTATGAAGAGAAGAACTGGTGTGAGGAGCAGTACTCTGGGGGCTGCT ACACGGCCTACTTCCCTCCTGGGATCATGACTCAATATGGAAGGGTGATTCGTCAACCCGTGGGCAGGAT TTTCTTTGCGGGCACAGAGACTGCCACAAAGTGGAGCGGCTACATGGAAGGGGCAGTTGAGGCTGGAGAA CGAGCAGCTAGGGAGGTCTTAAATGGTCTCGGGAAGGTGACCGAGAAAGATATCTGGGTACAAGAACCTG … >gi|4557735|ref|NP_000231.1| monoamine oxidase A [Homo sapiens] MENQEKASIAGHMFDVVVIGGGISGLSAAKLLTEYGVSVLVLEARDRVGGRTYTIRNEHVDYVDVGGAYV GPTQNRILRLSKELGIETYKVNVSERLVQYVKGKTYPFRGAFPPVWNPIAYLDYNNLWRTIDNMGKEIPT DAPWEAQHADKWDKMTMKELIDKICWTKTARRFAYLFVNINVTSEPHEVSALWFLWYVKQCGGTTRIFSV TNGGQERKFVGGSGQVSERIMDLLGDQVKLNHPVTHVDQSSDNIIIETLNHEHYECKYVINAIPPTLTAK IHFRPELPAERNQLIQRLPMGAVIKCMMYYKEAFWKKKDYCGCMIIEDEDAPISITLDDTKPDGSLPAIM GFILARKADRLAKLHKEIRKKKICELYAKVLGSQEALHPVHYEEKNWCEEQYSGGCYTAYFPPGIMTQYG RVIRQPVGRIFFAGTETATKWSGYMEGAVEAGERAAREVLNGLGKVTEKDIWVQEPESKDVPAVEITHTF WERNLPSVSGLLKIIGFSTSVTALGFVLYKYKLLPRS
  34. 34. Data Format <ul><li>Other formats </li></ul><ul><ul><li>NBRF/PIR (.pir file) </li></ul></ul><ul><ul><li>Begin with “>P1;” for protein sequence and “>N1;” for nucleotide. </li></ul></ul><ul><ul><li>GDE (.gde file) </li></ul></ul><ul><ul><li>Similar to FASTA file, begin with “%” instead of “>”. </li></ul></ul>
  35. 35. Protein Databases UniProt is the uni versal prot ein database, a central repository of protein data created by combining Swiss-Prot , TrEMBL and PIR . This makes it the world's most comprehensive resource on protein information. The Protein Information Resource (PIR), located at Georgetown University Medical Center (GUMC), is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. Swiss-Prot is a curated biological database of protein sequences from different species created in 1986 by Amos Bairoch during his PhD and developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute . Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. PDB NCBI http://proteome.nih.gov/links.html
  36. 36. PubMed – Protein Databases The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to Protein Information Resource (PIR), SWISS-PROT, Protein Research Foundation (PRF), and Protein Data Bank (PDB) (sequences from solved structures). The Structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. Use Cn3D , the NCBI 3D structure viewer, for easy interactive visualization of molecular structures from Entrez. Tutorial: http://www.pdb.org/pdbstatic/tutorials/tutorial.html
  37. 37. <ul><li>Example – UniProt - Expasy </li></ul><ul><ul><li>http://www.uniprot.org/ http://www.expasy.org/ </li></ul></ul>
  38. 38. Example – UniProt - Expasy
  39. 39. Example – UniProt - Expasy
  40. 40. Example – UniProt - Expasy
  41. 41. Example – UniProt - Expasy
  42. 42. Annotation - Visualization UCSC Genome Browser ( http:// genome.ucsc.edu/ )
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×