Sucheta Tripathy,
           27th September 2012
https://sites.google.com/site/suchetalab/
   Introduction.
   History of Genome Sequencing.
   Rationale behind genome sequencing.
   How genomes are sequenced.
   What happens next.
    ◦ Assembly and Annotation.
    ◦ Sequence Submissions.
   Microbial Genome Sequencing.
   Human Genome Project.
    ◦ Encode Project.
    ◦ 1000 genomes project.
   Gene + Chromosome -> Genome

                                  A/T/G/C

                                  A/U/G/C
   Determining the order of billions of chemical
    units that builds the genetic material.
    ◦ Secrets of life is locked up in the order of the 4
      letters!!!!




    5-100 million
    living
    species???
Organism        Year   Institute       Genome Size
Bacteriophage   1976   Walter Fiers at 3569 bp
MS2                    the University
                       of Ghent
Phage Φ-X174    1977   Fred Sanger     5386 bp
                       Cambridge
Haemophilus     1995   TIGR            1,830,138 bp
influenzae
Saccharomyces 1996     European        12,495,682
cerevisiae             Effort          (16
                                       chromosomes)
Human           2000   Multiple        3.3 x 109
Genome                 Organizations   (3 billion
Project                                letters)
   Eukaryotes [2231]
    ◦   Animal
    ◦   Fungi
    ◦   Plants
    ◦   Protists
    ◦   Others
 Prokaryotes [14268]
 Viruses [3219]
Ref:
  http://www.ncbi.nlm.nih.gov/genome/brows
  e/
   Sanger Dideoxy Sequencing methods(1977)
   Maxam Gilberts Chemical degradation methods(1977)
   Two Labs that owned automated sequencers:
    1. Leroy Hood at Caltech, 1986(commercialized by
    AB)
    2. Wilhelm Ansorge at EMBL, 1986(commercialized by
    Pharmacia-Amersham and GE healthcare)
    3.Hypoxanthine-guanine phosphoribosyltransferase
    (HGPRT)Alu sequences
    4. Hitachi Laboratory developed High throughput
    capillary array sequencer, 1996.1991, A patent filed
    by EMBL on media less, solid support based
    sequencing.
   454 sequencing methods(2006)
    ◦ Principles of pyrophosphate detection(1985, 1988)

   Illumina(Solexa) Genome sequencing methods(2007)
   Applied Biosystems ABI SOLiD System(2007)
   Helicos single molecule
    sequencing(Helioscope, 2007)
   Pacific Biosciences single-molecule real-time(SMRT)
    technology, 2010
   Sequenom for Nanotechnology based sequencing.
   BioNanomatrixnanofluidiscs
   RNAP technology
http://www.ncbi.nlm.nih.gov/books/NBK20261/
http://www.springerimages.com/Images/Biomedicine/1-10.1007_s12575-
009-9004-1-1
http://en.wikipedia.org/wiki/Sequence_assembly
   Gene Prediction
   Comparative Genomics
   Orthologs search
   Blast Analysis
   Functional Categories
http://www.genomesonline.org/cgi-
bin/GOLD/index.cgi
http://www.insdc.or   http://www.ebi.ac.uk/e
g/                    mbl/Contact/collaborati
                      on.html
   JGI – IMG [http://img.jgi.doe.gov/]
   Broad
   TIGR
   WashU
   VBI at Virginia Tech
NHGRI
                            Solicited           RFAs were
                First
                              pilot             sought for
              Publicat
                            proposal               full
               ion in
                          for ENCODE             ENCODE
               2000



 In October                         GWAS    -
1990 Human               Finished   90% lies    First Report
                                                                ENCODE
  Genome                 paper in   outside      on Encode
                                                               published
   project                 2003     coding       Published
                                        2005                     2012
   started                                        in 2007
http://www.youtube.com/watch?v=N4i6lYfYQzY
What we knew
• 95% of the genome is “junk”.
   – 2.94% of the genome is coding
• cis regulatory elements occur within a
  limited genome distance.
• Most of the genome is transposable
  elements that are of obscure origin are
  dying.
• Transcribed elements are most often
  translated than not.
Encyclopedia Of DNA
      Elements
Some of the useful links:
• http://www.nature.com/encode/
• http://www.encodeproject.org/ENCOD
  E/
• http://www.factorbook.org/
• http://encodeproject.org/ENCODE/dat
  aStandards.html
• http://1000genomes.org
• http://genome.ucsc.edu/ENCODE/
http://www.gencodegenes.org/data.html
http://www.nature.
com/nature/journa
l/v489/n7414/full
/489049a.html
Key Findings:
• 80% of the human genome is active!!
  – 70,000 promoters and 400,000 enhancers
• 75% of the genome transcribed in some tissue
  or other during life time.
• Environment plays great role in switching on
  or off of a lot many genes. [Epigenetics]
• Most of the diseases don’t lie with the genes
  but the switches!!
• Dark matters controlling the genes are
  physically close to the genes they control.
Key Findings:
• Genes and the switches don’t hold one
  to one relationship!
• 4 million switches controlling 21,000
  genes!!
• Identical twins are NOT identical –
  greatly influenced by environments.
• Astronomy and genetic Biology looks
  similar(95% of the Universe is called as
  dark matter – we don’t understand)
Copy Number
                                                             Variation

                                                             SNPs

                                                             Indels




   http://en.wikipedia.org/wiki/1000_Genomes_Project
Yoruba in Ibadan, Nigeria; Japanese in Tokyo; Chinese in Beijing; Utah
residents with ancestry from northern and western Europe; Luhya in
Webuye, Kenya; Maasai in Kinyawa, Kenya; Toscani in Italy; Peruvians in
Perú; Gujarati Indians in Houston; Chinese in metropolitan Denver; people
   To study the effect of environment and their
    effects on diseases.
   99.5% DNA are similar.
   269 individuals genotype.
   One million SNPs genotyped
    ◦ Rose to 10 million including polymorphic sites.

Genome sequencingprojects

  • 1.
    Sucheta Tripathy, 27th September 2012 https://sites.google.com/site/suchetalab/
  • 2.
    Introduction.  History of Genome Sequencing.  Rationale behind genome sequencing.  How genomes are sequenced.  What happens next. ◦ Assembly and Annotation. ◦ Sequence Submissions.  Microbial Genome Sequencing.  Human Genome Project. ◦ Encode Project. ◦ 1000 genomes project.
  • 3.
    Gene + Chromosome -> Genome A/T/G/C A/U/G/C
  • 4.
    Determining the order of billions of chemical units that builds the genetic material. ◦ Secrets of life is locked up in the order of the 4 letters!!!! 5-100 million living species???
  • 5.
    Organism Year Institute Genome Size Bacteriophage 1976 Walter Fiers at 3569 bp MS2 the University of Ghent Phage Φ-X174 1977 Fred Sanger 5386 bp Cambridge Haemophilus 1995 TIGR 1,830,138 bp influenzae Saccharomyces 1996 European 12,495,682 cerevisiae Effort (16 chromosomes) Human 2000 Multiple 3.3 x 109 Genome Organizations (3 billion Project letters)
  • 6.
    Eukaryotes [2231] ◦ Animal ◦ Fungi ◦ Plants ◦ Protists ◦ Others  Prokaryotes [14268]  Viruses [3219] Ref: http://www.ncbi.nlm.nih.gov/genome/brows e/
  • 7.
    Sanger Dideoxy Sequencing methods(1977)  Maxam Gilberts Chemical degradation methods(1977)  Two Labs that owned automated sequencers: 1. Leroy Hood at Caltech, 1986(commercialized by AB) 2. Wilhelm Ansorge at EMBL, 1986(commercialized by Pharmacia-Amersham and GE healthcare) 3.Hypoxanthine-guanine phosphoribosyltransferase (HGPRT)Alu sequences 4. Hitachi Laboratory developed High throughput capillary array sequencer, 1996.1991, A patent filed by EMBL on media less, solid support based sequencing.
  • 8.
    454 sequencing methods(2006) ◦ Principles of pyrophosphate detection(1985, 1988)  Illumina(Solexa) Genome sequencing methods(2007)  Applied Biosystems ABI SOLiD System(2007)  Helicos single molecule sequencing(Helioscope, 2007)  Pacific Biosciences single-molecule real-time(SMRT) technology, 2010  Sequenom for Nanotechnology based sequencing.  BioNanomatrixnanofluidiscs  RNAP technology http://www.ncbi.nlm.nih.gov/books/NBK20261/
  • 10.
  • 11.
    Gene Prediction  Comparative Genomics  Orthologs search  Blast Analysis  Functional Categories
  • 12.
  • 13.
    http://www.insdc.or http://www.ebi.ac.uk/e g/ mbl/Contact/collaborati on.html
  • 14.
    JGI – IMG [http://img.jgi.doe.gov/]  Broad  TIGR  WashU  VBI at Virginia Tech
  • 15.
    NHGRI Solicited RFAs were First pilot sought for Publicat proposal full ion in for ENCODE ENCODE 2000 In October GWAS - 1990 Human Finished 90% lies First Report ENCODE Genome paper in outside on Encode published project 2003 coding Published 2005 2012 started in 2007
  • 16.
  • 17.
    What we knew •95% of the genome is “junk”. – 2.94% of the genome is coding • cis regulatory elements occur within a limited genome distance. • Most of the genome is transposable elements that are of obscure origin are dying. • Transcribed elements are most often translated than not.
  • 18.
  • 19.
    Some of theuseful links: • http://www.nature.com/encode/ • http://www.encodeproject.org/ENCOD E/ • http://www.factorbook.org/ • http://encodeproject.org/ENCODE/dat aStandards.html • http://1000genomes.org • http://genome.ucsc.edu/ENCODE/
  • 20.
  • 21.
  • 22.
    Key Findings: • 80%of the human genome is active!! – 70,000 promoters and 400,000 enhancers • 75% of the genome transcribed in some tissue or other during life time. • Environment plays great role in switching on or off of a lot many genes. [Epigenetics] • Most of the diseases don’t lie with the genes but the switches!! • Dark matters controlling the genes are physically close to the genes they control.
  • 23.
    Key Findings: • Genesand the switches don’t hold one to one relationship! • 4 million switches controlling 21,000 genes!! • Identical twins are NOT identical – greatly influenced by environments. • Astronomy and genetic Biology looks similar(95% of the Universe is called as dark matter – we don’t understand)
  • 24.
    Copy Number Variation SNPs Indels http://en.wikipedia.org/wiki/1000_Genomes_Project Yoruba in Ibadan, Nigeria; Japanese in Tokyo; Chinese in Beijing; Utah residents with ancestry from northern and western Europe; Luhya in Webuye, Kenya; Maasai in Kinyawa, Kenya; Toscani in Italy; Peruvians in Perú; Gujarati Indians in Houston; Chinese in metropolitan Denver; people
  • 25.
    To study the effect of environment and their effects on diseases.  99.5% DNA are similar.  269 individuals genotype.  One million SNPs genotyped ◦ Rose to 10 million including polymorphic sites.