Your SlideShare is downloading. ×
0
DNA is Data         Dave Adelsondavid.adelson@adelaide.edu.au     BioInfo Summer 2012          Dec. 3 2012                ...
What is Bioinformatics?• The mathematical,  statistical and  computing methods  that aim to solve  biological problems  us...
G-gnome vs Genome                Thanks to Ernie Bailey                                    2
What is a Genome?             • The genome is the               total genetic content               of the individual/cell...
Central paradigm of Molecular Biology     DNA           RNA          Protein         Phenotype   Guanine- G    Guanine- G ...
Central paradigm of Molecular Biology                                        5
Gene vs Genome• Each chromosome is a  single, long DNA molecule.• Genes are the basic unit of  heredity.• Genes are specif...
DNA Sequences- threebases and stop codons   http://www.genome.gov/EdKit/bio2b.html                                        ...
Genetic Codehttp://plato.stanford.edu/entries/information-biological/GeneticCode.png                                      ...
Sense Strand / Antisense Strand           http://www.genome.gov/EdKit/bio2c.html                                          ...
Open reading framesp://www.genome.gov/EdKit/bio2d.html                                      10
Reading frames  http://www.genome.gov/EdKit/bio2e.html                                           11
Exons and Introns  http://www.genome.gov/EdKit/bio2i.html                                           12
Genes from different animals are                      similarQuery = human actinSubject= fruit fly actin                  ...
Bioinformatics: what we do                             14
Bioinformatics: What we really do• Once the sequencing has been done, every  other part of the process is bioinformatics. ...
Bioinformatics: Why do we do it?• It’s the only way to make sense of billions of  base pairs of DNA sequence.• To understa...
Cost of DNA sequencing                         17
Genbank: 1982-    2008 • The number of entries   in databases of gene   sequences has   increased   exponentially         ...
Genbank: latest release• In October 15 2012, Release 192.0   – 145,430,961,262 bases   – 157,889,737 reported sequences   ...
Growth of GenBank                                                        20Nucleic Acids Res. 2011 Jan;39(Database issue):...
Current status of genome projects           http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Statistics
Mammalian GenesUnique genes                      Cow   Dog Man Mouse Rat   Opossum Platypus        Known genes            ...
Mammal Family Tree Based on         Genes                              23
Key things about DNA sequencing• Only short sequences can be generated (up to  1000bp long, depending on technology)• Typi...
Shotgun sequencing            1. Create libraries of               the whole genome.            2. Sequence millions      ...
Shotgun assembly steps• Remove bad sequences, trim adapters from  reads.• Identify repeats.• Identify overlaps by sequence...
Shotgun sequencing problems• Leaves gaps.• Contigs have  to be ordered  and oriented.• Occasional  misassembled  contigs. ...
Old style paired end libraries• To take advantage of information from paired  ends multiple libraries are made:• Small ins...
Scaffold (long range) assembly                  E Myers et all(2000)Science,v287,p2196                                    ...
Current shotgun sequencing                             30
Repeat sequences cause problems                              31
“Junk DNA”, an unfortunate choice of                words                      Used to describe the mostly repetitive DNA ...
LINEs and SINEsThese are typical of Eukaryotes, in particular mammals.Intact autonomous elements are about 6kb long.Non-au...
Retrotransposition                             •Retrotransposons                             are ancient, retroviral      ...
Repeats and genome assembly• Repeats can align to many places in the  genome.• Many repeats are longer than the sequence  ...
Repeats affect anything requiring              alignment.• Any sequence data needing to be aligned  must have repeats mask...
Resequencing is the norm• Sequencing of patient samples to determine  mutations underlying disease.• Must be able to detec...
Different classes of mutation operating in the human genome.                   Freeman J L et al. Genome Res. 2006;16:949-...
Genome resequencing for SV                                                              39            http://www.sciencema...
Summary and Challenges Ahead• DNA sequencing is becoming faster and cheaper at a pace far  outstripping Moore’s law (the r...
Summary and Challenges Ahead• Storage and access to data causes issues   – Not all data in Genbank or in a format that can...
Biggest driver for bioinformatics                                    42
Upcoming SlideShare
Loading in...5
×

DNA is Data - BioInfoSummer 2012 (Dave Adelson)

3,806

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,806
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • You share 6000 genes with a banana.
  • Transcript of "DNA is Data - BioInfoSummer 2012 (Dave Adelson)"

    1. 1. DNA is Data Dave Adelsondavid.adelson@adelaide.edu.au BioInfo Summer 2012 Dec. 3 2012 0
    2. 2. What is Bioinformatics?• The mathematical, statistical and computing methods that aim to solve biological problems using DNA and protein sequences and related information.• My main interest is genome analysis of mammals. 1
    3. 3. G-gnome vs Genome Thanks to Ernie Bailey 2
    4. 4. What is a Genome? • The genome is the total genetic content of the individual/cell. • All mammal genomes are about the same size. • Made up of chromosomes, each of which is a single molecule of DNA. • Total genome length 3,000,000,000 base pairs. 3 Image courtesy NHGRI
    5. 5. Central paradigm of Molecular Biology DNA RNA Protein Phenotype Guanine- G Guanine- G G Glycine Gly Adenine- A Adenine- A P Proline Pro Thymine- T Uracil- U Cytosine- C Cytosine- C A Alanine Ala V Valine Val 20 amino acids 4
    6. 6. Central paradigm of Molecular Biology 5
    7. 7. Gene vs Genome• Each chromosome is a single, long DNA molecule.• Genes are the basic unit of heredity.• Genes are specific DNA sequences located on chromosomes.• Genome contains approximately 20,000 protein coding genes.• The 20,000 genes fill up about 2% of the genome. 6
    8. 8. DNA Sequences- threebases and stop codons http://www.genome.gov/EdKit/bio2b.html 7
    9. 9. Genetic Codehttp://plato.stanford.edu/entries/information-biological/GeneticCode.png 8
    10. 10. Sense Strand / Antisense Strand http://www.genome.gov/EdKit/bio2c.html 9
    11. 11. Open reading framesp://www.genome.gov/EdKit/bio2d.html 10
    12. 12. Reading frames http://www.genome.gov/EdKit/bio2e.html 11
    13. 13. Exons and Introns http://www.genome.gov/EdKit/bio2i.html 12
    14. 14. Genes from different animals are similarQuery = human actinSubject= fruit fly actin 13
    15. 15. Bioinformatics: what we do 14
    16. 16. Bioinformatics: What we really do• Once the sequencing has been done, every other part of the process is bioinformatics. – Genome Assembly – Gene Prediction – Sequence Analysis 15
    17. 17. Bioinformatics: Why do we do it?• It’s the only way to make sense of billions of base pairs of DNA sequence.• To understand the mechanistic basis of biological trait determination. 16
    18. 18. Cost of DNA sequencing 17
    19. 19. Genbank: 1982- 2008 • The number of entries in databases of gene sequences has increased exponentially 18
    20. 20. Genbank: latest release• In October 15 2012, Release 192.0 – 145,430,961,262 bases – 157,889,737 reported sequences 19
    21. 21. Growth of GenBank 20Nucleic Acids Res. 2011 Jan;39(Database issue):D32-7.
    22. 22. Current status of genome projects http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Statistics
    23. 23. Mammalian GenesUnique genes Cow Dog Man Mouse Rat Opossum Platypus Known genes 22
    24. 24. Mammal Family Tree Based on Genes 23
    25. 25. Key things about DNA sequencing• Only short sequences can be generated (up to 1000bp long, depending on technology)• Typical mammalian genome is 3x109 bp.• Sequencing a genome means stitching together millions of short reads.• To assemble reads, one must be able to identify overlap by aligning sequences.• Sequence alignment tools are fundamental to bioinformatics. 24
    26. 26. Shotgun sequencing 1. Create libraries of the whole genome. 2. Sequence millions of fragments. 3. Look for overlap between reads. 4. Assemble reads based on overlaps into contigs. ED Green(2001) Nature Reviews Genetics vol. 2 (8) pp. 573-583 25
    27. 27. Shotgun assembly steps• Remove bad sequences, trim adapters from reads.• Identify repeats.• Identify overlaps by sequence alignment (excluding repeats).• Build contigs from overlapping sequences.• Used paired-end reads to assemble contigs into scaffolds.• Use additional marker information to order and orient scaffolds into super-scaffolds (chromosomes). 26
    28. 28. Shotgun sequencing problems• Leaves gaps.• Contigs have to be ordered and oriented.• Occasional misassembled contigs. ED Green(2001) Nature Reviews Genetics vol. 2 (8) pp. 573-583 27
    29. 29. Old style paired end libraries• To take advantage of information from paired ends multiple libraries are made:• Small insert ~2kb (plasmid)• Medium insert ~10kb (plasmid)• Large insert ~40kb (fosmid)• Tight control of insert size is paramount. Use random shearing, not restriction digest to generate inserts. Small inserts may well be sequenced through with overlap from ends. 28
    30. 30. Scaffold (long range) assembly E Myers et all(2000)Science,v287,p2196 29
    31. 31. Current shotgun sequencing 30
    32. 32. Repeat sequences cause problems 31
    33. 33. “Junk DNA”, an unfortunate choice of words Used to describe the mostly repetitive DNA between geneshttp://www.junkdna.com/ohno.html
    34. 34. LINEs and SINEsThese are typical of Eukaryotes, in particular mammals.Intact autonomous elements are about 6kb long.Non-autonomous truncated (SINE) elements that sharethe same tail make use of the autonomous elementsinsertion machinery.Adelson GENE3111/3110 33
    35. 35. Retrotransposition •Retrotransposons are ancient, retroviral Cytoplasm like pieces of DNACell that copy themselves around the genome. •They cannot “infect” Nucleus other individuals or cells because they lack key components that viruses have.
    36. 36. Repeats and genome assembly• Repeats can align to many places in the genome.• Many repeats are longer than the sequence reads produced by current sequencers.• To avoid many to many mapping, leading to incorrect contig assembly, repeats must be identified and masked prior to alignment. 35
    37. 37. Repeats affect anything requiring alignment.• Any sequence data needing to be aligned must have repeats masked. – Transcriptome data – Structural variation(SV)/mutation mapping 36
    38. 38. Resequencing is the norm• Sequencing of patient samples to determine mutations underlying disease.• Must be able to detect a range of mutation events (of various sizes).• Applies to germ line mutations/variations or somatic mutations/variations (ie cancer). 37
    39. 39. Different classes of mutation operating in the human genome. Freeman J L et al. Genome Res. 2006;16:949-961 38Copyright © 2006, Cold Spring Harbor Laboratory Press
    40. 40. Genome resequencing for SV 39 http://www.sciencemag.org/cgi/content/full/318/5849/420
    41. 41. Summary and Challenges Ahead• DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore’s law (the rate at which computing gets faster and cheaper).• the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.• Data handling is now the bottleneck• It costs more to analyze a genome than to sequence a genome.• The cost of sequencing a human genome — all three billion bases of DNA in a set of human chromosomes — plunged to under $10,000 this year from $8.9 million in July 2007
    42. 42. Summary and Challenges Ahead• Storage and access to data causes issues – Not all data in Genbank or in a format that can be easily accessed• Demand from health care system for tools to visualize, understand and interpret patient genomic data.
    43. 43. Biggest driver for bioinformatics 42
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×