2. DNA SEQUENCING
• DNA sequencing is the process of
determining the precise order of
nucleotides within a DNA molecule.
• It includes any method or technology that
is used to determine the order of the four
bases—adenine, guanine, cytosine, and
thymine—in a strand of DNA.
4. •A. M. Maxam and W.Gilbert-1977
•Chemical Sequencing
•Treatment of DNA with certain
Chemicals DNA cuts into
Fragments Monitoring of
sequences
•MAXAM & GILBERT METHOD
6. • Most common approach used
for DNA sequencing .
• Invented by Frederick Sanger -
1977
• Nobel prize - 1980
• Also termed as Chain
Termination or Dideoxy method
SANGER METHOD
7. •SANGER METHOD
• The chain termination reaction
• Dideoxynucleotide triphosphates (ddNTPs) chain
terminators
•havig an H on the 3’C of the ribose sugar
(normally OH found in dNTPs)
• ssDNA addition of dNTPs elongation
• ssDNA addition of ddNTPs elongation stops
12. Fluorescent Dyes
• Fluorescent dyes are multicyclic
molecules that absorb and emit
fluorescent light at specific wavelengths.
• Examples are fluorescein and rhodamine
derivatives.
• For sequencing applications, these
molecules can be covalently linked to
nucleotides.
13. AC
GT
The fragments are
distinguished by size and
“color.”
•Dye Terminator Sequencing
• A distinct dye or “color” is used for each of the
four ddNTP.
• Since the terminating nucleotides can be
distinguished by color, all four reactions can be
performed in a single tube.
A
T
G
T
18. •The Human Genome Project
• First draft genome of human in 2001,
final 2004
• Estimated costs $3 billion, time 13 years
• Used Sanger Sequencing
• Today:
Illumina: 1 week, 9500$
Exome: 6 weeks*, $1000
Towards 1000$ genome
Setia Pramana
18
19. •The Human Genome Project
• The draft sequence of the
HGP was imperfect
because of the incomplete
coverage of many regions
– a huge number of gaps
• The IHGSC published a
‘finished’ version of the
human genome sequence
in 2004 and the HGP was
then deemed to be
‘complete’
19
20. •The Human Genome Project
• This ‘finished’ version of the
genome achieved almost
complete coverage of all the
regions and also significantly
reduced the number of gaps
to 341 from the initial
hundreds of thousands
• Initiated a new era in the
study of genetic variation and
the functional
characterization of the
human genome
20
21. •Next (second) Generation Sequencing
• New technologies allowing the massive
production of tens of millions of short
sequencing fragments. Thus, it is also
called: “Massively parallel sequencing”
• These techniques could be used to
• deal with similar problems than microarrays,
• but also with many other.
• They raised the promise of personalized
medicine
21
22. NGS
• The advent of high-throughput
sequencing technologies has initiated
the ‘personal genome sequencing’ era
for both normal and cancer genomes
• Large-scale international projects such
as the 1000 Genomes Project and the
International Cancer Genome
Consortium
22
23. NGS
• NGS technologies have been on the
market only since 2004
• Have now largely replaced Sanger
sequencing technologies (owing to the
ultra-high-throughput
production/hundreds gigabases)
• Ability to simultaneously sequence
millions of DNA fragments - massively
parallel sequencing technologies
23
24. •NGS
• Reduced sequencing costs
significantly, making large-scale or
WGS studies much more affordable
Setia Pramana
24
28. Sequencing has gotten Cheaper and Faster
Cost of one human genome
• HGP $ 3 billion (13 yrs)
•2004: $ 30,000,000
•2008: $100,000
•2010: $ 30,000
•2011: $10,000
•2012-13: $7,000
•2014: $4,000 (~1 week)
•???: $1,000
The Race for the $1,000 Genome
29. equencing) Cost is Getting Cheaper
• Reduced sequencing costs significantly, making
large-scale or WGS studies much more affordable
Setia Pramana
29
32. •NGS Challenges
• Highest cost is (almost) not the sequencing
but storage and analysis.
• A standard human (30-40x) whole genome
sequencing would create 100 Gb of data
• Extreme data size causes problems
• Just transferring and storing the data
• Standard comparisons fail (N*N)
• Standard tools can not be used
• Think in fast and parallel programs
Setia Pramana
32
33. •Bioinformatics Challenges of NGS
• Need for large amount of CPU power
- Informatics groups must manage
compute clusters
-Challenges in parallelizing existing
software or redesign of algorithms to work
in a parallel environment
- Another level of software complexity
and challenges to interoperability
Setia Pramana
33
34. •Bioinformatics Challenges of NGS
• VERY large text files (~10 million lines
long)
- Can’t do ‘business as usual’with
familiar tools such as Perl/Python.
- Impossible memory usage and
execution time - Impossible to
browse for problems
• Need sequence Quality filtering
Setia Pramana
34
35. •Data Management Issues
• Raw data are large. How long should be kept?
• Processed data are manageable for most people
• 20 million reads (50bp) ~1Gb
• More of an issue for a facility: HiSeq recommends
32 CPU cores, each with 4GB RAM
• Certain studies much more data intensive than
other
• Whole genome sequencing
30X coverage genome pair (tumor/normal)
~500 GB
50 genome pairs ~ 25 TB
Setia Pramana
35
36. •Data Management
• Primary data usually discarded soon after run
• Secondary and tertiary data maintained on fast access
disk during analysis, then moved to slower access disk
afterward
38. •Big Collaboration
• Need Collaborative expertise (human intelligence
and intuition) are required for meaning and
interpretation (Bergeron 2002)
• Including on-demand communication & sharing of
protocols, electronic resources, data, and findings
among the stakeholders
• Collaboration with other Big DATA sources: National
Registers, BPJS, Hospitals, etc.
39. •Summary
• Challenges:
• Still expensive
• Lack of Infrastructure (in developing
countries)
• Lack of skilled personal on Bioinformatics
• Need (large scale) collaborations
• Integrate different technologies and system
• Making it all clinically relevant
Setia Pramana
39