NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
2011 jeroen vanhoudt_ngs
1.
2. Next-Generation sequencing (NGS)
technologies – overview
NGS targeted re-sequencing – fishing out the
regions of interest
NGS workflow: data collection and processing
– the exome sequencing pipeline
4. The automated
Sanger method is
considered as a ‘first-
generation’
technology, and
newer methods are
referred to as next-
generation
sequencing (NGS).
5. 1953 Discovery of DNA double helix structure
1977
◦ A Maxam and W Gilbert "DNA seq by chemical degradation"
◦ F Sanger"DNA sequencing with chain-terminating inhibitors"
1984 DNA sequence of the Epstein-Barr virus, 170 kb
1987 Applied Biosystems - first automated sequencer
1991 Sequencing of human genome in Venter's lab
1996 P. Nyrén and M Ronaghi - pyrosequencing
2001 A draft sequence of the human genome
2003 human genome completed
2004 454 Life Sciences markets first NGS machine
11. Produce a non-biased source of nucleic acid
material from the genome
12. Produce a non-biased source of nucleic acid
material from the genome
13. Produce a non-biased source of nucleic acid
material from the genome
Current methods:
◦ randomly breaking genomic DNA into smaller sizes
◦ Ligate adaptors
◦ attach or immobilize the template to a solid surface
or support
◦ the spatially separated template sites allows
thousands to billions of sequencing reactions to be
performed simultaneously
26. The major advance offered by NGS is the
ability to cheaply produce an enormous
volume of data
The arrival of NGS technologies in the
marketplace has changed the way we think
about scientific approaches in basic, applied
and clinical research
36. In solution
• Relatively cheap
• High throughput is
possible
• Small amounts of DNA
sufficient
Solid phase
• Straightforward method
• Flexible
• Higher amounts of DNA
45. The human genome
◦ Genome = 3Gb
◦ Exome = 30Mb
◦ 180 000 exons
Protein coding genes
◦ constitute only approximately 1% of the human
genome
◦ It is estimated that 85% of the mutations with large
effects on disease-related traits can be found in
exons or splice sites
48. HiSeq specifications:
◦ 2 flow cells
◦ 16 lanes (8 per flow cell)
◦ 200-300 Gbases per flow cell
◦ 10 days for a single run
Exome throughput
◦ 96 @ 60x coverage per run
◦ 3000 @ 60x coverage per year