SlideShare a Scribd company logo
1 of 30
Download to read offline
FIND MEANING IN COMPLEXITY
© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
Jane Landolin
Wednesday, January 21st
Open Data: PacBio Long Reads for Model Organisms
Best Balti in Bay Area ?
2
Little Delhi
• Bad neighborhood, good food
• San Francisco “tenderloin”
• 83 Eddy St
• (415) 398-3173
Amber India
• Unlimited Balti Buffet
• Mountain View
• Palo Alto
• San Jose
• San Francisco
Bioinformatics Support
3
• Bioinformatics Scientist in the Customer Support group
• Focus on enabling our customers
• Installing and using SMRT Analysis software
• Interpreting data
• Experimental design
• Referencing third-party tools
• Customer Portal/Github Issues
• We support everyone with PacBio Data
http://marketanalysts.lifescienceexecutive.com/blog/?p=1510
Mini Survey: Customer Service and Technical Support for Life Science Products Courtesy of BioInformatics, LLC June 2013
How important is the quality of technical
support in your decision to purchase a
new instrument?
Agenda
4
• Paper summary
• How to download the data
• DNA & Sample Prep
• Quality filter & technical validation
• Summary Statistics
Goal: Instructions to enable users
• Analysis and Assembly
• S.Cerevisiae
• Neurospora
• Drosophila
• MHAP + Human
• Thoughts on open data
http://www.nature.com/articles/sdata201445
Summary
5
• BioRxiv paper first released on
August 15 2014
• Published at Scientific Data online on
Nov 25 2014 (4 months later)
• Data released without restriction
• Data released at NCBI SRA
(.sra .fastq format) & Amazon S3 (.h5 format)
• Five Model Organisms
(E.coli, S.cerevisiae, N.crassa, A.Thaliana, D.Melanogaster)
• Eight datasets
• 55.8 Giga-bases of filtered sequence
(adapters and low quality sequence removed)
How to Download Data (Supplementary Table 1)
6
0
2000
4000
6000
8000
10000
12000
14000
Data & Technology Release Timelines (Newest P6C4)
AverageReadLength(bp)
2008 2009 2010 2011 2012 2013 2014 2015
Early PacBio® chemistries
453 1012 1734
LPR
FCR
ECR2
C2–C2
P4–C2
P5–C3
P6–C4
Half of data in reads: >14,000 bp
Average read length: 10,000 - 15,000 bp
Consensus accuracy: Achieves QV50 @30X
Preprint (BiorXiv)
Publication
DNA Prep is Lab & Organism-specific
8
Get large fragments of DNA:
- gentle-handling of DNA
- sequence right after prep
- minimal freeze-thaws
- Blue Pippin size selection
Remove Contaminants
- CTAB
- CsCl
- RNase
- Phenol Chloroform
- Ampure bead cleanup
40kb
20kb
15kb
S1 S2 S3
Quality Filtering
9
• In SMRT Sequencing, we typically have high yields after quality filtering
• On average, 95% of bases are high quality bases and pass quality filtering
• All high-quality samples retained 90-97% of the bases after filtering
(E. coli, A. thaliana, D. melanogaster)
o Retain high-quality (HQ) regions, remove others
o Remove adapter sequences between subreads
Quality Filtering Statistics
10
Mapping and Coverage Statistics
11
• Subreads are mapped to available reference
- In some cases, reference is not the
same strain
- Results are typical of SMRT Sequencing
- concordance includes indels and mismatches
- mode at 86%
• De Novo assemblies achieve
> 99.99% consensus accuracy
• Coverage is even along entire genome
- Expected distribution of coverage
- Least bias
- Random profile
• Mapping artifacts reflect poor quality of
reference genome, not sequence data
FIND MEANING IN COMPLEXITY
© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
New Analysis and Genome Assemblies
PacBio-only De Novo Sequencing of Yeast
I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI M
100 kb
Reference (S228C): 17 chromosomes
• Genome size = 12.3 Mb
• N50 = 950 kb
• Max chrom = 1.5 Mb (chr. IV)
HGAP de novo assembly : 30 contigs
• Assembly size = 12.3 Mb
• N50 = 770 kb
• Max contig = 1.5 Mb (chr. IV)
Neurospora HGAP assembly fills 356 gaps (only 4 left)
14
Chromosome
(reference scaffold)
Length # gaps Assembled Contigs # gaps
Supercontig 12.1 9.7 Mb 89 Contig_54 (6.6 Mb)
Contig_63 (3.4 Mb)
1
Supercontig 12.2 4.5 Mb 56 Contig 60 (4.5 Mb) 0
Supercontig 12.3 5.3 Mb 45 Contig 59 (5.3 Mb) 0
Supercontig 12.4 6.0 Mb 47 Contig 57 (6.2 Mb)
Contig 58 (12 kb)
1
Supercontig 12.5 6.4 Mb 42 Contig 56 (6.4 Mb) 0
Supercontig 12.6 4.2 Mb 38 Contig 62 (4.3 Mb) 0
Supercontig 12.7 4.3 Mb 43 Contig 69 (2.6 Mb)
Contig 70 (1.7 Mb)
Contig 61 (20 kb)
2
http://figshare.com/articles/ENCODE_like_study_using_PacBio_sequencing/928630
• Added >0.5Mb of sequence
Telomere-to-Telomere Assembly!
15
Boundary of 276 kb centromere
4.5 Mb chromosome captured in contig 60
Drosophila Assembly (~160 Mb)
16
Reference genome De novo assembly
chr2L 6 pieces 4-6 pieces
chr2R 27 pieces 2 pieces
chr3L 22 pieces 1 piece
chr3R 15 pieces 3 pieces
chr4 2 pieces 3 pieces
chrX 3 pieces 42 pieces
10+ years
shotgun sanger + BAC + Opgen +
manual finishing
$millions$
1 week – collect DNA
1 week – sample prep
6 days – sequencing
3 weeks – assembly
$9,000
24.6 MB!!
Drosophila Y Chromosome
Release 5 reference contains
~1% of chromosome Y
This assembly: >50%
Drosophilla Assembly (vs. Synthetic reads)
18
Assembly I
(FALCON)
Assembly II
(Celera
Assembler +
PBcR)
Moleculo
(Celera
Assembler)
Number of
contigs
434 128 5,066
N50
length
5.0 Mb 15.3 Mb 0.1 Mb
http://biorxiv.org/content/early/2014/06/17/001834
“By directly sequencing long
molecules, these third-generation
technologies will likely outperform
TruSeq synthetic long-reads in certain
capacities, such as assembly contiguity
enabled by homogeneous genome
coverage. Indeed, preliminary results
from the assembly of a different
substrain of D. melanogaster using
corrected PacBio data achieved an N50
contig length of 15.3 Mbp and closed two
of the remaining gaps in the euchromatin
of the Release 5 reference sequence
(Landolin, et al., 2014,
http://dx.doi.org/10.6084/m9.figshare.976097).”
Completely spans repeat elements
19
PacBio
Reads
Moleculo
Synthetic
Reads
Repeats
6kb ROO elements
Stacked
reads
No reads
simple repeats
(chr2L:922,441-1,013,372)
96
5.2
PacBio Moleculo
0
20
40
60
80
100
%
Resolved roo TEs:
Sequences through GC-rich regions
20
PacBio
GC Percent
(chr2L:1,714,784-1,741,283)
Moleculo
Synthetic
Reads
Advances in PacBio-only De Novo Assembly
Spinach 1G
Contig N50
531 kbpDrosophila 170M
Contig N50
4.5 MbpArabidopsis 120M
Contig N50
7.1 Mbp
Human 3.2 G
Contig N50
4.4 Mbp,
Max=44 Mbp
(Assembly
powered by
Google®
Exacycle)
2013 2014
Bacteria:
Finished
Genomes
Yeast 12M
Resolve most
chromosomes
“Haploid” Assemblies
Next Challenge:
Diploid Assemblies
MinHash Alignment Process (MHAP)
22http://biorxiv.org/content/early/2014/08/14/008003
For D. melanogaster, MHAP
achieved a 600-fold speedup
relative to prior methods and a
cloud computing cost of a few
hundred dollars.
Public Genome Assembly Tools (blog/preprint)
• Dazzler
– Gene Myers, U. Dresden
- Benchmarking on H. sapiens
- Distributed filesystem (GlusterFS) to optimize read/write I/O operations
- New data structures to minimize data loading/memory burden (.qva, DAM)
- Blog: https://dazzlerblog.wordpress.com/
- Code: https://github.com/thegenemyers/DALIGNER
• ECtools
- Mike Schatz, CSHL
- Benchmarking on E.coli, S. Cerevisiae, A. thaliana, O. sativa
- Hybrid Assembly
- Support Vector Regression
- Preprint: http://schatzlab.cshl.edu/data/ectools/AssemblyComplexity.pdf
- Code: https://github.com/jgurtowski/ectools
23
Human PacBio Data
24
• Resolved >26,000 euchromatic structural variants at the base-pair level
• ~22,000 (85%) of these are novel
• Closes/extends 55% of the remaining gaps in human reference genome
Chaisson et al. (2014) Nature doi:10.1038/nature13907
PacBio Data vs. GRCh37 & 1000 Genomes Project
Chaisson et al. (2014) Nature doi:10.1038/nature13907
Gonzaga-Jauregui et al. (2012) Annu Rev Med 63: 35-61
Genomic Variation Detection by 2nd Gen Technologies
“Approximately 35% of the genes in the human genome are
encompassed either totally or partially by a CNV that can alter their
expression or even their structure, possibly giving rise to novel fusion
transcripts.”
“Detection of structural variation is imperative in any WGS study.”
Behavioral Diseases Associated with Structural Variation
Girirajan & Eichler (2010) Human Molecular Genetics 19: R176-187
Accelerating discovery in open-access/preprint world
28
Data Release Paper:
BiorXiv preprint: http://biorxiv.org/content/early/2014/10/23/008037
Publication: http://www.nature.com/articles/sdata201445
Neurospora:
Poster: http://figshare.com/articles/ENCODE_like_study_using_PacBio_sequencing/928630
Publication: In the works
Drosophila:
Poster: http://figshare.com/articles/A_better_Drosophila_Melanogaster_genome_by_long_read_sequencing/976097
Publication: In the works
Moleculo Synthetic Reads Paper
BiorXiv preprint: http://biorxiv.org/content/early/2014/01/19/001834
MinHash Assembly Process Paper:
BiorXiv: http://biorxiv.org/content/early/2014/08/14/008003
Publication: Accepted
Human Structural Complexity Paper:
Publication: http://www.nature.com/nature/journal/vaop/ncurrent/full/nature13907.html
What’s next for PacBio
29
Data
• https://github.com/PacificBiosciences/DevNet/wiki/Datasets
o P6C4 C.elegans 40X dataset
o HLA Multiplexed GenDx Amplicon
Performance
Analysis
Software/Analysis
• Scaling with increasing platform throughput and provide faster time to results
• De novo assembly for larger genomes
• Diploid Genome Assembler
• Regional methylation analysis for large genomes
• Intuitive and easy to use Graphical User Interface (SMRT® Portal)
Estimated Output per
SMRT® Cell
Read Length
Avg N50bases Max
Jan 2013 ~100 Mb 4,500 bp 6 kb >20 kb
Oct 2013
P5-C3
~400 Mb 8,500 bp 10 kb >40 kb
Oct 2014
P6-C4
500 Mb – 1 Gb 10-15 kb 12-18 kb >60 kb
Active Loading, Template Prep,
& Read Length Improvements
2 – 4 Gb 15-20 kb 17-23 kb >60 kb
2015 roadmap (http://blog.pacificbiosciences.com/)
Thank you!
30
PacBio
Kristi Spittle-Kim, Primo Babayan, Paul Peluso, David Rank, Jonas Korlach
Collaborators
Casey Bergman, Sue Celniker, Jane Yeadon, David Catcheside, Joachim Li
Community
Lex Nederbragt, Konstantin Berlin, Sergey Koren, Adam Phillippy, Gene Myers,
Mike Schatz, Nick Loman

More Related Content

What's hot

Combining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assemblyCombining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assembly
Lex Nederbragt
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 

What's hot (20)

NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Combining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assemblyCombining PacBio with short read technology for improved de novo genome assembly
Combining PacBio with short read technology for improved de novo genome assembly
 
Jan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollJan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carroll
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Generating high-quality reference human genomes using PromethION nanopore seq...
Generating high-quality reference human genomes using PromethION nanopore seq...Generating high-quality reference human genomes using PromethION nanopore seq...
Generating high-quality reference human genomes using PromethION nanopore seq...
 
40 Years of Genome Assembly: Are We Done Yet?
40 Years of Genome Assembly: Are We Done Yet?40 Years of Genome Assembly: Are We Done Yet?
40 Years of Genome Assembly: Are We Done Yet?
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral genetics
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
Aug2015 analysis team 04 10x genomics
Aug2015 analysis team 04 10x genomicsAug2015 analysis team 04 10x genomics
Aug2015 analysis team 04 10x genomics
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
2015 03 13_puurs_v_public
2015 03 13_puurs_v_public2015 03 13_puurs_v_public
2015 03 13_puurs_v_public
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han cao
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 

Similar to Open pacbiomodelorgpaper j_landolin_20150121

Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Surya Saha
 
China Medical University Student ePaper2
China Medical University Student ePaper2China Medical University Student ePaper2
China Medical University Student ePaper2
Isabelle Chiu
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
GenomeInABottle
 

Similar to Open pacbiomodelorgpaper j_landolin_20150121 (20)

BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Bda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databasesBda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databases
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
CRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowCRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and How
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfwhole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdf
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
 
Generations of sequencing technologies.
Generations of sequencing technologies. Generations of sequencing technologies.
Generations of sequencing technologies.
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
DNA sequencing: rapid improvements and their implications
DNA sequencing: rapid improvements and their implicationsDNA sequencing: rapid improvements and their implications
DNA sequencing: rapid improvements and their implications
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
China Medical University Student ePaper2
China Medical University Student ePaper2China Medical University Student ePaper2
China Medical University Student ePaper2
 
26072016 uc davis_small
26072016 uc davis_small26072016 uc davis_small
26072016 uc davis_small
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Rewriting the Genome Using CRISPR and Synthetic Biology
Rewriting the Genome Using CRISPR and Synthetic Biology Rewriting the Genome Using CRISPR and Synthetic Biology
Rewriting the Genome Using CRISPR and Synthetic Biology
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 

Recently uploaded

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
Cherry
 

Recently uploaded (20)

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Genome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptxGenome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 

Open pacbiomodelorgpaper j_landolin_20150121

  • 1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved. Jane Landolin Wednesday, January 21st Open Data: PacBio Long Reads for Model Organisms
  • 2. Best Balti in Bay Area ? 2 Little Delhi • Bad neighborhood, good food • San Francisco “tenderloin” • 83 Eddy St • (415) 398-3173 Amber India • Unlimited Balti Buffet • Mountain View • Palo Alto • San Jose • San Francisco
  • 3. Bioinformatics Support 3 • Bioinformatics Scientist in the Customer Support group • Focus on enabling our customers • Installing and using SMRT Analysis software • Interpreting data • Experimental design • Referencing third-party tools • Customer Portal/Github Issues • We support everyone with PacBio Data http://marketanalysts.lifescienceexecutive.com/blog/?p=1510 Mini Survey: Customer Service and Technical Support for Life Science Products Courtesy of BioInformatics, LLC June 2013 How important is the quality of technical support in your decision to purchase a new instrument?
  • 4. Agenda 4 • Paper summary • How to download the data • DNA & Sample Prep • Quality filter & technical validation • Summary Statistics Goal: Instructions to enable users • Analysis and Assembly • S.Cerevisiae • Neurospora • Drosophila • MHAP + Human • Thoughts on open data http://www.nature.com/articles/sdata201445
  • 5. Summary 5 • BioRxiv paper first released on August 15 2014 • Published at Scientific Data online on Nov 25 2014 (4 months later) • Data released without restriction • Data released at NCBI SRA (.sra .fastq format) & Amazon S3 (.h5 format) • Five Model Organisms (E.coli, S.cerevisiae, N.crassa, A.Thaliana, D.Melanogaster) • Eight datasets • 55.8 Giga-bases of filtered sequence (adapters and low quality sequence removed)
  • 6. How to Download Data (Supplementary Table 1) 6
  • 7. 0 2000 4000 6000 8000 10000 12000 14000 Data & Technology Release Timelines (Newest P6C4) AverageReadLength(bp) 2008 2009 2010 2011 2012 2013 2014 2015 Early PacBio® chemistries 453 1012 1734 LPR FCR ECR2 C2–C2 P4–C2 P5–C3 P6–C4 Half of data in reads: >14,000 bp Average read length: 10,000 - 15,000 bp Consensus accuracy: Achieves QV50 @30X Preprint (BiorXiv) Publication
  • 8. DNA Prep is Lab & Organism-specific 8 Get large fragments of DNA: - gentle-handling of DNA - sequence right after prep - minimal freeze-thaws - Blue Pippin size selection Remove Contaminants - CTAB - CsCl - RNase - Phenol Chloroform - Ampure bead cleanup 40kb 20kb 15kb S1 S2 S3
  • 9. Quality Filtering 9 • In SMRT Sequencing, we typically have high yields after quality filtering • On average, 95% of bases are high quality bases and pass quality filtering • All high-quality samples retained 90-97% of the bases after filtering (E. coli, A. thaliana, D. melanogaster) o Retain high-quality (HQ) regions, remove others o Remove adapter sequences between subreads
  • 11. Mapping and Coverage Statistics 11 • Subreads are mapped to available reference - In some cases, reference is not the same strain - Results are typical of SMRT Sequencing - concordance includes indels and mismatches - mode at 86% • De Novo assemblies achieve > 99.99% consensus accuracy • Coverage is even along entire genome - Expected distribution of coverage - Least bias - Random profile • Mapping artifacts reflect poor quality of reference genome, not sequence data
  • 12. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved. New Analysis and Genome Assemblies
  • 13. PacBio-only De Novo Sequencing of Yeast I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI M 100 kb Reference (S228C): 17 chromosomes • Genome size = 12.3 Mb • N50 = 950 kb • Max chrom = 1.5 Mb (chr. IV) HGAP de novo assembly : 30 contigs • Assembly size = 12.3 Mb • N50 = 770 kb • Max contig = 1.5 Mb (chr. IV)
  • 14. Neurospora HGAP assembly fills 356 gaps (only 4 left) 14 Chromosome (reference scaffold) Length # gaps Assembled Contigs # gaps Supercontig 12.1 9.7 Mb 89 Contig_54 (6.6 Mb) Contig_63 (3.4 Mb) 1 Supercontig 12.2 4.5 Mb 56 Contig 60 (4.5 Mb) 0 Supercontig 12.3 5.3 Mb 45 Contig 59 (5.3 Mb) 0 Supercontig 12.4 6.0 Mb 47 Contig 57 (6.2 Mb) Contig 58 (12 kb) 1 Supercontig 12.5 6.4 Mb 42 Contig 56 (6.4 Mb) 0 Supercontig 12.6 4.2 Mb 38 Contig 62 (4.3 Mb) 0 Supercontig 12.7 4.3 Mb 43 Contig 69 (2.6 Mb) Contig 70 (1.7 Mb) Contig 61 (20 kb) 2 http://figshare.com/articles/ENCODE_like_study_using_PacBio_sequencing/928630 • Added >0.5Mb of sequence
  • 15. Telomere-to-Telomere Assembly! 15 Boundary of 276 kb centromere 4.5 Mb chromosome captured in contig 60
  • 16. Drosophila Assembly (~160 Mb) 16 Reference genome De novo assembly chr2L 6 pieces 4-6 pieces chr2R 27 pieces 2 pieces chr3L 22 pieces 1 piece chr3R 15 pieces 3 pieces chr4 2 pieces 3 pieces chrX 3 pieces 42 pieces 10+ years shotgun sanger + BAC + Opgen + manual finishing $millions$ 1 week – collect DNA 1 week – sample prep 6 days – sequencing 3 weeks – assembly $9,000 24.6 MB!!
  • 17. Drosophila Y Chromosome Release 5 reference contains ~1% of chromosome Y This assembly: >50%
  • 18. Drosophilla Assembly (vs. Synthetic reads) 18 Assembly I (FALCON) Assembly II (Celera Assembler + PBcR) Moleculo (Celera Assembler) Number of contigs 434 128 5,066 N50 length 5.0 Mb 15.3 Mb 0.1 Mb http://biorxiv.org/content/early/2014/06/17/001834 “By directly sequencing long molecules, these third-generation technologies will likely outperform TruSeq synthetic long-reads in certain capacities, such as assembly contiguity enabled by homogeneous genome coverage. Indeed, preliminary results from the assembly of a different substrain of D. melanogaster using corrected PacBio data achieved an N50 contig length of 15.3 Mbp and closed two of the remaining gaps in the euchromatin of the Release 5 reference sequence (Landolin, et al., 2014, http://dx.doi.org/10.6084/m9.figshare.976097).”
  • 19. Completely spans repeat elements 19 PacBio Reads Moleculo Synthetic Reads Repeats 6kb ROO elements Stacked reads No reads simple repeats (chr2L:922,441-1,013,372) 96 5.2 PacBio Moleculo 0 20 40 60 80 100 % Resolved roo TEs:
  • 20. Sequences through GC-rich regions 20 PacBio GC Percent (chr2L:1,714,784-1,741,283) Moleculo Synthetic Reads
  • 21. Advances in PacBio-only De Novo Assembly Spinach 1G Contig N50 531 kbpDrosophila 170M Contig N50 4.5 MbpArabidopsis 120M Contig N50 7.1 Mbp Human 3.2 G Contig N50 4.4 Mbp, Max=44 Mbp (Assembly powered by Google® Exacycle) 2013 2014 Bacteria: Finished Genomes Yeast 12M Resolve most chromosomes “Haploid” Assemblies Next Challenge: Diploid Assemblies
  • 22. MinHash Alignment Process (MHAP) 22http://biorxiv.org/content/early/2014/08/14/008003 For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars.
  • 23. Public Genome Assembly Tools (blog/preprint) • Dazzler – Gene Myers, U. Dresden - Benchmarking on H. sapiens - Distributed filesystem (GlusterFS) to optimize read/write I/O operations - New data structures to minimize data loading/memory burden (.qva, DAM) - Blog: https://dazzlerblog.wordpress.com/ - Code: https://github.com/thegenemyers/DALIGNER • ECtools - Mike Schatz, CSHL - Benchmarking on E.coli, S. Cerevisiae, A. thaliana, O. sativa - Hybrid Assembly - Support Vector Regression - Preprint: http://schatzlab.cshl.edu/data/ectools/AssemblyComplexity.pdf - Code: https://github.com/jgurtowski/ectools 23
  • 24. Human PacBio Data 24 • Resolved >26,000 euchromatic structural variants at the base-pair level • ~22,000 (85%) of these are novel • Closes/extends 55% of the remaining gaps in human reference genome Chaisson et al. (2014) Nature doi:10.1038/nature13907
  • 25. PacBio Data vs. GRCh37 & 1000 Genomes Project Chaisson et al. (2014) Nature doi:10.1038/nature13907
  • 26. Gonzaga-Jauregui et al. (2012) Annu Rev Med 63: 35-61 Genomic Variation Detection by 2nd Gen Technologies “Approximately 35% of the genes in the human genome are encompassed either totally or partially by a CNV that can alter their expression or even their structure, possibly giving rise to novel fusion transcripts.” “Detection of structural variation is imperative in any WGS study.”
  • 27. Behavioral Diseases Associated with Structural Variation Girirajan & Eichler (2010) Human Molecular Genetics 19: R176-187
  • 28. Accelerating discovery in open-access/preprint world 28 Data Release Paper: BiorXiv preprint: http://biorxiv.org/content/early/2014/10/23/008037 Publication: http://www.nature.com/articles/sdata201445 Neurospora: Poster: http://figshare.com/articles/ENCODE_like_study_using_PacBio_sequencing/928630 Publication: In the works Drosophila: Poster: http://figshare.com/articles/A_better_Drosophila_Melanogaster_genome_by_long_read_sequencing/976097 Publication: In the works Moleculo Synthetic Reads Paper BiorXiv preprint: http://biorxiv.org/content/early/2014/01/19/001834 MinHash Assembly Process Paper: BiorXiv: http://biorxiv.org/content/early/2014/08/14/008003 Publication: Accepted Human Structural Complexity Paper: Publication: http://www.nature.com/nature/journal/vaop/ncurrent/full/nature13907.html
  • 29. What’s next for PacBio 29 Data • https://github.com/PacificBiosciences/DevNet/wiki/Datasets o P6C4 C.elegans 40X dataset o HLA Multiplexed GenDx Amplicon Performance Analysis Software/Analysis • Scaling with increasing platform throughput and provide faster time to results • De novo assembly for larger genomes • Diploid Genome Assembler • Regional methylation analysis for large genomes • Intuitive and easy to use Graphical User Interface (SMRT® Portal) Estimated Output per SMRT® Cell Read Length Avg N50bases Max Jan 2013 ~100 Mb 4,500 bp 6 kb >20 kb Oct 2013 P5-C3 ~400 Mb 8,500 bp 10 kb >40 kb Oct 2014 P6-C4 500 Mb – 1 Gb 10-15 kb 12-18 kb >60 kb Active Loading, Template Prep, & Read Length Improvements 2 – 4 Gb 15-20 kb 17-23 kb >60 kb 2015 roadmap (http://blog.pacificbiosciences.com/)
  • 30. Thank you! 30 PacBio Kristi Spittle-Kim, Primo Babayan, Paul Peluso, David Rank, Jonas Korlach Collaborators Casey Bergman, Sue Celniker, Jane Yeadon, David Catcheside, Joachim Li Community Lex Nederbragt, Konstantin Berlin, Sergey Koren, Adam Phillippy, Gene Myers, Mike Schatz, Nick Loman