SlideShare a Scribd company logo
1 of 152
Download to read offline
Things	to	consider	when	
initiating	a	genome	project
the	assembly	pipeline	@	SciLifeLab
!
Helsinki,	Dec	9th	2015
Álvaro	Martínez	Barrio,		PhD	
Alvaro.Martinez.Barrio@scilifelab.se	
								linkedin.com/in/ambarrio	
							@ambarrio
Workshop	Outline
• Introducing	SciLifeLab	
• The	important	considerations	of	all	genome	
projects		
• The	annotation	and	assembly	platforms	
• A	vision	into	the	future
Survey
Survey
• How	many	of	you	have	used	sequencing	
facilities?
Survey
• How	many	of	you	have	used	sequencing	
facilities?	
• Assembled	a	genome?
Survey
• How	many	of	you	have	used	sequencing	
facilities?	
• Assembled	a	genome?	
• Planning	to	start	a	genome	project?
Survey
• How	many	of	you	have	used	sequencing	
facilities?	
• Assembled	a	genome?	
• Planning	to	start	a	genome	project?	
• Have	worked	with	NGS	data?
Survey
• How	many	of	you	have	used	sequencing	
facilities?	
• Assembled	a	genome?	
• Planning	to	start	a	genome	project?	
• Have	worked	with	NGS	data?	
• Just	curious	about	NGS?
Things	to	consider
• Repeats	
• Heterozygosity	
• Size	of	your	genome	
• GC	content	
• Access	to	material	and	specifically	HMW	DNA	
• Access	to	a	good	computational	cluster	
• Good	bioinformaticians	/	lab	technicians
Things	to	consider
• Repeats	
• Heterozygosity	
• Size	of	your	genome	
• Access	to	material	and	specifically	HMW	DNA	
• Access	to	a	good	computational	cluster	
• Good	bioinformaticians	/	lab	technicians
WHAT	IS	YOUR	SCIENTIFIC	QUESTION?
Variation	space
• Repeats	
• Heterozygosity	
• Size	of	your	genome	
• Access	to	material	and	specifically	HMW	DNA	
• Access	to	a	good	computational	cluster	
• Good	bioinformaticians	/	lab	technicians
WHAT	IS	YOUR	SCIENTIFIC	QUESTION?
Things	to	consider
• Repeats	
• Heterozygosity	
• Size	of	your	genome	
• Access	to	material	and	specifically	HMW	DNA	
• Access	to	a	good	computational	cluster	
• Good	bioinformaticians	/	lab	technicians
WHAT	IS	YOUR	SCIENTIFIC	QUESTION?
http://www.intechopen.com/books/recent-advances-in-autism-
spectrum-disorders-volume-i/discovering-the-genetics-of-autism
Things	to	consider
• Repeats	
• Heterozygosity	
• Size	of	your	genome	
• Access	to	material	and	specifically	HMW	DNA	
• Access	to	a	good	computational	cluster	
• Good	bioinformaticians	/	lab	technicians
WHAT	IS	YOUR	SCIENTIFIC	QUESTION?
Alkan	C.,	Coe	B.P.,	Eichler	E.E..	Nature	Rev	Genetics	(2011)
Things	to	consider
• Repeats	
• Heterozygosity	
• Size	of	your	genome	
• Access	to	material	and	specifically	HMW	DNA	
• Access	to	a	good	computational	cluster	
• Good	bioinformaticians	/	lab	technicians
WHAT	IS	YOUR	SCIENTIFIC	QUESTION?
Things	to	consider
• Repeats	
• Heterozygosity	
• Size	of	your	genome	
• Access	to	material	and	specifically	HMW	DNA	
• Access	to	a	good	computational	cluster	
• Good	bioinformaticians	/	lab	technicians
WHAT	IS	YOUR	SCIENTIFIC	QUESTION?
Ward	L.D.	&	Kellis	M.	
Nat	Biotechnology	(2012)
About	me
Álvaro	Martínez	Barrio,		PhD	
Alvaro.Martinez.Barrio@scilifelab.se	
								linkedin.com/in/ambarrio	
							@ambarrio
• PhD	Bioinformatics	2010	
• Postdoc	Pop	Genetics	/	Comp	Biol	2014,			
L.	Andersson	+	H.	Ronne	
• Herring:	Illumina,	SOLiD,	Moleculo,	PacBio	
• Species	Plant:	454,	SOLiD,	Illumina	
• Species	Seal	(~3Gb):	Illumina	
• Species	Beetle:	Illumina,	PacBio
Figure 1 | Cost-effectiveness of Pool-seq. The accuracy of allele frequency estimates is compared for whole-genome
sequencing of pools of individuals (Pool-seq) and whole-genome sequencing of individuals using the ratio of the
standard deviation (SD) of the estimated allele frequency with both methods. The same number of reads is used for
both sequencing strategies. A value smaller than one indicates that Pool-seq is more accurate than sequencing of
individuals. a | The influence of the pool size is shown. A larger pool size results in higher accuracy of Pool-seq, but
Pool-seq still produces more accurate allele frequency estimates even for pool sizes of 50 individuals in most
Nature Reviews | Genetics
0.4
10 20 30
Number of individuals sequenced seperately
SDpool/SDindividuals
SDpool/SDindividuals
Number of individuals sequenced seperately
40 50
0.5
0.6
0.7
0.8
0.9
1.0
1.1a b
0.4
10 20 30 40 50
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Pool size
Coverage per sequenced individual
Deviation in DNA content from
each individual in the pool
100
20×
0%
100
20×
30%
100
5×
30%
100
1×
30%
Pool size
Coverage per sequenced individual
Deviation in DNA content from
each individual in the pool
500
5×
30%
100
5×
30%
50
5×
30%
Schlötterer	C.,	Tobler	R.,	Kofler	R.	and	Nolte	V.	Nature	Rev	Genetics	(2014)
Why	pooling?
Schlötterer	C.,	Tobler	R.,	Kofler	R.	and	Nolte	V.	Nature	Rev	Genetics	(2014)
SciLifeLab	(promotion	slides)SciLifeLab
National service
Local scientific
center
SciLifeLab
Director (July 2015)
Olli Kallioniemi
Co-director
Kerstin Lindblad-TohVision:
To be an internationally leading center that
develops, uses and provides access to
advanced technologies for molecular
biosciences with focus on health and
environment.
www.scilifelab.se
2010: Strategic research initiative
2013: National resource
2015: New management and chairman
SciLifeLab platforms
SciLifeLab
National
Genomics
Infrastructure
National
Bioinformatics
Infrastructure
Sweden
Joakim Lundeberg
Ann-Christine Syvänen
Ulf Gyllensten
Bengt Persson
Clinical
Diagnostics
…
Lars Engstrand
Computer
resources
free for
Swedish
researchers
VR
SNIC
Ongoing merge of BILS,
WABI and more; complete
2016.
National, distributed
Know	a	good	bioinformatician
NBIS	-	We’re	here	for	you!We’re here for you!
23
The Bioinformatics Platform 2016
Funding
•  The Research
Council
•  SciLifeLab
•  KAW foundation
•  Host universities
Applied at the Research Council as continued
national infrastructure 2016-2023. Decision late 2015.
Custom-tailored support Tools Training
Today
~70 FTE
24
Long-term Support
Wallenberg Advanced Bioinformatics Infrastructure
www.scilifelab.se/facilities/wabi/
Björn Nystedt Thomas Svensson
Tailored solutions – high impact
Siv AnderssonGunnar von Heijne
Applied bioinformatics: 500h free support/project
•  Variant analyses in health and disease
•  Transcriptomics
•  Single-cell analyses
•  Epigenetics
•  Metagenomics
Directors
Managers
Swedens strongest unit for analyses of
large-scale genomic data (24 FTE)
National committee reviews and selects
projects based on scientific quality
Staff in Stockholm, Uppsala, Lund,
Gothenburg, Linköping, Umeå.
WABI	personnel	(2013-2014)
Johan	Reimegård Mikael	Huss Åsa	Björklund Pär	Engström Jakub	
Orzechowski	
Westholm
Estelle	Proux-
Wéra
Sanella	
Kjellqvist
Diana	Ekman Pall	Olason Anna	Johansson Marcel	Martin
Alvaro	Martinez	
Barrio
Per	Unneberg
Know	how	to	handle	your	data
Today:)Human)genome)sequenced)in)days)
C)towards)$1000)genome)
…requires$supercomputers$
for$analysis$and$storage$
Massively$parallel$sequencing….$
2.$Data$delivery$
SciLifeLab)Bioinforma/cs)Compute)and)Storage)(UPPNEX))
3.$Analysis$
ScienBsts$
www.uppmax.uu.se/uppnex$
High%performance/computers/and/
large/scale/storage/for/
bioinforma6cs/analysis./
1.$Sample$
transfer$
Login$
Submit$jobs$
Job)Que)
Job$
assigned$
Work$interacBvely$
How)do)you)work)on)UPPMAX)computers?)
Job	Queue
Research$$
~8000$cores$
ProducBon$
~3200$cores$
Redundancy$
768$cores$
Storage$
~11$PB$
2015)
Longbterm$
Storage$
Mosler$
384$cores$
Research$
3328$cores$
ProducBon$
768$cores$
Storage$
~7$PB$
Longbterm$
Storage$
2014) Resources)
Mosler$
384$cores$
Private$Cloud$
1600$cores$
Chipster,$CanvasDB$
Project)growth)
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
100
200
300
400
Active Projects
Numberofactiveprojects
●
●
UPPMAX
UPPNEX
2009:$ $13,152$MSEK$from$KAW$and$SNIC$
$
2012:$$ $23.8$MSEK$from$KAW/SNIC$
$
2014:$ $20$+$20$MSEK$from$KAW$for$WGS$
$ $ $SNIC$receives$47.8$MSEK$from$VR$to$handle$sensiBve$data$
$
UPPNEX)history)
UPPMAX)personnel)
+3$more…$
Know	how	to	extract	your	DNA
Olga	Vinnere	Pettersson	(UGC)
olga.pettersson@igp.uu.se
mp4	-	http://bit.ly/1Ul7RmH	
pptx	-	http://bit.ly/1Z6yIFH	
Q&A	-	http://bit.ly/1I1Sb6o
Bacteria Fungi
Insects Plants
Know	how	to	measure	your	assembly	results
Just	a	word	on	N50…	
N50	typically	refers	to	a	con<g	(or	scaffold)	length	
	
But…	
•  The	original	defini<on	is	the	number	of	con<gs	needed	to	reach	half	
of	the	genome	size	(L50	is	the	length)	
•  Many	programs	use	the	total	assembly	size	as	a	proxy	for	the	
genome	size;	this	is	some<mes	completely	misleading:	Use	NG50!		
•  PI:s	don’t	understand	N50	anyway;	use	something	more	intui<ve	J:	
-	con<gs	larger	than	1	kbp					sum	to	93%	of	the	genome	size	
-	con<gs	larger	than	10	kbp			sum	to	48%	of	the	genome	size	
-	con<gs	larger	than	100	kbp	sum	to	19%	of	the	genome	size	
	
	
Genome	
Assembly	
Genome	size	Assembly	size	NG50	N50	
3	con<gs	
100	kbp	
5	con<gs	
30	kbp	
Just	a	word	on	N50
Know	why	assembling	is	difficult
Two	types	of	assemblies	
Case	1	:			 Flycatcher	(1.2	Gbp)	
Herring		(800	Mbp)		
Malassezia	(7	Mbp)	
Case	2	:			 Spruce	(20	Gbp)	
Barnacle	(1.4	Gbp)	
Wolbachia	(4	Mbp)	
Two	types	of	assemblies
Pre-assembly
• Quality	trimming	
• (Error	correction)	
• Kmer	analysis	
• De	novo	repeat	library
Quality	trimming	
DeBruijn-graph	assemblers	are	in	principle	sensi<ve	to	errors		
since	they	do	not	take	base	quality	values	into	account	
•  Trim	adapters	(e.g.	Cutadapt)	
•  Filter	on	quality,	both	5’	and	3’	end!	(e.g.	Trimmoma<c)	
•  Consider	hard-trimming	of	5’	end	
•  Error	correc<on	(e.g.	Quake)	
•  Inspect	(e.g.	FastQC)	
Plots	by	Olof	Karlberg	
Quality	trimming
Reads	vs	kmers	
1	read:	
100	bp	
……..	
Kmers:	
k=21bp	
N=	(L	–	k	+	1)	
						(100bp	–	21	bp	+	1)	
						80	
Base coverage * (L-k+1) = Kmer coverage!
! ! L!
	
Ex: !50X * (100-21+1) = 40X (i.e.	kmer	coverage	is	80%	of	base	coverage)	
! ! 100!
	
Reads	vs	Kmers
Kmer	analyses	
Compute	the	frequency	of	each	
kmer	in	the	dataset		
(e.g.	Jellyfish --both-strands)	
	
Note:	RAM-intense!	
How	to	count	kmers?
Digging	into	the	kmers	
Genome	size	
•  Remove	low-copy	kmers	
•  Iden<fy	the	coverage	peak	
•  Divide	total	nb	of	kmers	by	peak	
	
	
	
Genome size = Ktot/Cpeak!
!
Here: !
1.4 Gbp = 80 G / 55 !
!
Note: Ktot = Nb reads * (L-k+1)!
!
!
“Cpeak	
20	million	dis<nct	kmers	occure		
55	<mes	in	all	reads	combined”	
Base coverage = Cpeak
! ! (L-k+1)/L!
Here:!
69X = 55 !
(100 – 21 +1)/100!
Interpreting	kmer	graphs	(1/2)
Repeats:	first	shot	
The	nb	of	dis<nct	kmers	in	
the	single-copy	peak	
corresponds	roughly	to	the	
single-copy	genome	size	
Repeats	
Single-copy	 Example		
Beetle:	0.75	Gbp	is	single-copy,	so	almost	
40%	of	the	1.2	Gbp	genome	is	repeated	
(kmer=27)	
Interpreting	kmer	graphs	(2/2)
Heterozygosity	and	ploidy
…and	humans	are	easy.	
Bacteria,	archaea,	
fungi,	some	plants	
Most	animals,	
some	plants		
Many	plants	
Also:	Heterozygozity	is	generally	very	low	in	mammals;	
most	other	species	are	much	harder
Heterozygosity	with	kmer	graphs
Double	peak	in	the	kmer	histogram;	clear	indica6on	of	heterozygosity	
Not	en6rely	easy	to	quan6fy	(although	a=empts	have	been	made)
A	word	on	quality	filtering…	
Light	QC	filter	 Hard	QC	filter	
A	word	of	precaution	on	quality	filtering!
Heterozygosity	with	kmer	graphs
Double	peak	in	the	kmer	histogram;	clear	indica6on	of	heterozygosity	
Not	en6rely	easy	to	quan6fy	(although	a=empts	have	been	made)	
Heterozygosity	with	kmer	graphs
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
1.6
1.8
2
Heterozygosity	with	kmer	graphs
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
1.6
1.8
2
Fig4.2 Hybrid effect on K-mer distribution.
The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer
curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is
1.2%, H_0.015 is 1.5%.
From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is
becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We
can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 20 40 60 80
Percentage(X)
Depth(X)
H_0.01067
Epi
H_0.012
H_0.015
Heterozygosity	with	kmer	graphs
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70 80 90 100
Percentage(%)
Depth(X)
1.6
1.8
2
Fig4.2 Hybrid effect on K-mer distribution.
The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer
curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is
1.2%, H_0.015 is 1.5%.
From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is
becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We
can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 20 40 60 80
Percentage(X)
Depth(X)
H_0.01067
Epi
H_0.012
H_0.015
The	heterozygosity	was	estimated	to	be	1.5%
Heterozygosity	with	kmer	graphs
Repeats:	first	shot	
The	nb	of	dis<nct	kmers	in	
the	single-copy	peak	
corresponds	roughly	to	the	
single-copy	genome	size	
Repeats	
Single-copy	 Example		
Beetle:	0.75	Gbp	is	single-copy,	so	almost	
40%	of	the	1.2	Gbp	genome	is	repeated	
(kmer=27)	
Estimating	repeats	with	kmer	graphs
Why	repeats	destroy	assembliesGenome	assembly	-	things	to	think	about
Repeat	library	and	repeat	quantification
Create	a	de	novo	repeat	library	
•  Run	a	low-coverage	(e.g.	0.1X)	assembly	(e.g.	RepeatExplorer	or	Trinity)	
•  Filter	contaminants	and	mito/chloro	
•  [	Make	non-redundant	(e.g.	Cdhit)	]	
•  QuanJfy	the	(high)	repeat	content	by	an	independent	subset	of	reads	
	-	Mapping	(e.g.	bwa),	or	
	-	Mask	with	RepeatMasker
Repeat	library	and	repeat	quantification
Create	a	de	novo	repeat	library	
•  Run	a	low-coverage	(e.g.	0.1X)	assembly	(e.g.	RepeatExplorer	or	Trinity)	
•  Filter	contaminants	and	mito/chloro	
•  [	Make	non-redundant	(e.g.	Cdhit)	]	
•  QuanJfy	the	(high)	repeat	content	by	an	independent	subset	of	reads	
	-	Mapping	(e.g.	bwa),	or	
	-	Mask	with	RepeatMasker	
A!real!example!
Coverage!
%GC!
5!Mbp!mitochondrion!in!spruce!
Repeat	library	from	low	coverage	dataRepeat	library	from	low	coverage	data	
R	 R	 R’	 R	 R’’	
Overlaps?	
Sparse		
seq	data
Repeat	library	from	low	coverage	dataRepeat	library	from	low	coverage	data	
R	 R	 R’	 R	 R’’	
Overlaps?	
Assembled	con<gs	
Sparse		
seq	data
Repeat	library	from	low	coverage	dataRepeat	library	from	low	coverage	data	
R	 R	 R’	 R	 R’’	
Overlaps?	
Assembled	con<gs	
Sparse		
seq	data	
Warning!	Beware	of	contamina<ons,	plas<ds	etc
Repeat	library	from	low	coverage	dataQuan<fy	your	repeat	seqs	
R	 R	 R’	 R	 R’’	
Independent		
set	of	sparse		
data	
Screen	reads	with		
repeat	seqs	
33%	of	all	bases	in	the	reads	are	covered	by	repeat	seqs	
	ó		
33%	of	the	genome	is	“repeated”	
Warning!	The	quan<fica<on	depends	heavily	on	the	size	of	the	original	read	set
Classifying	repeats
LTR	Gypsy/Copia	
LINE/SINE	
DNA	elements	
…	
This	is	very	tricky…	
	
Classifying	the	repeat	library	directly		
•  RepeatMasker	
•  Repeat	protein	domain	search	(h=p://www.repeatmasker.org/cgi-bin/RepeatProteinMaskRequest)	
Problems	
•  No	close	homologs	in	databases	
•  Rapid	evoluHon	of	repeats	(like	transposable	elements,	TEs)	
•  Non-autonomous	TEs	do	not	contain	proteins	
	
SoluHons		
•  Fetch	intact	ORF:s	from	hits	in	assembly	
•  Extend	assembly	matches	and	get	more	complete	elements	
•  Check	match	alignment	profiles	in	assembly	(LINEs	conserved	at	3’	end	but	not	at	5’..)	
	
=>	OWen	slow,	manual,	species-specific	soluHons
Know	the	technology	bias
0 20 40 60 80 100
050100150200250300350
Coverage
NumberofMb'sinhg19
454
Illumina
SOLiD
average
coverage
_C%:(!)#&1-#!
!
"#$%#&'#!
The current
(in hg18 the
The six type
1&!
00
4
umina
OLiD
average
coverage
• Stephan	C.	Schuster	(Penn	U)
Clark	M.J.,	et	al.	
Nat.	Biotech	(2011)
Performance	comparison	of	exome	
DNA	sequencing	technologies.	
(Mike	Snyder’s	lab)
Ning	L.,	et	al.	Scientific	
Reports	(2015)
Know	the	assembly	algorithms
Short Reads (Illumina) - graph assembly
adapter
removal
quality
trimming
de Bruijn or string graph construction
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - HGAP assembly
read length
reads
read self-correction
overlap-layout-consensus
assembly
consensus calling with
quiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT
CGAGTCT-CGCGCAATCGCAAGCG-TTTC
ATCGTT-CCGAGTCTCCCCGCCATC
TT-CCGAGACTCCCCGCAATCGCAAGCGATT
GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing/polishing
the overall assembly strategy is the same…
…but the data and tools are fundamentally different
http://www.lucigen.com/NxSeq-Long-Mate-Pair-Library-Kit/
http://www.lucigen.com/NxSeq-Long-Mate-Pair-Library-Kit/
Short Reads (Illumina) - graph assembly
adapter
removal
quality
trimming
de Bruijn or string graph construction
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - HGAP assembly
read length
reads
read self-correction
overlap-layout-consensus
assembly
consensus calling with
quiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT
CGAGTCT-CGCGCAATCGCAAGCG-TTTC
ATCGTT-CCGAGTCTCCCCGCCATC
TT-CCGAGACTCCCCGCAATCGCAAGCGATT
GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing/polishing
the overall assembly strategy is the same…
…but the data and tools are fundamentally different
Many!instruments…!to
Assembler#Name# Algorithm# Input#
Arachne! OLC! Sanger!
CAP3! OLC! Sanger!
TIGR! Greedy! Sanger!
Newbler! OLC! 454/Roche!
Edena! OLC! Illumina!
SGA! OLC! Illumina!
MaSuRCA! De!Bruijn/OLC! Illumina!
Velvet! De!Bruijn! Illumina!
ALLPATHS! De!Bruijn! Illumina/PacBio!
ABySS! De!Bruijn! Illumina!
SOAPdenovo! De!Bruijn! Illumina!
CLC! De!Bruijn! Illumina/454!
CABOG! OLC! Hybrid!!
•  Currently!efforts!ongoing!to!
OLC	vs.	de	Bruijn
OLC
• Pros:	Can	use	longer	reads	properly	
• Cons:	Time	consuming,	high	memory	
requirements
de	Bruijn
de	Bruijn
Generate	assembly	via	de	Bruijn
Marpn	&	Wang,	Nat.	Rev.	Genet.	(2011)
Generate	assembly	via	de	Bruijn
Marpn	&	Wang,	Nat.	Rev.	Genet.	(2011)
Generate	assembly	via	de	Bruijn
Marpn	&	Wang,	Nat.	Rev.	Genet.	(2011)
• Pros:	Computationally	efficient,	can	work	with	
large	coverage	short	read	datasets	
• Cons:	Sensitive	to	sequence	errors,	connection	
between	assembly	and	read	is	lost,	does	not	
work	so	well	with	longer	reads
De	Bruijn
Short Reads (Illumina) - graph assembly
adapter
removal
quality
trimming
de Bruijn or string graph construction
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - HGAP assembly
read length
reads
read self-correction
overlap-layout-consensus
assembly
consensus calling with
quiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT
CGAGTCT-CGCGCAATCGCAAGCG-TTTC
ATCGTT-CCGAGTCTCCCCGCCATC
TT-CCGAGACTCCCCGCAATCGCAAGCGATT
GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing/polishing
the overall assembly strategy is the same…
…but the data and tools are fundamentally different
Many!instruments…!to
Assembler#Name# Algorithm# Input#
Arachne! OLC! Sanger!
CAP3! OLC! Sanger!
TIGR! Greedy! Sanger!
Newbler! OLC! 454/Roche!
Edena! OLC! Illumina!
SGA! OLC! Illumina!
MaSuRCA! De!Bruijn/OLC! Illumina!
Velvet! De!Bruijn! Illumina!
ALLPATHS! De!Bruijn! Illumina/PacBio!
ABySS! De!Bruijn! Illumina!
SOAPdenovo! De!Bruijn! Illumina!
CLC! De!Bruijn! Illumina/454!
CABOG! OLC! Hybrid!!
•  Currently!efforts!ongoing!to!
!
•  No!easy!way!to!determine!best!
assembly/assembler!
•  implemented!heuris4cs!are!the!
key!issue!
•  Choice!of!approach!depends!on!
data!being!assembled!
Short Reads (Illumina) - graph assembly
adapter
removal
quality
trimming
de Bruijn or string graph construction
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - HGAP assembly
read length
reads
read self-correction
overlap-layout-consensus
assembly
consensus calling with
quiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT
CGAGTCT-CGCGCAATCGCAAGCG-TTTC
ATCGTT-CCGAGTCTCCCCGCCATC
TT-CCGAGACTCCCCGCAATCGCAAGCGATT
GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing/polishing
the overall assembly strategy is the same…
…but the data and tools are fundamentally different
Many!instruments…!to
Assembler#Name# Algorithm# Input#
Arachne! OLC! Sanger!
CAP3! OLC! Sanger!
TIGR! Greedy! Sanger!
Newbler! OLC! 454/Roche!
Edena! OLC! Illumina!
SGA! OLC! Illumina!
MaSuRCA! De!Bruijn/OLC! Illumina!
Velvet! De!Bruijn! Illumina!
ALLPATHS! De!Bruijn! Illumina/PacBio!
ABySS! De!Bruijn! Illumina!
SOAPdenovo! De!Bruijn! Illumina!
CLC! De!Bruijn! Illumina/454!
CABOG! OLC! Hybrid!!
•  Currently!efforts!ongoing!to!
CABOG! OLC! Hybrid!!
•  Currently!efforts!ongoing!to!
establish!best!prac4ces!
•  Assemblathons!and!GAGE!to!
evaluate!exis4ng!solu4ons…!
Some	recommendations
• Large	eukaryote	genome,	Illumina	data:	Allpaths-LG	(needs	
specific	libraries),	SOAPdenovo,	SGA,	Masurca,	DISCOVAR	
• Large	eukaryote	genome,	additional	longer	reads:	Masurca,	
Newbler,	CABOG	
• Small	eukaryote	or	prokaryote	genome,	Illumina	data:	Spades,	
Masurca,	SOAPdenovo,	Abyss,	Velvet,	DISCOVAR	
• Small	eukaryote	or	prokaryote	genome,	mixed	data:	MIRA,	
Spades,	Masurca,	Newbler	
• Need	to	run	in	parallel:	Abyss,	Rai	
• Amplified	data	(Single	Cell	Genomics):	Spades
Standard	contiguity	metrics
Just	a	word	on	N50…	
N50	typically	refers	to	a	con<g	(or	scaffold)	length	
	
But…	
•  The	original	defini<on	is	the	number	of	con<gs	needed	to	reach	half	
of	the	genome	size	(L50	is	the	length)	
•  Many	programs	use	the	total	assembly	size	as	a	proxy	for	the	
genome	size;	this	is	some<mes	completely	misleading:	Use	NG50!		
•  PI:s	don’t	understand	N50	anyway;	use	something	more	intui<ve	J:	
-	con<gs	larger	than	1	kbp					sum	to	93%	of	the	genome	size	
-	con<gs	larger	than	10	kbp			sum	to	48%	of	the	genome	size	
-	con<gs	larger	than	100	kbp	sum	to	19%	of	the	genome	size	
	
	
Genome	
Assembly	
Genome	size	Assembly	size	NG50	N50	
3	con<gs	
100	kbp	
5	con<gs	
30	kbp
The	devil	is	in	the	repeatsats and Short Reads
reover
C	 R	 A	 B	
Mathema,cally	best	result:
Repeat	errors
Overlapping	non-iden/cal	reads	 Collapsed	repeats		
and	chimeras	
Wrong	con/g	order	 Inversions
ATCGGGTATATAG-CCTA!
||||||| || || ||||!
ATCGGGTGTACAGCCCTA!
!
?	
A	
B	
A	&	B	
A:	
	
B:	
Collapsable	repeat	errors	(worst!)
Know	how	to	patch	gaps/finalize
Gaps
Gaps
CCS	vs	CLR
CCS	vs	CLR
CCS	vs	CLR
other options for assembling PacBio reads
https:/ /github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-
Assembly-with-PacBio-Long-Reads
Hybrid	assemblies
Gaps
•  PacBio data cannot (currently) be assembled in its raw
state
•  several strategies exist for correcting reads prior to assembly
•  correction without complementary technology used to be
difficult
–  until recently, was limited by computational power and SMRT cell
throughput
PacBio data is noisy
Koren & Philippy Curr Op Micro 2014
Hybrid	assemblers	(for	PacBio)
105
other options for assembling PacBio reads
Hybrid	assemblers
106
other options for assembling PacBio reads
Zimin	A.V.,	Marçais	G.,	Puiu	D.,	Roberts	M.,	Salzberg	S.L.,	Yorke	J.A.	Bioinformatics	(2013)
Hybrid	assemblers
107
other options for assembling PacBio reads
Zimin	A.V.,	Marçais	G.,	Puiu	D.,	Roberts	M.,	Salzberg	S.L.,	Yorke	J.A.	Bioinformatics	(2013)
Pure	PacBio
Short Reads (Illumina) - graph assembly
adapter
removal
quality
trimming
de Bruijn or string graph construction
error
correction
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - HGAP assembly
read length
reads
read self-correction
overlap-layout-consensus
assembly
consensus calling with
quiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT
CGAGTCT-CGCGCAATCGCAAGCG-TTTC
ATCGTT-CCGAGTCTCCCCGCCATC
TT-CCGAGACTCCCCGCAATCGCAAGCGATT
GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing/polishing
the overall assembly strategy is the same…
…but the data and tools are fundamentally different
Pure	PacBio
Pure	PacBio
Pure	PacBio
other options for assembling PacBio reads
Pure	PacBio
other options for assembling PacBio reads
Pure	PacBio
Finishing/Polishing	(Olli-Pekka)
Finishing/Polishing	(Olli-Pekka)
Finishing/Polishing	(Olli-Pekka)
quiver isn’t perfect
using Pilon to polish remaining indels
•  makes use of short read mapping to identify potential indels,
SNPs, ambiguous bases, local misassemblies
$ java -Xmx16G –jar path/to/pilon-1.8.jar 	
--genome path/to/fasta --unpaired path/to/mapping.bam 	
--output sample_name --changes --variant --tracks 	
--mindepth 100	
Pilon removed 128 remaining indels in 3.8 Mbp genome despite
Finishing/Polishing	(Olli-Pekka)
quiver isn’t perfect
using Pilon to polish remaining indels
•  makes use of short read mapping to identify potential indels,
SNPs, ambiguous bases, local misassemblies
$ java -Xmx16G –jar path/to/pilon-1.8.jar 	
--genome path/to/fasta --unpaired path/to/mapping.bam 	
--output sample_name --changes --variant --tracks 	
--mindepth 100	
Pilon removed 128 remaining indels in 3.8 Mbp genome despite
Finishing/Polishing	(Olli-Pekka)
Finishing/Polishing	(Olli-Pekka)
LETTER doi:10.1038/nature15714
Single-molecule sequencing of the desiccation-
tolerant grass Oropetium thomaeum
Robert VanBuren1
*, Doug Bryant1
*, Patrick P. Edger2,3
, Haibao Tang4,5
, Diane Burgess2
, Dinakar Challabathula6
†, Kristi Spittle7
,
Richard Hall7
, Jenny Gu7
, Eric Lyons4
, Michael Freeling2
, Dorothea Bartels6
, Boudewijn Ten Hallers8
, Alex Hastie8
,
Todd P. Michael9
& Todd C. Mockler1
Plant genomes, and eukaryotic genomes in general, are typically
repetitive, polyploid and heterozygous, which complicates genome
assembly1
. The short read lengths of early Sanger and current
next-generation sequencing platforms hinder assembly through
complex repeat regions, and many draft and reference genomes
are fragmented, lacking skewed GC and repetitive intergenic
sequences, which are gaining importance due to projects like
the Encyclopedia of DNA Elements (ENCODE)2
. Here we report
the whole-genome sequencing and assembly of the desiccation-
tolerant grass Oropetium thomaeum. Using only single-molecule
real-time sequencing, which generates long (>16 kilobases)
reads with random errors, we assembled 99% (244megabases)
of the Oropetium genome into 625 contigs with an N50 length of
2.4megabases. Oropetium is an example of a ‘near-complete’ draft
genome which includes gapless coverage over gene space as well as
intergenic sequences such as centromeres, telomeres, transposable
elements and rRNA clusters that are typically unassembled in draft
genomes. Oropetium has 28,466 protein-coding genes and 43%
repeat sequences, yet with 30% more compact euchromatic regions
it is the smallest known grass genome. The Oropetium genome
demonstrates the utility of single-molecule real-time sequencing for
assembling high-quality plant and other eukaryotic genomes, and
serves as a valuable resource for the plant comparative genomics
community.
The genomes of Arabidopsis3
, rice4
, poplar, grape and Sorghum5
were first sequenced using high-quality and reiterative Sanger-based
approaches producing a series of ‘gold standard’ reference genomes.
The advent of next-generation sequencing (NGS) technologies reduced
and comparative genomics, although draft genomes are now avail-
able for most agriculturally important grasses1
. The largest genome
assemblies, such as maize (2,300megabases (Mb))7
, barley (5,100Mb)8
and wheat (hexaploid, 17,000Mb)9
are highly fragmented as a result
of the inability of current sequencing technologies to span complex
repeat regions. Near-finished reference genomes are available for rice4
,
Sorghum5
and Brachypodium10
, but more high-quality grass genomes
are needed for comparative genomics and gene discovery. Here we pres-
ent the ‘near-complete’ draft genome of the grass Oropetium thomaeum,
the first high-quality reference genome from the Chloridoideae sub-
family. The draft genome is near complete because we were able to
sequence through complex repeat regions that are unassembled in most
draft genomes. Oropetium has the smallest known grass genome at
245Mb and is also a resurrection plant that can survive the extreme
water stress such as loss of >95% of cellular water (Fig. 1)11
.
Single-molecule real-time (SMRT) sequencing (Pacific Biosciences)
produces long and unbiased sequences, which enables assembly of
complex repeat structures and GC- and AT-rich regions that are often
unassembled or highly fragmented in NGS-based draft genomes. We
generated ~72× sequencing coverage of the Oropetium genome using
32 SMRT cells on the PacBio RS II platform (which is equivalent to <1
week of sequencing time and <US$10,000 in reagents). The resulting
sequence had a read N50 length of over 16kilobases (kb), and there was
10× coverage of reads over 20kb in length (Extended Data Fig. 1a). The
raw reads were error-corrected using the hierarchical genome assembly
process (HGAP), and the longest reads (>16kb) were assembled using
Celera assembler followed by two rounds of genome polishing using
Quiver12
. The assembly contains 650 contigs spanning 99% (244Mb)
OPEN
Know	how	to	annotate
Annotation	(Jarkko)
BILS	assembly	and	annota1on	service	
1	
Henrik	Lantz	
Team	leader	
Mahesh	Panchal	
Assembly	
Jacques	Dainat	
Annota1on	
Mar1n	Norling	
Assembly	
Lucile	Soler	
Annota1on	
5	PhDs,	all	in	Uppsala	
•  Annota1on	2	years,	assembly	1	year	
•  Not	driving	own	research,	focusing	on	support	
•  80	h	of	free	support	to	all	projects	-	submiPed	by	customer	
•  Dedicated	compute	cluster	for	annota1on,	~160	cores	
•  Assemblies	run	on	shared	cluster,	~3200	cores	
•  All	organisms	-	all	types	of	data	
•  Close	contact	with	sequencing	facili1es
Annotation	(Jarkko)
BILS	assembly	and	annota1on	service	
1	
Henrik	Lantz	
Team	leader	
Mahesh	Panchal	
Assembly	
Jacques	Dainat	
Annota1on	
Mar1n	Norling	
Assembly	
Lucile	Soler	
Annota1on	
5	PhDs,	all	in	Uppsala	
•  Annota1on	2	years,	assembly	1	year	
•  Not	driving	own	research,	focusing	on	support	
•  80	h	of	free	support	to	all	projects	-	submiPed	by	customer	
•  Dedicated	compute	cluster	for	annota1on,	~160	cores	
•  Assemblies	run	on	shared	cluster,	~3200	cores	
•  All	organisms	-	all	types	of	data	
•  Close	contact	with	sequencing	facili1es	
	
Annota1on/Assembly	technology	
	 	 	Assembly	
Perl/Make	pipeline	
•  Pre-assembly	
–  Quality	control	
–  kmer	analyses	
•  Assembly	
–  Different	assembly	
programs	
•  Assembly	valida1on	
–  FRCbam	
–  Quast	
–  Own	tools	
	 	 	Annota-on	
•  Maker-MPI	
–  proteins	
–  RNA-seq	
•  Refinement	scripts	
•  Func1onal	annota1on	
–  Blast	
–  Synteny	
2
Know	how	to	validate
Assembly	validationly!valida4on…!is!it!important?!
Some4mes,!easy!ques4ons!are!the!most!difficult:!
•  Is!my!de!novo!assembly!correct?!
•  What!assembler!I!need!to!use?!
•  I!just!used!all!the!possible!assemblers!one!
can!think!of….!How!I!pick!up!one!now?!
n!genes?!
ugh!to!!
?!
Assembly!valida4on…!is!it!important?!
Some4mes,!easy!ques4ons!are!the!most!difficult:!
•  Is!my!de!novo!assembly!correct?!
•  What!assembler!I!need!to!use?!
•  I!just!used!all!the!possible!assemblers!one!
can!think!of….!How!I!pick!up!one!now?!
•  Does!my!assembly!contain!genes?!
•  Is!my!assembly!good!!enough!to!!
perform!gene!annota4on?!
!
!
!
Assembly	validationAssembly!valida4on!
Assembly!valida4on!is!extremely!difficult!
•  Too!o_en!only!connec4vity!measures!are!used!
•  There!is!no!a!real!solu4on,!only!a!set!of!best!prac4ces!
that!one!can!follow!
!
Recently!a!lot!of!a`en4on!on!assembly!valida4on:!
Evaluating	assemblies	with	referenceEvalua4ng!assemblies!with!a!reference!
Coun4ng!errors!not!always!possible:!
•  Reference!almost!always!absent.!
•  Error! types! are! not! weighted!
accordingly.!
Visualiza4on!is!useful,!however:!
•  No!automa4on!
•  !Does!not!scale!on!large!genomes!
WOW….!Looks!like!that!it!is!difficult!even!
with!the!answer!
Evaluating	assemblies	without	referenceEvalua4ng!assemblies!without!a!reference!
•  Sta4s4cs!(N50,!etc.)!
•  Congruency!with!raw!sequencing!data:!
•  Alignments!
•  QAtools!
•  FRCbam!
•  REAPR!
•  Gene!space!!
•  CEGMA!
•  reference!genes!
•  transcriptome!
There!is!no!a!real!recipe,!or!a!tool.!We!can!only!suggest!some!
best!prac4ce.!!
Your	reads	are	often	the	best	source	to	validate	your	
assemblies
• Check	again	your	insert	sizes	(Picard	Tools,	http://picard.sourceforge.net)	
!
!
!
!
!
• Plotting	coverage	x	%GC	x	length
Post!assembly…!am!I!on!the!right!track?!
•  Check!lib!insert!sizes!(use!PicardTools!h`p://picard.sourceforge.net/)!
•  PE! •  MP!
Your!genome!
Mitochondrion!
Contamina4ons!
0 2000 4000 6000 8000
02004006008001000
Insert Size Histogram for All_Reads
in file MP_on_masurca_sorted.bam
Insert Size
Count
FR
RF
TANDEM
0 100 200 300 400 500
02000400060008000
Insert Size Histogram for All_Reads
in file PE_on_masurca_sorted.bam
Insert Size
Count
FR
RF
TANDEM
0 2000 4000 6000 8000 10000
0200400600800
Insert Size Histogram for All_Reads
in file 7_130425_AD1YUEACXX_P469_101_index12_trimmed−to−assembly.abyss.scaf_onlyAligned.bam
Insert Size
Count
FR
RF
TANDEM
•  Failed!MP!or!bad!
assembly?!
•  Plot!cov!vs!%GC!vs!length!
!
Look! at! the! plots!
and!at!the!tables,!
duplica4on! rate!
is! an! important!
measure.!!
You!need!to!check!
i f! t h e! p l o t ( s )!
coincides! with!
what!you!expect.!
0.0 0.2 0.4 0.6 0.8 1.0
0100200300400500
GC
coverage
coverage
Frequency
0 100 200 300 400 500
050100150
0 100 200 300 400 500
01020304050
cov
len(kbp)
0 10 20 30 40 50
0246810
cov
len(kbp)
Plopng!coverage!and!GC!content!
0
coverage
Frequency
0 100 200 300 400 500
050100150
age!and!GC!content!
Your	reads	are	often	the	best	source	to	validate	your	
assemblies
• Check	again	your	insert	sizes	(Picard	Tools,	http://picard.sourceforge.net)	
!
!
!
!
!
• Plotting	coverage	x	%GC	x	length
Post!assembly…!am!I!on!the!right!track?!
•  Check!lib!insert!sizes!(use!PicardTools!h`p://picard.sourceforge.net/)!
•  PE! •  MP!
Your!genome!
Mitochondrion!
Contamina4ons!
0 2000 4000 6000 8000
02004006008001000
Insert Size Histogram for All_Reads
in file MP_on_masurca_sorted.bam
Insert Size
Count
FR
RF
TANDEM
0 100 200 300 400 500
02000400060008000
Insert Size Histogram for All_Reads
in file PE_on_masurca_sorted.bam
Insert Size
Count
FR
RF
TANDEM
0 2000 4000 6000 8000 10000
0200400600800
Insert Size Histogram for All_Reads
in file 7_130425_AD1YUEACXX_P469_101_index12_trimmed−to−assembly.abyss.scaf_onlyAligned.bam
Insert Size
Count
FR
RF
TANDEM
•  Failed!MP!or!bad!
assembly?!
•  Plot!cov!vs!%GC!vs!length!
!
Look! at! the! plots!
and!at!the!tables,!
duplica4on! rate!
is! an! important!
measure.!!
You!need!to!check!
i f! t h e! p l o t ( s )!
coincides! with!
what!you!expect.!
0.0 0.2 0.4 0.6 0.8 1.0
0100200300400500
GC
coverage
coverage
Frequency
0 100 200 300 400 500
050100150
0 100 200 300 400 500
01020304050
cov
len(kbp)
0 10 20 30 40 50
0246810
cov
len(kbp)
Plopng!coverage!and!GC!content!
0
coverage
Frequency
0 100 200 300 400 500
050100150
age!and!GC!content!
Data	congruencyData!congruency!
Idea:!Map!read:pairs!back!to!assembly!and!look!for!discrepancies!like:!
•  no!read!coverage!
•  no!span!coverage!
•  too!long/short!pair!distances!
Reads! can! be! aligned!
back! to! the! assembly! to!
iden4fies! “suspicious”!
features.!
But!what!we!do!with!this!features?!
FRCbam	(Vezzi	et	al.	2012)
Data	congruency
FRCbam	(Vezzi	et	al.	2012)	
Features!
4!coverage!related!features:!
•  LOW_COV_PE,!HIGH_COV_PE,!LOW_NORM_COV_PE,!and!HIGH_NORM_COV_PE!
!
!
!
!
!
4!features!for!compression/expansion!event!(CE!stats)!
•  COMPR_PE,!STRECH_PE,!COMPR_MP,!and!STRECH_MP!
!
!
!
6!features!on!suspicious!pair/mate!orienta4ons:!
•  HIGH_SINGLE_PE,!and!HIGH_SINGLE_MP!
•  HIGH_SPAN_PE,!and!HIGH_SPAN_MP!
•  HIGH_OUTIE_PE,!and!HIGH_OUTIE_MP!
!
A
R1,2
B
A
R1,2
C
B
A R1 B R2 C
AGAGCTAGC
AGAGCTAGC
AGATCTCGC
AGATCTCGC
Reads! can! be! aligned! back! to!
the! assembly! to! iden4fies!
“suspicious”!features.!
FRCurve
FRCurve!
FRCbam!predicted!“Assemblathon!2”!outcome!
The!Feature!Response!Curve!(FRCurve)!characterizes!the!sensi4vity!
(coverage)! of! the! sequence! assembler! as! a! func4on! of! its!
discrimina4on!threshold!(number!of!features!).!
Feature!Response!Curve:!
•  Overcomes!limits!of!standard!
indicators!(i.e.!N50)!
•  Captures!trade:off!between!
quality!and!con4guity!
•  Deeply!connected!to!ROC!curves!
•  Features!can!be!used!to!iden4fy!
problema4c!regions!
•  Single!features!can!be!plo`ed!to!
iden4fy!assembler:specific!bias!
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,000
0
20
40
60
80
100
120
Feature threshold
approximatecoverage(%)
Feature Space rhody TOTAL
SGA
Ray
CLC
SOAPdenovo
ALLPATHS-LG PB
ABySS
MSRA-CA
CABOG PB
CABOG
VELVET
ALLPATHS-LG
Features	and	PCA
Features!and!PCA!
−5 −4 −3 −2 −1
−2−1012
PCA1
PCA2
●bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
●swig
tim●bifido
ecoli
entero
fragilis
●fuso7
kleb
staphylocossusstrep
●swig
tim
●bifido
clap
clap19
ecoli
entero
fragilis
fusonuke
kleb
strep
●swig
tim
●bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
●swig
●bifido
ecolientero
eubac
●swig
tim
●bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
●
swig
clap
clap19
ecoli
enteroeubac
fragilis
●fuso7
kleb
staphylocossus
strep
●swig
tim
entero
eubac
●fuso7
strep
●
swig
●bifido
ecoli
entero
eubac
fragilis
kleb
strep
●swig
●bifido
ecoli
entero
eubac
kleb
staphylocossus
strep
●
swig
●bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
●swig
−4 −2 0 2 4
−6−4−20246
PCA1
PCA2
●bifido
clap
clap19
copro
ecoli
egg
enteroeubac
fragilis
● fuso7
fusonuke
kleb
staphylocossus
strep
●swig
tim●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
staphylocossusstrep●swig
tim
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
strep
●swig
tim
●bifido
clap
clap19
ecoli
egg
entero
eubac
fragilis
● fuso7
fusonuke
kleb
staphylocossus
strep
●swig
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
● fuso7
fusonuke
strep
●swig
tim
●bifido
clap
clap19
copro
ecoli
egg
entero eubac
fragilis
● fuso7
fusonuke
kleb
staphylocossus
strep
●
swig
tim
●bifido
clap
clap19
copro
ecoli
egg
enteroeubac
fragilis
●fuso7
fusonuke
kleb
●
rhody
staphylocossus
strep
●swig
tim
●
bifido
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
strep
●
swig
tim
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
● fuso7
fusonuke
kleb
staphylocossus
strep●swig
tim
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
staphylocossus
strep
●
swig
●bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
●fuso7
fusonuke
kleb
staphylocossus
strep
●swig
Assembled!18!bacterial!genomes!
with!11!assemblers!!
(illumina!+!PacBio!data)!
PCA!performed!on!features:!
•  Assemblies!of!the!same!organism!
(family)!tend!to!cluster;!
•  No!clear!difference!when!using!
PacBio!data;!
REAPR	(Hunt	et	al.	2013)	
REAPRREAPR!
REAPR!(Hunt!et!al.!2013)!
Uses!same!principle!of!FRCurve:!
•  Iden4fies!suspicious/erroneous!
posi4ons!
•  Breaks#assemblies#in#suspicious#
posi.ons#
•  The!“broken!assembly”!is!more!
fragmented!but!hopefully!more!
corrected!(Reapr!cannot!make!
things!worse…)!
Conserved	core	(species)	gene	space
Gene!space!
CEGMA#(h`p://korflab.ucdavis.edu/datasets/cegma/)!
HMM:s!for!248!core!eukaryo4c!genes!aligned!to!your!
assembly!to!assess!completeness!of!gene!space!
“complete”:!70%!aligned!
“par4al”:! !30%!aligned!
!
!
Similar#idea#based#on#aa#or#nt#alignments#of#
•  Golden!standard!genes!from!own!species!
•  Transcriptome!assembly!
•  Reference!species!protein!set!
Use!e.g.!GSNAP/BLAT!(nt),!exonerate/SCIPIO!(aa)!!
!
Other	external	validation	methodsOther!External!Valida4on!Methods!
!  Restric4on!Map!
◦  Representa4on! of! the! cut! sites! on! a!
given! DNA! molecule! to! provide! spa4al!
informa4on!of!gene4c!loci!
Op4cal!maps!can!be!used!to!check!assembly!correctness:!
Long!PacBio!Reads!can!be!used!as!well!
Other	external	validation	methods
De!novo!reconstructs!!parts!
missing!in!the!reference!strain!
Correctly!assembles!long!tandem!
repeats!!
De!Novo!assembly!
!!!(Illumina,!PGM)!
Set!of!un:ordered!
and!not!oriented!ctgs!
Op4cal!Map!
DNA!seq!Con4gs!
Other!External!Valida4on!Methods!
Don’t!panic.!And!don’t!rush!
Keeping!up!with!the!development!can!be!stressful,!!so!you!need!to!stay!calm!
• !Choose!quality!before!quan4ty!
• !Know!your!biological!system!!so!you!know!what!to!expect!
• Combine!sequencing!with!other!data!
• !Share!knowledge!and!be!nice!to!your!bioinforma4cs!friends!
For!each!conclusion,!ask!yourself!if!it!can!be!an!artefact!due!to!!
• !Incomplete!assembly!
• !Repeats!
• !Indels!
• !Coverage!bias!
• !Divergent!sequences!(mapping)!
Don’t	panic.	And	don’t	rush
Know	that	your	final	assembly	will	be	incomplete
Things	that	are	not	there
100Mb
1 2 3 4 5 6 7 8 9 10 11
12
13
1415
16
1718
1920
2122
X
Closed gap
Inversion
Complex event
High
Low
STR Density
Extended Data Figure 3 | Genome distribution of closed gaps and
insertions. Chromosome ideogram heatmap depicts the normalized density of
inserted CHM1 base pairs per 5-Mb bin with a strong bias noted near the end of
most chromosomes. Locations of structural variants and closed gaps are given
by coloured diamonds to the left of each chromosome: closed gap sequences
(red), inversions (green), and complex events (blue).
RESEARCH LETTER
Chaison	M.J.P	et	al.	Nature	(2014)
yhigh-throughputDNAsequencing(ChIP-seq)analysis(Supplemen-
aryInformation).Weidentifiedasignificant15-foldenrichmentofshort
andemrepeats(STRs)whencomparedtoarandomsample(P,0.00001)
Fig. 1a). A total of 78% (39 out of 50) of the closed gap sequences were
omposedof10%ormoreofSTRs.TheSTRswerefrequentlyembedded
n longer, more complex, tandem arrays of degenerate repeats reach-
ng up to 8,000 bp in length (Extended Data Fig. 1a–c), some of which
ore resemblance to sequences known to be toxic to Escherichia coli16
.
ecause most human reference sequences17,18
have been derived from
ones propagated in E. coli, it is perhaps not surprising that the appli-
ation of a long-read sequence technology to uncloned DNA would
esolvesuchgaps.Moreover,thelengthandcomplexdegeneracyofthese
TRs embedded within (G1C)-rich DNA probably thwarted efforts to
ollow up most of these by PCR amplification and sequencing.
Next, we developed a computational pipeline (Extended Data Fig. 2)
o characterize structural variation systematically (structural variation
efined here as differences $50 bp in length, including deletions, dupli-
ations, insertions and inversions7
). Structural variants were discovered
y mapping SMRT sequencing reads to the human reference genome11
P = 0.02712
P = 0.00003
P < 0.00001
0
25
50
75
100
(G+C)content
Reference flank
Gap closure
Tandem repeat
P < 2.2 × 10–16
0.00
0.25
0.50
0.75
1.00
Gaps Reference
Proportionofregionwithsimplerepeats
a b
G
ap
onlyTandem
repeatsG
ap
w
ithout
tandem
repeats
Sam
pled
reference
igure 1 | Sequence content of gap closures. a, Gap closures are enriched
or simple repeats compared to equivalently sized regions randomly sampled
om GRCh37. b, Human genome gaps typically consist of (G1C)-rich
equence (yellow) flanking complex (A1T)-rich STRs (green) (empirical
value; Supplementary Information). Red line indicates genomic (G1C)
ontent.
Things	that	are	not	there
Steinberg	K.M.	et	al.		
Genome	Research	(2014)Figure 5. Overview of the Chr 11 (NC_018922.2) 1.9-Mb region, exhibiting three alignment bins with a large number of PacBio ‘‘cliff’’ reads where the
alignment coverage dropped off sharply. WGS component (light green lines) boundaries flanked by such reads are marked with red dashed lines. The ends
of each component at the boundary are labeled with letters to show orientation. Pairs of alignments corresponding to three different PacBio reads are
marked in yellow, green, and dark blue. These alignments overlap by < 10% on each of the reads. The split alignments for these three reads suggest that
the two WGS components marked in purple should be inverted and translocated as indicated by the arrow at the top of the image. The other PacBio reads
in these bins exhibit the same pattern of split alignments, which supports the proposed reordering and orientation of the WGS components. The bottom
light green lines show a proposed tiling path with the orientation corrected; the letters indicate where each end of the initial tiling path components should
be placed.
CHM1 assembly of the human genome
Cold Spring Harbor Laboratory Presson November 16, 2014 - Published bygenome.cshlp.orgDownloaded from
Summary
• Genome	size	and	repeat	content	can	be	estimated	w/o	an	assembly.	
• Adapters	and	trim	low	QV	is	good	unless	the	assembly	program	does	
EC	itself.	
• Assess	the	levels	of	heterozygosity	in	your	target	genome	before	you	
assemble	(or	sequence)	it	and	set	your	expectations	accordingly.	
• Choose	an	assembler	that	excels	in	the	area	you	are	interested	in	
(e.g.,	coverage,	continuity)	and	do	libraries	for	it.	
• Interested	in	doing	just	coding	potential	analyses?	(e.g.,	training	a	
gene	finder,	studying	codon	usage	bias,	looking	for	intron-specific	
motifs)	=>	Consider	studying	exome	assemblies.	
• Or	consider	a	proxy,	studying	a	specie	that	it	is	sufficiently	close	
evolutionary	which	genome	is	quite	good	in	quality.
Summary
• Genome	size	and	repeat	content	can	be	estimated	w/o	an	assembly.	
• Adapters	and	trim	low	QV	is	good	unless	the	assembly	program	does	
EC	itself.	
• Assess	the	levels	of	heterozygosity	in	your	target	genome	before	you	
assemble	(or	sequence)	it	and	set	your	expectations	accordingly.	
• Choose	an	assembler	that	excels	in	the	area	you	are	interested	in	
(e.g.,	coverage,	continuity,	or	number	of	error	free	bases).	
• Interested	in	doing	just	coding	potential	analyses?	(e.g.,	training	a	
gene	finder,	studying	codon	usage	bias,	looking	for	intron-specific	
motifs)	=>	Consider	studying	exome	assemblies.	
• Or	consider	a	proxy,	studying	a	specie	that	it	is	sufficiently	close	
evolutionary	which	genome	is	quite	good	in	quality.	
Settle	down	an	assembly	so	Science	can	continue!
Know	the	future
A	vision	into	the	future
A	vision	into	the	future
A	vision	into	the	future
A	vision	into	the	future
Acknowledgements
• Olga	Vinnere	Pettersson	
• Björn	Nysted	
• Ola	Spujth	
• Henrik	Lantz	
• Jacques	Daimat	
• Francesco	Vezzi	
• BGI	
• Jon	Badalamenti	(Bond	Lab)	
• Stephan	C.	Schuster	(Penn	U)

More Related Content

Similar to Helsinki genome project-20151210-amb

An Oz Mammals Bioinformatics and Data Resource
An Oz Mammals Bioinformatics and Data ResourceAn Oz Mammals Bioinformatics and Data Resource
An Oz Mammals Bioinformatics and Data ResourcePhilippa Griffin
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen ARDC
 
BioVis Meetup @ IEEE VIS 2015
BioVis Meetup @ IEEE VIS 2015BioVis Meetup @ IEEE VIS 2015
BioVis Meetup @ IEEE VIS 2015Nils Gehlenborg
 
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016Philippa Griffin
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyArchiver
 
Constructing bottomup
Constructing bottomupConstructing bottomup
Constructing bottomupAlex Hardisty
 
From bioinformatics scientist to entrepreneur - Women in Omics - ICG11 - 2016
From bioinformatics scientist to entrepreneur - Women in Omics - ICG11 - 2016From bioinformatics scientist to entrepreneur - Women in Omics - ICG11 - 2016
From bioinformatics scientist to entrepreneur - Women in Omics - ICG11 - 2016Fiona Nielsen
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
 
NLBIF_NIOO_2017v3
NLBIF_NIOO_2017v3NLBIF_NIOO_2017v3
NLBIF_NIOO_2017v3Jan Kuiper
 
Michael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignMichael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignAlice Sheppard
 
ISWC2015 Opening Session
ISWC2015 Opening SessionISWC2015 Opening Session
ISWC2015 Opening SessionSteffen Staab
 
Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...
Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...
Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...SURFevents
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
 
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomicsYannick Wurm
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulTheContentMine
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulpetermurrayrust
 
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenStorage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenOla Spjuth
 

Similar to Helsinki genome project-20151210-amb (20)

sDiv_IJSCM-part_2
sDiv_IJSCM-part_2sDiv_IJSCM-part_2
sDiv_IJSCM-part_2
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
An Oz Mammals Bioinformatics and Data Resource
An Oz Mammals Bioinformatics and Data ResourceAn Oz Mammals Bioinformatics and Data Resource
An Oz Mammals Bioinformatics and Data Resource
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen
 
BioVis Meetup @ IEEE VIS 2015
BioVis Meetup @ IEEE VIS 2015BioVis Meetup @ IEEE VIS 2015
BioVis Meetup @ IEEE VIS 2015
 
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
Constructing bottomup
Constructing bottomupConstructing bottomup
Constructing bottomup
 
From bioinformatics scientist to entrepreneur - Women in Omics - ICG11 - 2016
From bioinformatics scientist to entrepreneur - Women in Omics - ICG11 - 2016From bioinformatics scientist to entrepreneur - Women in Omics - ICG11 - 2016
From bioinformatics scientist to entrepreneur - Women in Omics - ICG11 - 2016
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015
 
NLBIF_NIOO_2017v3
NLBIF_NIOO_2017v3NLBIF_NIOO_2017v3
NLBIF_NIOO_2017v3
 
EMBL-ABR_ AGRF2016
EMBL-ABR_ AGRF2016EMBL-ABR_ AGRF2016
EMBL-ABR_ AGRF2016
 
Michael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignMichael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project Design
 
ISWC2015 Opening Session
ISWC2015 Opening SessionISWC2015 Opening Session
ISWC2015 Opening Session
 
Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...
Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...
Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenStorage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
 

Recently uploaded

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 

Recently uploaded (20)

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 

Helsinki genome project-20151210-amb