16th December 2015
Genomics 3.0: Big Data in
Precision Medicine
Asoke K Talukder, Ph.D
InterpretOmics, Bangalore, India
BDA 2015, 16th December 2015,
Hyderabad, India
17th December 2009
16th December 2015
Part III – Big Data Genomics
16th December 2015
Multi Scale Big Data
3
16th December 2015
Multi Omics Big Data
4
16th December 2015
Big ‘OMICS’
(High-throughput) Data Domains
DNA-Seq
ChIP-Seq
RNA-Seq
Systems
Biology
Meta
Analysis
Population
Genetics
GWAS
Microarray
Exome-Seq
Repli-Seq
Small
RNA-Seq
Biological
Networks
Proteomics
Metagenomics
5
16th December 2015
Model
• Create a virtual (or physical) entity that has same physical
appearance of the original entity in a reduced scale
(space)
• Use Physical Science to create sensors that can sense
and quantify the input to the system causing Perturbation
• Use Physical Science to measure the output of the
Perturbed model entity
• Use Mathematical (or Statistical) science that can simulate
the function and behaviour of the original entity in reduced
space and reduced time with perturbation
6
16th December 2015
Dimensions of Big Data
The 7 Vs of Genomic Big Data
• Volume is defined in terms of the physical volume of the data that need to be online, like giga-byte (10
9 ), tera-byte (10 12 ), peta-byte (10 15 ) or exa-byte (10 18 ) or even beyond.
• Velocity is about the data-retrieval time or the time taken to service a request. Velocity is
also measured through the rate of change of the data volume.
• Variety relates to heterogeneous types of data like text, structured, unstructured, video, audio
etcetera.
• Veracity is another dimension to measure data reliability - the ability of an organization to trust the data
and be able to confidently use it to make crucial decisions.
• Vexing covers the effectiveness of the algorithm. The algorithm needs to be designed to ensure that
data processing time is close to linear and the algorithm does not have any bias; irrespective of the
volume of the data, the algorithm is able to process the data in reasonable time.
• Variability is the scale of data. Data in biology is multi-scale, ranging from sub-atomic ions at
picometers, macro-molecules, cells, tissues and finally to a population [9] at thousands of kilometers.
• Value is the final actionable insight or the functional knowledge. The same
mutation in a gene may have a different effect depending on the population or the
environmental factors.
16th December 2015
Types of Genomic Big Data
1. Patient Data (n = 1)
2. Perishable (n = 1)
3. Persistent (n = N)
4. Phenotypic (n = N)
5. Clinical (n = N)
6. Biological/Molecular (n = N)
16th December 2015
Applications of Next-Generation Sequencing
9
16th December 2015
Asoke Talukder
Frederick Sanger
• Frederick Sanger was born in Rendcomb, a small
village in Gloucestershire on August 13, 1918. He
completed his Ph.D. in 1943 on lysine metabolism and a
more practical problem concerning the nitrogen of
potatoes. Sanger's first triumph was to determine the
complete amino acid sequence of the two polypeptide
chains of insulin in 1955. It was this achievement that
earned him his first Nobel prize in Chemistry in 1958. By
1967 he had determined the nucleotide sequence of the
5S ribosomal RNA from Escherichia coli, a small RNA
about 115 nucleotides long. He then turned to DNA and,
by 1975, had developed the “dideoxy” method for
sequencing DNA molecules, also known as the Sanger
method. This has been of key importance in such
projects as the Human Genome Project and earned him
his second Nobel prize in Chemistry in 1980.
10
16th December 2015
Asoke Talukder 11
16th December 2015
Sample generation and cluster generation
200,000 clusters per tile
62.5 million reads per lane
100 bp reads -> 12.5 Gb per lane
Prepare DNA or
cDNA fragments
Ligate adapters
100μm Random
array of clusters
Attach single molecules to
surface Amplify to form cl
12
16th December 2015
Base Calling
Consecuitive cycles
The identity of each base of a cluster is read from stacked sequence image
Sequence
13
16th December 2015
Asoke Talukder
14
Dideoxynucleotide Sequencing
14
16th December 2015
Decoding the Book of Life
– milestone for Quantitative Biology
A Milestone for Humanity – the Human genome
Human Genome Completed, June 26th, 2000
15
Francis CollinsBill ClintonJ Craig Ventor
15
16th December 2015
3 billion base pair => 6 G letters
&
1 letter => 1 byte
The whole genome can be recorded in
just 10 CD-ROMs!
In 2003, Human genome
sequence was deciphered!
• Genome is the complete set of genes of a living thing.
• In 2003, the human genome sequencing was completed.
• The human genome contains about 3 billion base pairs.
• The number of genes is estimated to be between 20,000 to
25,000.
• The difference between the genome of human and that of
chimpanzee is only 1.23%!
16
16th December 2015
Asoke Talukder
Illumina Genome Analyzer (GA)
• The Genome Analyzer
sequences clustered template
DNA using a robust four-color
DNA Sequencing-By-
Synthesis (SBS) technology
that employs reversible
terminators with removable
fluorescence. This approach
provides a high degree of
sequencing accuracy even
through homopolymeric
regions.
17
16th December 2015
Asoke Talukder
NGS (Next Generation Sequencing)
Technology
18
16th December 2015
Asoke Talukder
How is Microarray Manufactured?
• Affymetrix GeneChip
• silicon chip
• oligonucleiotide probes lithographically synthesized on the
array
• cRNA is used instead of cDNA
19
16th December 2015
How Does Microarray Work?
20
16th December 2015
Part IV – Biological Databases
Molecular Biology Databases …
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
ARR, AsDb,BBDB, BCGD,Beanref,Biolmage,
BioMagResBank, BIOMDB, BLOCKS, BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,
Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-
MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!
16 December, 2015
22
NCBI (National Center for Biotechnology
Information)
• over 30 databases including
GenBank, PubMed, OMIM, and
GEO
• Access all NCBI resources via
Entrez
(www.ncbi.nlm.nih.gov/Entrez/)
16 December, 2015
23
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
16 December, 2015
36
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
16 December, 2015
37
Protein Data Bank (PDB)
16 December, 2015
38
16 December, 2015
39
ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
16 December, 2015 40
PRECISE RESULTS
MLH1[Gene Name] AND Human[Organism]
UMLS Knowledge Source Server (UMLSKS)
Home Page
Unified Medical Language System
From top links or buttons
 Search 3 Knowledge Sources
From sidebar
 Downloads
 Documentation
 Resources
16 December, 2015 42
“Biologic Function” hierarchy
Biologic Function
360
Pathologic Function
9983
Physiologic Function
691
Disease or
Syndrome
67716
Cell or
Molecular
Dysfunction
1276
Experimental
Model of
Disease
72
Organism
Function
1528
Organ
or Tissue
Function
2912
Cell
Function
4417
Molecular
Function
13442
Mental or
Behavioral
Dysfunction
5691
Neoplastic
Process
19436
Mental
Process
1224
Genetic
Function
1340
16 December, 2015 43
16th December 2015
Part V – Algorithms
Algorithms
• An algorithm is a sequence of instructions that one
must perform in order to solve a well-formulated
problem
• First you must identify exactly what the problem is!
• A problem describes a class of computational tasks. A
problem instance is one particular input from that task
• In general, you should design your algorithms to work
for any instance of a problem (although there are cases
in which this is not possible)
• Unlike commercial software that is data intensive,
algorithms as science and mathematics intensive
16 December, 2015
45
Schematic representation of our implementation of the de Bruijn graph
Zerbino D. R., Birney E. Genome Res.;2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
Example of Tour Bus error correction
Zerbino D. R., Birney E. Genome Res.;2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
Breadcrumb algorithm
Zerbino D. R., Birney E. Genome Res.;2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
16 December, 2015
49
16th December 2015
• Overview of Human Disease
– classifications, Inheritance, mechanisms (cause)
• Databases
– OMIM (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim)
– Gene Clinics (http://www.geneclinics.org/)
– Mutation database (http://mutdb.org/)
– Ocomine (http://www.oncomine.org/)
– Cancer Genome project (http://www.sanger.ac.uk/genetics/CGP/)
• Analysis of genes for molecular functions,
biological processes and pathways
The PANTHER (Protein ANalysis THrough Evolutionary Relationships
(http://www.pantherdb.org/)
Protein Interaction network (http://string.embl.de/)
50
16th December 2015
• Are results statistically significant?
• Many random process are involved in Biological
processes
• Many processes appear to be random but in reality
are non-random
• Many chances and uncertainties are involved in
biology data collection
• Statistical modeling of biological phenomenon can
help to understand patterns in life
Why Statistics?
51
16th December 2015
Deductive and Inductive Science
Ref: Sylvia Wassertheil-Smoller, Biostatistics and Epidemiology, Springer, 2003
Law of Gravitation,
Newton's Law of Motion
E = mC2
Biological Phenomenon
Simulation
Clinical Trial
52
16th December 2015
Why Statistics?
Purpose of statistics is to draw
inferences from samples of data to
the population from which these
samples came
Or
Abstract an entity with average
behavior where the behavior of the
constituent parts cannot be
measured
53
16th December 2015
Challenges in Computing
• Nature is a Tweaker
• Computers are efficient in discovering identity but
not similarity
• Biology needs similarity & not identity
• All Biology problems are different & unique
• Huge data generated by Next Generation
Sequencers with many errors
• Eliminate Noise from Information
• Minimize False Positive and False Negative
54
16th December 2015
Most Biology Solutions are
NP-Hard
• If the data volume increases by x, complexity of solution is
much higher than x (non deterministic polynomial time)
• Getting exact solutions may not be possible for some
problems on some inputs, without spending a great deal of
time
• You may not know when you have an optimal solution, if
you use a heuristic
• Almost impossible to arrive at exact solution; however, if
the solution is obtained, it can be proved it is the right
solution
• Sometimes exact solutions may not be necessary, and
approximate solutions may suffice. But, how good an
approximation does the solution need?
55
16th December 2015
NGS: Experiment with an Open Mind
• The process (Wet Lab)
• Take DNA/RNA/cDNA/miRNA etc
• Break into tiny pieces
• Amplify them
• Read them as sequence of bases
• The process (Dry Lab)
• Analyze the data
• Extract information from data
• NGS Experiments are unbiased
• NGS can help discover many unknown patterns in
the genome/gene or cell
56
16th December 2015
Next Generation Sequence Data
• FASTQ (Illumina)
• Sff (454)
• CCS (PacBio)
• ...
• Microarray
Single End
Sequences
Insert Size
Library Size
Sequence Seque
nce
Paired End or
Mate-paired
 
   


DNA/RNA/miRNA
OverlappedOverlapped reads

Random Order & Orientation
Long reads
Short reads
Fixed length reads
Variable length reads
cDNA/mRNA
Hundreds to Billions Bases
Circular Consensus reads
Billions to Hundreds Bases
57
16th December 2015
Paired-end/Mate-pair Data
Paul Medvedev, Monica Stanciu & Michael Brudno, Computational methods for discovering structural variation with next-generation sequencing,
Nature Methods Supplement| Vol.6 No.11s | November 2009
58
16th December 2015
Roche 454 NGS Data (.sff)
FNA File content
>contig00001 length=439 numreads=17
CcTcGGCGACGCACTCCgTCTTTtCAGTCAAAGGTCGAGGCAGTtGAGGTTACCCCACCC
GTCCATCCGCCTTCGGCGGCTGTCCACCCTCCCCTCAAGGGGGAGGGGAACGCCCCGCCA
GGAACCCCGCCAATGACCGACGCCCCGACCGTTCTTtCCCCcACCGCCGAAGCCCCGGTC
GAAGGCCTGCCGTCGGGTTTCGGCGAAGGCATCGCCGGCAAGGCCGCATTTCTCATCGCC
QUAL File content
>contig00001 length=439 numreads=17
64 35 64 34 64 64 64 64 64 64 64 64 64 64 64 64 64 23 64 64 64 64 64 11 64 64 64 64 64 64 64 64 64 64
64 64 58 64 64 58 64 64 64 64 25 64 64 64 64 64 64 64 64 64 64 61 64 64 64 49
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 53 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 61 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 16 64 64 64 64 18 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 61
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
Phred quality score
Q is defined as a property which is logarithmically related to the base-calling error probabilities P
Q = -10 * log10P
or
P = 10-Q/10
• If Phred assigns a quality score of 30 to a base, the chances that this base is called
incorrectly are 1 in 1000. The most commonly used method is to count the bases with a quality
score of 20 and above. The high accuracy of Phred quality scores make them an ideal tool to
assess the quality of sequences. Because
• In 1 character representation, less than 20 is unprintable, the Q value is added with 33 or 64
based on the vendor
59
16th December 2015
@HWI-EAS107_1_4_1_113_501
CATTATAAATTGAAGCTTATACAAAAAACTCGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@HWI-EAS107_1_4_1_213_501
ATTATAAATTGAAGCTTATACAAAAAACTCGAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@HWI-EAS107_1_4_1_313_501
CATTATAAATTGAAGCTTATACAAAAAACTCGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@HWI-EAS107_1_4_1_413_501
TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@HWI-EAS107_1_4_1_513_501
TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
>HWI-EAS107_1_4_1_113_501
CATTATAAATTGAAGCTTATACAAAAAACTCGA
>HWI-EAS107_1_4_1_213_501
ATTATAAATTGAAGCTTATACAAAAAACTCGAA
>HWI-EAS107_1_4_1_313_501
CATTATAAATTGAAGCTTATACAAAAAACTCGA
>HWI-EAS107_1_4_1_413_501
TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA
>HWI-EAS107_1_4_1_513_501
TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA
Data in FASTQ/FASTA Format
• For Paired-end sequences you have two
files with name
• _1 & _2 to indicate End_1 & End_2
• Within files you have matching record id
@HWI-EAS107_1_4_1_113_501/1
• To indicate the sequence of End_1
• And
@HWI-EAS107_1_4_1_113_501/2
• To indicate the sequence of End_2
• Paired-end read is INWARD
•  
• Mate-pair read is OUTWARD
•  
• FASTA
• FASTQ
60
16th December 2015
Error Due to Physics
Beginning
(bad quality data)
Middle
(good quality data)
End
(bad quality data)
Source: Wikipedia
61
16th December 2015
Base-calling Error
(Errors occur at rates 1 to 5 errors every 100 nucleotide)
ACCGT
CGTGC
TTAC
TACCGT
ACCGT
CGTGC
TTAC
TGCCGT
ACCGT
CAGTGC
TTAC
TACCGT
ACCGT
CGTGC
TTAC
TACGT
Substitution Insertion Deletion
Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology
--ACCGT--
----CGTGC
TTAC-----
-TACCGT—
TTACCGTC (Consensus)
62
16th December 2015
Adaptors & Contamination
• Illumina Adaptors:
1) P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
2) ACACTCTTTCCCTACACGACGCTCTTCCGATCT
3) AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
4) CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
5) ACACTCTTTCCCTACACGACGCTCTTCCGATCT
6) CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
• In a Paired Read, contamination in one
end will result into filtering of both ends
63
16th December 2015
Genome/DNA Data
Run1:
Lane No of Reads Size (bytes)
1 41,179,668 10,285,393,108
2 43,252,726 10,455,103,434
3 42,951,004 10,381,539,992
4 43,580,180 10,534,360,126
6 42,071,130 10,171,701,008
7 43,084,416 10,414,795,392
8 42,891,196 10,369,596,648
Run2:
Lane No of Reads Size(bytes)
1 42,773,842 10,703,924,228
2 44,809,016 10,772,709,314
3 44,898,528 10,790,934,680
4 44,099,962 10,598,532,600
6 44,731,270 10,746,564,462
7 44,162,428 10,607,946,662
8 43,689,238 10,492,962,600
Lane Size (bytes)
6 6,396,631,302
7 6,392,634,380
8 6,240,332,704
Run1: Total # of Paired-End Reads: 272,535,758; 29,901,032,000 Nucleotides
Run2: Total # of Paired-End Reads: 282,273,960; 30,916,428,400 Nucleotides
Run3: Total # of Mate-paired Reads: 841,326,748; 30,287,762,928 Nucleotides
Run3: Mate Pair data with Read size 35 Nucleotide Library Size 5K (Insert size 5470 NT)
Lane Size (bytes)
1 6,535,068,410
2 6,512,213,186
3 6,497,931,646
4 6,417,130,928
64
16th December 2015
RNA-Seq Data for a Marine Animal
Tissue Name # Reads # Bases Size (bytes)
Brain 73,224,886 4,393,493,160 14,378,439,860
Heart 71,954,940 4,317,296,400 14,129,178,812
Liver 68,992,472 4,139,548,320 13,547,005,500
65
16th December 2015
miRNA Data
Sample No of Bases No. of Bases No. of Size of
Name Received Processed Sequences Data
========================================================
S1 27,951,043 27,951,043 1,114,585 70.5 MB
S2 24,768,291 24,768,291 1,043,462 64.5 MB
S3 41,569,143 41,569,143 1,685,096 106.5 MB
S4 34,037,239 34,037,239 1,433,791 89.2 MB
S5 24,963,089 24,963,089 1,033,362 61.6 MB
S6 34,846,223 34,846,223 1,439,337 96.5 MB
S7 74,262,271 74,262,271 2,309,712 164.6 MB
Read Size varying from 18 to 36 in FASTA format
66
16th December 2015
Typical Biological Data Volume
(Illumina sequencing platform based)
67
16th December 2015
Complexities in NGS Data
• Large files – Microsoft Windows often fails to even open the file
• Variable Length Reads – allocating memory is always a computational
challenges
• Computers are good at Identity discovery but Biology needs Similarity
discovery
• Categorical data – cannot take differences between two objects
• Data are error prone – Quality of data is always a challenge
• Proprietary formats (e.g., SFF, XSQ, CEL, 0 base, 33 base, 64 base)
• Needs Super Computing power with Terabytes of Memory, and Petabytes of
Storage
• Most Biology problems are NP-Hard – algorithms fail to scale with large data
volume
• Many Open Source tools for NGS data and poorly documented and not
maintained, supported, or easy to change
68
16th December 2015
NGS Data Challenges
TACCGT
TGCCGT
TCCGT
TCCCGT
ACCCGT
ACCGT
Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology
No Coverage
Fragments
No Coverage
DeletionInsertionSubstitution
Read Errors
XTarget XA XB C
XA XD XCAssembled
D
B
Repeats
69
16th December 2015
Unknown Orientation & Order
CACGT
ACGT
TGCA
ACTACG
GTACT
ACTGA
CTGA
CACGT--------
-ACGT--------
-ACGT--------
--CGTAGT-----
-----AGTAC---
--------ACTGA
---------CTGA
CACGTAGTACTGA
70
16th December 2015
Discovering Biomedical Knowledge
Data
Information
Knowledge
Literature/
Molecular Data
Clinical/Bedside Data Medical
Knowledge
Target Data
Preprocessed
Data
Transformed
Data
Patterns
iOmics
Clinical/Drug Data
71
16th December 2015
Data Information Knowledge
Zoltán N. Oltvai and Albert-László Barabási, Life’s Complexity Pyramid, Science Vol 298, 25 October 2002
Wet Lab experiment &
High-throughput data
Open-domain widely
used Algorithms &
Tools
Custom Tools and
Open-domain
Databases
Problem Specific
Algorithms, Analysis,
and Databases
Data
Information
Knowledge
Related
Information
72
16th December 2015
Systems Biology –
Hypothesis Agnostic System/Genome Wide Study
ETL
Experiment/Sample Big Data
Data ScienceMolecular Biology /
Genetics
Hypothesis
Computer Science/
Algorithms
Bioinformatics Statistics Meta Analysis /
Network Biology Publish /
Translational Biomedicine
Scientist / Biologist
NGS / Sequencer
Biomedical
Databases
Literature
73
16th December 2015
Data Sciences
• Data Science is about learning from data, in order to gain useful
predictions and insights
• Separating signal from noise presents many computational and
inferential challenges, which we approached from a perspective at the
interface of computer science and statistics
• Data munging/scraping/sampling/cleaning in order to get an
informative, manageable data set
• Data storage and management in order to be able to access data -
especially big data - quickly and reliably during subsequent analysis
• Exploratory data analysis to generate hypotheses and intuition about
the data
• Prediction based on statistical tools such as regression, classification,
and clustering
• Communication of results through visualization, stories, and
interpretable summaries.
74
16th December 2015
Data Simulator (Synthetic Data)
• Take a Reference genome (e.g., hg19 or mm10 or some other genome)
• Create a VCF (Variation Call Format) file with synthetic mutations
• Or, take known mutations in VCF format from COSMIC or 1000Genome
• Apply (inject) the mutations from VCF file into the reference genome
• This will create a genome (single strand) with known mutations
• Inject random errors (sequencer errors)
• Define the depth or coverage
• Create fixed length single-end or paired-end reads
• A FASTQ file will be generated with known coverage and known
mutations
• Single strand RNA-Seq, DNA-Seq, or ChIP-Seq data
75
16th December 2015
Data Scientists' Skills
Ref: Wikipedia
76
16th December 2015
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an
approach/philosophy for data analysis that employs a
variety of techniques (mostly graphical) to
1. Maximize insight into a data set;
2. Uncover underlying structure;
3. Extract important variables;
4. Detect outliers and anomalies;
5. Test underlying assumptions;
6. Develop parsimonious models; and
7. Determine optimal factor settings.
77
16th December 2015
• Real Human miRNA Data
• Nucleotide Patterns
– Mono, Di, Poly statistics
– Motif Statistics
• Quality of Nucleotides
Truth is in the Data
78
16th December 2015
Random genomes
fragmentation
Genomes assembly
using overlaps
Metagenomics/
Multiple genomes
The Sequencing & Assembly Process
Target Microbial
Genomes
16th December 2015
The Jigsaw Puzzle
Source: Unknown
80
16th December 2015
Phases in Assembly
• Understand the data
– Data inventory
– Single End, Paired End, Mate Paired etc
– Sequence structure (Read size, Format)
– Quality of the data
– Patterns within the data
• Clean up the data
– Remove (Filter/Trim) vector/adaptor contaminated data
– Remove data of bad quality
– Remove data that might cause chimeric error
• Genome or Trancriptome in Ref-Assembly
• Contigs in Denovo Assembly
81
16th December 2015
Genome Reference Assembly
• Seed Based Algorithm
– Indexes either the genome or the reads in a data
structure
– All k-long words (k-mers) of one sequence are
indexed in a table with an entry for every possible
k-mer
– Seeds (exact or nearly exact substring matches
between the read and the genome) are used to
rapidly isolate the potential locations where the
read could match, and then a sensitive, full
alignment phase, often with the Smith–Waterman
Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058
82
16th December 2015
• MAQ (Mapping and Assembly with Qualities) is a
Reference Assembly that supports 63 bases of short fixed-
length Reads
• MAQ was designed for Illumina 1G Genetic Analyzer data,
with functions to handle ABI SOLiD data.
• MAQ aligns reads to reference sequences and then calls the
consensus. For single-end reads, MAQ is able to find all hits
with up to 2 or 3 mismatches.
• For paired-end reads, MAQ finds all paired hits with one of
the two reads containing up to 1 mismatch.
• At the assembling stage, MAQ calls the consensus based on
a statistical model. It calls the base which maximizes the
posterior probability and calculates a phread quality at each
position along the consensus. Heterozygotes are also called
in this process.
MAQ
Ref: http://maq.sourceforge.net/
83
16th December 2015
• BWT (Burrows–Wheeler Transform)
• In the BWT index, only a fraction of the
pointers must be precomputed and saved,
while the rest are reconstructed on demand
• Bowtie and BWA utilize heuristic algorithms to
search for non-exact matches in the BWT-
based index, if exact matches cannot be
located
Faster Genome Ref-assembly Algorithm
Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058
84
16th December 2015
Alignment – Bowtie
(SAM – Sequence Assembly Map)
HWUSI-EAS705_9146:3:24:828:1109/1 0 chr1_length_4160774
1374500255 100M * 0 0
TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGA
CCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCA
%%%%%%%%%%%%%%%41213:/=;555440323113=44;;>1=;?=1>>=53>;?A/
>8=?;===A;?A5AA9A4?B?AAAB@BA>AAA<ABAB@@A@< XA:i:0 MD:Z:100
NM:i:0
HWUSI-EAS705_9146:3:98:1103:366/1 0 chr1_length_4160774
1374501255 100M * 0 0
CTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC
CTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCAT
444454313355544455544433244445661493/3;;565=;491=;5;54==3=
;;>;5;;;95>><:==53=?2>??=>A;A=A?A>?AB>AA>A XA:i:0 MD:Z:100
NM:i:0
HWUSI-EAS705_9146:3:20:433:1834/1 0 chr1_length_4160774
1374502255 100M * 0 0
TTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACC
TTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCATC
BAA<AB=?A30@A?AAA>?9=B<=>5;8;=>?4:=919;3555/554533;35;5555
5;5554%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% XA:i:0 MD:Z:100
NM:i:0
85
16th December 2015
Alignment in Genome Viewer
86
16th December 2015
• Greatest computational challenge for Variation
Analysis (SNP/InDel) task lies in judging the
likelihood that a position is a heterozygous or
homozygous variant given the error rates of the
various platforms
• The probability of bad mappings, and the amount of
support or coverage
• Therefore, most of the tools include a detailed data
preparation step in which they filter, realign and
often re-score reads, followed by a nucleotide or
heterozygosity calling step done under a Bayesian
framework
SNP, Micro-InDel, & Point Mutation
87
16th December 2015
Lack of Coverage
• Coverage at a position i of a target is defined as
the number of fragments that cover this position. If
coverage is zero or low, there is not enough
information in the fragment set to reconstruct the
target completely
No
Coverage
Target
Fragments
No Coverage
88
16th December 2015
End of Part III, IV & V
InterpretOmics
Office: Shezan Lavelle, 5th Floor,
#15 Walton Road, Bengaluru 560001
Lab: #329, 7th Main, HAL 2nd Stage,
Indiranagar, Bengaluru 560008
Phone: +91(80)46623800
89

Bda2015 tutorial-part2-data&amp;databases

  • 1.
    16th December 2015 Genomics3.0: Big Data in Precision Medicine Asoke K Talukder, Ph.D InterpretOmics, Bangalore, India BDA 2015, 16th December 2015, Hyderabad, India 17th December 2009
  • 2.
    16th December 2015 PartIII – Big Data Genomics
  • 3.
    16th December 2015 MultiScale Big Data 3
  • 4.
    16th December 2015 MultiOmics Big Data 4
  • 5.
    16th December 2015 Big‘OMICS’ (High-throughput) Data Domains DNA-Seq ChIP-Seq RNA-Seq Systems Biology Meta Analysis Population Genetics GWAS Microarray Exome-Seq Repli-Seq Small RNA-Seq Biological Networks Proteomics Metagenomics 5
  • 6.
    16th December 2015 Model •Create a virtual (or physical) entity that has same physical appearance of the original entity in a reduced scale (space) • Use Physical Science to create sensors that can sense and quantify the input to the system causing Perturbation • Use Physical Science to measure the output of the Perturbed model entity • Use Mathematical (or Statistical) science that can simulate the function and behaviour of the original entity in reduced space and reduced time with perturbation 6
  • 7.
    16th December 2015 Dimensionsof Big Data The 7 Vs of Genomic Big Data • Volume is defined in terms of the physical volume of the data that need to be online, like giga-byte (10 9 ), tera-byte (10 12 ), peta-byte (10 15 ) or exa-byte (10 18 ) or even beyond. • Velocity is about the data-retrieval time or the time taken to service a request. Velocity is also measured through the rate of change of the data volume. • Variety relates to heterogeneous types of data like text, structured, unstructured, video, audio etcetera. • Veracity is another dimension to measure data reliability - the ability of an organization to trust the data and be able to confidently use it to make crucial decisions. • Vexing covers the effectiveness of the algorithm. The algorithm needs to be designed to ensure that data processing time is close to linear and the algorithm does not have any bias; irrespective of the volume of the data, the algorithm is able to process the data in reasonable time. • Variability is the scale of data. Data in biology is multi-scale, ranging from sub-atomic ions at picometers, macro-molecules, cells, tissues and finally to a population [9] at thousands of kilometers. • Value is the final actionable insight or the functional knowledge. The same mutation in a gene may have a different effect depending on the population or the environmental factors.
  • 8.
    16th December 2015 Typesof Genomic Big Data 1. Patient Data (n = 1) 2. Perishable (n = 1) 3. Persistent (n = N) 4. Phenotypic (n = N) 5. Clinical (n = N) 6. Biological/Molecular (n = N)
  • 9.
    16th December 2015 Applicationsof Next-Generation Sequencing 9
  • 10.
    16th December 2015 AsokeTalukder Frederick Sanger • Frederick Sanger was born in Rendcomb, a small village in Gloucestershire on August 13, 1918. He completed his Ph.D. in 1943 on lysine metabolism and a more practical problem concerning the nitrogen of potatoes. Sanger's first triumph was to determine the complete amino acid sequence of the two polypeptide chains of insulin in 1955. It was this achievement that earned him his first Nobel prize in Chemistry in 1958. By 1967 he had determined the nucleotide sequence of the 5S ribosomal RNA from Escherichia coli, a small RNA about 115 nucleotides long. He then turned to DNA and, by 1975, had developed the “dideoxy” method for sequencing DNA molecules, also known as the Sanger method. This has been of key importance in such projects as the Human Genome Project and earned him his second Nobel prize in Chemistry in 1980. 10
  • 11.
  • 12.
    16th December 2015 Samplegeneration and cluster generation 200,000 clusters per tile 62.5 million reads per lane 100 bp reads -> 12.5 Gb per lane Prepare DNA or cDNA fragments Ligate adapters 100μm Random array of clusters Attach single molecules to surface Amplify to form cl 12
  • 13.
    16th December 2015 BaseCalling Consecuitive cycles The identity of each base of a cluster is read from stacked sequence image Sequence 13
  • 14.
    16th December 2015 AsokeTalukder 14 Dideoxynucleotide Sequencing 14
  • 15.
    16th December 2015 Decodingthe Book of Life – milestone for Quantitative Biology A Milestone for Humanity – the Human genome Human Genome Completed, June 26th, 2000 15 Francis CollinsBill ClintonJ Craig Ventor 15
  • 16.
    16th December 2015 3billion base pair => 6 G letters & 1 letter => 1 byte The whole genome can be recorded in just 10 CD-ROMs! In 2003, Human genome sequence was deciphered! • Genome is the complete set of genes of a living thing. • In 2003, the human genome sequencing was completed. • The human genome contains about 3 billion base pairs. • The number of genes is estimated to be between 20,000 to 25,000. • The difference between the genome of human and that of chimpanzee is only 1.23%! 16
  • 17.
    16th December 2015 AsokeTalukder Illumina Genome Analyzer (GA) • The Genome Analyzer sequences clustered template DNA using a robust four-color DNA Sequencing-By- Synthesis (SBS) technology that employs reversible terminators with removable fluorescence. This approach provides a high degree of sequencing accuracy even through homopolymeric regions. 17
  • 18.
    16th December 2015 AsokeTalukder NGS (Next Generation Sequencing) Technology 18
  • 19.
    16th December 2015 AsokeTalukder How is Microarray Manufactured? • Affymetrix GeneChip • silicon chip • oligonucleiotide probes lithographically synthesized on the array • cRNA is used instead of cDNA 19
  • 20.
    16th December 2015 HowDoes Microarray Work? 20
  • 21.
    16th December 2015 PartIV – Biological Databases
  • 22.
    Molecular Biology Databases… AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,BBDB, BCGD,Beanref,Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc .................. !!!! 16 December, 2015 22
  • 23.
    NCBI (National Centerfor Biotechnology Information) • over 30 databases including GenBank, PubMed, OMIM, and GEO • Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/) 16 December, 2015 23
  • 36.
    Microarray data arestored in GEO (NCBI) and ArrayExpress (EBI) 16 December, 2015 36
  • 37.
    Microarray data arestored in GEO (NCBI) and ArrayExpress (EBI) 16 December, 2015 37
  • 38.
    Protein Data Bank(PDB) 16 December, 2015 38
  • 39.
  • 40.
    ENTREZ: A DISCOVERYSYSTEM Gene Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structure Word weight VAST BLASTBLAST Phylogeny Hard Link Neighbors Related Sequences Neighbors Related Sequences BLink Domains Neighbors Related Structures Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships. •Used less than expected. 16 December, 2015 40
  • 41.
  • 42.
    UMLS Knowledge SourceServer (UMLSKS) Home Page Unified Medical Language System From top links or buttons  Search 3 Knowledge Sources From sidebar  Downloads  Documentation  Resources 16 December, 2015 42
  • 43.
    “Biologic Function” hierarchy BiologicFunction 360 Pathologic Function 9983 Physiologic Function 691 Disease or Syndrome 67716 Cell or Molecular Dysfunction 1276 Experimental Model of Disease 72 Organism Function 1528 Organ or Tissue Function 2912 Cell Function 4417 Molecular Function 13442 Mental or Behavioral Dysfunction 5691 Neoplastic Process 19436 Mental Process 1224 Genetic Function 1340 16 December, 2015 43
  • 44.
    16th December 2015 PartV – Algorithms
  • 45.
    Algorithms • An algorithmis a sequence of instructions that one must perform in order to solve a well-formulated problem • First you must identify exactly what the problem is! • A problem describes a class of computational tasks. A problem instance is one particular input from that task • In general, you should design your algorithms to work for any instance of a problem (although there are cases in which this is not possible) • Unlike commercial software that is data intensive, algorithms as science and mathematics intensive 16 December, 2015 45
  • 46.
    Schematic representation ofour implementation of the de Bruijn graph Zerbino D. R., Birney E. Genome Res.;2008;18:821-829 ©2008 by Cold Spring Harbor Laboratory Press
  • 47.
    Example of TourBus error correction Zerbino D. R., Birney E. Genome Res.;2008;18:821-829 ©2008 by Cold Spring Harbor Laboratory Press
  • 48.
    Breadcrumb algorithm Zerbino D.R., Birney E. Genome Res.;2008;18:821-829 ©2008 by Cold Spring Harbor Laboratory Press
  • 49.
  • 50.
    16th December 2015 •Overview of Human Disease – classifications, Inheritance, mechanisms (cause) • Databases – OMIM (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim) – Gene Clinics (http://www.geneclinics.org/) – Mutation database (http://mutdb.org/) – Ocomine (http://www.oncomine.org/) – Cancer Genome project (http://www.sanger.ac.uk/genetics/CGP/) • Analysis of genes for molecular functions, biological processes and pathways The PANTHER (Protein ANalysis THrough Evolutionary Relationships (http://www.pantherdb.org/) Protein Interaction network (http://string.embl.de/) 50
  • 51.
    16th December 2015 •Are results statistically significant? • Many random process are involved in Biological processes • Many processes appear to be random but in reality are non-random • Many chances and uncertainties are involved in biology data collection • Statistical modeling of biological phenomenon can help to understand patterns in life Why Statistics? 51
  • 52.
    16th December 2015 Deductiveand Inductive Science Ref: Sylvia Wassertheil-Smoller, Biostatistics and Epidemiology, Springer, 2003 Law of Gravitation, Newton's Law of Motion E = mC2 Biological Phenomenon Simulation Clinical Trial 52
  • 53.
    16th December 2015 WhyStatistics? Purpose of statistics is to draw inferences from samples of data to the population from which these samples came Or Abstract an entity with average behavior where the behavior of the constituent parts cannot be measured 53
  • 54.
    16th December 2015 Challengesin Computing • Nature is a Tweaker • Computers are efficient in discovering identity but not similarity • Biology needs similarity & not identity • All Biology problems are different & unique • Huge data generated by Next Generation Sequencers with many errors • Eliminate Noise from Information • Minimize False Positive and False Negative 54
  • 55.
    16th December 2015 MostBiology Solutions are NP-Hard • If the data volume increases by x, complexity of solution is much higher than x (non deterministic polynomial time) • Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time • You may not know when you have an optimal solution, if you use a heuristic • Almost impossible to arrive at exact solution; however, if the solution is obtained, it can be proved it is the right solution • Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation does the solution need? 55
  • 56.
    16th December 2015 NGS:Experiment with an Open Mind • The process (Wet Lab) • Take DNA/RNA/cDNA/miRNA etc • Break into tiny pieces • Amplify them • Read them as sequence of bases • The process (Dry Lab) • Analyze the data • Extract information from data • NGS Experiments are unbiased • NGS can help discover many unknown patterns in the genome/gene or cell 56
  • 57.
    16th December 2015 NextGeneration Sequence Data • FASTQ (Illumina) • Sff (454) • CCS (PacBio) • ... • Microarray Single End Sequences Insert Size Library Size Sequence Seque nce Paired End or Mate-paired         DNA/RNA/miRNA OverlappedOverlapped reads  Random Order & Orientation Long reads Short reads Fixed length reads Variable length reads cDNA/mRNA Hundreds to Billions Bases Circular Consensus reads Billions to Hundreds Bases 57
  • 58.
    16th December 2015 Paired-end/Mate-pairData Paul Medvedev, Monica Stanciu & Michael Brudno, Computational methods for discovering structural variation with next-generation sequencing, Nature Methods Supplement| Vol.6 No.11s | November 2009 58
  • 59.
    16th December 2015 Roche454 NGS Data (.sff) FNA File content >contig00001 length=439 numreads=17 CcTcGGCGACGCACTCCgTCTTTtCAGTCAAAGGTCGAGGCAGTtGAGGTTACCCCACCC GTCCATCCGCCTTCGGCGGCTGTCCACCCTCCCCTCAAGGGGGAGGGGAACGCCCCGCCA GGAACCCCGCCAATGACCGACGCCCCGACCGTTCTTtCCCCcACCGCCGAAGCCCCGGTC GAAGGCCTGCCGTCGGGTTTCGGCGAAGGCATCGCCGGCAAGGCCGCATTTCTCATCGCC QUAL File content >contig00001 length=439 numreads=17 64 35 64 34 64 64 64 64 64 64 64 64 64 64 64 64 64 23 64 64 64 64 64 11 64 64 64 64 64 64 64 64 64 64 64 64 58 64 64 58 64 64 64 64 25 64 64 64 64 64 64 64 64 64 64 61 64 64 64 49 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 53 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 61 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 16 64 64 64 64 18 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 61 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 Phred quality score Q is defined as a property which is logarithmically related to the base-calling error probabilities P Q = -10 * log10P or P = 10-Q/10 • If Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000. The most commonly used method is to count the bases with a quality score of 20 and above. The high accuracy of Phred quality scores make them an ideal tool to assess the quality of sequences. Because • In 1 character representation, less than 20 is unprintable, the Q value is added with 33 or 64 based on the vendor 59
  • 60.
    16th December 2015 @HWI-EAS107_1_4_1_113_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_213_501 ATTATAAATTGAAGCTTATACAAAAAACTCGAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_313_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_413_501 TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @HWI-EAS107_1_4_1_513_501 TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII >HWI-EAS107_1_4_1_113_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA >HWI-EAS107_1_4_1_213_501 ATTATAAATTGAAGCTTATACAAAAAACTCGAA >HWI-EAS107_1_4_1_313_501 CATTATAAATTGAAGCTTATACAAAAAACTCGA >HWI-EAS107_1_4_1_413_501 TTATAAATTGAAGCTTCTTTAATCTTGGAGCAA >HWI-EAS107_1_4_1_513_501 TATAAATTGAAGCTTCTTTAATCTTGGAGCAAA Datain FASTQ/FASTA Format • For Paired-end sequences you have two files with name • _1 & _2 to indicate End_1 & End_2 • Within files you have matching record id @HWI-EAS107_1_4_1_113_501/1 • To indicate the sequence of End_1 • And @HWI-EAS107_1_4_1_113_501/2 • To indicate the sequence of End_2 • Paired-end read is INWARD •   • Mate-pair read is OUTWARD •   • FASTA • FASTQ 60
  • 61.
    16th December 2015 ErrorDue to Physics Beginning (bad quality data) Middle (good quality data) End (bad quality data) Source: Wikipedia 61
  • 62.
    16th December 2015 Base-callingError (Errors occur at rates 1 to 5 errors every 100 nucleotide) ACCGT CGTGC TTAC TACCGT ACCGT CGTGC TTAC TGCCGT ACCGT CAGTGC TTAC TACCGT ACCGT CGTGC TTAC TACGT Substitution Insertion Deletion Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology --ACCGT-- ----CGTGC TTAC----- -TACCGT— TTACCGTC (Consensus) 62
  • 63.
    16th December 2015 Adaptors& Contamination • Illumina Adaptors: 1) P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG 2) ACACTCTTTCCCTACACGACGCTCTTCCGATCT 3) AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT 4) CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT 5) ACACTCTTTCCCTACACGACGCTCTTCCGATCT 6) CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT • In a Paired Read, contamination in one end will result into filtering of both ends 63
  • 64.
    16th December 2015 Genome/DNAData Run1: Lane No of Reads Size (bytes) 1 41,179,668 10,285,393,108 2 43,252,726 10,455,103,434 3 42,951,004 10,381,539,992 4 43,580,180 10,534,360,126 6 42,071,130 10,171,701,008 7 43,084,416 10,414,795,392 8 42,891,196 10,369,596,648 Run2: Lane No of Reads Size(bytes) 1 42,773,842 10,703,924,228 2 44,809,016 10,772,709,314 3 44,898,528 10,790,934,680 4 44,099,962 10,598,532,600 6 44,731,270 10,746,564,462 7 44,162,428 10,607,946,662 8 43,689,238 10,492,962,600 Lane Size (bytes) 6 6,396,631,302 7 6,392,634,380 8 6,240,332,704 Run1: Total # of Paired-End Reads: 272,535,758; 29,901,032,000 Nucleotides Run2: Total # of Paired-End Reads: 282,273,960; 30,916,428,400 Nucleotides Run3: Total # of Mate-paired Reads: 841,326,748; 30,287,762,928 Nucleotides Run3: Mate Pair data with Read size 35 Nucleotide Library Size 5K (Insert size 5470 NT) Lane Size (bytes) 1 6,535,068,410 2 6,512,213,186 3 6,497,931,646 4 6,417,130,928 64
  • 65.
    16th December 2015 RNA-SeqData for a Marine Animal Tissue Name # Reads # Bases Size (bytes) Brain 73,224,886 4,393,493,160 14,378,439,860 Heart 71,954,940 4,317,296,400 14,129,178,812 Liver 68,992,472 4,139,548,320 13,547,005,500 65
  • 66.
    16th December 2015 miRNAData Sample No of Bases No. of Bases No. of Size of Name Received Processed Sequences Data ======================================================== S1 27,951,043 27,951,043 1,114,585 70.5 MB S2 24,768,291 24,768,291 1,043,462 64.5 MB S3 41,569,143 41,569,143 1,685,096 106.5 MB S4 34,037,239 34,037,239 1,433,791 89.2 MB S5 24,963,089 24,963,089 1,033,362 61.6 MB S6 34,846,223 34,846,223 1,439,337 96.5 MB S7 74,262,271 74,262,271 2,309,712 164.6 MB Read Size varying from 18 to 36 in FASTA format 66
  • 67.
    16th December 2015 TypicalBiological Data Volume (Illumina sequencing platform based) 67
  • 68.
    16th December 2015 Complexitiesin NGS Data • Large files – Microsoft Windows often fails to even open the file • Variable Length Reads – allocating memory is always a computational challenges • Computers are good at Identity discovery but Biology needs Similarity discovery • Categorical data – cannot take differences between two objects • Data are error prone – Quality of data is always a challenge • Proprietary formats (e.g., SFF, XSQ, CEL, 0 base, 33 base, 64 base) • Needs Super Computing power with Terabytes of Memory, and Petabytes of Storage • Most Biology problems are NP-Hard – algorithms fail to scale with large data volume • Many Open Source tools for NGS data and poorly documented and not maintained, supported, or easy to change 68
  • 69.
    16th December 2015 NGSData Challenges TACCGT TGCCGT TCCGT TCCCGT ACCCGT ACCGT Ref: Joao Setubal, Joao Meidanis, Introduction to Computational Molecular Biology No Coverage Fragments No Coverage DeletionInsertionSubstitution Read Errors XTarget XA XB C XA XD XCAssembled D B Repeats 69
  • 70.
    16th December 2015 UnknownOrientation & Order CACGT ACGT TGCA ACTACG GTACT ACTGA CTGA CACGT-------- -ACGT-------- -ACGT-------- --CGTAGT----- -----AGTAC--- --------ACTGA ---------CTGA CACGTAGTACTGA 70
  • 71.
    16th December 2015 DiscoveringBiomedical Knowledge Data Information Knowledge Literature/ Molecular Data Clinical/Bedside Data Medical Knowledge Target Data Preprocessed Data Transformed Data Patterns iOmics Clinical/Drug Data 71
  • 72.
    16th December 2015 DataInformation Knowledge Zoltán N. Oltvai and Albert-László Barabási, Life’s Complexity Pyramid, Science Vol 298, 25 October 2002 Wet Lab experiment & High-throughput data Open-domain widely used Algorithms & Tools Custom Tools and Open-domain Databases Problem Specific Algorithms, Analysis, and Databases Data Information Knowledge Related Information 72
  • 73.
    16th December 2015 SystemsBiology – Hypothesis Agnostic System/Genome Wide Study ETL Experiment/Sample Big Data Data ScienceMolecular Biology / Genetics Hypothesis Computer Science/ Algorithms Bioinformatics Statistics Meta Analysis / Network Biology Publish / Translational Biomedicine Scientist / Biologist NGS / Sequencer Biomedical Databases Literature 73
  • 74.
    16th December 2015 DataSciences • Data Science is about learning from data, in order to gain useful predictions and insights • Separating signal from noise presents many computational and inferential challenges, which we approached from a perspective at the interface of computer science and statistics • Data munging/scraping/sampling/cleaning in order to get an informative, manageable data set • Data storage and management in order to be able to access data - especially big data - quickly and reliably during subsequent analysis • Exploratory data analysis to generate hypotheses and intuition about the data • Prediction based on statistical tools such as regression, classification, and clustering • Communication of results through visualization, stories, and interpretable summaries. 74
  • 75.
    16th December 2015 DataSimulator (Synthetic Data) • Take a Reference genome (e.g., hg19 or mm10 or some other genome) • Create a VCF (Variation Call Format) file with synthetic mutations • Or, take known mutations in VCF format from COSMIC or 1000Genome • Apply (inject) the mutations from VCF file into the reference genome • This will create a genome (single strand) with known mutations • Inject random errors (sequencer errors) • Define the depth or coverage • Create fixed length single-end or paired-end reads • A FASTQ file will be generated with known coverage and known mutations • Single strand RNA-Seq, DNA-Seq, or ChIP-Seq data 75
  • 76.
    16th December 2015 DataScientists' Skills Ref: Wikipedia 76
  • 77.
    16th December 2015 ExploratoryData Analysis Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to 1. Maximize insight into a data set; 2. Uncover underlying structure; 3. Extract important variables; 4. Detect outliers and anomalies; 5. Test underlying assumptions; 6. Develop parsimonious models; and 7. Determine optimal factor settings. 77
  • 78.
    16th December 2015 •Real Human miRNA Data • Nucleotide Patterns – Mono, Di, Poly statistics – Motif Statistics • Quality of Nucleotides Truth is in the Data 78
  • 79.
    16th December 2015 Randomgenomes fragmentation Genomes assembly using overlaps Metagenomics/ Multiple genomes The Sequencing & Assembly Process Target Microbial Genomes
  • 80.
    16th December 2015 TheJigsaw Puzzle Source: Unknown 80
  • 81.
    16th December 2015 Phasesin Assembly • Understand the data – Data inventory – Single End, Paired End, Mate Paired etc – Sequence structure (Read size, Format) – Quality of the data – Patterns within the data • Clean up the data – Remove (Filter/Trim) vector/adaptor contaminated data – Remove data of bad quality – Remove data that might cause chimeric error • Genome or Trancriptome in Ref-Assembly • Contigs in Denovo Assembly 81
  • 82.
    16th December 2015 GenomeReference Assembly • Seed Based Algorithm – Indexes either the genome or the reads in a data structure – All k-long words (k-mers) of one sequence are indexed in a table with an entry for every possible k-mer – Seeds (exact or nearly exact substring matches between the read and the genome) are used to rapidly isolate the potential locations where the read could match, and then a sensitive, full alignment phase, often with the Smith–Waterman Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058 82
  • 83.
    16th December 2015 •MAQ (Mapping and Assembly with Qualities) is a Reference Assembly that supports 63 bases of short fixed- length Reads • MAQ was designed for Illumina 1G Genetic Analyzer data, with functions to handle ABI SOLiD data. • MAQ aligns reads to reference sequences and then calls the consensus. For single-end reads, MAQ is able to find all hits with up to 2 or 3 mismatches. • For paired-end reads, MAQ finds all paired hits with one of the two reads containing up to 1 mismatch. • At the assembling stage, MAQ calls the consensus based on a statistical model. It calls the base which maximizes the posterior probability and calculates a phread quality at each position along the consensus. Heterozygotes are also called in this process. MAQ Ref: http://maq.sourceforge.net/ 83
  • 84.
    16th December 2015 •BWT (Burrows–Wheeler Transform) • In the BWT index, only a fraction of the pointers must be precomputed and saved, while the rest are reconstructed on demand • Bowtie and BWA utilize heuristic algorithms to search for non-exact matches in the BWT- based index, if exact matches cannot be located Faster Genome Ref-assembly Algorithm Ref: Adrian V. Dalca and Michael Brudno, Genome variation discovery with high-throughput sequencing data, doi:10.1093/bib/bbp058 84
  • 85.
    16th December 2015 Alignment– Bowtie (SAM – Sequence Assembly Map) HWUSI-EAS705_9146:3:24:828:1109/1 0 chr1_length_4160774 1374500255 100M * 0 0 TCTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGA CCTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCA %%%%%%%%%%%%%%%41213:/=;555440323113=44;;>1=;?=1>>=53>;?A/ >8=?;===A;?A5AA9A4?B?AAAB@BA>AAA<ABAB@@A@< XA:i:0 MD:Z:100 NM:i:0 HWUSI-EAS705_9146:3:98:1103:366/1 0 chr1_length_4160774 1374501255 100M * 0 0 CTTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGAC CTTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCAT 444454313355544455544433244445661493/3;;565=;491=;5;54==3= ;;>;5;;;95>><:==53=?2>??=>A;A=A?A>?AB>AA>A XA:i:0 MD:Z:100 NM:i:0 HWUSI-EAS705_9146:3:20:433:1834/1 0 chr1_length_4160774 1374502255 100M * 0 0 TTCGCCTTCGGCCTTCTTGTCGCGGGCGATTTCCTTGCCGGTGGCCTGGTCGACGACC TTCATCGACAGGCGGACCTTGCCGCGCTCGTCGAAGCCCATC BAA<AB=?A30@A?AAA>?9=B<=>5;8;=>?4:=919;3555/554533;35;5555 5;5554%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% XA:i:0 MD:Z:100 NM:i:0 85
  • 86.
    16th December 2015 Alignmentin Genome Viewer 86
  • 87.
    16th December 2015 •Greatest computational challenge for Variation Analysis (SNP/InDel) task lies in judging the likelihood that a position is a heterozygous or homozygous variant given the error rates of the various platforms • The probability of bad mappings, and the amount of support or coverage • Therefore, most of the tools include a detailed data preparation step in which they filter, realign and often re-score reads, followed by a nucleotide or heterozygosity calling step done under a Bayesian framework SNP, Micro-InDel, & Point Mutation 87
  • 88.
    16th December 2015 Lackof Coverage • Coverage at a position i of a target is defined as the number of fragments that cover this position. If coverage is zero or low, there is not enough information in the fragment set to reconstruct the target completely No Coverage Target Fragments No Coverage 88
  • 89.
    16th December 2015 Endof Part III, IV & V InterpretOmics Office: Shezan Lavelle, 5th Floor, #15 Walton Road, Bengaluru 560001 Lab: #329, 7th Main, HAL 2nd Stage, Indiranagar, Bengaluru 560008 Phone: +91(80)46623800 89