SlideShare a Scribd company logo
1 of 49
Data analysis challenges in genomics

Guest lecture, Data Mining

Uppsala 2013-10-08
Mikael Huss
Science for Life Laboratory / Stockholm University
Where I work

Science for Life Laboratory Stockholm, at Karolinska institutet science park
A national center for high throughput biology (ie massively parallel measurements of
DNA/RNA (“genomics”, “next generation DNA sequencing”), proteins (“proteomics”,
mass spectrometry) etc.
Nodes in Uppsala & Stockholm; funded by strategic grants
Offers services to customers, mostly DNA sequencing + associated analysis
Outline

1. Context (short intro to DNA sequencing)

1. Big goals / visions
2. Examples of data mining applications and technical
challenges
1. Some context on DNA sequencing
?

All* living organisms have DNA as
their blueprint
GTTACGTAACCGTTACGTA…..
CCTTGATCGTAAC….
Etc. (2x3 billion letters for humans)
*OK, some viruses have RNA
A short refresher on molecular genetics!
…ACGT…

DNA

Blueprint / source code (http://ds9a.nl/amazing-dna)
Pretty much identical in all your cells

…ACGU…

RNA

“Expressed”, “active” genes
Differs between tissues, cell types, disease vs health

Proteins

The molecules that actually do stuff

…KVL…

Reading the nucleotide or amino acids is called sequencing
It is easier to isolate and therefore to sequence DNA and RNA
DNA sequencing means “reading the genome”
RNA sequencing can be used to get a snapshot of the active genes
Protein abundance can be measured but harder to do on a massive scale
SciLifeLab

Presently sequencing ~3 megabases of DNA per second
Corresponding to about 3 human genome sizes per hour
Also RNA, protein measurements
What is sequencing good for?
- Mapping new genomes
- Comparing individual genomes to each other
- Looking at how genes are expressed (RNA sequencing)
De novo genome sequencing
Mapping new genomes
E. g. Norwegian spruce (Christmas tree)
Economically the most important Swedish tree
Provide basis for research on

Conifers
(20 Gbp)

• tools for breeding for tree productivity, quality, health
• tools for cellulose and wood fibre modification (new materials)

Arabidopsis
(0.12 Gbp)

Populus
(0.45 Gbp)

Humans
(3 Gbp)

Spruce
(20 Gbp)
Resequencing and variation analysis
Working in the context of a known
reference genome.
Common application: Looking for genes
responsible for hereditary diseases
Often rare monogenic or common
complex diseases
More than 6,000 known monogenic
disease
Only ~ ½ have a gene associated
(OMIM)
Complex diseases – diabetes, asthma,
MS, ….
Functional genomics
- How genes actually get expressed

Variation between
-Tissues
-Cell types
-Cell states
-Individuals
Functional genomics

Transcriptional patterns
“cell types” as attractors
in systems of interacting
genes

Furusawa and Kaneko, Biology Direct 2009 4:17
2. Big goals / visions
Big goals / visions
• Precision medicine
–
–
–

Genomic medicine
Personalized medicine
Individualized treatments

• Understanding natural diversity
– Discovering new organisms
– Mapping ecological niches

• Understanding complex diseases
– Molecular definitions of diseases
– Lifestyle and epigenetics
Big goals / visions
• Precision medicine
–
–
–

Genomic medicine
Personalized medicine
Individualized treatments

• Understanding natural diversity
– Discovering new organisms
– Mapping ecological niches

• Understanding complex diseases
– Molecular definitions of diseases
– Lifestyle and epigenetics
Mount Sinai Medical Center / Eric Schadt
Personal sequencing?

Genomics apps
Community genomics & crowdsourced clinical trials

https://www.23andme.com/about/factoids/
Exploring the human
microbiome

Estimated 10x more
bacterial cells than
human cells in human
body
Three “enterotypes”
Personal microbiome sequencing
Big goals / visions
• Precision medicine
–
–
–

Genomic medicine
Personalized medicine
Individualized treatments

• Understanding natural diversity
– Discovering new organisms
– Mapping ecological niches

• Understanding complex diseases
– Molecular definitions of diseases
– Lifestyle and epigenetics
Environmental samples: soil, ocean etc

Identifying new viruses in human or environmental samples; <1% known so far
http://www.ted.com/talks/nathan_wolfe_what_s_left_to_explore.html
Planetary ecology
Perhaps: “genomic observatories” continuously monitoring environmental DNA
streaming, real-time analysis important
Big goals / visions
• Precision medicine
–
–
–

Genomic medicine
Personalized medicine
Individualized treatments

• Understanding natural diversity
– Discovering new organisms
– Mapping ecological niches

• Understanding complex diseases
– Molecular definitions of diseases
– Lifestyle and epigenetics
Complex diseases
• Cardiovascular disease
• Autoimmune disease
–
–
–
–

Rheumatism
Multiple sclerosis
Psoriasis
…

• Diabetes
(etc.)
No simple genetic explanation.
Lifestyle & environment factors likely important.
Data integration and correlative analysis
Cancer – not one disease
http://techcrunch.com/2012/03/29/cloud-will-cure-cancer/

“Collecting comprehensive profiles of every tumor for every patient provides a dataset to
build models that learn normal cellular function from cancerous deviations.
Diagnostics and treatment companies/hospitals/physicians can then use the models to
deliver therapy.

If we imagine a world where every tumor is comprehensively profiled, it quickly becomes
clear that not only will the data sets be very large but also involve different domains of
expertise required for quality control, model building, and interpretation.”
Epigenetics and lifestyle

Genes – Epigenetics – Lifestyle - Environment

Understanding the interplay of lifestyle
(including environment) and genes through
the “interface layer”, epigenetics.
Massive correlational analyses …

epigenetics – changes in gene expression that are not due to base sequence changes
(and that can be passed on to daughter cells during cell division)
Gigantic clinical sequencing projects

Genomics England / NHS will sequence 100,000 genomes of patients in
the next 5 years

… BGI aims for a million

But are we ready to interpret genomes?
3. Applications and challenges of data
mining in genomics
Storage and transfer

“European Bioinformatics Institute (EBI) stores 20 pb of data, of which 2 pb is
genomic”
“Single human genome ~140 Gb”
“ … downloading the data is time-consuming, and researchers must be sure that their
computational infrastructure and software tools are up to the task. “If I could, I would
routinely look at all sequenced cancer genomes,” says [Arend] Sidow. “With the
current infrastructure, that's impossible.”
Cloud solutions:
Embassy Cloud – EBI + CSC in Espoo
easyGenomics – BGI Hong Kong
DNANexus – commercial service, Silicon Valley
Analysis challenges

Dealing with the size of raw data

Growth in sequencing capacity has outstripped
Moore’s law

Need to throw away data
 Tailored streaming / approximate algorithms

The Economist
Shape of data
“Commercial” big data:
(e.g. purchase data, movie ratings, “likes”, cell phone locations, tweets)
- Typically cheap to collect examples (data points) -> many observations
- Usually low-dimensional (few features)
- Data are informative only in aggregate (each data point is almost meaningless)
Biomedical big data:
(e.g. DNA sequencing, fMRI etc)
- Typically expensive to collect data points -> few observations
- Usually very high dimensional (e.g. ~20.000 gene measurements)
- Underpowered for modelling, much more features than observations
So, biological data often seems to be “transposed” relative to other types
(“large p, small n”)
The shape of (raw and processed) data

10-250 million such entries for one sample in an experiment
Gene expression

20.000-row x 125-column matrix

Genetic variants

Perhaps 3 million rows
Examples of data mining applications in
genomics
•

•

•

Classification
– Diseases and disease subtypes
– Biomarkers for disease
– Predicting disease presence or
subtype from gene expression
Clustering and visualization
– Defining cell types
– Molecular definitions of disease
Association rules
– Text analysis
Electronic health records
Mining electronic health records: towards better research
applications and clinical care
Peter B. Jensen, Lars J. Jensen & Søren Brunak
Nature Reviews Genetics 13, 395-405 (June 2012)

Unstructured and structured text
Medication history
Test results
Demographics
(etc)
Genome interpretation
Gene expression patterns and
neuronal cell types

Gene expression

Genes

Shape and
behavior of
neurons

Cell types
Sugino et al, Molecular taxonomy of major neuronal classes in the
adult mouse forebrain, Nature Neuroscience 9, 99 - 107 (2005)
Genetics of multiple sclerosis
• Gene expression data on ~120 patients and 70 controls
• Medication, lifestyle, specific diagnosis
• Environment important – sunlight, tobacco etc

12_B
42_B
50_B
3_B
18_B
34_B
94_B
40_B
92_B
70_B
83_B
24_B
5_B
53_B
90_B
66_B
69_B
44_B
58_B
60_B
93_B
19_B
81_B
35_B
85_B
61_B
51_B
10_B
64_B
43_B
56_B
41_B
52_B
2_B
95_B
49_B
82_B
89_B
76_B
88_B
17_B
36_B
84_B
65_B
25_B
86_B
33_B
13_B
20_B
145
171
124
108
125
91_B
123
164
105
165
67_B
146
132
161
153
156
155
157
131
122
162
173
172
170
149
128
167
158
150

12_B
42_B
50_B
3_B
18_B
34_B
94_B
40_B
92_B
70_B
83_B
24_B
5_B
53_B
90_B
66_B
69_B
44_B
58_B
60_B
93_B
19_B
81_B
35_B
85_B
61_B
51_B
10_B
64_B
43_B
56_B
41_B
52_B
2_B
95_B
49_B
82_B
89_B
76_B
88_B
17_B
36_B
84_B
65_B
25_B
86_B
33_B
13_B
20_B
145
171
124
108
125
91_B
123
164
105
165
67_B
146
132
161
153
156
155
157
131
122
162
173
172
170
149
128
167
158
150

Gene expression

Medication, diagnosis etc
Predictive analysis contests
Predictive analysis contests
Science-oriented
SBV Improver Challenge #1

• Build predictive models for classifying gene
expression signatures for:
– Psoriasis
– Multiple sclerosis
– COPD
– Lung cancer

• Training set is public data, the secret test set
was proprietary
SBV Improver Challenge #1

• Build predictive models for classifying gene
expression signatures for:
– Psoriasis
– Multiple sclerosis
– COPD
– Lung cancer

• Training set is public data, the secret test set
was proprietary
SBV Improver Challenge #1

•
•
•
•

Psoriasis easy
Lung cancer hard
MS diagnostic, COPD somewhere in the middle
MS subtype: no statistically significant submissions!

https://www.sbvimprover.com/sbv-improver-symposium-2012-presentations
Species translation challenge
- Can the perturbations of signaling
pathways in one species predict the
response to a given stimulus in another
species?

- Which computational methods are most
effective for inferring gene, phosphorylation
and pathway responses from one species
to another?
CAMDA 2013 challenges
Question 1: Can we replace the animal study
with an in vitro assay? The current safety
assessment is largely relied on the animal
model, which is time-consuming, laborintensive, and definitely not in line with the
animal right voice. There is a paradigm shift in
toxicology to explore the possibility of replacing
the animal model with in vitro assay coupled
with toxicogenomics. The TGP data contains
both in vitro and animal data, which is essential
to address this question.
Question 2: Can we predict the liver injury in
humans using toxicogenomics data from
animals?

Available data:
Drug Information (Excel table) – the basic information about
individual drugs from DrugBank
Pathology Data (Excel table) –Pathology and clinical chemistry
data for each rat

“toxicogenomics”

Array Metadata (csv format) – Meta data (e.g., dose, time,
sacrifice time and etc)
Fully open code that runs on the server to generate predictions. Can build on others’ results

More Related Content

What's hot

Microbial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New CyberinfrastructureMicrobial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New CyberinfrastructureLarry Smarr
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentationSKUAST-Kashmir
 
Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Sijo A
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Leighton Pritchard
 
Introduction to Bioinformatics Slides
Introduction to Bioinformatics SlidesIntroduction to Bioinformatics Slides
Introduction to Bioinformatics SlidesSaide OER Africa
 
Computational Genomics - Bioinformatics - IK
Computational Genomics - Bioinformatics - IKComputational Genomics - Bioinformatics - IK
Computational Genomics - Bioinformatics - IKIlgın Kavaklıoğulları
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsAyeshaYousaf20
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsAmna Jalil
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencingcdgenomics525
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple nadeem akhter
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Sreekanth Gali
 

What's hot (20)

Microbial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New CyberinfrastructureMicrobial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New Cyberinfrastructure
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentation
 
Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Introduction to Bioinformatics Slides
Introduction to Bioinformatics SlidesIntroduction to Bioinformatics Slides
Introduction to Bioinformatics Slides
 
Computational Genomics - Bioinformatics - IK
Computational Genomics - Bioinformatics - IKComputational Genomics - Bioinformatics - IK
Computational Genomics - Bioinformatics - IK
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencing
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Bioinformatics in a Nutshell
Bioinformatics in a NutshellBioinformatics in a Nutshell
Bioinformatics in a Nutshell
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02
 

Viewers also liked

Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
Intel - Challenges and Opportunities in Cloud-Based Genomics AnalyticsIntel - Challenges and Opportunities in Cloud-Based Genomics Analytics
Intel - Challenges and Opportunities in Cloud-Based Genomics AnalyticsIntelHealthcare
 
Big data analysing genomics and the bdg project
Big data   analysing genomics and the bdg projectBig data   analysing genomics and the bdg project
Big data analysing genomics and the bdg projectsree navya
 
BITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Introduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsIntroduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsBITS
 
Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemscursoNGS
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNAcursoNGS
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6BITS
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4BITS
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanacursoNGS
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsBITS
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsBITS
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3BITS
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in Rmikaelhuss
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-SeqcursoNGS
 

Viewers also liked (20)

Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
Intel - Challenges and Opportunities in Cloud-Based Genomics AnalyticsIntel - Challenges and Opportunities in Cloud-Based Genomics Analytics
Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
 
Big data analysing genomics and the bdg project
Big data   analysing genomics and the bdg projectBig data   analysing genomics and the bdg project
Big data analysing genomics and the bdg project
 
BITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS - Search engines for mass spec data
BITS - Search engines for mass spec data
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Introduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsIntroduction to Linux for bioinformatics
Introduction to Linux for bioinformatics
 
Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systems
 
CAD CAM CAE
CAD CAM CAECAD CAM CAE
CAD CAM CAE
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNA
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomics
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humana
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformatics
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Cad cam cae
Cad cam caeCad cam cae
Cad cam cae
 

Similar to Data analytics challenges in genomics

TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)jmoore89
 
Jacques Fellay, EPFL, pour la journée e-health 2013
Jacques Fellay, EPFL, pour la journée e-health 2013Jacques Fellay, EPFL, pour la journée e-health 2013
Jacques Fellay, EPFL, pour la journée e-health 2013Thearkvalais
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationJoaquin Dopazo
 
Methods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataMethods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataChirag Patel
 
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...David Peyruc
 
NIH Data Science Special Interest Group
NIH Data Science Special Interest GroupNIH Data Science Special Interest Group
NIH Data Science Special Interest GroupYaffa Rubinstien
 
Biobanking a user’s perspective: Dr. Jonathan Pevsner
Biobanking a user’s perspective: Dr. Jonathan PevsnerBiobanking a user’s perspective: Dr. Jonathan Pevsner
Biobanking a user’s perspective: Dr. Jonathan PevsnerData Science NIH
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformaticaMartín Arrieta
 
PAPER 3.1 ~ HUMAN GENOME PROJECT
PAPER 3.1 ~  HUMAN GENOME PROJECTPAPER 3.1 ~  HUMAN GENOME PROJECT
PAPER 3.1 ~ HUMAN GENOME PROJECTNusrat Gulbarga
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Intel IT Center
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08Russ Altman
 
Addressing standardization challenges through integrated approaches in biomed...
Addressing standardization challenges through integrated approaches in biomed...Addressing standardization challenges through integrated approaches in biomed...
Addressing standardization challenges through integrated approaches in biomed...Lynn Schriml
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management inscit2006
 
Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...Joaquin Dopazo
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureRobert Cormia
 
Manteia non confidential-presentation 2003-09
Manteia non confidential-presentation 2003-09Manteia non confidential-presentation 2003-09
Manteia non confidential-presentation 2003-09Pascal Mayer
 

Similar to Data analytics challenges in genomics (20)

TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)
 
Jacques Fellay, EPFL, pour la journée e-health 2013
Jacques Fellay, EPFL, pour la journée e-health 2013Jacques Fellay, EPFL, pour la journée e-health 2013
Jacques Fellay, EPFL, pour la journée e-health 2013
 
Gellibolian 2010 Audio Visual2
Gellibolian 2010 Audio Visual2Gellibolian 2010 Audio Visual2
Gellibolian 2010 Audio Visual2
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical information
 
Methods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataMethods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big data
 
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...
 
NIH Data Science Special Interest Group
NIH Data Science Special Interest GroupNIH Data Science Special Interest Group
NIH Data Science Special Interest Group
 
Biobanking a user’s perspective: Dr. Jonathan Pevsner
Biobanking a user’s perspective: Dr. Jonathan PevsnerBiobanking a user’s perspective: Dr. Jonathan Pevsner
Biobanking a user’s perspective: Dr. Jonathan Pevsner
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
 
PAPER 3.1 ~ HUMAN GENOME PROJECT
PAPER 3.1 ~  HUMAN GENOME PROJECTPAPER 3.1 ~  HUMAN GENOME PROJECT
PAPER 3.1 ~ HUMAN GENOME PROJECT
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
Addressing standardization challenges through integrated approaches in biomed...
Addressing standardization challenges through integrated approaches in biomed...Addressing standardization challenges through integrated approaches in biomed...
Addressing standardization challenges through integrated approaches in biomed...
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management
 
Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...
 
Dr. Leroy Hood Lecuture on P4 Medicine
Dr. Leroy Hood Lecuture on P4 MedicineDr. Leroy Hood Lecuture on P4 Medicine
Dr. Leroy Hood Lecuture on P4 Medicine
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
 
Genomic Data Analysis
Genomic Data AnalysisGenomic Data Analysis
Genomic Data Analysis
 
Manteia non confidential-presentation 2003-09
Manteia non confidential-presentation 2003-09Manteia non confidential-presentation 2003-09
Manteia non confidential-presentation 2003-09
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Data analytics challenges in genomics

  • 1. Data analysis challenges in genomics Guest lecture, Data Mining Uppsala 2013-10-08 Mikael Huss Science for Life Laboratory / Stockholm University
  • 2. Where I work Science for Life Laboratory Stockholm, at Karolinska institutet science park A national center for high throughput biology (ie massively parallel measurements of DNA/RNA (“genomics”, “next generation DNA sequencing”), proteins (“proteomics”, mass spectrometry) etc. Nodes in Uppsala & Stockholm; funded by strategic grants Offers services to customers, mostly DNA sequencing + associated analysis
  • 3. Outline 1. Context (short intro to DNA sequencing) 1. Big goals / visions 2. Examples of data mining applications and technical challenges
  • 4. 1. Some context on DNA sequencing
  • 5. ? All* living organisms have DNA as their blueprint GTTACGTAACCGTTACGTA….. CCTTGATCGTAAC…. Etc. (2x3 billion letters for humans) *OK, some viruses have RNA
  • 6. A short refresher on molecular genetics! …ACGT… DNA Blueprint / source code (http://ds9a.nl/amazing-dna) Pretty much identical in all your cells …ACGU… RNA “Expressed”, “active” genes Differs between tissues, cell types, disease vs health Proteins The molecules that actually do stuff …KVL… Reading the nucleotide or amino acids is called sequencing It is easier to isolate and therefore to sequence DNA and RNA DNA sequencing means “reading the genome” RNA sequencing can be used to get a snapshot of the active genes Protein abundance can be measured but harder to do on a massive scale
  • 7. SciLifeLab Presently sequencing ~3 megabases of DNA per second Corresponding to about 3 human genome sizes per hour Also RNA, protein measurements
  • 8. What is sequencing good for? - Mapping new genomes - Comparing individual genomes to each other - Looking at how genes are expressed (RNA sequencing)
  • 9. De novo genome sequencing Mapping new genomes E. g. Norwegian spruce (Christmas tree) Economically the most important Swedish tree Provide basis for research on Conifers (20 Gbp) • tools for breeding for tree productivity, quality, health • tools for cellulose and wood fibre modification (new materials) Arabidopsis (0.12 Gbp) Populus (0.45 Gbp) Humans (3 Gbp) Spruce (20 Gbp)
  • 10. Resequencing and variation analysis Working in the context of a known reference genome. Common application: Looking for genes responsible for hereditary diseases Often rare monogenic or common complex diseases More than 6,000 known monogenic disease Only ~ ½ have a gene associated (OMIM) Complex diseases – diabetes, asthma, MS, ….
  • 11. Functional genomics - How genes actually get expressed Variation between -Tissues -Cell types -Cell states -Individuals
  • 12. Functional genomics Transcriptional patterns “cell types” as attractors in systems of interacting genes Furusawa and Kaneko, Biology Direct 2009 4:17
  • 13. 2. Big goals / visions
  • 14. Big goals / visions • Precision medicine – – – Genomic medicine Personalized medicine Individualized treatments • Understanding natural diversity – Discovering new organisms – Mapping ecological niches • Understanding complex diseases – Molecular definitions of diseases – Lifestyle and epigenetics
  • 15. Big goals / visions • Precision medicine – – – Genomic medicine Personalized medicine Individualized treatments • Understanding natural diversity – Discovering new organisms – Mapping ecological niches • Understanding complex diseases – Molecular definitions of diseases – Lifestyle and epigenetics
  • 16. Mount Sinai Medical Center / Eric Schadt
  • 17.
  • 19. Community genomics & crowdsourced clinical trials https://www.23andme.com/about/factoids/
  • 20. Exploring the human microbiome Estimated 10x more bacterial cells than human cells in human body Three “enterotypes”
  • 22. Big goals / visions • Precision medicine – – – Genomic medicine Personalized medicine Individualized treatments • Understanding natural diversity – Discovering new organisms – Mapping ecological niches • Understanding complex diseases – Molecular definitions of diseases – Lifestyle and epigenetics
  • 23. Environmental samples: soil, ocean etc Identifying new viruses in human or environmental samples; <1% known so far
  • 25. Planetary ecology Perhaps: “genomic observatories” continuously monitoring environmental DNA streaming, real-time analysis important
  • 26. Big goals / visions • Precision medicine – – – Genomic medicine Personalized medicine Individualized treatments • Understanding natural diversity – Discovering new organisms – Mapping ecological niches • Understanding complex diseases – Molecular definitions of diseases – Lifestyle and epigenetics
  • 27. Complex diseases • Cardiovascular disease • Autoimmune disease – – – – Rheumatism Multiple sclerosis Psoriasis … • Diabetes (etc.) No simple genetic explanation. Lifestyle & environment factors likely important.
  • 28. Data integration and correlative analysis Cancer – not one disease http://techcrunch.com/2012/03/29/cloud-will-cure-cancer/ “Collecting comprehensive profiles of every tumor for every patient provides a dataset to build models that learn normal cellular function from cancerous deviations. Diagnostics and treatment companies/hospitals/physicians can then use the models to deliver therapy. If we imagine a world where every tumor is comprehensively profiled, it quickly becomes clear that not only will the data sets be very large but also involve different domains of expertise required for quality control, model building, and interpretation.”
  • 29. Epigenetics and lifestyle Genes – Epigenetics – Lifestyle - Environment Understanding the interplay of lifestyle (including environment) and genes through the “interface layer”, epigenetics. Massive correlational analyses … epigenetics – changes in gene expression that are not due to base sequence changes (and that can be passed on to daughter cells during cell division)
  • 30. Gigantic clinical sequencing projects Genomics England / NHS will sequence 100,000 genomes of patients in the next 5 years … BGI aims for a million But are we ready to interpret genomes?
  • 31. 3. Applications and challenges of data mining in genomics
  • 32. Storage and transfer “European Bioinformatics Institute (EBI) stores 20 pb of data, of which 2 pb is genomic” “Single human genome ~140 Gb” “ … downloading the data is time-consuming, and researchers must be sure that their computational infrastructure and software tools are up to the task. “If I could, I would routinely look at all sequenced cancer genomes,” says [Arend] Sidow. “With the current infrastructure, that's impossible.” Cloud solutions: Embassy Cloud – EBI + CSC in Espoo easyGenomics – BGI Hong Kong DNANexus – commercial service, Silicon Valley
  • 33. Analysis challenges Dealing with the size of raw data Growth in sequencing capacity has outstripped Moore’s law Need to throw away data  Tailored streaming / approximate algorithms The Economist
  • 34. Shape of data “Commercial” big data: (e.g. purchase data, movie ratings, “likes”, cell phone locations, tweets) - Typically cheap to collect examples (data points) -> many observations - Usually low-dimensional (few features) - Data are informative only in aggregate (each data point is almost meaningless) Biomedical big data: (e.g. DNA sequencing, fMRI etc) - Typically expensive to collect data points -> few observations - Usually very high dimensional (e.g. ~20.000 gene measurements) - Underpowered for modelling, much more features than observations So, biological data often seems to be “transposed” relative to other types (“large p, small n”)
  • 35. The shape of (raw and processed) data 10-250 million such entries for one sample in an experiment Gene expression 20.000-row x 125-column matrix Genetic variants Perhaps 3 million rows
  • 36. Examples of data mining applications in genomics • • • Classification – Diseases and disease subtypes – Biomarkers for disease – Predicting disease presence or subtype from gene expression Clustering and visualization – Defining cell types – Molecular definitions of disease Association rules – Text analysis
  • 37. Electronic health records Mining electronic health records: towards better research applications and clinical care Peter B. Jensen, Lars J. Jensen & Søren Brunak Nature Reviews Genetics 13, 395-405 (June 2012) Unstructured and structured text Medication history Test results Demographics (etc)
  • 39. Gene expression patterns and neuronal cell types Gene expression Genes Shape and behavior of neurons Cell types Sugino et al, Molecular taxonomy of major neuronal classes in the adult mouse forebrain, Nature Neuroscience 9, 99 - 107 (2005)
  • 40. Genetics of multiple sclerosis • Gene expression data on ~120 patients and 70 controls • Medication, lifestyle, specific diagnosis • Environment important – sunlight, tobacco etc 12_B 42_B 50_B 3_B 18_B 34_B 94_B 40_B 92_B 70_B 83_B 24_B 5_B 53_B 90_B 66_B 69_B 44_B 58_B 60_B 93_B 19_B 81_B 35_B 85_B 61_B 51_B 10_B 64_B 43_B 56_B 41_B 52_B 2_B 95_B 49_B 82_B 89_B 76_B 88_B 17_B 36_B 84_B 65_B 25_B 86_B 33_B 13_B 20_B 145 171 124 108 125 91_B 123 164 105 165 67_B 146 132 161 153 156 155 157 131 122 162 173 172 170 149 128 167 158 150 12_B 42_B 50_B 3_B 18_B 34_B 94_B 40_B 92_B 70_B 83_B 24_B 5_B 53_B 90_B 66_B 69_B 44_B 58_B 60_B 93_B 19_B 81_B 35_B 85_B 61_B 51_B 10_B 64_B 43_B 56_B 41_B 52_B 2_B 95_B 49_B 82_B 89_B 76_B 88_B 17_B 36_B 84_B 65_B 25_B 86_B 33_B 13_B 20_B 145 171 124 108 125 91_B 123 164 105 165 67_B 146 132 161 153 156 155 157 131 122 162 173 172 170 149 128 167 158 150 Gene expression Medication, diagnosis etc
  • 44. SBV Improver Challenge #1 • Build predictive models for classifying gene expression signatures for: – Psoriasis – Multiple sclerosis – COPD – Lung cancer • Training set is public data, the secret test set was proprietary
  • 45. SBV Improver Challenge #1 • Build predictive models for classifying gene expression signatures for: – Psoriasis – Multiple sclerosis – COPD – Lung cancer • Training set is public data, the secret test set was proprietary
  • 46. SBV Improver Challenge #1 • • • • Psoriasis easy Lung cancer hard MS diagnostic, COPD somewhere in the middle MS subtype: no statistically significant submissions! https://www.sbvimprover.com/sbv-improver-symposium-2012-presentations
  • 47. Species translation challenge - Can the perturbations of signaling pathways in one species predict the response to a given stimulus in another species? - Which computational methods are most effective for inferring gene, phosphorylation and pathway responses from one species to another?
  • 48. CAMDA 2013 challenges Question 1: Can we replace the animal study with an in vitro assay? The current safety assessment is largely relied on the animal model, which is time-consuming, laborintensive, and definitely not in line with the animal right voice. There is a paradigm shift in toxicology to explore the possibility of replacing the animal model with in vitro assay coupled with toxicogenomics. The TGP data contains both in vitro and animal data, which is essential to address this question. Question 2: Can we predict the liver injury in humans using toxicogenomics data from animals? Available data: Drug Information (Excel table) – the basic information about individual drugs from DrugBank Pathology Data (Excel table) –Pathology and clinical chemistry data for each rat “toxicogenomics” Array Metadata (csv format) – Meta data (e.g., dose, time, sacrifice time and etc)
  • 49. Fully open code that runs on the server to generate predictions. Can build on others’ results

Editor's Notes

  1. DNA is the blueprint for living organisms from bacteria to plants, animals and people. DNA sequencing refers to “reading” the “letters”, or bases, that make up the DNA, the genetic code. The past decade has seen an explosive growth in sequencing capability worldwide.
  2. More generally, researchers are trying to move to a data-driven, predictive, personalized view of disease and health.
  3. Some of the solutions could lie in community or crowd based approaches. A new generation of USB drive sized sequencers could enable regular people like you and me to sequence themselves, and cloud apps for genomic analysis are appearing so that people can do their own analysis.
  4. It’s not just the human genome that is interesting to sequence for medical reasons. It is estimated that ten times more bacterial cells than human cells are inhabiting your body. Each person has a specific bacterial flora which can be connected to various diseases or things like obesity.
  5. These bacteria can be characterized through massively parallel sequencing, and also novel viruses can be found by in this way from body fluids, like snot, or in environmental samples from soil, ocean water and so on.
  6. Sequencing samples from their “natural state” outside the lab is called metagenomics and opens up whole new vistas for understanding “biological dark matter”, as the virologist Nathan Wolfe has put it.
  7. There will be huge challenges in understanding the interplay between genetics, environment and lifestyle, as well as in monitoring the biological environment, perhaps in “genomic observatories” around the world.
  8. Then there is the problem of just dealing with the raw data, especially for applications like monitoring infectious disease outbreaks or metagenomic monitoring of an environment. The growth in sequencing capacity has outpaced Moore’s law so that we need to start throwing away some of the data and developing tailored streaming approximate algorithms to extract the most relevant information.
  9. What could technically be regarded as 50 million separate data points usually get summarized as something smaller
  10. This, of course, has many implications for medical research. For instance, it is now much easier to look for genetic variants that cause rare diseases. Our team from SciLifeLab recently participated in an international genomic analysis competition where we and other teams identified mutations probably underlying rare muscle and heart diseases in children.
  11. Online competitions such as this one about predicting breast cancer prognosis – with fully open code – can help us discover the best analysis methods.