Data analytics challenges in genomics

Data analysis challenges in genomics

Guest lecture, Data Mining

Uppsala 2013-10-08
Mikael Huss
Science for Life Laboratory / Stockholm University

Where I work

Science for Life Laboratory Stockholm, at Karolinska institutet science park
A national center for high throughput biology (ie massively parallel measurements of
DNA/RNA (“genomics”, “next generation DNA sequencing”), proteins (“proteomics”,
mass spectrometry) etc.
Nodes in Uppsala & Stockholm; funded by strategic grants
Offers services to customers, mostly DNA sequencing + associated analysis

Outline

1. Context (short intro to DNA sequencing)

1. Big goals / visions
2. Examples of data mining applications and technical
challenges

1. Some context on DNA sequencing

?

All* living organisms have DNA as
their blueprint
GTTACGTAACCGTTACGTA…..
CCTTGATCGTAAC….
Etc. (2x3 billion letters for humans)
*OK, some viruses have RNA

A short refresher on molecular genetics!
…ACGT…

DNA

Blueprint / source code (http://ds9a.nl/amazing-dna)
Pretty much identical in all your cells

…ACGU…

RNA

“Expressed”, “active” genes
Differs between tissues, cell types, disease vs health

Proteins

The molecules that actually do stuff

…KVL…

Reading the nucleotide or amino acids is called sequencing
It is easier to isolate and therefore to sequence DNA and RNA
DNA sequencing means “reading the genome”
RNA sequencing can be used to get a snapshot of the active genes
Protein abundance can be measured but harder to do on a massive scale

SciLifeLab

Presently sequencing ~3 megabases of DNA per second
Corresponding to about 3 human genome sizes per hour
Also RNA, protein measurements

What is sequencing good for?
- Mapping new genomes
- Comparing individual genomes to each other
- Looking at how genes are expressed (RNA sequencing)

De novo genome sequencing
Mapping new genomes
E. g. Norwegian spruce (Christmas tree)
Economically the most important Swedish tree
Provide basis for research on

Conifers
(20 Gbp)

• tools for breeding for tree productivity, quality, health
• tools for cellulose and wood fibre modification (new materials)

Arabidopsis
(0.12 Gbp)

Populus
(0.45 Gbp)

Humans
(3 Gbp)

Spruce
(20 Gbp)

Resequencing and variation analysis
Working in the context of a known
reference genome.
Common application: Looking for genes
responsible for hereditary diseases
Often rare monogenic or common
complex diseases
More than 6,000 known monogenic
disease
Only ~ ½ have a gene associated
(OMIM)
Complex diseases – diabetes, asthma,
MS, ….

Functional genomics
- How genes actually get expressed

Variation between
-Tissues
-Cell types
-Cell states
-Individuals

Functional genomics

Transcriptional patterns
“cell types” as attractors
in systems of interacting
genes

Furusawa and Kaneko, Biology Direct 2009 4:17

Big goals / visions
• Precision medicine
–
–
–

Genomic medicine
Personalized medicine
Individualized treatments

• Understanding natural diversity
– Discovering new organisms
– Mapping ecological niches

• Understanding complex diseases
– Molecular definitions of diseases
– Lifestyle and epigenetics

Mount Sinai Medical Center / Eric Schadt

Personal sequencing?

Genomics apps

Community genomics & crowdsourced clinical trials

https://www.23andme.com/about/factoids/

Exploring the human
microbiome

Estimated 10x more
bacterial cells than
human cells in human
body
Three “enterotypes”

Personal microbiome sequencing

Environmental samples: soil, ocean etc

Identifying new viruses in human or environmental samples; <1% known so far

http://www.ted.com/talks/nathan_wolfe_what_s_left_to_explore.html

Planetary ecology
Perhaps: “genomic observatories” continuously monitoring environmental DNA
streaming, real-time analysis important

Complex diseases
• Cardiovascular disease
• Autoimmune disease
–
–
–
–

Rheumatism
Multiple sclerosis
Psoriasis
…

• Diabetes
(etc.)
No simple genetic explanation.
Lifestyle & environment factors likely important.

Data integration and correlative analysis
Cancer – not one disease
http://techcrunch.com/2012/03/29/cloud-will-cure-cancer/

“Collecting comprehensive profiles of every tumor for every patient provides a dataset to
build models that learn normal cellular function from cancerous deviations.
Diagnostics and treatment companies/hospitals/physicians can then use the models to
deliver therapy.

If we imagine a world where every tumor is comprehensively profiled, it quickly becomes
clear that not only will the data sets be very large but also involve different domains of
expertise required for quality control, model building, and interpretation.”

Epigenetics and lifestyle

Genes – Epigenetics – Lifestyle - Environment

Understanding the interplay of lifestyle
(including environment) and genes through
the “interface layer”, epigenetics.
Massive correlational analyses …

epigenetics – changes in gene expression that are not due to base sequence changes
(and that can be passed on to daughter cells during cell division)

Gigantic clinical sequencing projects

Genomics England / NHS will sequence 100,000 genomes of patients in
the next 5 years

… BGI aims for a million

But are we ready to interpret genomes?

3. Applications and challenges of data
mining in genomics

Storage and transfer

“European Bioinformatics Institute (EBI) stores 20 pb of data, of which 2 pb is
genomic”
“Single human genome ~140 Gb”
“ … downloading the data is time-consuming, and researchers must be sure that their
computational infrastructure and software tools are up to the task. “If I could, I would
routinely look at all sequenced cancer genomes,” says [Arend] Sidow. “With the
current infrastructure, that's impossible.”
Cloud solutions:
Embassy Cloud – EBI + CSC in Espoo
easyGenomics – BGI Hong Kong
DNANexus – commercial service, Silicon Valley

Analysis challenges

Dealing with the size of raw data

Growth in sequencing capacity has outstripped
Moore’s law

Need to throw away data
 Tailored streaming / approximate algorithms

The Economist

Shape of data
“Commercial” big data:
(e.g. purchase data, movie ratings, “likes”, cell phone locations, tweets)
- Typically cheap to collect examples (data points) -> many observations
- Usually low-dimensional (few features)
- Data are informative only in aggregate (each data point is almost meaningless)
Biomedical big data:
(e.g. DNA sequencing, fMRI etc)
- Typically expensive to collect data points -> few observations
- Usually very high dimensional (e.g. ~20.000 gene measurements)
- Underpowered for modelling, much more features than observations
So, biological data often seems to be “transposed” relative to other types
(“large p, small n”)

The shape of (raw and processed) data

10-250 million such entries for one sample in an experiment
Gene expression

20.000-row x 125-column matrix

Genetic variants

Perhaps 3 million rows

Examples of data mining applications in
genomics
•

•

•

Classification
– Diseases and disease subtypes
– Biomarkers for disease
– Predicting disease presence or
subtype from gene expression
Clustering and visualization
– Defining cell types
– Molecular definitions of disease
Association rules
– Text analysis

Electronic health records
Mining electronic health records: towards better research
applications and clinical care
Peter B. Jensen, Lars J. Jensen & Søren Brunak
Nature Reviews Genetics 13, 395-405 (June 2012)

Unstructured and structured text
Medication history
Test results
Demographics
(etc)

Gene expression patterns and
neuronal cell types

Gene expression

Genes

Shape and
behavior of
neurons

Cell types
Sugino et al, Molecular taxonomy of major neuronal classes in the
adult mouse forebrain, Nature Neuroscience 9, 99 - 107 (2005)

Genetics of multiple sclerosis
• Gene expression data on ~120 patients and 70 controls
• Medication, lifestyle, specific diagnosis
• Environment important – sunlight, tobacco etc

12_B
42_B
50_B
3_B
18_B
34_B
94_B
40_B
92_B
70_B
83_B
24_B
5_B
53_B
90_B
66_B
69_B
44_B
58_B
60_B
93_B
19_B
81_B
35_B
85_B
61_B
51_B
10_B
64_B
43_B
56_B
41_B
52_B
2_B
95_B
49_B
82_B
89_B
76_B
88_B
17_B
36_B
84_B
65_B
25_B
86_B
33_B
13_B
20_B
145
171
124
108
125
91_B
123
164
105
165
67_B
146
132
161
153
156
155
157
131
122
162
173
172
170
149
128
167
158
150

12_B
42_B
50_B
3_B
18_B
34_B
94_B
40_B
92_B
70_B
83_B
24_B
5_B
53_B
90_B
66_B
69_B
44_B
58_B
60_B
93_B
19_B
81_B
35_B
85_B
61_B
51_B
10_B
64_B
43_B
56_B
41_B
52_B
2_B
95_B
49_B
82_B
89_B
76_B
88_B
17_B
36_B
84_B
65_B
25_B
86_B
33_B
13_B
20_B
145
171
124
108
125
91_B
123
164
105
165
67_B
146
132
161
153
156
155
157
131
122
162
173
172
170
149
128
167
158
150

Gene expression

Medication, diagnosis etc

SBV Improver Challenge #1

• Build predictive models for classifying gene
expression signatures for:
– Psoriasis
– Multiple sclerosis
– COPD
– Lung cancer

• Training set is public data, the secret test set
was proprietary

SBV Improver Challenge #1

•
•
•
•

Psoriasis easy
Lung cancer hard
MS diagnostic, COPD somewhere in the middle
MS subtype: no statistically significant submissions!

https://www.sbvimprover.com/sbv-improver-symposium-2012-presentations

Species translation challenge
- Can the perturbations of signaling
pathways in one species predict the
response to a given stimulus in another
species?

- Which computational methods are most
effective for inferring gene, phosphorylation
and pathway responses from one species
to another?

CAMDA 2013 challenges
Question 1: Can we replace the animal study
with an in vitro assay? The current safety
assessment is largely relied on the animal
model, which is time-consuming, laborintensive, and definitely not in line with the
animal right voice. There is a paradigm shift in
toxicology to explore the possibility of replacing
the animal model with in vitro assay coupled
with toxicogenomics. The TGP data contains
both in vitro and animal data, which is essential
to address this question.
Question 2: Can we predict the liver injury in
humans using toxicogenomics data from
animals?

Available data:
Drug Information (Excel table) – the basic information about
individual drugs from DrugBank
Pathology Data (Excel table) –Pathology and clinical chemistry
data for each rat

“toxicogenomics”

Array Metadata (csv format) – Meta data (e.g., dose, time,
sacrifice time and etc)

Fully open code that runs on the server to generate predictions. Can build on others’ results

Data analytics challenges in genomics

More Related Content

What's hot

Viewers also liked

Similar to Data analytics challenges in genomics

Recently uploaded

Data analytics challenges in genomics

Editor's Notes