Lecture given for the Data Mining course at Uppsala university in October 2013. The presentation talks about data analysis in the context of genomics, next-generation sequencing, metagenomics etc.
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Data analytics challenges in genomics
1. Data analysis challenges in genomics
Guest lecture, Data Mining
Uppsala 2013-10-08
Mikael Huss
Science for Life Laboratory / Stockholm University
2. Where I work
Science for Life Laboratory Stockholm, at Karolinska institutet science park
A national center for high throughput biology (ie massively parallel measurements of
DNA/RNA (“genomics”, “next generation DNA sequencing”), proteins (“proteomics”,
mass spectrometry) etc.
Nodes in Uppsala & Stockholm; funded by strategic grants
Offers services to customers, mostly DNA sequencing + associated analysis
3. Outline
1. Context (short intro to DNA sequencing)
1. Big goals / visions
2. Examples of data mining applications and technical
challenges
5. ?
All* living organisms have DNA as
their blueprint
GTTACGTAACCGTTACGTA…..
CCTTGATCGTAAC….
Etc. (2x3 billion letters for humans)
*OK, some viruses have RNA
6. A short refresher on molecular genetics!
…ACGT…
DNA
Blueprint / source code (http://ds9a.nl/amazing-dna)
Pretty much identical in all your cells
…ACGU…
RNA
“Expressed”, “active” genes
Differs between tissues, cell types, disease vs health
Proteins
The molecules that actually do stuff
…KVL…
Reading the nucleotide or amino acids is called sequencing
It is easier to isolate and therefore to sequence DNA and RNA
DNA sequencing means “reading the genome”
RNA sequencing can be used to get a snapshot of the active genes
Protein abundance can be measured but harder to do on a massive scale
7. SciLifeLab
Presently sequencing ~3 megabases of DNA per second
Corresponding to about 3 human genome sizes per hour
Also RNA, protein measurements
8. What is sequencing good for?
- Mapping new genomes
- Comparing individual genomes to each other
- Looking at how genes are expressed (RNA sequencing)
9. De novo genome sequencing
Mapping new genomes
E. g. Norwegian spruce (Christmas tree)
Economically the most important Swedish tree
Provide basis for research on
Conifers
(20 Gbp)
• tools for breeding for tree productivity, quality, health
• tools for cellulose and wood fibre modification (new materials)
Arabidopsis
(0.12 Gbp)
Populus
(0.45 Gbp)
Humans
(3 Gbp)
Spruce
(20 Gbp)
10. Resequencing and variation analysis
Working in the context of a known
reference genome.
Common application: Looking for genes
responsible for hereditary diseases
Often rare monogenic or common
complex diseases
More than 6,000 known monogenic
disease
Only ~ ½ have a gene associated
(OMIM)
Complex diseases – diabetes, asthma,
MS, ….
11. Functional genomics
- How genes actually get expressed
Variation between
-Tissues
-Cell types
-Cell states
-Individuals
28. Data integration and correlative analysis
Cancer – not one disease
http://techcrunch.com/2012/03/29/cloud-will-cure-cancer/
“Collecting comprehensive profiles of every tumor for every patient provides a dataset to
build models that learn normal cellular function from cancerous deviations.
Diagnostics and treatment companies/hospitals/physicians can then use the models to
deliver therapy.
If we imagine a world where every tumor is comprehensively profiled, it quickly becomes
clear that not only will the data sets be very large but also involve different domains of
expertise required for quality control, model building, and interpretation.”
29. Epigenetics and lifestyle
Genes – Epigenetics – Lifestyle - Environment
Understanding the interplay of lifestyle
(including environment) and genes through
the “interface layer”, epigenetics.
Massive correlational analyses …
epigenetics – changes in gene expression that are not due to base sequence changes
(and that can be passed on to daughter cells during cell division)
30. Gigantic clinical sequencing projects
Genomics England / NHS will sequence 100,000 genomes of patients in
the next 5 years
… BGI aims for a million
But are we ready to interpret genomes?
32. Storage and transfer
“European Bioinformatics Institute (EBI) stores 20 pb of data, of which 2 pb is
genomic”
“Single human genome ~140 Gb”
“ … downloading the data is time-consuming, and researchers must be sure that their
computational infrastructure and software tools are up to the task. “If I could, I would
routinely look at all sequenced cancer genomes,” says [Arend] Sidow. “With the
current infrastructure, that's impossible.”
Cloud solutions:
Embassy Cloud – EBI + CSC in Espoo
easyGenomics – BGI Hong Kong
DNANexus – commercial service, Silicon Valley
33. Analysis challenges
Dealing with the size of raw data
Growth in sequencing capacity has outstripped
Moore’s law
Need to throw away data
Tailored streaming / approximate algorithms
The Economist
34. Shape of data
“Commercial” big data:
(e.g. purchase data, movie ratings, “likes”, cell phone locations, tweets)
- Typically cheap to collect examples (data points) -> many observations
- Usually low-dimensional (few features)
- Data are informative only in aggregate (each data point is almost meaningless)
Biomedical big data:
(e.g. DNA sequencing, fMRI etc)
- Typically expensive to collect data points -> few observations
- Usually very high dimensional (e.g. ~20.000 gene measurements)
- Underpowered for modelling, much more features than observations
So, biological data often seems to be “transposed” relative to other types
(“large p, small n”)
35. The shape of (raw and processed) data
10-250 million such entries for one sample in an experiment
Gene expression
20.000-row x 125-column matrix
Genetic variants
Perhaps 3 million rows
36. Examples of data mining applications in
genomics
•
•
•
Classification
– Diseases and disease subtypes
– Biomarkers for disease
– Predicting disease presence or
subtype from gene expression
Clustering and visualization
– Defining cell types
– Molecular definitions of disease
Association rules
– Text analysis
37. Electronic health records
Mining electronic health records: towards better research
applications and clinical care
Peter B. Jensen, Lars J. Jensen & Søren Brunak
Nature Reviews Genetics 13, 395-405 (June 2012)
Unstructured and structured text
Medication history
Test results
Demographics
(etc)
39. Gene expression patterns and
neuronal cell types
Gene expression
Genes
Shape and
behavior of
neurons
Cell types
Sugino et al, Molecular taxonomy of major neuronal classes in the
adult mouse forebrain, Nature Neuroscience 9, 99 - 107 (2005)
44. SBV Improver Challenge #1
• Build predictive models for classifying gene
expression signatures for:
– Psoriasis
– Multiple sclerosis
– COPD
– Lung cancer
• Training set is public data, the secret test set
was proprietary
45. SBV Improver Challenge #1
• Build predictive models for classifying gene
expression signatures for:
– Psoriasis
– Multiple sclerosis
– COPD
– Lung cancer
• Training set is public data, the secret test set
was proprietary
46. SBV Improver Challenge #1
•
•
•
•
Psoriasis easy
Lung cancer hard
MS diagnostic, COPD somewhere in the middle
MS subtype: no statistically significant submissions!
https://www.sbvimprover.com/sbv-improver-symposium-2012-presentations
47. Species translation challenge
- Can the perturbations of signaling
pathways in one species predict the
response to a given stimulus in another
species?
- Which computational methods are most
effective for inferring gene, phosphorylation
and pathway responses from one species
to another?
48. CAMDA 2013 challenges
Question 1: Can we replace the animal study
with an in vitro assay? The current safety
assessment is largely relied on the animal
model, which is time-consuming, laborintensive, and definitely not in line with the
animal right voice. There is a paradigm shift in
toxicology to explore the possibility of replacing
the animal model with in vitro assay coupled
with toxicogenomics. The TGP data contains
both in vitro and animal data, which is essential
to address this question.
Question 2: Can we predict the liver injury in
humans using toxicogenomics data from
animals?
Available data:
Drug Information (Excel table) – the basic information about
individual drugs from DrugBank
Pathology Data (Excel table) –Pathology and clinical chemistry
data for each rat
“toxicogenomics”
Array Metadata (csv format) – Meta data (e.g., dose, time,
sacrifice time and etc)
49. Fully open code that runs on the server to generate predictions. Can build on others’ results
Editor's Notes
DNA is the blueprint for living organisms from bacteria to plants, animals and people. DNA sequencing refers to “reading” the “letters”, or bases, that make up the DNA, the genetic code. The past decade has seen an explosive growth in sequencing capability worldwide.
More generally, researchers are trying to move to a data-driven, predictive, personalized view of disease and health.
Some of the solutions could lie in community or crowd based approaches. A new generation of USB drive sized sequencers could enable regular people like you and me to sequence themselves, and cloud apps for genomic analysis are appearing so that people can do their own analysis.
It’s not just the human genome that is interesting to sequence for medical reasons. It is estimated that ten times more bacterial cells than human cells are inhabiting your body. Each person has a specific bacterial flora which can be connected to various diseases or things like obesity.
These bacteria can be characterized through massively parallel sequencing, and also novel viruses can be found by in this way from body fluids, like snot, or in environmental samples from soil, ocean water and so on.
Sequencing samples from their “natural state” outside the lab is called metagenomics and opens up whole new vistas for understanding “biological dark matter”, as the virologist Nathan Wolfe has put it.
There will be huge challenges in understanding the interplay between genetics, environment and lifestyle, as well as in monitoring the biological environment, perhaps in “genomic observatories” around the world.
Then there is the problem of just dealing with the raw data, especially for applications like monitoring infectious disease outbreaks or metagenomic monitoring of an environment. The growth in sequencing capacity has outpaced Moore’s law so that we need to start throwing away some of the data and developing tailored streaming approximate algorithms to extract the most relevant information.
What could technically be regarded as 50 million separate data points usually get summarized as something smaller
This, of course, has many implications for medical research. For instance, it is now much easier to look for genetic variants that cause rare diseases. Our team from SciLifeLab recently participated in an international genomic analysis competition where we and other teams identified mutations probably underlying rare muscle and heart diseases in children.
Online competitions such as this one about predicting breast cancer prognosis – with fully open code – can help us discover the best analysis methods.