Data analytics challenges in genomics
Upcoming SlideShare
Loading in...5
×
 

Data analytics challenges in genomics

on

  • 3,429 views

Lecture given for the Data Mining course at Uppsala university in October 2013. The presentation talks about data analysis in the context of genomics, next-generation sequencing, metagenomics etc.

Lecture given for the Data Mining course at Uppsala university in October 2013. The presentation talks about data analysis in the context of genomics, next-generation sequencing, metagenomics etc.

Statistics

Views

Total Views
3,429
Views on SlideShare
581
Embed Views
2,848

Actions

Likes
0
Downloads
22
Comments
0

15 Embeds 2,848

http://followthedata.wordpress.com 2702
http://www.scoop.it 64
http://feedly.com 32
https://followthedata.wordpress.com 22
http://digg.com 6
http://newsblur.com 6
https://www.google.md 4
http://www.google.fr 2
http://www.newsblur.com 2
http://reader.aol.com 2
http://reader.proxen.com 2
http://translate.googleusercontent.com 1
http://prlog.ru 1
http://www.google.co.in 1
https://www.commafeed.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • DNA is the blueprint for living organisms from bacteria to plants, animals and people. DNA sequencing refers to “reading” the “letters”, or bases, that make up the DNA, the genetic code. The past decade has seen an explosive growth in sequencing capability worldwide.
  • More generally, researchers are trying to move to a data-driven, predictive, personalized view of disease and health.
  • Some of the solutions could lie in community or crowd based approaches. A new generation of USB drive sized sequencers could enable regular people like you and me to sequence themselves, and cloud apps for genomic analysis are appearing so that people can do their own analysis.
  • It’s not just the human genome that is interesting to sequence for medical reasons. It is estimated that ten times more bacterial cells than human cells are inhabiting your body. Each person has a specific bacterial flora which can be connected to various diseases or things like obesity.
  • These bacteria can be characterized through massively parallel sequencing, and also novel viruses can be found by in this way from body fluids, like snot, or in environmental samples from soil, ocean water and so on.
  • Sequencing samples from their “natural state” outside the lab is called metagenomics and opens up whole new vistas for understanding “biological dark matter”, as the virologist Nathan Wolfe has put it.
  • There will be huge challenges in understanding the interplay between genetics, environment and lifestyle, as well as in monitoring the biological environment, perhaps in “genomic observatories” around the world.
  • Then there is the problem of just dealing with the raw data, especially for applications like monitoring infectious disease outbreaks or metagenomic monitoring of an environment. The growth in sequencing capacity has outpaced Moore’s law so that we need to start throwing away some of the data and developing tailored streaming approximate algorithms to extract the most relevant information.
  • What could technically be regarded as 50 million separate data points usually get summarized as something smaller
  • This, of course, has many implications for medical research. For instance, it is now much easier to look for genetic variants that cause rare diseases. Our team from SciLifeLab recently participated in an international genomic analysis competition where we and other teams identified mutations probably underlying rare muscle and heart diseases in children.
  • Online competitions such as this one about predicting breast cancer prognosis – with fully open code – can help us discover the best analysis methods.

Data analytics challenges in genomics Data analytics challenges in genomics Presentation Transcript

  • Data analysis challenges in genomics Guest lecture, Data Mining Uppsala 2013-10-08 Mikael Huss Science for Life Laboratory / Stockholm University
  • Where I work Science for Life Laboratory Stockholm, at Karolinska institutet science park A national center for high throughput biology (ie massively parallel measurements of DNA/RNA (“genomics”, “next generation DNA sequencing”), proteins (“proteomics”, mass spectrometry) etc. Nodes in Uppsala & Stockholm; funded by strategic grants Offers services to customers, mostly DNA sequencing + associated analysis
  • Outline 1. Context (short intro to DNA sequencing) 1. Big goals / visions 2. Examples of data mining applications and technical challenges
  • 1. Some context on DNA sequencing
  • ? All* living organisms have DNA as their blueprint GTTACGTAACCGTTACGTA….. CCTTGATCGTAAC…. Etc. (2x3 billion letters for humans) *OK, some viruses have RNA
  • A short refresher on molecular genetics! …ACGT… DNA Blueprint / source code (http://ds9a.nl/amazing-dna) Pretty much identical in all your cells …ACGU… RNA “Expressed”, “active” genes Differs between tissues, cell types, disease vs health Proteins The molecules that actually do stuff …KVL… Reading the nucleotide or amino acids is called sequencing It is easier to isolate and therefore to sequence DNA and RNA DNA sequencing means “reading the genome” RNA sequencing can be used to get a snapshot of the active genes Protein abundance can be measured but harder to do on a massive scale
  • SciLifeLab Presently sequencing ~3 megabases of DNA per second Corresponding to about 3 human genome sizes per hour Also RNA, protein measurements
  • What is sequencing good for? - Mapping new genomes - Comparing individual genomes to each other - Looking at how genes are expressed (RNA sequencing)
  • De novo genome sequencing Mapping new genomes E. g. Norwegian spruce (Christmas tree) Economically the most important Swedish tree Provide basis for research on Conifers (20 Gbp) • tools for breeding for tree productivity, quality, health • tools for cellulose and wood fibre modification (new materials) Arabidopsis (0.12 Gbp) Populus (0.45 Gbp) Humans (3 Gbp) Spruce (20 Gbp)
  • Resequencing and variation analysis Working in the context of a known reference genome. Common application: Looking for genes responsible for hereditary diseases Often rare monogenic or common complex diseases More than 6,000 known monogenic disease Only ~ ½ have a gene associated (OMIM) Complex diseases – diabetes, asthma, MS, ….
  • Functional genomics - How genes actually get expressed Variation between -Tissues -Cell types -Cell states -Individuals
  • Functional genomics Transcriptional patterns “cell types” as attractors in systems of interacting genes Furusawa and Kaneko, Biology Direct 2009 4:17
  • 2. Big goals / visions
  • Big goals / visions • Precision medicine – – – Genomic medicine Personalized medicine Individualized treatments • Understanding natural diversity – Discovering new organisms – Mapping ecological niches • Understanding complex diseases – Molecular definitions of diseases – Lifestyle and epigenetics
  • Big goals / visions • Precision medicine – – – Genomic medicine Personalized medicine Individualized treatments • Understanding natural diversity – Discovering new organisms – Mapping ecological niches • Understanding complex diseases – Molecular definitions of diseases – Lifestyle and epigenetics
  • Mount Sinai Medical Center / Eric Schadt
  • Personal sequencing? Genomics apps
  • Community genomics & crowdsourced clinical trials https://www.23andme.com/about/factoids/
  • Exploring the human microbiome Estimated 10x more bacterial cells than human cells in human body Three “enterotypes”
  • Personal microbiome sequencing
  • Big goals / visions • Precision medicine – – – Genomic medicine Personalized medicine Individualized treatments • Understanding natural diversity – Discovering new organisms – Mapping ecological niches • Understanding complex diseases – Molecular definitions of diseases – Lifestyle and epigenetics
  • Environmental samples: soil, ocean etc Identifying new viruses in human or environmental samples; <1% known so far
  • http://www.ted.com/talks/nathan_wolfe_what_s_left_to_explore.html
  • Planetary ecology Perhaps: “genomic observatories” continuously monitoring environmental DNA streaming, real-time analysis important
  • Big goals / visions • Precision medicine – – – Genomic medicine Personalized medicine Individualized treatments • Understanding natural diversity – Discovering new organisms – Mapping ecological niches • Understanding complex diseases – Molecular definitions of diseases – Lifestyle and epigenetics
  • Complex diseases • Cardiovascular disease • Autoimmune disease – – – – Rheumatism Multiple sclerosis Psoriasis … • Diabetes (etc.) No simple genetic explanation. Lifestyle & environment factors likely important.
  • Data integration and correlative analysis Cancer – not one disease http://techcrunch.com/2012/03/29/cloud-will-cure-cancer/ “Collecting comprehensive profiles of every tumor for every patient provides a dataset to build models that learn normal cellular function from cancerous deviations. Diagnostics and treatment companies/hospitals/physicians can then use the models to deliver therapy. If we imagine a world where every tumor is comprehensively profiled, it quickly becomes clear that not only will the data sets be very large but also involve different domains of expertise required for quality control, model building, and interpretation.”
  • Epigenetics and lifestyle Genes – Epigenetics – Lifestyle - Environment Understanding the interplay of lifestyle (including environment) and genes through the “interface layer”, epigenetics. Massive correlational analyses … epigenetics – changes in gene expression that are not due to base sequence changes (and that can be passed on to daughter cells during cell division)
  • Gigantic clinical sequencing projects Genomics England / NHS will sequence 100,000 genomes of patients in the next 5 years … BGI aims for a million But are we ready to interpret genomes?
  • 3. Applications and challenges of data mining in genomics
  • Storage and transfer “European Bioinformatics Institute (EBI) stores 20 pb of data, of which 2 pb is genomic” “Single human genome ~140 Gb” “ … downloading the data is time-consuming, and researchers must be sure that their computational infrastructure and software tools are up to the task. “If I could, I would routinely look at all sequenced cancer genomes,” says [Arend] Sidow. “With the current infrastructure, that's impossible.” Cloud solutions: Embassy Cloud – EBI + CSC in Espoo easyGenomics – BGI Hong Kong DNANexus – commercial service, Silicon Valley
  • Analysis challenges Dealing with the size of raw data Growth in sequencing capacity has outstripped Moore’s law Need to throw away data  Tailored streaming / approximate algorithms The Economist
  • Shape of data “Commercial” big data: (e.g. purchase data, movie ratings, “likes”, cell phone locations, tweets) - Typically cheap to collect examples (data points) -> many observations - Usually low-dimensional (few features) - Data are informative only in aggregate (each data point is almost meaningless) Biomedical big data: (e.g. DNA sequencing, fMRI etc) - Typically expensive to collect data points -> few observations - Usually very high dimensional (e.g. ~20.000 gene measurements) - Underpowered for modelling, much more features than observations So, biological data often seems to be “transposed” relative to other types (“large p, small n”)
  • The shape of (raw and processed) data 10-250 million such entries for one sample in an experiment Gene expression 20.000-row x 125-column matrix Genetic variants Perhaps 3 million rows
  • Examples of data mining applications in genomics • • • Classification – Diseases and disease subtypes – Biomarkers for disease – Predicting disease presence or subtype from gene expression Clustering and visualization – Defining cell types – Molecular definitions of disease Association rules – Text analysis
  • Electronic health records Mining electronic health records: towards better research applications and clinical care Peter B. Jensen, Lars J. Jensen & Søren Brunak Nature Reviews Genetics 13, 395-405 (June 2012) Unstructured and structured text Medication history Test results Demographics (etc)
  • Genome interpretation
  • Gene expression patterns and neuronal cell types Gene expression Genes Shape and behavior of neurons Cell types Sugino et al, Molecular taxonomy of major neuronal classes in the adult mouse forebrain, Nature Neuroscience 9, 99 - 107 (2005)
  • Genetics of multiple sclerosis • Gene expression data on ~120 patients and 70 controls • Medication, lifestyle, specific diagnosis • Environment important – sunlight, tobacco etc 12_B 42_B 50_B 3_B 18_B 34_B 94_B 40_B 92_B 70_B 83_B 24_B 5_B 53_B 90_B 66_B 69_B 44_B 58_B 60_B 93_B 19_B 81_B 35_B 85_B 61_B 51_B 10_B 64_B 43_B 56_B 41_B 52_B 2_B 95_B 49_B 82_B 89_B 76_B 88_B 17_B 36_B 84_B 65_B 25_B 86_B 33_B 13_B 20_B 145 171 124 108 125 91_B 123 164 105 165 67_B 146 132 161 153 156 155 157 131 122 162 173 172 170 149 128 167 158 150 12_B 42_B 50_B 3_B 18_B 34_B 94_B 40_B 92_B 70_B 83_B 24_B 5_B 53_B 90_B 66_B 69_B 44_B 58_B 60_B 93_B 19_B 81_B 35_B 85_B 61_B 51_B 10_B 64_B 43_B 56_B 41_B 52_B 2_B 95_B 49_B 82_B 89_B 76_B 88_B 17_B 36_B 84_B 65_B 25_B 86_B 33_B 13_B 20_B 145 171 124 108 125 91_B 123 164 105 165 67_B 146 132 161 153 156 155 157 131 122 162 173 172 170 149 128 167 158 150 Gene expression Medication, diagnosis etc
  • Predictive analysis contests
  • Predictive analysis contests
  • Science-oriented
  • SBV Improver Challenge #1 • Build predictive models for classifying gene expression signatures for: – Psoriasis – Multiple sclerosis – COPD – Lung cancer • Training set is public data, the secret test set was proprietary
  • SBV Improver Challenge #1 • Build predictive models for classifying gene expression signatures for: – Psoriasis – Multiple sclerosis – COPD – Lung cancer • Training set is public data, the secret test set was proprietary
  • SBV Improver Challenge #1 • • • • Psoriasis easy Lung cancer hard MS diagnostic, COPD somewhere in the middle MS subtype: no statistically significant submissions! https://www.sbvimprover.com/sbv-improver-symposium-2012-presentations
  • Species translation challenge - Can the perturbations of signaling pathways in one species predict the response to a given stimulus in another species? - Which computational methods are most effective for inferring gene, phosphorylation and pathway responses from one species to another?
  • CAMDA 2013 challenges Question 1: Can we replace the animal study with an in vitro assay? The current safety assessment is largely relied on the animal model, which is time-consuming, laborintensive, and definitely not in line with the animal right voice. There is a paradigm shift in toxicology to explore the possibility of replacing the animal model with in vitro assay coupled with toxicogenomics. The TGP data contains both in vitro and animal data, which is essential to address this question. Question 2: Can we predict the liver injury in humans using toxicogenomics data from animals? Available data: Drug Information (Excel table) – the basic information about individual drugs from DrugBank Pathology Data (Excel table) –Pathology and clinical chemistry data for each rat “toxicogenomics” Array Metadata (csv format) – Meta data (e.g., dose, time, sacrifice time and etc)
  • Fully open code that runs on the server to generate predictions. Can build on others’ results