• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Strata London: Big (Sequence) Data in Pharmaceutical R&D
 

Strata London: Big (Sequence) Data in Pharmaceutical R&D

on

  • 583 views

Does pre-competetive collaboration ease the pain of adopting disruptive big-data technologies? This question is tacked using the example of management/analysis of large genomic sequence data sets, and ...

Does pre-competetive collaboration ease the pain of adopting disruptive big-data technologies? This question is tacked using the example of management/analysis of large genomic sequence data sets, and their role in the development of personalised medicine.

Statistics

Views

Total Views
583
Views on SlideShare
582
Embed Views
1

Actions

Likes
0
Downloads
14
Comments
0

1 Embed 1

http://www.slashdocs.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 2012: most exciting time to be working in molecular biology forover half a centaury.Over the past 5 years: new techniques -> measure our genetic blueprint with unprecedented accuracyLast month a major international project, “Encyclopedia of Non Coding Elements”, Findings from hundreds of researchers in 30 publications,Also made data/analysis available a virtual machine, but that’s a topic for another talk.Today weknow an order of magnitude more, in data terms,about switches that regulate our genes than we did previously.New scientific approaches embodied by Encode and many othersHave ready applications to industrial R&Dthroughout the life sciences, Plant breeding to biofuels topersonal hygiene.
  • This talk focuses on the highest profile of all these applications – medicine.Widely anticipated, by e.g. Erik Lander (PI of thegenome sequencing project);“genomic medicine will revolutionisehealthcare for our children and our children's children”As with all disruptive technologies, there are challenges to overcome. We’ll get to those laterBut I'd like to start this talk with a quick introduction to thescience.
  • …And I'd like to use the Eagle fact sheet to introduce … DNA sequences.So - Here's Eagle, nestled in the beautiful South Cambridgeshirecountryside at Babraham Hall7 miles South East of our namesake, the Eagle publichouse. Where in 1953, Crick and Watson famouslyannounced their discovery of the double helix structure of DNA.And this line running for 1 mile South of Addenbrooke's Hospitalmarkes the location of...
  • ... The DNA cycle path.Into the tarmac are set 10,000 coloured tiles,Each colourrepresents one of the four letters of the DNA alphabet, A, C, G and T.The sequence of these letters codes for a single human geneThe Breast Cancer Susceptibility locus, BRCA2.
  • A further 5miles down the cycle path is the WT Sanger Instutute, the UK contribution to the genome sequencing projectReleased in2000, it took 10 years, and the sequencingcost over $100M.If this path were to continue for 3billion letters, 20,000genes of the human genome, it would circle the world 10 times!You have this complete molecular instruction set reproduced in each of the ten trillion cell in your body.
  • We talk about THE human genome, butGenome sequences vary slightly between individualsEven single letter differences, inherited from yourparents, can lead to differences in observable traits (phenotypes);height, hair colour, or disease suceptability.By comparing decoded genomes of people with a trait to thosewithout you can determine which differences are statistically likelyto be associated.This slide is an example of such an experiment. Compared 27,000 peopleFrom the whole genome(top panel) we zoom into a 1million basepair region (0.03% genome)that contains the most strongly associated signal (y-axis); the singleletter change called rs2470893.The bottom panel shows the genes in this region.So what is the trait in this case? ...Coffee consumption! The strongest genomic signal lies in theregion of CYP1A1, which is a primary caffeine metabolosingenzyme. Another strong association is with NRCAM; aneuronal cell adhesion gene implicated in addiction vulnerability.So what does this mean? For each copy of the letter T you inheretfrom your parents, you will drink 0.2 extra cups of coffee/day.I have had my genome decoded by 23andMe, I have one T -the other is a C. That means my genome at this location is responsiblefor about 1 cup per week of my prodigious coffee habit.
  • Leads to pharmacogenomics - the influence of genetic variation on drug response.Discover pharmacogenomic associations - biomarker discoveryUse biomarkers to develop genetic tests. Use tests to give drugs to the right patient at the right dose at the right time.Hypertension patients with gene that slows metabolises Warfarin (CYP269) need a lower dose to avoid adverse reaction.Only cancer patients with mutation in a gene (HER1) will get therapeutic benefit from HerceptinMore examples being published all the time.Consideration of pharmacogenomics in clinical trials becoming widely accepted => stratification
  • Alongside biomarker discovery, genomics is also for target discovery, Discover interesting disease-relatedbiology that can be modified using drugs.Find genetic drivers for disease (e.g. mutations incancer) and then develop drugs that target those mutations.Genomics has revolutionised our understanding of the genetic basisbehind disease,The CEO of Eli Lilly, John Lechleiter,"Insome cases, biological knowledge is akin tolights being turned on in a room versus groping around in the dark.”
  • I would like to call modern genomics data "an embarrassment of riches”But we're not quite there yet.We’re still at the stage of the severe information overload
  • Travel up the cycle path 2miles to the Chemistry Dept in the center of Cambridge,Find the home of a technology for massively parallel next-generation DNA sequencing (NGS). This has caused sequencing costs and generation times to plummetFrom $100M/10y a decade ago, to $10K in 10d todayThis is one of the NGS machines, cost about £300,000
  • As a result, genomes are now being sequenced in their thousands. Head South on the path again, back to the Sanger Institute,Coordinated data collection for the International 1000 genome projectProvide a comprehensive catalogue of common genetic variants.It has so far generated over 200Tbyte sequence data.
  • …Analysis at the kilo-genomes scale takes compute.Sanger has a 10,000 CPU high performance compute cluster, andpetabytes of storage.They also have a large team of sysadmins to keep it all running.Such projects are becoming commonplace; the UK 10K is underwayLike many bio projects, 1000Genomes data is all in the public domain. AWS public data setPublic data banks such as the European Nucleotide Archive (Hinxton) that archive these data Growing to petabytes of data. These data banks are used daily by genomics researchers.
  • Iconic slide in the fieldSequencing costs continue to plummet, at 1 order of magnitude per year.Nanopore technology (e.g. Oxford Nanopore) slide predicted to continueGetting close to the magic $1000 tosequence a person’s genome.Routine clinical whole genome sequencingBut there is a problem, Sequencing costs are falling much faster thanthe computational costs (Moore’s law).We have recently reached the point where data analysis costs exceedthe data generation costs.
  • Why is bioinformatics becoming the bottleneck?Time on config experimental infrastructure, collectingmetadata, and managing the data input files,Time buried in vast quantities of output data in avariety of obscure formats.Precious little time available for actual analysis.The informatics bathtubWhat we need, of course, are systems that take care of the grunt work,Leaving the researcher free to do >ahem< research.
  • So, if you will forgive the mixed metaphor, leaves us in thesituation of having to battle the flood to reach the haystackbefore we can even get started looking for the flaming needle!
  • The pharma industry: no competitive advantage by developing the required management infrastructure in-house.Alternative is "open innovation”;looking outside of theorganisation for ideas and know-how to advance their technology.Pistoia alliance is a shining example of open innovation in action.Since 2009 the Alliance has taken a collaborative approach to definingthe key informatics challengeswith the aim of improving the interoperability of R&D business processes. The Alliance membership now extends to over 50 life science companies, vendors, and publishers.[To clarify what I mean by open innovation; For open source, open data and open access publishing, it is the finalproduct that is made openly available to all.Open innovation operates at the other end of the developmentlifecycle, i.e. it is the initial specification of the problem that ismade public.]
  • Pistoia identified NGS as a key informatics challenges faced by the Pharma industryIn 2010 embarked on the Sequence Servicesproject. Directly sponsored by 4 pharmacos, GSK, AZ,Roche and LundbeckSo what is the overarching vision of sequence services? A software platform to support experimental research using DNA sequences
  • Requirements as 30-pageRFP, compiled by the four sponsors.Detailed technical specification, fourmain drivers;1. to collaborate securely with external partners without openingup the corporate firewall,2. To share the cost of access to the latest research data and software3. To convert expenditure from capital to operational and4. Elasticity - To get on-demand access to compute and storagereflects the highly variable infrastructure requirements of R&Dactivities. Cloud, basically.
  • When providing a hosted service for sequence data, Careful consideration of dataprotection legislation. Laws vary from country to country, and may affectwhere the data can be stored and processed.It can also be argued that, in addition to being personallyidentifiable, sequence data also encodes health-related data,bringing additional legislation such as the US Health InformationPortability and Accountability Act into play.For an SME like Eagle, it is crucial to know where your processes stand againstthe legislationSince being involved in Pistoia SS we have foundinformation security management systems such as ISO22001 to beinvaluable.
  • Eagle participated in Sequence Services, and were delighted to be funded.We have been building analysis pipelines, developing HPC solutions,hosting secure cloud apps for years.In addition to meeting the needs of the pharma industry,We saw this as an opportunity to build a platform we could useourselves in order to improve on he delivery of what we already do.
  • In broad terms, what were we proposing? How is it different to whathas gone before? How do we get the researcher out of the bathtub?It's about experimental templates, not fixed sample-processing workflows - there are LIMS systems for that.It's about analytic components, not end-to-end automated analyses.Simply put, it is a platform that enables the researcher to perform experiments. It does not attempt to do the experiment for them!
  • So now we know what we want to achieve,next engage a world-beating partner. We were lucky enough to spark the interest of Cycle Computing.Addresses the question how does an SME like Eagle compete with an acknowledged HPC powerhouse like the Sanger?Well, Cycle Computing, recently span up the world’s largest virtual supercomputer Consisting of 50,000 cores; which well and truly trumps the Sanger. It’s cost, of $5,000/ha similar system is over $10 million up front on traditional hardware.
  • Next is to design the architecture given the budget and timelines imposed by the project,Our approach, proven in other engagements, was to adopt a loosely coupled architectureSelect best of breed components Many of which are open sourceAnd connect them using industry standards, protocol, and web services Federate as much as possible, identity, infrastructure, to third parties (e.g. AWS)Also implemented emerging ISA-Tab metadata tracking standard
  • So we now have the Elastic Analysis Platform, for storage, analysis and sharing of life sciences data in the cloudData scale validated using the 1000Genomes public data set – add value through metadata (ISAtab)Compute scale validated using a BAYER/MaxPlankInst. Collaboration for investigating cancer biogenesisPipeline tested against public data from the NCBI GEO database.
  • How long has this all taken?The Pistoia Sequence Services project officially ran over a period of 9months, from July '11 to April '11.We have been developing ElasticAP for 9 months, from Jan '11, and nowhave a restricted beta release of the software.We are also looking for venture funding to take the project forward –The seed funding from Pistoia has been a huge benefit in this regard
  • Inflection point.Kenneth Cukier (the economist) – “a change in scale leads to a change in state”Applicable to other areas? I stole the analysis bathtub idea from Thayne Coffman; “lessons learned from network analysis R&D in defence”
  • Thanks toCycleComputing, Pistoia Alliance.