Your SlideShare is downloading. ×
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Strata London: Big (Sequence) Data in Pharmaceutical R&D
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Strata London: Big (Sequence) Data in Pharmaceutical R&D


Published on

Does pre-competetive collaboration ease the pain of adopting disruptive big-data technologies? This question is tacked using the example of management/analysis of large genomic sequence data sets, and …

Does pre-competetive collaboration ease the pain of adopting disruptive big-data technologies? This question is tacked using the example of management/analysis of large genomic sequence data sets, and their role in the development of personalised medicine.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • 2012: most exciting time to be working in molecular biology forover half a centaury.Over the past 5 years: new techniques -> measure our genetic blueprint with unprecedented accuracyLast month a major international project, “Encyclopedia of Non Coding Elements”, Findings from hundreds of researchers in 30 publications,Also made data/analysis available a virtual machine, but that’s a topic for another talk.Today weknow an order of magnitude more, in data terms,about switches that regulate our genes than we did previously.New scientific approaches embodied by Encode and many othersHave ready applications to industrial R&Dthroughout the life sciences, Plant breeding to biofuels topersonal hygiene.
  • This talk focuses on the highest profile of all these applications – medicine.Widely anticipated, by e.g. Erik Lander (PI of thegenome sequencing project);“genomic medicine will revolutionisehealthcare for our children and our children's children”As with all disruptive technologies, there are challenges to overcome. We’ll get to those laterBut I'd like to start this talk with a quick introduction to thescience.
  • …And I'd like to use the Eagle fact sheet to introduce … DNA sequences.So - Here's Eagle, nestled in the beautiful South Cambridgeshirecountryside at Babraham Hall7 miles South East of our namesake, the Eagle publichouse. Where in 1953, Crick and Watson famouslyannounced their discovery of the double helix structure of DNA.<CL>And this line running for 1 mile South of Addenbrooke's Hospitalmarkes the location of...
  • ... The DNA cycle path.Into the tarmac are set 10,000 coloured tiles,Each colourrepresents one of the four letters of the DNA alphabet, A, C, G and T.The sequence of these letters codes for a single human geneThe Breast Cancer Susceptibility locus, BRCA2.
  • A further 5miles down the cycle path is the WT Sanger Instutute, the UK contribution to the genome sequencing projectReleased in2000, it took 10 years, and the sequencingcost over $100M.If this path were to continue for 3billion letters, 20,000genes of the human genome, it would circle the world 10 times!You have this complete molecular instruction set reproduced in each of the ten trillion cell in your body.
  • We talk about THE human genome, butGenome sequences vary slightly between individualsEven single letter differences, inherited from yourparents, can lead to differences in observable traits (phenotypes);height, hair colour, or disease suceptability.By comparing decoded genomes of people with a trait to thosewithout you can determine which differences are statistically likelyto be associated.This slide is an example of such an experiment. Compared 27,000 peopleFrom the whole genome(top panel) we zoom into a 1million basepair region (0.03% genome)that contains the most strongly associated signal (y-axis); the singleletter change called rs2470893.The bottom panel shows the genes in this region.<CL>So what is the trait in this case? <CL>...Coffee consumption! The strongest genomic signal lies in theregion of CYP1A1, which is a primary caffeine metabolosingenzyme. Another strong association is with NRCAM; aneuronal cell adhesion gene implicated in addiction vulnerability.So what does this mean? For each copy of the letter T you inheretfrom your parents, you will drink 0.2 extra cups of coffee/day.I have had my genome decoded by 23andMe, I have one T -the other is a C. That means my genome at this location is responsiblefor about 1 cup per week of my prodigious coffee habit.
  • Leads to pharmacogenomics - the influence of genetic variation on drug response.Discover pharmacogenomic associations - biomarker discoveryUse biomarkers to develop genetic tests. Use tests to give drugs to the right patient at the right dose at the right time.Hypertension patients with gene that slows metabolises Warfarin (CYP269) need a lower dose to avoid adverse reaction.Only cancer patients with mutation in a gene (HER1) will get therapeutic benefit from HerceptinMore examples being published all the time.Consideration of pharmacogenomics in clinical trials becoming widely accepted => stratification
  • Alongside biomarker discovery, genomics is also for target discovery, Discover interesting disease-relatedbiology that can be modified using drugs.Find genetic drivers for disease (e.g. mutations incancer) and then develop drugs that target those mutations.Genomics has revolutionised our understanding of the genetic basisbehind disease,The CEO of Eli Lilly, John Lechleiter,"Insome cases, biological knowledge is akin tolights being turned on in a room versus groping around in the dark.”
  • I would like to call modern genomics data "an embarrassment of riches”But we're not quite there yet.We’re still at the stage of the severe information overload
  • Travel up the cycle path 2miles to the Chemistry Dept in the center of Cambridge,Find the home of a technology for massively parallel next-generation DNA sequencing (NGS). This has caused sequencing costs and generation times to plummetFrom $100M/10y a decade ago, to $10K in 10d todayThis is one of the NGS machines, cost about £300,000
  • As a result, genomes are now being sequenced in their thousands. Head South on the path again, back to the Sanger Institute,Coordinated data collection for the International 1000 genome projectProvide a comprehensive catalogue of common genetic variants.It has so far generated over 200Tbyte sequence data.
  • …Analysis at the kilo-genomes scale takes compute.Sanger has a 10,000 CPU high performance compute cluster, andpetabytes of storage.They also have a large team of sysadmins to keep it all running.Such projects are becoming commonplace; the UK 10K is underwayLike many bio projects, 1000Genomes data is all in the public domain. AWS public data setPublic data banks such as the European Nucleotide Archive (Hinxton) that archive these data Growing to petabytes of data. These data banks are used daily by genomics researchers.
  • Iconic slide in the fieldSequencing costs continue to plummet, at 1 order of magnitude per year.Nanopore technology (e.g. Oxford Nanopore) slide predicted to continueGetting close to the magic $1000 tosequence a person’s genome.Routine clinical whole genome sequencingBut there is a problem, Sequencing costs are falling much faster thanthe computational costs (Moore’s law).We have recently reached the point where data analysis costs exceedthe data generation costs.
  • Why is bioinformatics becoming the bottleneck?Time on config experimental infrastructure, collectingmetadata, and managing the data input files,Time buried in vast quantities of output data in avariety of obscure formats.Precious little time available for actual analysis.The informatics bathtubWhat we need, of course, are systems that take care of the grunt work,Leaving the researcher free to do >ahem< research.
  • So, if you will forgive the mixed metaphor, leaves us in thesituation of having to battle the flood to reach the haystackbefore we can even get started looking for the flaming needle!
  • The pharma industry: no competitive advantage by developing the required management infrastructure in-house.Alternative is "open innovation”;looking outside of theorganisation for ideas and know-how to advance their technology.Pistoia alliance is a shining example of open innovation in action.Since 2009 the Alliance has taken a collaborative approach to definingthe key informatics challengeswith the aim of improving the interoperability of R&D business processes. The Alliance membership now extends to over 50 life science companies, vendors, and publishers.[To clarify what I mean by open innovation; For open source, open data and open access publishing, it is the finalproduct that is made openly available to all.Open innovation operates at the other end of the developmentlifecycle, i.e. it is the initial specification of the problem that ismade public.]
  • Pistoia identified NGS as a key informatics challenges faced by the Pharma industryIn 2010 embarked on the Sequence Servicesproject. Directly sponsored by 4 pharmacos, GSK, AZ,Roche and LundbeckSo what is the overarching vision of sequence services? A software platform to support experimental research using DNA sequences
  • Requirements as 30-pageRFP, compiled by the four sponsors.Detailed technical specification, fourmain drivers;1. to collaborate securely with external partners without openingup the corporate firewall,2. To share the cost of access to the latest research data and software3. To convert expenditure from capital to operational and4. Elasticity - To get on-demand access to compute and storagereflects the highly variable infrastructure requirements of R&Dactivities. Cloud, basically.
  • When providing a hosted service for sequence data, Careful consideration of dataprotection legislation. Laws vary from country to country, and may affectwhere the data can be stored and processed.It can also be argued that, in addition to being personallyidentifiable, sequence data also encodes health-related data,bringing additional legislation such as the US Health InformationPortability and Accountability Act into play.For an SME like Eagle, it is crucial to know where your processes stand againstthe legislationSince being involved in Pistoia SS we have foundinformation security management systems such as ISO22001 to beinvaluable.
  • Eagle participated in Sequence Services, and were delighted to be funded.We have been building analysis pipelines, developing HPC solutions,hosting secure cloud apps for years.In addition to meeting the needs of the pharma industry,We saw this as an opportunity to build a platform we could useourselves in order to improve on he delivery of what we already do.
  • In broad terms, what were we proposing? How is it different to whathas gone before? How do we get the researcher out of the bathtub?It's about experimental templates, not fixed sample-processing workflows - there are LIMS systems for that.It's about analytic components, not end-to-end automated analyses.Simply put, it is a platform that enables the researcher to perform experiments. It does not attempt to do the experiment for them!
  • So now we know what we want to achieve,next engage a world-beating partner. We were lucky enough to spark the interest of Cycle Computing.Addresses the question how does an SME like Eagle compete with an acknowledged HPC powerhouse like the Sanger?Well, Cycle Computing, recently span up the world’s largest virtual supercomputer Consisting of 50,000 cores; which well and truly trumps the Sanger. It’s cost, of $5,000/ha similar system is over $10 million up front on traditional hardware.
  • Next is to design the architecture given the budget and timelines imposed by the project,Our approach, proven in other engagements, was to adopt a loosely coupled architectureSelect best of breed components Many of which are open sourceAnd connect them using industry standards, protocol, and web services Federate as much as possible, identity, infrastructure, to third parties (e.g. AWS)Also implemented emerging ISA-Tab metadata tracking standard
  • So we now have the Elastic Analysis Platform, for storage, analysis and sharing of life sciences data in the cloudData scale validated using the 1000Genomes public data set – add value through metadata (ISAtab)Compute scale validated using a BAYER/MaxPlankInst. Collaboration for investigating cancer biogenesisPipeline tested against public data from the NCBI GEO database.
  • How long has this all taken?The Pistoia Sequence Services project officially ran over a period of 9months, from July '11 to April '11.We have been developing ElasticAP for 9 months, from Jan '11, and nowhave a restricted beta release of the software.We are also looking for venture funding to take the project forward –The seed funding from Pistoia has been a huge benefit in this regard
  • Inflection point.Kenneth Cukier (the economist) – “a change in scale leads to a change in state”Applicable to other areas? I stole the analysis bathtub idea from Thayne Coffman; “lessons learned from network analysis R&D in defence”
  • Thanks toCycleComputing, Pistoia Alliance.
  • Transcript

    1. Big (sequence) data in pharmaceutical R&D William Spooner, CTO and Founder, Eagle Genomics @wspoonr O’Reilly Strata | London 1st October 2012©Eagle Genomics Ltd ©Eagle Genomics Ltd
    2. The dawn of the age of genomic medicine –The Science –The Data Deluge –Pharma’s Challenge –Eagle’s ResponseStrata | London ©Eagle Genomics Ltd 1st October 2012 2 Image: CC-BY-NC-ND 3.0
    3. About Eagle GenomicsBabraham-based consultancyInformatics: life science R&DCustomers in US, Europe, AsiaOperating for 4 years13 EmployeesStrata | London ©Eagle Genomics Ltd 1st October 2012 3
    4. The DNA Path1 mile10,000 letters1 gene; BRCA2 BReast CAncer 2 Tumor suppressor © Keith Edkins (CC BY-SA 2.0)Strata | London ©Eagle Genomics Ltd 1st October 2012 4
    5. The HumanGenome3,000,000,000 letters20,000 genesx10 round the worldFirst sequence (HGP); Released in 2000 Took 10 years Cost $100M © (CC SA 3.0)Strata | London ©Eagle Genomics Ltd 1st October 2012 5
    6. Phenotype Association Scientific impact of genomics Strata | London ©Eagle Genomics Ltd 1st October 2012 6Image: Sartr CC BY-NC-online publication 30 August 2011; Molecular Psychiatry advance
    7. Genomics in pharmacology Pharmacogenomics Personalised Medicine Genotypic Right drug Genetic Test Transcriptomic Right patient Epigenetic Right time Strata | London 1st October 2012 7© ©Eagle Genomics Ltd
    8. Genomics in disease researchStrata | London ©Eagle Genomics Ltd 1st October 2012 8
    9. The Data DelugeStrata | London ©Eagle Genomics Ltd 1st October 2012 9 Image: CC-BY 2.0
    10. © S. Ballard (CC BY-SA 2.0)Next Generation DNASequencing (NGS)Latest figures (2012) Takes 10 days Costs $10,000Costs still fallingrapidlyStrata | LondonType footer in here ©Eagle Genomics Ltd 1st October 2012 October 4, 2012 10 10
    11. 1000 HumanGenomes200 TB sequence data04/10/2012Strata | London 1st October 2012 11 11 ©Eagle Genomics Ltd
    12. © T. HarrisProcessing 1000genomesSanger Institute HPC:• 10,000 cores• 10 PB storage• Supported by a large team © Genome Research Ltd.Strata | London ©Eagle Genomics Ltd 1st October 2012 12
    13. $1000 genome ~$0.01 Mbase (30x coverage)Strata | London ©Eagle Genomics Ltd 1st October 2012 13
    14. What causes the analysis bottleneck? • Poor Experimental Reproducibility • Low Researcher Productivity collection analysis reportingResearcher effort Progression through experiment Strata | London ©Eagle Genomics Ltd 1st October 2012 14
    15. Pharma’s ChallengeStrata | London ©Eagle Genomics Ltd 1st October 2012 15 Image: Public Domain
    16. Collaborate ExploreEnterprise Work togetherAcademia to find aGovernment commonFoundations Open purpose InnovationNurture ExploitBuild trust, Turn ideas intoshared tangible benefitslanguage Strata | London ©Eagle Genomics Ltd 1st October 2012 16
    17. Sequence Services• Vision for a platform to researchers; – where they can solve scientific problems – based on DNA/RNA sequence information – tailored to the needs of the pharmaceutical industry. $50,000 proof-of-concept fundingStrata | London ©Eagle Genomics Ltd 1st October 2012 17
    18. • Collaborate securely/easily with other organisations/individuals – Without any risk to company firewalls.• Secure access to the latest public data and applications, – Outsource management of rapidly changing resources.• Cost reduction – Convert capital expense to operational expense.• Store large amounts of data in an extensible way. – Internal capacity planning cycles are much longer than the time over which demand varies. – Applies equally well to compute as to storage.Strata | London ©Eagle Genomics Ltd 1st October 2012 18
    19. Regulation for storage and analysis of DNA data• Data/identity protection/privacy laws; – Varies between territories, – Can affects where data must be located.• If genomes = personal health information? – Mandates compliance with HIPAA• If analyses used in clinical trials? – Mandates compliance with FDA’s 21 CFR part 11 B• Information security management is key – Certification, e.g. ISO27001, IASMEStrata | London ©Eagle Genomics Ltd 1st October 2012 19
    20. Eagle’s SolutionStrata | London ©Eagle Genomics Ltd 1st October 2012 20 Image: iStockphoto all rights reserved
    21. Mission Statement• Researchers need a tool that enables flexible experimental workflows – Distill, mature, scale, apply, integrate, catalog, and share.• Provide prototype experimental templates – NOT Fixed sample-centric workflows built around standing research• Provide reusable analytic tools – NOT Massive, inflexible, automated, analytic “solutions”• Embrace researcher-centric iterative process – DO NOT Try to take the researcher out of the loopStrata | London ©Eagle Genomics Ltd 1st October 2012 21 CC-BY-NC-SA 3.0
    22. HPC with AWSVirtual supercomputer 50,000 cores $5,000/hourVs. Hardware cost of: ~$15,000,000Used for Proteinsimulation experimentStrata | London ©Eagle Genomics Ltd 1st October 2012 22
    23. Architecture REST HTTPS SAML SSLStrata | London ©Eagle Genomics Ltd 1st October 2012 23
    24. The platform for storage, analysis and sharing of life sciences data in the cloudStrata | London ©Eagle Genomics Ltd 1st October 2012 24
    25. Pistoia Sequence Services TimelineJul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct 2011 2012Strata | London ©Eagle Genomics Ltd 1st October 2012 25
    26. Big data open innovation in Pharma R&D? Experience from Pistoia sequence services…• Genomic medicine has huge potential, but – Lots of R&D headaches – the “bioinformatics bathtub”• Inflection point at the move to big data – opportunity to consider new delivery modelsOpen innovation for Pistoia Open innovation for Eagle – Collaboration improves – Pre-validation of opportunity specification – Introduction to new partners – Shared development costs – Accelerated product – New approaches are development introduced• Pharma’s requirements are not unique – Apply to other areas of pre-competitive big data R&DStrata | London ©Eagle Genomics Ltd 1st October 2012 26
    27. +44 (0)1223 654481 @wspoonr @eaglegen ©Eagle Genomics LtdEagle® is a registered trademark no. 010418135 of Eagle Genomics Ltd.Postal address: Eagle Genomics Ltd., Babraham Research Campus, Cambridge CB22 3AT, United Kingdom. ©Eagle Genomics Ltd