Accelerate Pharmaceutical R&D with Big Data and MongoDB


Published on

Introduction of disruptive technologies, including use of unstructured data, is critical to Pharmaceutical R&D. We will explore how MongoDB can be used to accelerate this. We will also have an open discussion with panel members who are using MongoDB in this space. This session will be 30 minutes and will be followed by a 20 minute panel discussion led by Jason Tetrault and Deniz Kural.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • A chat about Introducing Disruptive Technology Mention PanelPlease Hold questions till the end.
  • I am an Architect focusing on Drug R&DI support Researchers, specifically in Oncology and Infection R&DI focus on Next Generation Sequencing, IAASAlso running some Big Data Pilots.
  • Introduce Disruptive Concepts: StressUnstructured DataNoSQL + Map ReduceDocument Store and MongoHow: Lunch and Learns, Big Cookies For Big Data
  • Make a good pilot. Show what Unstructured data is. See what you can do. Make something difficult easy.
  • Run through tech findingsWe are not alone…Mention PanelAsk if anyone else is using Mongo in this space.
  • Focus on closing out Introduction of Disruptive TechnologyGenomes Sequencing is getting cheaper.More public samples are available.Big Data tools can be used to sift through some of these bigger questions.
  • Make joke about “All set, ready to work on this?”A Whole Human Genome is the whole thing. They tend to be 10 – 20 X the size of a “Standard Human Experiment”Rattle off numbers,
  • Make sure you talk about MetaData and Big Data StoreWrap Up VisionNow, lets talk about VCF
  • Now lets look at a variantRemember that question I had earlier: How do I ask my driving question.
  • Accelerate Pharmaceutical R&D with Big Data and MongoDB

    1. 1. Mongo Boston 2013 Accelerate Pharmaceutical R&D with Big Data and MongoDB Jason Tetrault Architect - AstraZeneca
    2. 2. AstraZeneca at a glance We are a global, innovation led biopharmaceutical company with a mission to make a meaningful difference to patient health through great medicines and a belief that health connects us all Global Targeted Collaborative 57,000 people Sales in 100 countries Manufacturing in 16 R&D across 3 continents $4 bn invested in R&D $33 bn sales in 2011 Cancer Cardiovascular Gastrointestinal Infection Neuroscience Respiratory & inflammation HCPs Patients Payers Regulators Partners Local communities Constantly anticipating and adapting to the needs of a changing world. Driving continued innovation where we can make the most difference. Connecting with others to achieve common goals in improving healthcare. Committed to driving business success responsibly
    3. 3. Architect: R&D Information What does this mean? • Support the Researchers • AstraZeneca has Multiple iMeds that are focused on different areas of R&D • Specifically, I work with the Oncology and Infection iMeds here in Waltham • Support different software and system builds and / or purchases • Looking to apply new technologies to enable Researchers • Core Focus: • Next Generation Sequencing Scaling • IAAS • Big Data Pilots and Exploration
    4. 4. Introduction of Disruptive Technology: Step 1: Introduce Concepts • What • Unstructured Data • NoSQL • Categories (Document, Key Value, Graph) • • • • Hadoop Map Reduce Horizontal Scalability Cloud (IAAS and SAAS) • How • • • • Lunch and Learns Examples (Craigslist uses this) “Big Cookies for Big Data” Demonstrations
    5. 5. Introduction of Disruptive Technology: Step 2: Pilots • Goals: • We needed to show what “Unstructured Data” actually means. • We needed to prove what these technologies can and cannot do for us. • Find something difficult and make it easy! • We needed to find the best way to enable researchers.
    6. 6. Iterative Agile Analytics How quickly can I make indirect associations between gene sequence features and structural fingerprints? Now scale up to 4M compounds, 20K assays…and more decoration – 5to50 Tb Data sources Compound JSON Pivot Map Reduce Matrix AssayResults (300K Compounds) – 200Gb GeneCatalog (1.4M fingerprints) – 1Gb • Compound with Fingerprints • Gene sequence • Target mappings • Assay results Gather Fingerprint with compounds Aggregate (500m pairs) – 81Gb Tanimoto matrix Gene matrix Analyze Target mappings Decorate • Easily convert to JSON and import an initial cut of data from different sources (e.g. spreadsheets, RDBMS, …) • Embrace unstructured data, massage it into a more useful format: Rinse, Wash, Repeat! • Ability to decorate data, adding fields and additional datastores quickly 6
    7. 7. Introduction of Disruptive Technology: Pilot Findings • Tech Findings: • GSON can help with weird character conversions. • Per Node write limits (500 per second) but, you can save a bunch of documents at once (Change to bulk Insert). • Users think that even though they could do it relationally, this was way quicker. • Using arrays for multiple results in a doc can be interesting. • JSON and JavaScript is fairly natural to technical researchers (python). • We are not alone… • • • • Davy Suvee tranSMART Seven Bridges …
    8. 8. Next Generation Sequencing: Driving Question: Can we predict which drug is most effective against specific tumors? How many other cancer types that I have processed have the same variation as the cancer type I am working on?
    9. 9. Fairly Inaccurate Overview of Genetics Processing A 2 Minutes Over Simplification to a Really Hard Problem 9
    10. 10. Fairly Inaccurate Overview of Genetics Processing Sequencing 10
    11. 11. Fairly Inaccurate Overview of Genetics Processing Sequencing 11
    12. 12. Fairly Inaccurate Overview of Genetics Processing Alignment HG19 12 Set area descriptor | Sub level 1
    13. 13. Fairly Inaccurate Overview of Genetics Processing Down Stream Processing (Variant) HG19 13
    14. 14. Can I Process 88 Whole Human Genomes? Researcher: I would like to process 88 public Genomic Samples from of Cancer Patients. They are Whole Human Genomes. Each patient has 2 genomic sequences, one of the tumor and one from a normal cell. Tech: • 200 GB raw uncompressed fastq per experiment • 176 Genome Pipelines to process • Each “pipeline” runs on a m1.xlarge • We ran 4 runs of ~3.5 days on 50 nodes • Total processed data in the pipeline may be 5X per experiment • Could expand to 10X or more for more complex pipelines • ~86 GB result average to save • Stored in S3 / Glacier • Totals: • ~171 TB Total Processed Storage • ~14,784 hours of processing • ~15 TB of results Elastic HPC Infrastructure Scripts, programs, reference Shared Storage Compute Amazon StarCluster Elastic Node Expansion Local Storage Processing Result offload to S3 Transition to Glacier
    15. 15. A Possible Vision for Experiment Management NGS Data Explants TumorsFFPE Tumors – fresh frozen Cell lines  Patient stratification  Biomarkers for prognosis, drug response, safety Expression RNASeq Variants Amplicon DNASeq Whole exome Whole genome Coding and non-coding variants Coding variants  Mechanism of drug action  Mechanism of disease New Target ID Inbound Seven Bridges GenePattern Storage Partners Big Data Store Experiment Management / Metadata Management Services Genome Upload / Curation Pipeline Engines Long Term Storage Partner Integration Big Data Storage and Analytics
    16. 16. Lets look at a Variant … Another Area Mongo May Help 16
    17. 17. VCF Format ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3 17
    18. 18. VCF as JSON Header and Variant Information { { "_id" : ObjectId("52617b613004b77f64efed62"), "ALT" : [ "A" ], "QUAL" : "29", "NA00001" : "0|0:48:1:51,51", "POS" : 14370, "NA00002" : "1|0:48:8:51,51", "FILTER" : "PASS", "CHROM" : "20", "NA00003" : "1/1:43:5:.,.", "FORMAT" : "GT:GQ:DP:HQ", "__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27", "ID" : "rs6054257", "INFO" : { "DP" : "14", "AF" : "0.5", "NS" : "3" }, "REF" : "G" } 18 "_id" : ObjectId("52617b613004b77f64efed67"), "phasing" : "partial", "fileformat" : "VCFv4.1", "fileDate" : "20090805", "source" : "myImputationProgramV3.1", "FORMAT" : { "Description" : ""Haplotype Quality"", "Type" : "Integer", "Number" : "2", "ID" : "HQ" }, "__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27", "contig" : { "species" : ""Homo sapiens"", "assembly" : "B36", "md5" : "f126cdf8a6e0c7f379d618ff66beb2da", "length" : "62435964", "ID" : "20", "taxonomy" : "x" }, "INFO" : { "Description" : ""HapMap2 membership"", "Type" : "Flag", "Number" : "0", "ID" : "H2" }, "reference" : "file:///seq/references/1000GenomesPilotNCBI36.fasta", "FILTER" : { "Description" : ""Less than 50% of samples have data"", "ID" : "s50" } }
    19. 19. Query Search Variant Ranges // Here is our range definition var begin = 10000; var end = 10200; // The Chromosome position is fuzzy in format so, we use a regex var chromosome = ".*17$"; var variant = "A" // Query for range and chromosome position. db.publicvariants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome} }) db.variants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome} }) // Query for a specific variant in a range db.publicvariants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome}, "ALT":variant}) db.variants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome}, "ALT":variant}) 19
    20. 20. Wrap Up and Panel • Panel • Deniz Kural: Founder and CEO – SevenBridges • Code: • • Thanks • Todd Nelson, Rajan Desai • Sebastien Lefebvre, Robin Brouwer • Sara Dempster 20
    21. 21. The Panel … 21
    22. 22. Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, 22