Successfully reported this slideshow.
Your SlideShare is downloading. ×

Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 32 Ad

Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger

Download to read offline

Human genetics holds the key to understanding pathogenesis of many devastating diseases like type 2 diabetes and Alzheimer’s disease. The discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market. Committed to creating therapeutic innovations, Regeneron has built one of the world’s most comprehensive genetics databases to supplement our state-of-the-art drug development pipeline. While these massive volumes of data provide an unprecedented opportunity to gain novel therapeutic insights, Regeneron has encountered a number of challenges on the road to delivering on the promises of big data and genomics in drug discovery. For example, how do you enable fast and accurate query from >80B data points? And how do you expedite novel statistical tests on TB-scale data?

This presentation will share Regeneron’s vision for building a scalable and performant informatics infrastructure to accelerate genetics-driven drug development. Specifically, we highlight key challenges in establishing the world’s largest clinical genetics databases, provide an overview of how Regeneron leverages Databricks’ Unified Analytics Platform and Apache Spark, and discuss in detail key engineering innovations that have already come out of this collaborative effort.

Human genetics holds the key to understanding pathogenesis of many devastating diseases like type 2 diabetes and Alzheimer’s disease. The discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market. Committed to creating therapeutic innovations, Regeneron has built one of the world’s most comprehensive genetics databases to supplement our state-of-the-art drug development pipeline. While these massive volumes of data provide an unprecedented opportunity to gain novel therapeutic insights, Regeneron has encountered a number of challenges on the road to delivering on the promises of big data and genomics in drug discovery. For example, how do you enable fast and accurate query from >80B data points? And how do you expedite novel statistical tests on TB-scale data?

This presentation will share Regeneron’s vision for building a scalable and performant informatics infrastructure to accelerate genetics-driven drug development. Specifically, we highlight key challenges in establishing the world’s largest clinical genetics databases, provide an overview of how Regeneron leverages Databricks’ Unified Analytics Platform and Apache Spark, and discuss in detail key engineering innovations that have already come out of this collaborative effort.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger (20)

Advertisement

More from Databricks (20)

Advertisement

Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger

  1. 1. Lukas Habegger, Associate Director Bioinformatics Regeneron Genetics Center (RGC) Insights from Building the Future of Drug Discovery with Apache Spark #EntSAIS14
  2. 2. Outline • Current state of drug discovery and development • Benefits of leveraging human genetics data • Overview of the Regeneron Genetics Center (RGC) • Challenges on the road to delivering on the promises of big data and genomics in drug discovery • Overview of how the RGC leverages Databricks’ Unified Analytics Platform and Apache Spark • Discussion of key engineering innovations • Conclusions & lessons learned 2#EntSAIS14
  3. 3. Current state of drug discovery and development: Maximizing chances of success with human genetics 3 95% of experimental medicines fail in development; costs exceed $2B per approved drug Higher probability for success for drugs with strong human genetics evidence >$100B spent on worldwide R&D by biopharma industry à only 10–20 new drugs per year Target bottleneck: <1,000 genes (<5% of all genes) account for targets of all drugs currently in development Herper M. Forbes.com. The Truly Staggering Cost of Inventing New Drugs. https://www.forbes.com/sites/matthewherper/2012/02/10/the-truly-staggering-cost-of-inventing-new-drugs/#355471a54a94. Feb. 10, 2012. Herper M. Forbes.com. How the Staggering Cost of Inventing New Drugs Is Shaping the Future of Medicine. https://www.forbes.com/sites/matthewherper/2013/08/11/how-the-staggering-cost-of-inventing-new-drugs-is-shaping-the-future-of-medicine/#30f1a95113c3. Aug. 11, 2013. Booth B. Forbes.com. A Billion Here, A Billion There: The Cost of Making a Drug Revisited. https://www.forbes.com/sites/brucebooth/2014/11/21/a-billion-here-a-billion-there-the-cost-of-making-a-drug-revisited/#6034e7f226a8. Nov. 21, 2014. Nat Genet. 2015 Aug;47(8):856-60. doi: 10.1038/ng.3314. Nat Rev Drug Discov. 2013 Aug;12(8):581-94. doi: 10.1038/nrd4051. Nat Rev Drug Discov. 2017 Jan;16(1):19-34. doi: 10.1038/nrd.2016.230. You cannot pursue modern drug discovery and development without incorporating human genetics
  4. 4. Why is human genetics such a powerful tool for drug discovery? 4 Neutral DNA mutation Loss-of-function Impact on disease Impact on gene product Gain-of-function NeutralProtective Damaging Example: A à T
  5. 5. Why is human genetics such a powerful tool for drug discovery? 5 Neutral DNA mutation Loss-of-function Impact on disease Impact on gene product Gain-of-function NeutralProtective Damaging Example: A à T
  6. 6. Why is human genetics such a powerful tool for drug discovery? 6 Neutral DNA mutation Loss-of-function Impact on disease Impact on gene product Gain-of-function NeutralProtective Damaging Example: A à T
  7. 7. Why is human genetics such a powerful tool for drug discovery? 7 Neutral DNA mutation Loss-of-function Drug Impact on disease Impact on gene product Gain-of-function NeutralProtective Damaging Example: A à T
  8. 8. PCSK9: A success story where human genetics evidence played a key role in drug development 8 Neutral DNA mutation Loss-of-function Drug Impact on disease Impact on gene product Gain-of-function NeutralProtective Damaging • Loss-of-function mutations in PCSK9 protect against heart disease • Regeneron developed a drug to block PCSK9, which has shown to be effective in preventing heart disease Example: A à T
  9. 9. The goal of the RGC is build one of the world’s largest genotype-phenotype resources • Regeneron has a long history of commitment to genetics-based science, and a track record of integrating human genetics into development programs, delivering new medicines to patients • Regeneron established the Regeneron Genetics Center (RGC) in 2014 • Goal: build one of the world’s most comprehensive genetics databases to supplement our state- of-the-art drug development pipeline • To date, the RGC has sequenced DNA from more than 300,000 individuals 9#EntSAIS14
  10. 10. Breadth of human genetics resources: RGC network of 60+ collaborators representing over 1 million samples 10#EntSAIS14 Founder populations Phenotype specific cohorts Family studies General population
  11. 11. Breadth of human genetics resources: RGC network of 60+ collaborators representing over 1 million samples 11#EntSAIS14 Founder populations Phenotype specific cohorts Family studies General population
  12. 12. RGC collaboration with UK Biobank: RGC will sequence ~500K participants over 3-5 years 12#EntSAIS14 ®
  13. 13. Automation is key to enable large-scale data production and analysis 13#EntSAIS14 Automated biobank (1.4M samples) Library preparation (>300,000 samples / year) Sequencing instruments (>300,000 samples / year) 100% cloud-based informatics & analysis ® A scalable informatics platform is needed to analyze this data and make it accessible to a broad set of users
  14. 14. How do we analyze our data to gain novel insights? Approach and desired goal 14#EntSAIS14 • Approach: 1. Sequence a large number of individuals to identify their mutations 2. Obtain paired clinical data (traits derived from de-identified electronic medical records) 3. Test for correlations/associations between all mutations and traits 4. Mine association results in various ways to gain insights MM Individuals Mutations TM Traits Individuals AR Mutation : Trait Analytical engine Association Results Mutation Matrix Trait Matrix Desired goal
  15. 15. How do we analyze our data to gain novel insights? It’s more complicated – lack of data unification 15#EntSAIS14 MM Individuals Mutations TM Traits Individuals AR Mutation : Trait Analytical engine Association Results Mutation Matrix Trait Matrix Desired goalReality MM Individuals Mutations TM Traits Individuals txt txtpVCF AR ResultsFiles Mutation : Trait • Data is decentralized and stored in different formats • Data is organized in different ways (e.g., not squared off, transposed, custom representations and indexing schemes) • Asking simple questions requires many time- consuming data wrangling steps txt
  16. 16. How do we analyze our data to gain novel insights? It’s more complicated – data from multiple cohorts 16#EntSAIS14 MM Individuals Mutations TM Traits Individuals AR Mutation : Trait Analytical engine Association Results Mutation Matrix Trait Matrix Desired goalReality GT Individuals Mutations TM Traits Individuals txt ResultsFiles Mutation : Trait • The RGC has data from multiple collaborators • Data is not always consistent • Limited functionality to unify / aggregate matrices from multiple cohorts GT TM MM TM AR pVCF txt txt
  17. 17. How do we analyze our data to gain novel insights? It’s more complicated – scalability issues 17#EntSAIS14 MM Individuals Mutations TM Traits Individuals AR Mutation : Trait Analytical engine Association Results Mutation Matrix Trait Matrix Desired goalReality MM Individuals Mutations TM Traits Individuals AR Mutation : Trait Analytical engine10s of millions 100s of billions 10s of thousands • Large inputs (MM & TM) • MM x TM cross join • Massive outputs (AR)
  18. 18. How do we find out what these mutations do? The Databricks solution 18#EntSAIS14 • RGC has established a major partnership with Databricks in 2017 • RGC is leveraging the Databricks Unified Analytics Platform to create a unified data & compute infrastructure: 1. Developed efficient and unified data representations 2. Implemented scalable production workflows optimized for analyzing billions of rows 3. Created a unified codebase to enable all levels of users to perform computation MM Individuals Mutations TM Traits Individuals AR Mutation : Trait Analytical engine Association Results Mutation Matrix Trait Matrix
  19. 19. The RGC has developed easy-to-use web applications to make the data accessible to a broad set of users 19#EntSAIS14 Web Application Databricks Cluster Query Results Queries Library Architecture of RGC web applications MM Individuals Mutations TM Traits Individuals AR Mutation : Trait Analytical engine Association Results Mutation Matrix Trait Matrix Goal: to enable everyone in the drug development process to easily access, analyze, and extract insights from the RGC’s data
  20. 20. The RGC Results Browser enables users to query billions of association results • Goal: Efficiently search billions of association results across multiple cohorts • The data set is updated when association results from a new cohort become available • Size of the current data set: >67 billion association results (>200 billion results for the next update) 20#EntSAIS14 AR
  21. 21. Optimizations to the ETL workflow have significantly reduced the time to ingest the association results • Association results are ingested and merged from multiple cohorts • Spark-based solution scales linearly with cluster size – Several optimizations have made the process more efficient – Migration of other QC processes into this workflow enable an end-to-end Spark solution 21#EntSAIS14
  22. 22. Optimizing the partitioning scheme has significantly reduced the query response time • The input data is naturally organized by cohort; not query optimized 22#EntSAIS14 AR Chromosomal Location Gene density Results density AR Chromosomal boundaries Partition density Variable range width & count Range Partitioned • Optimizations reduced the query response time from >30 minutes to <3 seconds
  23. 23. Demo notebook: mining association results and extracting key insights 23#EntSAIS14
  24. 24. The RGC has recently identified a new potential drug target for treating liver disease 24#EntSAIS14 Source: https://endpts.com/the-pcsk9-of-nash-regeneron-and-alnylam-join-forces-to-tackle-a-promising-target-for-severe-liver-diseases/
  25. 25. Liver disease can be detected based on enzyme levels in the blood • Two enzymes are typically analyzed to evaluate liver damage: – AST (Aspartate transaminase) – ALT (Alanine transaminase) • Elevated levels of AST and ALT are indicative of liver damage – Necessary but not sufficient • Goal: identify loss-of-function mutations that are associated with lower AST and ALT levels (protective effect) 25#EntSAIS14
  26. 26. Manhattan plot for AST: Several mutations in the genome are associated with this liver trait 26#EntSAIS14 What peak / mutation is the most interesting?
  27. 27. Manhattan plot for AST: Several mutations in the genome are associated with this liver trait 27#EntSAIS14 What peak / mutation is the most interesting? HSD17B13
  28. 28. 28#EntSAIS14
  29. 29. • The mutation of interest is associated with a broad spectrum of liver disease traits • All of these associations confer protection from liver disease 29#EntSAIS14
  30. 30. Conclusions & lessons learned • At Regeneron our goal is to bring the power of science to medicine and develop new medicines for patients in need • Incorporating human genetics evidence is critical for pursuing modern drug discovery; the RGC is building one of the world’s largest genetics databases to identify new potential drug targets • Our strategic partnership with Databricks has enabled us to build a state-of-the-art data science platform from scratch by: – Developing efficient and unified data representations – Building out scalable workflows to mine billions of rows and addressing key bottlenecks (e.g., reducing the ETL time from weeks to hours and optimizing the query response time to <3s) – Creating a unified codebase to enable all levels of users to perform computation • Most importantly, the Databricks Unified Analytics Platform, brings our data, tools, and people together to accelerate innovation 30#EntSAIS14
  31. 31. Acknowledgements 31#EntSAIS14 • RGC-LT – Alan Shuldiner – Aris Baras – Aris Economides – Jeffrey Reid – John Overton • RGC-GI – Alicia Hawes – Ashish Yadav – Claire Chai – Evan Maxwell – Gisu Eom – Jeff Staples – John Penn – Leland Barnard – Shareef Khalid – Sheldon Bai – Suganthi Balasubramanian – Young Hahn • RGC – Alexander Li – Alexander Lopez – Amy Damask – Charlie Paulding – Claudia Schurmann – Colm O’Dushlaine – Cristopher Van Hout – Dylan Sun – Jan Freudenberg – Kavita Praveen – Kia Manoochehri – Lauren Gurski – Manasi Pradhan – Mike Norsen – Nehal Gosalia – Nila Banerjee – Rick Ulloa – Shane McCarthy – Tanya Teslovich Dostal – Tony Marcketta • Databricks – Ali Ghodsi – Ali Hodroj – Allan Marcos – Ambareesh Kulkarni – Bavesh Patel – Christopher Hoshino-Fish – David Weaver – Francis Gerace – Hossein Falaki – Ion Stocia – Juliusz Sompolsk – Li Yu – Navid Bazzazzadeh – Paris Georgallis – Ram Sriharsha – Ronak Shah – Shiva Bhattacharjee – Vida Ha – Yongsheng Huang • REGN-IT – Abdul Shaik – Allen Chiang – Brandon Fetch – Christopher McCabe – Dale Cochran – David Glosser – Long Le – Michael Phillips – Mohammad Saeed – Pat Leblanc – Sal Mineo – Shaw Nawaz – Shiva Ravi – Stephen Huvane – Vin Dahake – Weylin Preodor
  32. 32. Questions? 32#EntSAIS14 https://tinyurl.com/yaqwl2bt We are hiring!

×