Hadoop ecosystem for health/life sciences


Published on

Overview of using the Hadoop ecosystem for life sciences/health use cases

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Already mature technologies at this point.DB community thought it was silly.Non-Google were not yet at this scale.Google not in the business of releasing infrastructure software. They sell ads.
  • Mostly through the Apache Software Foundation
  • Talk HDFS and MapReduce.Then some other tools.
  • Large blocksBlocks replicated around
  • Two functions required.
  • Only need to supply 2 functions
  • Need to be careful because you can DDoS your database.
  • Log scale.
  • Define ETL
  • Define ETL
  • Volume,Variety, Velocity
  • Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
  • Hadoop ecosystem for health/life sciences

    1. 1. 1 Hadoop ecosystem for life sciences Uri Laserson 30 September 2013
    2. 2. About the speaker • Currently “Data Scientist” at Cloudera • PhD in Biomedical Engineering at MIT/Harvard (2005-2012) • Focused on next-generation DNA sequencing technology in George Church’s lab • Co-founded Good Start Genetics (2007-) • First application of next-gen sequencing to genetic carrier screening • laserson@cloudera.com 2
    3. 3. Agenda • Historical context • Introduction to Hadoop ecosystem • Genomics on Hadoop • Other use cases in life sciences 3
    4. 4. 4 Historical Context
    5. 5. 5
    6. 6. Indexing the Web • Web is Huge • Hundreds of millions of pages in 1999 • How do you index it? • Crawl all the pages • Rank pages based on relevance metrics • Build search index of keywords to pages • Do it in real time! 6
    7. 7. 7
    8. 8. Databases in 1999 1. Buy a really big machine 2. Install expensive DBMS on it 3. Point your workload at it 4. Hope it doesn’t fail 5. Ambitious: buy another big machine as backup 8
    9. 9. 9
    10. 10. Database Limitations • Didn’t scale horizontally • High marginal cost ($$$) • No real fault-tolerance story • Vendor lock-in ($$$) • SQL unsuited for search ranking • Complex analysis (PageRank) • Unstructured data 10
    11. 11. 11
    12. 12. Google does something different • Designed their own storage and processing infrastructure • Google File System (GFS) and MapReduce (MR) • Goals: KISS • Cheap • Scalable • Reliable 12
    13. 13. Google does something different • It worked! • Powered Google Search for many years • General framework for large-scale batch computation tasks • Still used internally at Google to this day 13
    14. 14. Google benevolent enough to publish 14 2003 2004
    15. 15. Birth of Hadoop at Yahoo! • 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR. • 2006: Spun out as Apache Hadoop • Named after Doug’s son’s yellow stuffed elephant 15
    16. 16. Industry strategy: Copy Google 16 Google Open-source Function GFS HDFS Distributed file system MapReduce MapReduce Batch distributed data processing Bigtable HBase Distributed DB/key-value store Protobuf/Stubby Thrift or Avro Data serialization/RPC Pregel Giraph Distributed graph processing Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP) FlumeJava Crunch Abstracted data pipelines on Hadoop Hadoop
    17. 17. 17 Overview of core technology
    18. 18. HDFS design assumptions • Based on Google File System • Files are large (GBs to TBs) • Failures are common • Massive scale means failures very likely • Disk, node, or network failures • Accesses are large and sequential • Files are append-only 18
    19. 19. HDFS properties • Fault-tolerant • Gracefully responds to node/disk/network failures • Horizontally scalable • Low marginal cost • High-bandwidth 19 1 2 3 4 5 2 4 5 1 2 5 1 3 4 2 3 5 1 3 4 Input File HDFS storage distribution Node A Node B Node C Node D Node E
    20. 20. MapReduce computation 20
    21. 21. MapReduce • Structured as 1. Embarrassingly parallel “map stage” 2. Cluster-wide distributed sort (“shuffle”) 3. Aggregation “reduce stage” • Data-locality: process the data where it is stored • Fault-tolerance: failed tasks automatically detected and restarted • Schema-on-read: data must not be stored conforming to rigid schema 21
    22. 22. WordCount example 22
    23. 23. HPC separates compute from storage 23 Storage infrastructure Compute cluster • Proprietary, distributed file system • Expensive • High-performance hardware • Low failure rate • Expensive Big network pipe ($$$) User typically works by manually submitting jobs to scheduler e.g., LSF, Grid Engine, etc. HPC is about compute. Hadoop is about data.
    24. 24. Hadoop colocates compute and storage 24 Compute cluster Storage infrastructure • Commodity hardware • Data-locality • Reduced networking needs User typically works by manually submitting jobs to scheduler e.g., LSF, Grid Engine, etc. HPC is about compute. Hadoop is about data.
    25. 25. HPC is lower-level than Hadoop • HPC only exposes job scheduling • Parallelization typically occurs through MPI • Very low-level communication primitives • Difficult to horizontally scale by simply adding nodes • Large data sets must be manually split • Failures must be dealt with manually • Hadoop has fault-tolerance, data locality, horizontal scalability 25
    26. 26. Sqoop 26 Bidirectional data transfer between Hadoop and almost any SQL database with a JDBC driver
    27. 27. Flume 27 A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc. Client Client Client Client Agent Agent Agent
    28. 28. Cloudera Impala 28 Modern MPP database built on top of HDFS Designed for interactive queries on terabyte-scale data sets.
    29. 29. Cloudera Search 29 • Interactive search queries on top of HDFS • Built on Solr and SolrCloud • Near-realtime indexing of new documents
    30. 30. Benefits of Hadoop ecosystem • Inexpensive commodity compute/storage • Tolerates random hardware failure • Decreased need for high-bandwidth network pipes • Co-locate compute and storage • Exploit data locality • Simple horizontal scalability by adding nodes • MapReduce jobs effectively guaranteed to scale • Fault-tolerance/replication built-in. Data is durable • Large ecosystem of tools • Flexible data storage. Schema-on-read. Unstructured data. 30
    31. 31. 31 Scaling Genomics
    32. 32. 32
    33. 33. NCBI Sequence Read Archive (SRA) 33 Today… 1.14 petabytes One year ago… 609 terabytes
    34. 34. Every ‘ome has a -seq 34 Genome DNA-seq Transcriptome RNA-seq FRT-seq NET-seq Methylome Bisulfite-seq Immunome Immune-seq Proteome PhIP-seq Bind-n-seq
    35. 35. Genomics ETL 35 GATK best practices
    36. 36. Genomics ETL 36 .fastq .bam .vcf short read alignment genotype calling • Short read alignment is embarrassingly parallel • Pileup/variant calling requires distributed sort • GATK is a reimplementation of MapReduce; could run on Hadoop • Already available Hadoop tools • Crossbow: short read alignment/variant calling • Hadoop-BAM: distributed bamtools • BioPig: manipulating large fasta/q • SEAL: Hadoop-enabled BWA • Contrail: de-novo assembly
    37. 37. Use case 1: Scaling a genome center pipeline • Currently at 5k genomes (150 TB incl. raw), looking to scale to 25k now (1 PB) and eventually 100k (requiring 4 PB) • Current throughput • >1300 samples per month • >12 TB raw data per month • Data ultimately served from MySQL database • 750 GB of processed variant data • 25k genomes requires >3.5 TB in MySQL • Complex 4-tier storage system, including tape, filer, and RDMBS 37
    38. 38. Use case 1: Scaling a genome center pipeline • Database serves population genetics applications and case/control studies • Unify all data processing into HDFS • Replace MySQL with Impala on Hadoop for increased scalability • Possibly move raw data processing into MapReduce 38
    39. 39. Use case 2: Querying large, integrated data sets • Biotech client has thousands of genomes • Want to expose ad hoc querying functionality on large scale • e.g., vcftools/PLINK-SEQ on terabyte-scale data sets • Integrating data with public data sets (e.g., ENCODE, UCSC browser) • Terabyte-scale annotation sets • Currently, these capabilities (e.g., data joins) are often manually implemented 39
    40. 40. Use case 2: Querying large, integrated data sets • Hadoop allows all data to be centrally stored and accessible • Impala exposes a SQL query interface to data sets in Hadoop 40
    41. 41. Variant-filtering example • “Give me all SNPs that are: • on chromosome 5 • absent from dbSNP • present in COSMIC • observed in breast cancer samples • absent from prostate cancer samples • overlap a DNase hypersensitivity site • overlap a ChIP-seq site for a particular TF” • On full 1000 genome data set (~37 billion variants), query finishes in a couple seconds 41
    42. 42. All-vs-all eQTL • Possible to generate trillions of hypothesis tests • 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values • Tested below on 120 billion associations • Example queries: • “Given 5 genes of interest, find top 20 most significant eQTLs (cis and/or trans)” • Finishes in several seconds • “Find all cis-eQTLs across the entire genome” • Finishes in a couple of minutes • Limited by disk throughput 42
    43. 43. All-vs-all eQTL • “Find all SNPs that are: • in LD with some lead SNP or eQTL of interest • align with some functional annotation of interest” • Still in testing, but likely finishes in seconds 43 Schaub et al, Genome Research, 2012
    44. 44. Genomics summary • ETL (raw data to analysis-ready data) • Data integration • e.g., interactively queryable UCSC genome browser • De novo assembly • NLP on scientific literature 44
    45. 45. 45 Clinical data Manufacturing Other use cases
    46. 46. Use case 3: Clinical document queries for EHR company • EHR wants to expose query functionality to clinicians • >16 million clinical documents with free text; processed through NLP pipeline • >500 million lab results • Perform subject expansion on search queries via ontologies • e.g., “myocardial infarction” will match “heart disease” • Search functionality implemented with Lucene (serving) on top of Hbase (processing/storage/indexing) 46
    47. 47. Use case 3: Clinical document queries for EHR company • Interested in recommendation engine-enabled queries, like: • Clinician searches “diabetes” and has relevant lab results already highlighted when opening a patient’s record • Clinician wants to know what other conditions might be correlated with a finding of interest 47
    48. 48. Use case 3: Clinical document queries for EHR company 48 “Find other patients similar to mine” • The Stanford system is limited to search • Recommendation engines allow a button “find similar”
    49. 49. Use case 4: Insurance company • Data from 30 different EHRs across multiple business units • High variance in ICD9 coding between locales. • Use NLP and machine learning to improve ICD9 coding to reduce variance in diagnosis 49
    50. 50. Use case 5: Pharma company variance in yields • Pharma company performs large batch fermentations of their product • Find high levels of variance in their yield • Fermentations are automated and highly instrumented • e.g., dissolved oxygen, nutrients, COAs, temperature, etc. • Perform time series analysis on fermentation runs to predict yields and determine which variables control variance. 50
    51. 51. Use case 6: AgTech company integrating data sources • Multiple reference genome sequences • Genotyping on thousands of samples • Weather data • Soil data • Microbiome data • Yield data • Geo data • All integrated in HBase 51
    52. 52. Use case 6: AgTech company integrating data sources • Can increase crop yields ~15% by “printing” seeds onto a field • Support search queries by name, ontology concepts, protein families, creation dates, assembly/chromsome positions, SNPs • Import any annotation data in CSV/GFF • Integration with cloning tools • Supports a web front-end for easy access 52
    53. 53. 53 Conclusions
    54. 54. Highly heterogeneous data 5 4 COMMUNICATIONS Location- based advertising HEALTH CARE Patient sensors, monitoring, EHRs Quality of care LAW ENFORCEMENT & DEFENSE Threat analysis, Social media monitoring, Photo analysis EDUCATION & RESEARCH Experiment sensor analysis FINANCIAL SERVICES Risk & portfolio analysis New products ON-LINE ERVICES / SOCIAL MEDIA People & career matching Website optimization UTILITIES Smart Meter analysis for network capacity CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, customer service MEDIA / ENTERTAINMENT Viewers / advertising effectiveness TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment LIFE SCIENCES Clinical trials Genomics RETAIL Consumer sentiment Optimized marketing AUTOMOTIVE Auto sensors reporting location, problems HIGH TECH / INDUSTRIAL MFG. Mfg. quality Warranty analysis OIL & GAS Drilling exploration sensor analysis ©2013 Cloudera, Inc. All Rights Reserved.
    55. 55. Flexibility • Store any data • Run any analysis and processing • Keeps pace with the rate of change of incoming data Scalability • Proven growth to PBs/1,000s of nodes • No need to rewrite queries, automatically scales • Keeps pace with the rate of growth of incoming data Efficiency • Cost per TB at a fraction of other options • Keep all of your data alive in an active archive • Powering the data beats algorithm movement The Cloudera Enterprise Platform for Big Data 55 ©2013 Cloudera, Inc. All Rights Reserved.
    56. 56. 56 Cloudera Hadoop Stack
    57. 57. 57