Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

1,551 views

Published on

Genomics, Genealogy, and Biometrics / UID use cases for Hadoop and HBase in

Published in: Science, Technology, Business
  • Be the first to comment

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Genomics Use Cases @ MapR
  2. 2. © 2014 MapR Technologies 2© 2014 MapR Technologies DNA Sequencing Company
  3. 3. © 2014 MapR Technologies 3 Parallelize Primary Analytics .fastq .vcf short read alignment genotype callingreads & mappings
  4. 4. © 2014 MapR Technologies 4 Sequence Analysis, Quick Overview […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
  5. 5. © 2014 MapR Technologies 5 What is the (Probable) Color of Each Column?
  6. 6. © 2014 MapR Technologies 6 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
  7. 7. © 2014 MapR Technologies 7 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
  8. 8. © 2014 MapR Technologies 8 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
  9. 9. © 2014 MapR Technologies 9 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
  10. 10. © 2014 MapR Technologies 10 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
  11. 11. © 2014 MapR Technologies 11 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  12. 12. © 2014 MapR Technologies 12 Clinical Applications: Performance Matters MapR FilesystemN F S DNA Sequencer DNA Sequencer DNA Sequencer Raw DNARaw DNARaw DNA 1º Analytics Raw DNARaw DNASNP calls Static Clinical Reporting PhysicianPatient Reference DBs SNP DB ETL 2º Analytics ResearcherSubject
  13. 13. © 2014 MapR Technologies 13 Variant Collection Enables Downstream Apps • GWAS Association Studies • Versioned, Personalized Medicine • Companion Diagnostics SNP DB 2º Analytics New Markets Hello! More linear algebra  [Spark, Summingbird, Lambda Architecture Slides]
  14. 14. © 2014 MapR Technologies 14 The Post-Sequencing Genomics Workload Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  15. 15. © 2014 MapR Technologies 15 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
  16. 16. © 2014 MapR Technologies 16 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc • In context of, e.g. – ε1: Racial, etc. background – ε2: Experimental design- specific concerns (e.g. familial IBD/IBS) – ε3: Environmental factors and penetrance – ε4: Assay-specific biases and noise phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4 At risk of over-simplifying as business-level concept…
  17. 17. © 2014 MapR Technologies 17 HUGE PROBLEM COMBINATORIAL EXPLOSION
  18. 18. © 2014 MapR Technologies 18 What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental prioritized updates • No batch processing • Decouple computational results from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  19. 19. © 2014 MapR Technologies 19 Solution: Percolate SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models
  20. 20. © 2014 MapR Technologies 20 Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  21. 21. © 2014 MapR Technologies 21 Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  22. 22. © 2014 MapR Technologies 22© 2014 MapR Technologies Genealogy Company Slides credit: Bill Yetman, Hadoop Summit 2014 http://slidesha.re/1vRh3kY
  23. 23. © 2014 MapR Technologies 23 GERMLINE is… • …an algorithm that finds hidden relationships within a pool of DNA • …the reference implementation of that algorithm written in C++. • You can find it here: http://www1.cs.columbia.edu/~gusev/germline/ 2 3
  24. 24. © 2014 MapR Technologies 24 Projected GERMLINE run times (in hours) 2 4 Hours Samples 0 100 200 300 400 500 600 700 2,500 12,500 22,500 32,500 42,500 52,500 62,500 72,500 82,500 92,500 102,500 112,500 122,500 GERMLINE run times Projected GERMLINE run times 700 hours = 29+ days EXPONENTIAL COMPLEXITY
  25. 25. © 2014 MapR Technologies 25 GERMLINE: What’s the Problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting – Stateless, single threaded, prone to swapping (heavy memory usage) – GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale 2 5
  26. 26. © 2014 MapR Technologies 26 Run times for matching (in hours) 2 6 Hours Samples 0 20 40 60 80 100 120 140 160 180 GERMLINE run times Jermline run times Projected GERMLINE run times EXPONENTIAL LINEAR HBase Refactor
  27. 27. © 2014 MapR Technologies 27 • Paper submitted describing the implementation • Releasing as an Open Source project soon • [HBase Schema/Algorithm Slides] 2 7
  28. 28. © 2014 MapR Technologies 28© 2014 MapR Technologies Further Growth & Optimization
  29. 29. © 2014 MapR Technologies 29 Underdog (Strand Phasing) performance – Went from 12 hours to process 1,000 samples to under 25 minutes with a MapReduce implementation 2 9 With improved accuracy! Underdog replaces Beagle 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Total Run Size Total Beagle-Underdog Duration
  30. 30. © 2014 MapR Technologies 30 Pipeline steps and incremental change… – Incremental change over time – Supporting the business in a “just in time” Agile way 3 0 0 50000 100000 150000 200000 250000 500 3622 7243 9615 12353 16333 19522 22861 26642 31172 35986 40852 45252 49817 54738 61675 69496 77257 84337 90074 97448 104684 111937 119669 127194 134970 142232 149988 157710 165685 173719 181617 189817 197853 205855 213471 221290 228912 236516 243550 251315 259164 267266 275335 283114 291017 298823 306556 314662 322655 330745 338813 346847 354938 362954 371064 379208 387334 395432 Beagle-Underdog Phasing Pipeline Finalize Relationship Processing Germline-Jermline Results Processing Germline-Jermline Processing Beagle Post Phasing Admixture Plink Prep Pipeline Initialization Jermline replaces Germline Ethnicity V2 Release Underdog Replaces Beagle AdMixture on Hadoop
  31. 31. © 2014 MapR Technologies 31 …while the business continues to grow rapidly 3 1 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 #ofprocessedsamples) DNA Database Size
  32. 32. © 2014 MapR Technologies 32© 2014 MapR Technologies BigData App Development Lifecycle
  33. 33. © 2014 MapR Technologies 33 BigData App Development Lifecycle outputinput 1M rows tail | grep | sort | uniq -c
  34. 34. © 2014 MapR Technologies 34 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Over decades of progress, Unix-based systems have set the standard for compatibility and functionality
  35. 35. © 2014 MapR Technologies 35 BigData App Development Lifecycle outputinput 1M rows tail | grep | sort | uniq -c
  36. 36. © 2014 MapR Technologies 36 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1M rows 1B rows
  37. 37. © 2014 MapR Technologies 37 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1M rows 1B rows 1T rows
  38. 38. © 2014 MapR Technologies 38 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop Hadoop achieves much higher scalability by trading away essentially all of this compatibility
  39. 39. © 2014 MapR Technologies 39 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1T rows 1T rows input output Port to BigData Tools ($$$$)
  40. 40. © 2014 MapR Technologies 40 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance
  41. 41. © 2014 MapR Technologies 41 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1T rows POSIX (NFS) Hadoop HDFS Port
  42. 42. © 2014 MapR Technologies 42 BigData App Development Lifecycle tail | grep | sort | uniq -c 1 1 1 1 100 100 100 100 Prototype Tools Dev Cost BigData Tools Dev Cost Use When Possible Use When Needed
  43. 43. © 2014 MapR Technologies 43 BigData App Development Lifecycle tail | grep | sort | uniq -c 1 1 1 100 Prototype Tools Dev Cost BigData Tools Dev Cost Use When Possible Use When Needed
  44. 44. © 2014 MapR Technologies 44© 2014 MapR Technologies Aadhaar – World’s Largest Biometric Database
  45. 45. © 2014 MapR Technologies 45 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  46. 46. © 2014 MapR Technologies 46 India: Problem • 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pays Income Tax, <20% banking – ~800 million mobile, ~200-300 mn migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40%
  47. 47. © 2014 MapR Technologies 47 India: Vision • Create a common “national identity” for every “resident” – Biometric backed identity to eliminate duplicates – “Verifiable online identity” for portability • Applications ecosystem using open APIs – Aadhaar enabled bank account and payment platform – Aadhaar enabled electronic, paperless KYC • Enrolment – One time in a person’s lifetime – Multi-modal biometrics (fingerprints, iris)
  48. 48. © 2014 MapR Technologies 48 Aadhaar Biometric Capture & Index
  49. 49. © 2014 MapR Technologies 49 Aadhaar Biometric Capture & Index
  50. 50. © 2014 MapR Technologies 50 Aadhaar Biometric Capture & Index
  51. 51. © 2014 MapR Technologies 51 Architectural Principles • Design for Scale – Every component needs to scale to large volumes – Millions of transactions and billions of records – Accommodate failure and design for recovery • Open Architecture – Open Source – Open APIs • Security – End-to-end security of resident data
  52. 52. © 2014 MapR Technologies 52 Design for Scale • Horizontal scale-out • Distributed computing • Distributed data storage and partitioning • No single points of failure • No single points of bottleneck • Asynchronous processing throughout the system – Allows loose coupling various components – Allows independent component level scaling
  53. 53. © 2014 MapR Technologies 53 MapR Filesystem Aadhaar Multi-DC Data Storage Stack* ID + Biometrics (M7 HBase) All raw packets (HDFS+NFS) Enrollment ID API ID + Demo + Photo + Benefits (MySQL, Solr) Authentication API Authorization API * as best I understand from public documents
  54. 54. © 2014 MapR Technologies 54 Enrollment Volume • 600 to 800 million UIDs in 4 years – 1 million a day – 200+ trillion matches every day!!! • ~5MB per resident – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!) – About 30 TB I/O every day – Replication and backup across DCs of about 5+ TB of incremental data every day – Lifecycle updates and new enrolments will continue for ever • Additional process data – Several million events on an average moving through async channels (some persistent and some transient) – Needing complete update and insert guarantees across data stores
  55. 55. © 2014 MapR Technologies 55 Authentication Volume • 100+ million authentications per day (10 hrs) – Possible high variance on peak and average – Sub second response – Guaranteed audits • Multi-DC architecture – All changes needs to be propagated from enrolment data stores to all authentication sites • Authentication request is about 4 K – 100 million authentications a day – 1 billion audit records in 10 days (30+ billion a year) – 4 TB encrypted audit logs in 10 days – Audit write must be guaranteed
  56. 56. © 2014 MapR Technologies 56 How Do Biometrics Relate to Genomics? Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! SNP DB 2º Analytics
  57. 57. © 2014 MapR Technologies 57 Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! Vector Pattern Matching SNP DB 2º Analytics ƒ-1(x): common features ƒ(x): unique features ƒ(x): uncommon features ƒ(x): other features
  58. 58. © 2014 MapR Technologies 58 Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! Topological Pattern Matching SNP DB 2º Analytics
  59. 59. © 2014 MapR Technologies 59© 2014 MapR Technologies MapR Platform
  60. 60. © 2014 MapR Technologies 60 Apache Hadoop NameNode High Availability (HA) NameNode A B C D E F HDFS-based Distributions DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Primary NameNode A B C D E F Standby NameNode A B C D E F NameNode A B NameNode C D NameNode E F NameNode A B NameNode C D NameNode E F NAS Appliance HDFS HA HDFS Federation Single point of failure Limited to 50-200 million files Performance bottleneck Metadata must fit in memory Only one active NameNode Limited to 50-200 million files Commercial NAS possibly needed Metadata must fit in memory Performance bottleneck Double the block reports Multiple single points of failure w/o HA Needs 20 NameNodes for 1 Billion files Commercial NAS needed Metadata must fit in memory Performance bottleneck Double the block reports
  61. 61. © 2014 MapR Technologies 61 No NameNode Architecture DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode NameNode A B C D E FAAA BBBB CCC DDD EEE FFF Up to 1T files (> 5000x advantage) Significantly less hardware & OpEx Higher performance No special config to enable HA Automatic failover & re-replication Metadata is persisted to disk
  62. 62. © 2014 MapR Technologies 62 MapR M7: The Best In-Hadoop Database  NoSQL Columnar Store  Apache HBase API  In-Hadoop database HBase JVM HDFS JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR M7 The most scalable, enterprise-grade, NoSQL database that supports online applications and analytics
  63. 63. © 2014 MapR Technologies 63 MapR M7: The Best In-Hadoop Database  NoSQL Columnar Store  Apache HBase API  In-Hadoop database Hbase Interface JVM HDFS Interface JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR M7 The most scalable, enterprise-grade, NoSQL database that supports online applications and analytics BigData Application
  64. 64. © 2014 MapR Technologies 64 Hbase Apps: High Performance with Consistent Low Latency --- M7 Read Latency --- Others Read Latency
  65. 65. © 2014 MapR Technologies 65© 2014 MapR Technologies MapR Services
  66. 66. © 2014 MapR Technologies 66 Professional Services • Installation • Migrations • SLA Plans • Best Practices • Performance Tuning Hadoop Core Services IT/ Infrastructure Linux Networking Data Center Storage Operations Big Data Workflows • Hive/Pig • Oozie/Sqoop • Flume • M7/HBase • Data Flow BI / DBA BI / ETL / Reporting Scripting / Java Hadoop MR Eco Projects (HBase, Hive, …) Solution Design • HBase/M7 • Map/Reduce • Application Development • Integration Development Java Hadoop Developer Architectural Design Advanced Analytics • Use case Discovery • Use case Modeling • POC • Workshops Modeler / Analyst PhD Statistics/Math MatLab / R / SAS Scripting / Java BI / ETL / Reporting Data Engineering Data Science AUDIENCE ENGAGEMENTS SKILLS
  67. 67. © 2014 MapR Technologies 67 Global PS Resources 17 Today (+8 in Q3) D.C. Keys Botzum (DE/Security/Developer) Joe Blue (Data Scientist) Venkat Gunnup (DE/Development) Alex Rodriguez (DE/Development) Kannappan Sirchabesa (DE/OPS) SAN JOSE Wayne Cappas (Director/DE) John Benninghoff (DE/OPS) Dmitry Gomerman (DE/OPS & Security) Ivan Bishop (DE/OPS) James Caseletto (Data Scientist) Sungwook Yoon (Data Scientist) Sridhar Reddy (Director - M7/Hbase) LOS ANGELES John Ewing (DE/OPS) Marco Vasquez (Data Scientist/DE) SOUTH CAROLINA David Schexnayder (DE/OPS) PHOENIX Michael Farnbach (DE/OPS) SINGAPORE Allen Day (Data Scientist)
  68. 68. © 2014 MapR Technologies 68 Use Case Data Flow Example MapR Data Platform Processing and Analytics Ingest Sqoop Flume HDFS NFS Access Tez Drill Hive Pig Impala Data Sources Clickstream Billing Data Mobile Data Product Catalog Social Media Server Logs Merchant Listings Online Chat Call Detail Records Visualization M7HBase MapReduce v1 & v2 StormCascadingPig Solr MahoutYARN Oozie Hive MLLib Set-Top Box Data
  69. 69. © 2014 MapR Technologies 69 Engagement Types • Customer engagement is typically 1-4 weeks (longer okay) • Well established partners (15,000 resources globally) • Custom training based on customer use-case • Small 1-3 days workshops • Extended support / Staff augmentation
  70. 70. © 2014 MapR Technologies 70 Q&A twitter.com/allenday aday@mapr.com Thanks! slideshare.net/allendaylinkedin.com/in/allenday
  71. 71. © 2014 MapR Technologies 71© 2014 MapR Technologies An Overview of Apache Spark
  72. 72. © 2014 MapR Technologies 72 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Preexisting MapReduce • Examples and Resources
  73. 73. © 2014 MapR Technologies 73© 2014 MapR Technologies MapReduce Refresher
  74. 74. © 2014 MapR Technologies 74 MapReduce Basics • Foundational model is based on a distributed file system – Scalability and fault-tolerance • Map – Loading of the data and defining a set of keys • Reduce – Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  75. 75. © 2014 MapR Technologies 75 Languages and Frameworks • Languages – Java, Scala, Clojure – Python, Ruby • Higher Level Languages – Hive – Pig • Frameworks – Cascading, Crunch • DSLs – Scalding, Scrunch, Scoobi, Cascalog
  76. 76. © 2014 MapR Technologies 76 MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Or use a higher level language or DSL that does this for you
  77. 77. © 2014 MapR Technologies 77© 2014 MapR Technologies What is Spark?
  78. 78. © 2014 MapR Technologies 78 Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation
  79. 79. © 2014 MapR Technologies 79 The Spark Community
  80. 80. © 2014 MapR Technologies 80 Spark is the Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
  81. 81. © 2014 MapR Technologies 81 Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, Lamba Expressions) • Spark SQL (SQL on Spark, not just Hive) • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark)
  82. 82. © 2014 MapR Technologies 82 Supported Languages • Java • Scala • Python • Hive?
  83. 83. © 2014 MapR Technologies 83 Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • HBase
  84. 84. © 2014 MapR Technologies 84 Machine Learning - MLlib • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent * As of May 14, 2014 ** Don’t be surprised if you see the Mahout library converting to Spark soon
  85. 85. © 2014 MapR Technologies 85© 2014 MapR Technologies The Difference with Spark
  86. 86. © 2014 MapR Technologies 86 Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  87. 87. © 2014 MapR Technologies 87 Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel – Parallelized Collection: Scala collection which is run in parallel – Hadoop Dataset: records of files supported by Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  88. 88. © 2014 MapR Technologies 88 RDD Operations • Transformations – Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions – Return a value after running a computation • collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
  89. 89. © 2014 MapR Technologies 89 RDD Persistence / Caching • Variety of storage levels – memory_only (default), memory_and_disk, etc… • API Calls – persist(StorageLevel) – cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations – Read from disk vs. recompute (memory_and_disk) – Total memory storage size (memory_only_ser) – Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
  90. 90. © 2014 MapR Technologies 90 Cache Scaling Matters 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache
  91. 91. © 2014 MapR Technologies 91 Directed Acylic Graph (DAG) • Directed – Only in a single direction • Acyclic – No looping • Why does this matter? – This supports fault-tolerance
  92. 92. © 2014 MapR Technologies 92 RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  93. 93. © 2014 MapR Technologies 93 Comparison to Storm • Higher throughput than Storm – Spark Streaming: 670k records/sec/node – Storm: 115k records/sec/node – Commercial systems: 100-500k records/sec/node 0 10 20 30 100 1000 Throughputpernode (MB/s) Record Size (bytes) WordCount Spark Storm 0 20 40 60 100 1000 Throughputpernode (MB/s) Record Size (bytes) Grep Spark Storm
  94. 94. © 2014 MapR Technologies 94 Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
  95. 95. © 2014 MapR Technologies 95 The Game Changer! • The – Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts – Port them over – Try SPORK! • Hive Queries….
  96. 96. © 2014 MapR Technologies 96© 2014 MapR Technologies Preexisting MapReduce
  97. 97. © 2014 MapR Technologies 97 Existing Jobs • Java MapReduce – Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts – Port them over – Try SPORK! • Hive Queries….
  98. 98. © 2014 MapR Technologies 98 Shark – SQL over Spark • Hive-compatible (HiveQL, UDFs, metadata) – Works in existing Hive warehouses without changing queries or data! • Augments Hive – In-memory tables and columnar memory store • Fast execution engine – Uses Spark as the underlying execution engine – Low-latency, interactive queries – Scale-out and tolerates worker failures
  99. 99. © 2014 MapR Technologies 99© 2014 MapR Technologies Examples and Resources
  100. 100. © 2014 MapR Technologies 100 SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]); JavaRDD<String> file = sc.textFile("hdfs://..."); JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" "))) .mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://..."); val sc = new SparkContext(master, appName, [sparkHome], [jars]) val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) – interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)
  101. 101. © 2014 MapR Technologies 101 Network Word Count – Streaming // Create the context with a 1 second batch size val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass)) // Create a NetworkInputDStream on target host:port and count the // words in input stream of n delimited text (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()
  102. 102. © 2014 MapR Technologies 102 Deploying Spark – Cluster Manager Types • Standalone mode – Comes bundled (EC2 capable) • YARN • Mesos
  103. 103. © 2014 MapR Technologies 103 Remember • If you want to use a new technology you must learn that new technology • For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it • To get the most out of a new technology you need to learn that technology, this includes tuning – There are switches you can use to optimize your work
  104. 104. © 2014 MapR Technologies 104 Configuration http://spark.apache.org/docs/latest/ Most Important • Application Configuration http://spark.apache.org/docs/latest/configuration.html • Standalone Cluster Configuration http://spark.apache.org/docs/latest/spark-standalone.html • Tuning Guide http://spark.apache.org/docs/latest/tuning.html
  105. 105. © 2014 MapR Technologies 105 Resources • Pig on Spark – http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html – https://github.com/aniket486/pig – https://github.com/twitter/pig/tree/spork – http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 – https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • Latest on Spark – http://databricks.com/categories/spark/ – http://www.spark-stack.org/
  106. 106. © 2014 MapR Technologies 106 • San Francisco June 30 – July 2 • Use Cases • Tech Talks • Training http://spark-summit.org/
  107. 107. © 2014 MapR Technologies 107 Q&A @mapr maprtech jscott@mapr.com Engage with us! MapR maprtech mapr-technologies

×