• Like
  • Save

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

  • 445 views
Uploaded on

Genomics, Genealogy, and Biometrics / UID use cases for Hadoop and HBase in

Genomics, Genealogy, and Biometrics / UID use cases for Hadoop and HBase in

More in: Science , Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
445
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Graph of each step in the pipeline for every run. This graph shows how important it is to measure everything. Some steps have been greatly reduced or eliminated. Light blue is the matching step. You can see it going quadratic and then the change when ‘J’ Jermline was released.
  • Gives up random access read on files
    Gives up strong authentication / authorization model
    Gives up random access write / append on files
  • 45
  • Historically, the NameNode in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.

    Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation.

    As you add more nodes to your cluster and want to configure HA, you have to add expensive NAS and have warm standby’s for the NN and related metadata which is persisted in memory. Even more, once you surpass the file limit in HDFS, you have to have region NameNode servers to support those additional nodes. A “federated NameNode approach”.

    Think of the additional dedicated hardware and configurations/administration required to set up NameNode HA in Hadoop! And this is ONLY for NameNode HA.
  • What if you could distribute the NameNode metadata and have it share resources in your cluster? What if Hadoop was a truly distributed environment?

    With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance.

    (advantages of this approach are called out on the left and right side of the diagram
  • Because of architecture.
    Apache Hbase runs in a JVM which read/writes to HDFS which is also running is a separate JVM, storing data in the Linux OS which is reading and writing to disk.
    As data is collected, it needs to be written to disk and “compacted” (i.e, maintenance is performed), this introduces many layers and steps that need to happen
    MapR M7 has integrated tables and files which are a true file system, reading and writing directly on disks.
    MapR M7 is a tightly integrated, in-Hadoop database which is NoSQL, columnar store which is 100% Apache Hbase API compatible
  • Because of architecture.
    Apache Hbase runs in a JVM which read/writes to HDFS which is also running is a separate JVM, storing data in the Linux OS which is reading and writing to disk.
    As data is collected, it needs to be written to disk and “compacted” (i.e, maintenance is performed), this introduces many layers and steps that need to happen
    MapR M7 has integrated tables and files which are a true file system, reading and writing directly on disks.
    MapR M7 is a tightly integrated, in-Hadoop database which is NoSQL, columnar store which is 100% Apache Hbase API compatible
  • **Consistent** low latency on read due to compactions
    Recall Aadhar
    Why?
  • Spark is really cool…
  • When do you use regular mapreduce over higher level languages? When Hive? When Pig? When anything?
  • You can find Project Resources on the Apache. You’ll also find information about the mailing list there (including archives)
  • Yahoo and Adobe are in production with Spark.
  • This sounds a lot like the reason to consider Pig vs. Java MapReduce
  • Gracefully
  • Looks kind of like a source control tree
  • You can import the MLlib to use here in the shell!
  • Best use case? Standalone followed by Mesos… My personal opinion is that Mesos is where the future will take us.
  • Don’t forget to share your experiences. This is really what the community is about.

    Don’t have time to contribute to open source, use it and share your experiences!
  • This isn’t all proven out yet, but some of it should just work already.
  • This is a really simple example. Reality is 22 chromosomes and 96 characters in a word
  • ‘G’ Germline would have to rebuild the hash table for all samples and then re-run all comparisons. An all by all comparison
  • This is where HBase shines. It is easy to add columns and rows, very efficient with empty cells (sparse matrix). Hammer HBase with multiple processes doing this at the same time.

Transcript

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Genomics Use Cases @ MapR
  • 2. © 2014 MapR Technologies 2© 2014 MapR Technologies DNA Sequencing Company
  • 3. © 2014 MapR Technologies 3 Parallelize Primary Analytics .fastq .vcf short read alignment genotype callingreads & mappings
  • 4. © 2014 MapR Technologies 4 Sequence Analysis, Quick Overview […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
  • 5. © 2014 MapR Technologies 5 What is the (Probable) Color of Each Column?
  • 6. © 2014 MapR Technologies 6 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
  • 7. © 2014 MapR Technologies 7 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
  • 8. © 2014 MapR Technologies 8 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
  • 9. © 2014 MapR Technologies 9 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
  • 10. © 2014 MapR Technologies 10 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
  • 11. © 2014 MapR Technologies 11 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  • 12. © 2014 MapR Technologies 12 Clinical Applications: Performance Matters MapR FilesystemN F S DNA Sequencer DNA Sequencer DNA Sequencer Raw DNARaw DNARaw DNA 1º Analytics Raw DNARaw DNASNP calls Static Clinical Reporting PhysicianPatient Reference DBs SNP DB ETL 2º Analytics ResearcherSubject
  • 13. © 2014 MapR Technologies 13 Variant Collection Enables Downstream Apps • GWAS Association Studies • Versioned, Personalized Medicine • Companion Diagnostics SNP DB 2º Analytics New Markets Hello! More linear algebra  [Spark, Summingbird, Lambda Architecture Slides]
  • 14. © 2014 MapR Technologies 14 The Post-Sequencing Genomics Workload Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  • 15. © 2014 MapR Technologies 15 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
  • 16. © 2014 MapR Technologies 16 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc • In context of, e.g. – ε1: Racial, etc. background – ε2: Experimental design- specific concerns (e.g. familial IBD/IBS) – ε3: Environmental factors and penetrance – ε4: Assay-specific biases and noise phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4 At risk of over-simplifying as business-level concept…
  • 17. © 2014 MapR Technologies 17 HUGE PROBLEM COMBINATORIAL EXPLOSION
  • 18. © 2014 MapR Technologies 18 What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental prioritized updates • No batch processing • Decouple computational results from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • 19. © 2014 MapR Technologies 19 Solution: Percolate SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models
  • 20. © 2014 MapR Technologies 20 Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 21. © 2014 MapR Technologies 21 Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 22. © 2014 MapR Technologies 22© 2014 MapR Technologies Genealogy Company Slides credit: Bill Yetman, Hadoop Summit 2014 http://slidesha.re/1vRh3kY
  • 23. © 2014 MapR Technologies 23 GERMLINE is… • …an algorithm that finds hidden relationships within a pool of DNA • …the reference implementation of that algorithm written in C++. • You can find it here: http://www1.cs.columbia.edu/~gusev/germline/ 2 3
  • 24. © 2014 MapR Technologies 24 Projected GERMLINE run times (in hours) 2 4 Hours Samples 0 100 200 300 400 500 600 700 2,500 12,500 22,500 32,500 42,500 52,500 62,500 72,500 82,500 92,500 102,500 112,500 122,500 GERMLINE run times Projected GERMLINE run times 700 hours = 29+ days EXPONENTIAL COMPLEXITY
  • 25. © 2014 MapR Technologies 25 GERMLINE: What’s the Problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting – Stateless, single threaded, prone to swapping (heavy memory usage) – GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale 2 5
  • 26. © 2014 MapR Technologies 26 Run times for matching (in hours) 2 6 Hours Samples 0 20 40 60 80 100 120 140 160 180 GERMLINE run times Jermline run times Projected GERMLINE run times EXPONENTIAL LINEAR HBase Refactor
  • 27. © 2014 MapR Technologies 27 • Paper submitted describing the implementation • Releasing as an Open Source project soon • [HBase Schema/Algorithm Slides] 2 7
  • 28. © 2014 MapR Technologies 28© 2014 MapR Technologies Further Growth & Optimization
  • 29. © 2014 MapR Technologies 29 Underdog (Strand Phasing) performance – Went from 12 hours to process 1,000 samples to under 25 minutes with a MapReduce implementation 2 9 With improved accuracy! Underdog replaces Beagle 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Total Run Size Total Beagle-Underdog Duration
  • 30. © 2014 MapR Technologies 30 Pipeline steps and incremental change… – Incremental change over time – Supporting the business in a “just in time” Agile way 3 0 0 50000 100000 150000 200000 250000 500 3622 7243 9615 12353 16333 19522 22861 26642 31172 35986 40852 45252 49817 54738 61675 69496 77257 84337 90074 97448 104684 111937 119669 127194 134970 142232 149988 157710 165685 173719 181617 189817 197853 205855 213471 221290 228912 236516 243550 251315 259164 267266 275335 283114 291017 298823 306556 314662 322655 330745 338813 346847 354938 362954 371064 379208 387334 395432 Beagle-Underdog Phasing Pipeline Finalize Relationship Processing Germline-Jermline Results Processing Germline-Jermline Processing Beagle Post Phasing Admixture Plink Prep Pipeline Initialization Jermline replaces Germline Ethnicity V2 Release Underdog Replaces Beagle AdMixture on Hadoop
  • 31. © 2014 MapR Technologies 31 …while the business continues to grow rapidly 3 1 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 #ofprocessedsamples) DNA Database Size
  • 32. © 2014 MapR Technologies 32© 2014 MapR Technologies BigData App Development Lifecycle
  • 33. © 2014 MapR Technologies 33 BigData App Development Lifecycle outputinput 1M rows tail | grep | sort | uniq -c
  • 34. © 2014 MapR Technologies 34 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Over decades of progress, Unix-based systems have set the standard for compatibility and functionality
  • 35. © 2014 MapR Technologies 35 BigData App Development Lifecycle outputinput 1M rows tail | grep | sort | uniq -c
  • 36. © 2014 MapR Technologies 36 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1M rows 1B rows
  • 37. © 2014 MapR Technologies 37 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1M rows 1B rows 1T rows
  • 38. © 2014 MapR Technologies 38 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop Hadoop achieves much higher scalability by trading away essentially all of this compatibility
  • 39. © 2014 MapR Technologies 39 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1T rows 1T rows input output Port to BigData Tools ($$$$)
  • 40. © 2014 MapR Technologies 40 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance
  • 41. © 2014 MapR Technologies 41 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1T rows POSIX (NFS) Hadoop HDFS Port
  • 42. © 2014 MapR Technologies 42 BigData App Development Lifecycle tail | grep | sort | uniq -c 1 1 1 1 100 100 100 100 Prototype Tools Dev Cost BigData Tools Dev Cost Use When Possible Use When Needed
  • 43. © 2014 MapR Technologies 43 BigData App Development Lifecycle tail | grep | sort | uniq -c 1 1 1 100 Prototype Tools Dev Cost BigData Tools Dev Cost Use When Possible Use When Needed
  • 44. © 2014 MapR Technologies 44© 2014 MapR Technologies Aadhaar – World’s Largest Biometric Database
  • 45. © 2014 MapR Technologies 45 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  • 46. © 2014 MapR Technologies 46 India: Problem • 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pays Income Tax, <20% banking – ~800 million mobile, ~200-300 mn migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40%
  • 47. © 2014 MapR Technologies 47 India: Vision • Create a common “national identity” for every “resident” – Biometric backed identity to eliminate duplicates – “Verifiable online identity” for portability • Applications ecosystem using open APIs – Aadhaar enabled bank account and payment platform – Aadhaar enabled electronic, paperless KYC • Enrolment – One time in a person’s lifetime – Multi-modal biometrics (fingerprints, iris)
  • 48. © 2014 MapR Technologies 48 Aadhaar Biometric Capture & Index
  • 49. © 2014 MapR Technologies 49 Aadhaar Biometric Capture & Index
  • 50. © 2014 MapR Technologies 50 Aadhaar Biometric Capture & Index
  • 51. © 2014 MapR Technologies 51 Architectural Principles • Design for Scale – Every component needs to scale to large volumes – Millions of transactions and billions of records – Accommodate failure and design for recovery • Open Architecture – Open Source – Open APIs • Security – End-to-end security of resident data
  • 52. © 2014 MapR Technologies 52 Design for Scale • Horizontal scale-out • Distributed computing • Distributed data storage and partitioning • No single points of failure • No single points of bottleneck • Asynchronous processing throughout the system – Allows loose coupling various components – Allows independent component level scaling
  • 53. © 2014 MapR Technologies 53 MapR Filesystem Aadhaar Multi-DC Data Storage Stack* ID + Biometrics (M7 HBase) All raw packets (HDFS+NFS) Enrollment ID API ID + Demo + Photo + Benefits (MySQL, Solr) Authentication API Authorization API * as best I understand from public documents
  • 54. © 2014 MapR Technologies 54 Enrollment Volume • 600 to 800 million UIDs in 4 years – 1 million a day – 200+ trillion matches every day!!! • ~5MB per resident – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!) – About 30 TB I/O every day – Replication and backup across DCs of about 5+ TB of incremental data every day – Lifecycle updates and new enrolments will continue for ever • Additional process data – Several million events on an average moving through async channels (some persistent and some transient) – Needing complete update and insert guarantees across data stores
  • 55. © 2014 MapR Technologies 55 Authentication Volume • 100+ million authentications per day (10 hrs) – Possible high variance on peak and average – Sub second response – Guaranteed audits • Multi-DC architecture – All changes needs to be propagated from enrolment data stores to all authentication sites • Authentication request is about 4 K – 100 million authentications a day – 1 billion audit records in 10 days (30+ billion a year) – 4 TB encrypted audit logs in 10 days – Audit write must be guaranteed
  • 56. © 2014 MapR Technologies 56 How Do Biometrics Relate to Genomics? Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! SNP DB 2º Analytics
  • 57. © 2014 MapR Technologies 57 Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! Vector Pattern Matching SNP DB 2º Analytics ƒ-1(x): common features ƒ(x): unique features ƒ(x): uncommon features ƒ(x): other features
  • 58. © 2014 MapR Technologies 58 Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! Topological Pattern Matching SNP DB 2º Analytics
  • 59. © 2014 MapR Technologies 59© 2014 MapR Technologies MapR Platform
  • 60. © 2014 MapR Technologies 60 Apache Hadoop NameNode High Availability (HA) NameNode A B C D E F HDFS-based Distributions DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Primary NameNode A B C D E F Standby NameNode A B C D E F NameNode A B NameNode C D NameNode E F NameNode A B NameNode C D NameNode E F NAS Appliance HDFS HA HDFS Federation Single point of failure Limited to 50-200 million files Performance bottleneck Metadata must fit in memory Only one active NameNode Limited to 50-200 million files Commercial NAS possibly needed Metadata must fit in memory Performance bottleneck Double the block reports Multiple single points of failure w/o HA Needs 20 NameNodes for 1 Billion files Commercial NAS needed Metadata must fit in memory Performance bottleneck Double the block reports
  • 61. © 2014 MapR Technologies 61 No NameNode Architecture DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode NameNode A B C D E FAAA BBBB CCC DDD EEE FFF Up to 1T files (> 5000x advantage) Significantly less hardware & OpEx Higher performance No special config to enable HA Automatic failover & re-replication Metadata is persisted to disk
  • 62. © 2014 MapR Technologies 62 MapR M7: The Best In-Hadoop Database  NoSQL Columnar Store  Apache HBase API  In-Hadoop database HBase JVM HDFS JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR M7 The most scalable, enterprise-grade, NoSQL database that supports online applications and analytics
  • 63. © 2014 MapR Technologies 63 MapR M7: The Best In-Hadoop Database  NoSQL Columnar Store  Apache HBase API  In-Hadoop database Hbase Interface JVM HDFS Interface JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR M7 The most scalable, enterprise-grade, NoSQL database that supports online applications and analytics BigData Application
  • 64. © 2014 MapR Technologies 64 Hbase Apps: High Performance with Consistent Low Latency --- M7 Read Latency --- Others Read Latency
  • 65. © 2014 MapR Technologies 65© 2014 MapR Technologies MapR Services
  • 66. © 2014 MapR Technologies 66 Professional Services • Installation • Migrations • SLA Plans • Best Practices • Performance Tuning Hadoop Core Services IT/ Infrastructure Linux Networking Data Center Storage Operations Big Data Workflows • Hive/Pig • Oozie/Sqoop • Flume • M7/HBase • Data Flow BI / DBA BI / ETL / Reporting Scripting / Java Hadoop MR Eco Projects (HBase, Hive, …) Solution Design • HBase/M7 • Map/Reduce • Application Development • Integration Development Java Hadoop Developer Architectural Design Advanced Analytics • Use case Discovery • Use case Modeling • POC • Workshops Modeler / Analyst PhD Statistics/Math MatLab / R / SAS Scripting / Java BI / ETL / Reporting Data Engineering Data Science AUDIENCE ENGAGEMENTS SKILLS
  • 67. © 2014 MapR Technologies 67 Global PS Resources 17 Today (+8 in Q3) D.C. Keys Botzum (DE/Security/Developer) Joe Blue (Data Scientist) Venkat Gunnup (DE/Development) Alex Rodriguez (DE/Development) Kannappan Sirchabesa (DE/OPS) SAN JOSE Wayne Cappas (Director/DE) John Benninghoff (DE/OPS) Dmitry Gomerman (DE/OPS & Security) Ivan Bishop (DE/OPS) James Caseletto (Data Scientist) Sungwook Yoon (Data Scientist) Sridhar Reddy (Director - M7/Hbase) LOS ANGELES John Ewing (DE/OPS) Marco Vasquez (Data Scientist/DE) SOUTH CAROLINA David Schexnayder (DE/OPS) PHOENIX Michael Farnbach (DE/OPS) SINGAPORE Allen Day (Data Scientist)
  • 68. © 2014 MapR Technologies 68 Use Case Data Flow Example MapR Data Platform Processing and Analytics Ingest Sqoop Flume HDFS NFS Access Tez Drill Hive Pig Impala Data Sources Clickstream Billing Data Mobile Data Product Catalog Social Media Server Logs Merchant Listings Online Chat Call Detail Records Visualization M7HBase MapReduce v1 & v2 StormCascadingPig Solr MahoutYARN Oozie Hive MLLib Set-Top Box Data
  • 69. © 2014 MapR Technologies 69 Engagement Types • Customer engagement is typically 1-4 weeks (longer okay) • Well established partners (15,000 resources globally) • Custom training based on customer use-case • Small 1-3 days workshops • Extended support / Staff augmentation
  • 70. © 2014 MapR Technologies 70 Q&A twitter.com/allenday aday@mapr.com Thanks! slideshare.net/allendaylinkedin.com/in/allenday
  • 71. © 2014 MapR Technologies 71© 2014 MapR Technologies An Overview of Apache Spark
  • 72. © 2014 MapR Technologies 72 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Preexisting MapReduce • Examples and Resources
  • 73. © 2014 MapR Technologies 73© 2014 MapR Technologies MapReduce Refresher
  • 74. © 2014 MapR Technologies 74 MapReduce Basics • Foundational model is based on a distributed file system – Scalability and fault-tolerance • Map – Loading of the data and defining a set of keys • Reduce – Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  • 75. © 2014 MapR Technologies 75 Languages and Frameworks • Languages – Java, Scala, Clojure – Python, Ruby • Higher Level Languages – Hive – Pig • Frameworks – Cascading, Crunch • DSLs – Scalding, Scrunch, Scoobi, Cascalog
  • 76. © 2014 MapR Technologies 76 MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Or use a higher level language or DSL that does this for you
  • 77. © 2014 MapR Technologies 77© 2014 MapR Technologies What is Spark?
  • 78. © 2014 MapR Technologies 78 Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation
  • 79. © 2014 MapR Technologies 79 The Spark Community
  • 80. © 2014 MapR Technologies 80 Spark is the Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
  • 81. © 2014 MapR Technologies 81 Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, Lamba Expressions) • Spark SQL (SQL on Spark, not just Hive) • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark)
  • 82. © 2014 MapR Technologies 82 Supported Languages • Java • Scala • Python • Hive?
  • 83. © 2014 MapR Technologies 83 Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • HBase
  • 84. © 2014 MapR Technologies 84 Machine Learning - MLlib • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent * As of May 14, 2014 ** Don’t be surprised if you see the Mahout library converting to Spark soon
  • 85. © 2014 MapR Technologies 85© 2014 MapR Technologies The Difference with Spark
  • 86. © 2014 MapR Technologies 86 Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 87. © 2014 MapR Technologies 87 Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel – Parallelized Collection: Scala collection which is run in parallel – Hadoop Dataset: records of files supported by Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 88. © 2014 MapR Technologies 88 RDD Operations • Transformations – Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions – Return a value after running a computation • collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
  • 89. © 2014 MapR Technologies 89 RDD Persistence / Caching • Variety of storage levels – memory_only (default), memory_and_disk, etc… • API Calls – persist(StorageLevel) – cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations – Read from disk vs. recompute (memory_and_disk) – Total memory storage size (memory_only_ser) – Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
  • 90. © 2014 MapR Technologies 90 Cache Scaling Matters 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache
  • 91. © 2014 MapR Technologies 91 Directed Acylic Graph (DAG) • Directed – Only in a single direction • Acyclic – No looping • Why does this matter? – This supports fault-tolerance
  • 92. © 2014 MapR Technologies 92 RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 93. © 2014 MapR Technologies 93 Comparison to Storm • Higher throughput than Storm – Spark Streaming: 670k records/sec/node – Storm: 115k records/sec/node – Commercial systems: 100-500k records/sec/node 0 10 20 30 100 1000 Throughputpernode (MB/s) Record Size (bytes) WordCount Spark Storm 0 20 40 60 100 1000 Throughputpernode (MB/s) Record Size (bytes) Grep Spark Storm
  • 94. © 2014 MapR Technologies 94 Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
  • 95. © 2014 MapR Technologies 95 The Game Changer! • The – Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts – Port them over – Try SPORK! • Hive Queries….
  • 96. © 2014 MapR Technologies 96© 2014 MapR Technologies Preexisting MapReduce
  • 97. © 2014 MapR Technologies 97 Existing Jobs • Java MapReduce – Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts – Port them over – Try SPORK! • Hive Queries….
  • 98. © 2014 MapR Technologies 98 Shark – SQL over Spark • Hive-compatible (HiveQL, UDFs, metadata) – Works in existing Hive warehouses without changing queries or data! • Augments Hive – In-memory tables and columnar memory store • Fast execution engine – Uses Spark as the underlying execution engine – Low-latency, interactive queries – Scale-out and tolerates worker failures
  • 99. © 2014 MapR Technologies 99© 2014 MapR Technologies Examples and Resources
  • 100. © 2014 MapR Technologies 100 SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]); JavaRDD<String> file = sc.textFile("hdfs://..."); JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" "))) .mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://..."); val sc = new SparkContext(master, appName, [sparkHome], [jars]) val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) – interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)
  • 101. © 2014 MapR Technologies 101 Network Word Count – Streaming // Create the context with a 1 second batch size val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass)) // Create a NetworkInputDStream on target host:port and count the // words in input stream of n delimited text (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()
  • 102. © 2014 MapR Technologies 102 Deploying Spark – Cluster Manager Types • Standalone mode – Comes bundled (EC2 capable) • YARN • Mesos
  • 103. © 2014 MapR Technologies 103 Remember • If you want to use a new technology you must learn that new technology • For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it • To get the most out of a new technology you need to learn that technology, this includes tuning – There are switches you can use to optimize your work
  • 104. © 2014 MapR Technologies 104 Configuration http://spark.apache.org/docs/latest/ Most Important • Application Configuration http://spark.apache.org/docs/latest/configuration.html • Standalone Cluster Configuration http://spark.apache.org/docs/latest/spark-standalone.html • Tuning Guide http://spark.apache.org/docs/latest/tuning.html
  • 105. © 2014 MapR Technologies 105 Resources • Pig on Spark – http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html – https://github.com/aniket486/pig – https://github.com/twitter/pig/tree/spork – http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 – https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • Latest on Spark – http://databricks.com/categories/spark/ – http://www.spark-stack.org/
  • 106. © 2014 MapR Technologies 106 • San Francisco June 30 – July 2 • Use Cases • Tech Talks • Training http://spark-summit.org/
  • 107. © 2014 MapR Technologies 107 Q&A @mapr maprtech jscott@mapr.com Engage with us! MapR maprtech mapr-technologies