© 2014 MapR Technologies 1© 2014 MapR Technologies
Genomics Use Cases @ MapR
© 2014 MapR Technologies 2© 2014 MapR Technologies
DNA Sequencing Company
© 2014 MapR Technologies 3
Parallelize Primary Analytics
.fastq .vcf
short read
alignment
genotype
callingreads &
mappings
© 2014 MapR Technologies 4
Sequence Analysis, Quick Overview
[…] G A C T A G A fragment1
A C A G T T T A C A fragment2
A G...
© 2014 MapR Technologies 5
What is the (Probable) Color of Each Column?
© 2014 MapR Technologies 6
Which Columns are (probably) Not White?
Strategy 1: examine foreach column, foreach row O(rows*...
© 2014 MapR Technologies 7
Which Columns are (probably) Not White?
Strategy 2: examine foreach row. keep running tallies O...
© 2014 MapR Technologies 8
Which Columns are (probably) Not White?
Strategy 3: rotate matrix. examine foreach column O(row...
© 2014 MapR Technologies 9
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
...
© 2014 MapR Technologies 10
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3...
© 2014 MapR Technologies 11
Primary Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotyp...
© 2014 MapR Technologies 12
Clinical Applications: Performance Matters
MapR
FilesystemN
F
S
DNA
Sequencer
DNA
Sequencer
DN...
© 2014 MapR Technologies 13
Variant Collection Enables Downstream Apps
• GWAS Association Studies
• Versioned, Personalize...
© 2014 MapR Technologies 14
The Post-Sequencing Genomics Workload
Sboner, et al, 2011. The real cost of sequencing: higher...
© 2014 MapR Technologies 15
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype...
© 2014 MapR Technologies 16
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype...
© 2014 MapR Technologies 17
HUGE PROBLEM
COMBINATORIAL EXPLOSION
© 2014 MapR Technologies 18
What’s a Percolator?
• Google Percolator
– “Caffeine” update 2010
• Iterative, incremental pri...
© 2014 MapR Technologies 19
Solution: Percolate
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
anno...
© 2014 MapR Technologies 20
Robot Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific disco...
© 2014 MapR Technologies 21
Robot (Data?) Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientif...
© 2014 MapR Technologies 22© 2014 MapR Technologies
Genealogy Company
Slides credit: Bill Yetman, Hadoop Summit 2014
http:...
© 2014 MapR Technologies 23
GERMLINE is…
• …an algorithm that finds hidden relationships within a pool of
DNA
• …the refer...
© 2014 MapR Technologies 24
Projected GERMLINE run times (in hours)
2
4
Hours
Samples
0
100
200
300
400
500
600
700
2,500
...
© 2014 MapR Technologies 25
GERMLINE: What’s the Problem?
• GERMLINE (the implementation) was not meant to be used in
an i...
© 2014 MapR Technologies 26
Run times for matching (in hours)
2
6
Hours
Samples
0
20
40
60
80
100
120
140
160
180
GERMLINE...
© 2014 MapR Technologies 27
• Paper submitted describing the implementation
• Releasing as an Open Source project soon
• [...
© 2014 MapR Technologies 28© 2014 MapR Technologies
Further Growth & Optimization
© 2014 MapR Technologies 29
Underdog (Strand Phasing) performance
– Went from 12 hours to process 1,000 samples
to under 2...
© 2014 MapR Technologies 30
Pipeline steps and incremental change…
– Incremental change over time
– Supporting the busines...
© 2014 MapR Technologies 31
…while the business continues to grow rapidly
3
1
-
50,000
100,000
150,000
200,000
250,000
300...
© 2014 MapR Technologies 32© 2014 MapR Technologies
BigData App Development Lifecycle
© 2014 MapR Technologies 33
BigData App Development Lifecycle
outputinput
1M rows
tail | grep | sort | uniq -c
© 2014 MapR Technologies 34
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Over decades of ...
© 2014 MapR Technologies 35
BigData App Development Lifecycle
outputinput
1M rows
tail | grep | sort | uniq -c
© 2014 MapR Technologies 36
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1M rows
1B rows
© 2014 MapR Technologies 37
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1M rows
1B rows
1T ...
© 2014 MapR Technologies 38
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
Hadoop ac...
© 2014 MapR Technologies 39
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1T rows
1T rows
inp...
© 2014 MapR Technologies 40
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
MapR enha...
© 2014 MapR Technologies 41
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1T rows
POSIX (NFS)...
© 2014 MapR Technologies 42
BigData App Development Lifecycle
tail | grep | sort | uniq -c
1 1 1 1
100 100 100 100
Prototy...
© 2014 MapR Technologies 43
BigData App Development Lifecycle
tail | grep | sort | uniq -c
1 1 1
100
Prototype Tools
Dev C...
© 2014 MapR Technologies 44© 2014 MapR Technologies
Aadhaar – World’s Largest Biometric
Database
© 2014 MapR Technologies 45
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
© 2014 MapR Technologies 46
India: Problem
• 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% lite...
© 2014 MapR Technologies 47
India: Vision
• Create a common “national identity” for every “resident”
– Biometric backed id...
© 2014 MapR Technologies 48
Aadhaar Biometric Capture & Index
© 2014 MapR Technologies 49
Aadhaar Biometric Capture & Index
© 2014 MapR Technologies 50
Aadhaar Biometric Capture & Index
© 2014 MapR Technologies 51
Architectural Principles
• Design for Scale
– Every component needs to scale to large volumes
...
© 2014 MapR Technologies 52
Design for Scale
• Horizontal scale-out
• Distributed computing
• Distributed data storage and...
© 2014 MapR Technologies 53
MapR Filesystem
Aadhaar Multi-DC Data Storage Stack*
ID + Biometrics
(M7 HBase)
All raw packet...
© 2014 MapR Technologies 54
Enrollment Volume
• 600 to 800 million UIDs in 4 years
– 1 million a day
– 200+ trillion match...
© 2014 MapR Technologies 55
Authentication Volume
• 100+ million authentications per day (10 hrs)
– Possible high variance...
© 2014 MapR Technologies 56
How Do Biometrics Relate to Genomics?
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• ...
© 2014 MapR Technologies 57
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• Genome: ~3M features (variants)
Data S...
© 2014 MapR Technologies 58
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• Genome: ~3M features (variants)
Data S...
© 2014 MapR Technologies 59© 2014 MapR Technologies
MapR Platform
© 2014 MapR Technologies 60
Apache Hadoop NameNode High Availability (HA)
NameNode
A B C D E F
HDFS-based Distributions
Da...
© 2014 MapR Technologies 61
No NameNode Architecture
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNo...
© 2014 MapR Technologies 62
MapR M7: The Best In-Hadoop Database
 NoSQL Columnar Store
 Apache HBase API
 In-Hadoop dat...
© 2014 MapR Technologies 63
MapR M7: The Best In-Hadoop Database
 NoSQL Columnar Store
 Apache HBase API
 In-Hadoop dat...
© 2014 MapR Technologies 64
Hbase Apps: High Performance with Consistent Low
Latency
--- M7 Read Latency --- Others Read L...
© 2014 MapR Technologies 65© 2014 MapR Technologies
MapR Services
© 2014 MapR Technologies 66
Professional Services
• Installation
• Migrations
• SLA Plans
• Best Practices
• Performance
T...
© 2014 MapR Technologies 67
Global PS Resources
17 Today (+8 in Q3)
D.C.
Keys Botzum (DE/Security/Developer)
Joe Blue (Dat...
© 2014 MapR Technologies 68
Use Case Data Flow Example
MapR Data Platform
Processing and Analytics
Ingest
Sqoop
Flume
HDFS...
© 2014 MapR Technologies 69
Engagement Types
• Customer engagement is typically 1-4 weeks (longer okay)
• Well established...
© 2014 MapR Technologies 70
Q&A
twitter.com/allenday aday@mapr.com
Thanks!
slideshare.net/allendaylinkedin.com/in/allenday
© 2014 MapR Technologies 71© 2014 MapR Technologies
An Overview of Apache Spark
© 2014 MapR Technologies 72
Agenda
• MapReduce Refresher
• What is Spark?
• The Difference with Spark
• Preexisting MapRed...
© 2014 MapR Technologies 73© 2014 MapR Technologies
MapReduce Refresher
© 2014 MapR Technologies 74
MapReduce Basics
• Foundational model is based on a distributed file system
– Scalability and ...
© 2014 MapR Technologies 75
Languages and Frameworks
• Languages
– Java, Scala, Clojure
– Python, Ruby
• Higher Level Lang...
© 2014 MapR Technologies 76
MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For c...
© 2014 MapR Technologies 77© 2014 MapR Technologies
What is Spark?
© 2014 MapR Technologies 78
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally devel...
© 2014 MapR Technologies 79
The Spark Community
© 2014 MapR Technologies 80
Spark is the Most Active Open Source Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
12...
© 2014 MapR Technologies 81
Unified Platform
Shark
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (Gener...
© 2014 MapR Technologies 82
Supported Languages
• Java
• Scala
• Python
• Hive?
© 2014 MapR Technologies 83
Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Files...
© 2014 MapR Technologies 84
Machine Learning - MLlib
• K-Means
• L1 and L2-regularized Linear Regression
• L1 and L2-regul...
© 2014 MapR Technologies 85© 2014 MapR Technologies
The Difference with Spark
© 2014 MapR Technologies 86
Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive shel...
© 2014 MapR Technologies 87
Resilient Distributed Datasets (RDD)
• Spark revolves around RDDs
• Fault-tolerant collection ...
© 2014 MapR Technologies 88
RDD Operations
• Transformations
– Creation of a new dataset from an existing
• map, filter, d...
© 2014 MapR Technologies 89
RDD Persistence / Caching
• Variety of storage levels
– memory_only (default), memory_and_disk...
© 2014 MapR Technologies 90
Cache Scaling Matters
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully cached...
© 2014 MapR Technologies 91
Directed Acylic Graph (DAG)
• Directed
– Only in a single direction
• Acyclic
– No looping
• W...
© 2014 MapR Technologies 92
RDD Fault Recovery
RDDs track lineage information that can be used to efficiently
recompute lo...
© 2014 MapR Technologies 93
Comparison to Storm
• Higher throughput than Storm
– Spark Streaming: 670k records/sec/node
– ...
© 2014 MapR Technologies 94
Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask question...
© 2014 MapR Technologies 95
The Game Changer!
• The
– Port them over if you need better performance
• Be sure to share the...
© 2014 MapR Technologies 96© 2014 MapR Technologies
Preexisting MapReduce
© 2014 MapR Technologies 97
Existing Jobs
• Java MapReduce
– Port them over if you need better performance
• Be sure to sh...
© 2014 MapR Technologies 98
Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive war...
© 2014 MapR Technologies 99© 2014 MapR Technologies
Examples and Resources
© 2014 MapR Technologies 100
SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);
JavaRDD<String> fil...
© 2014 MapR Technologies 101
Network Word Count – Streaming
// Create the context with a 1 second batch size
val ssc = new...
© 2014 MapR Technologies 102
Deploying Spark – Cluster Manager Types
• Standalone mode
– Comes bundled (EC2 capable)
• YAR...
© 2014 MapR Technologies 103
Remember
• If you want to use a new technology you must learn that new
technology
• For those...
© 2014 MapR Technologies 104
Configuration
http://spark.apache.org/docs/latest/
Most Important
• Application Configuration...
© 2014 MapR Technologies 105
Resources
• Pig on Spark
– http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-t...
© 2014 MapR Technologies 106
• San Francisco
June 30 – July 2
• Use Cases
• Tech Talks
• Training
http://spark-summit.org/
© 2014 MapR Technologies 107
Q&A
@mapr maprtech
jscott@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
Upcoming SlideShare
Loading in...5
×

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

851

Published on

Genomics, Genealogy, and Biometrics / UID use cases for Hadoop and HBase in

Published in: Science, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
851
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Graph of each step in the pipeline for every run. This graph shows how important it is to measure everything. Some steps have been greatly reduced or eliminated. Light blue is the matching step. You can see it going quadratic and then the change when ‘J’ Jermline was released.
  • Gives up random access read on files
    Gives up strong authentication / authorization model
    Gives up random access write / append on files
  • 45
  • Historically, the NameNode in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.

    Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation.

    As you add more nodes to your cluster and want to configure HA, you have to add expensive NAS and have warm standby’s for the NN and related metadata which is persisted in memory. Even more, once you surpass the file limit in HDFS, you have to have region NameNode servers to support those additional nodes. A “federated NameNode approach”.

    Think of the additional dedicated hardware and configurations/administration required to set up NameNode HA in Hadoop! And this is ONLY for NameNode HA.
  • What if you could distribute the NameNode metadata and have it share resources in your cluster? What if Hadoop was a truly distributed environment?

    With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance.

    (advantages of this approach are called out on the left and right side of the diagram
  • Because of architecture.
    Apache Hbase runs in a JVM which read/writes to HDFS which is also running is a separate JVM, storing data in the Linux OS which is reading and writing to disk.
    As data is collected, it needs to be written to disk and “compacted” (i.e, maintenance is performed), this introduces many layers and steps that need to happen
    MapR M7 has integrated tables and files which are a true file system, reading and writing directly on disks.
    MapR M7 is a tightly integrated, in-Hadoop database which is NoSQL, columnar store which is 100% Apache Hbase API compatible
  • Because of architecture.
    Apache Hbase runs in a JVM which read/writes to HDFS which is also running is a separate JVM, storing data in the Linux OS which is reading and writing to disk.
    As data is collected, it needs to be written to disk and “compacted” (i.e, maintenance is performed), this introduces many layers and steps that need to happen
    MapR M7 has integrated tables and files which are a true file system, reading and writing directly on disks.
    MapR M7 is a tightly integrated, in-Hadoop database which is NoSQL, columnar store which is 100% Apache Hbase API compatible
  • **Consistent** low latency on read due to compactions
    Recall Aadhar
    Why?
  • Spark is really cool…
  • When do you use regular mapreduce over higher level languages? When Hive? When Pig? When anything?
  • You can find Project Resources on the Apache. You’ll also find information about the mailing list there (including archives)
  • Yahoo and Adobe are in production with Spark.
  • This sounds a lot like the reason to consider Pig vs. Java MapReduce
  • Gracefully
  • Looks kind of like a source control tree
  • You can import the MLlib to use here in the shell!
  • Best use case? Standalone followed by Mesos… My personal opinion is that Mesos is where the future will take us.
  • Don’t forget to share your experiences. This is really what the community is about.

    Don’t have time to contribute to open source, use it and share your experiences!
  • This isn’t all proven out yet, but some of it should just work already.
  • This is a really simple example. Reality is 22 chromosomes and 96 characters in a word
  • ‘G’ Germline would have to rebuild the hash table for all samples and then re-run all comparisons. An all by all comparison
  • This is where HBase shines. It is easy to add columns and rows, very efficient with empty cells (sparse matrix). Hammer HBase with multiple processes doing this at the same time.
  • Transcript of "2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China"

    1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Genomics Use Cases @ MapR
    2. 2. © 2014 MapR Technologies 2© 2014 MapR Technologies DNA Sequencing Company
    3. 3. © 2014 MapR Technologies 3 Parallelize Primary Analytics .fastq .vcf short read alignment genotype callingreads & mappings
    4. 4. © 2014 MapR Technologies 4 Sequence Analysis, Quick Overview […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
    5. 5. © 2014 MapR Technologies 5 What is the (Probable) Color of Each Column?
    6. 6. © 2014 MapR Technologies 6 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
    7. 7. © 2014 MapR Technologies 7 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
    8. 8. © 2014 MapR Technologies 8 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
    9. 9. © 2014 MapR Technologies 9 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
    10. 10. © 2014 MapR Technologies 10 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
    11. 11. © 2014 MapR Technologies 11 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
    12. 12. © 2014 MapR Technologies 12 Clinical Applications: Performance Matters MapR FilesystemN F S DNA Sequencer DNA Sequencer DNA Sequencer Raw DNARaw DNARaw DNA 1º Analytics Raw DNARaw DNASNP calls Static Clinical Reporting PhysicianPatient Reference DBs SNP DB ETL 2º Analytics ResearcherSubject
    13. 13. © 2014 MapR Technologies 13 Variant Collection Enables Downstream Apps • GWAS Association Studies • Versioned, Personalized Medicine • Companion Diagnostics SNP DB 2º Analytics New Markets Hello! More linear algebra  [Spark, Summingbird, Lambda Architecture Slides]
    14. 14. © 2014 MapR Technologies 14 The Post-Sequencing Genomics Workload Sboner, et al, 2011. The real cost of sequencing: higher than you think!
    15. 15. © 2014 MapR Technologies 15 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
    16. 16. © 2014 MapR Technologies 16 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc • In context of, e.g. – ε1: Racial, etc. background – ε2: Experimental design- specific concerns (e.g. familial IBD/IBS) – ε3: Environmental factors and penetrance – ε4: Assay-specific biases and noise phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4 At risk of over-simplifying as business-level concept…
    17. 17. © 2014 MapR Technologies 17 HUGE PROBLEM COMBINATORIAL EXPLOSION
    18. 18. © 2014 MapR Technologies 18 What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental prioritized updates • No batch processing • Decouple computational results from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
    19. 19. © 2014 MapR Technologies 19 Solution: Percolate SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models
    20. 20. © 2014 MapR Technologies 20 Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
    21. 21. © 2014 MapR Technologies 21 Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
    22. 22. © 2014 MapR Technologies 22© 2014 MapR Technologies Genealogy Company Slides credit: Bill Yetman, Hadoop Summit 2014 http://slidesha.re/1vRh3kY
    23. 23. © 2014 MapR Technologies 23 GERMLINE is… • …an algorithm that finds hidden relationships within a pool of DNA • …the reference implementation of that algorithm written in C++. • You can find it here: http://www1.cs.columbia.edu/~gusev/germline/ 2 3
    24. 24. © 2014 MapR Technologies 24 Projected GERMLINE run times (in hours) 2 4 Hours Samples 0 100 200 300 400 500 600 700 2,500 12,500 22,500 32,500 42,500 52,500 62,500 72,500 82,500 92,500 102,500 112,500 122,500 GERMLINE run times Projected GERMLINE run times 700 hours = 29+ days EXPONENTIAL COMPLEXITY
    25. 25. © 2014 MapR Technologies 25 GERMLINE: What’s the Problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting – Stateless, single threaded, prone to swapping (heavy memory usage) – GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale 2 5
    26. 26. © 2014 MapR Technologies 26 Run times for matching (in hours) 2 6 Hours Samples 0 20 40 60 80 100 120 140 160 180 GERMLINE run times Jermline run times Projected GERMLINE run times EXPONENTIAL LINEAR HBase Refactor
    27. 27. © 2014 MapR Technologies 27 • Paper submitted describing the implementation • Releasing as an Open Source project soon • [HBase Schema/Algorithm Slides] 2 7
    28. 28. © 2014 MapR Technologies 28© 2014 MapR Technologies Further Growth & Optimization
    29. 29. © 2014 MapR Technologies 29 Underdog (Strand Phasing) performance – Went from 12 hours to process 1,000 samples to under 25 minutes with a MapReduce implementation 2 9 With improved accuracy! Underdog replaces Beagle 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Total Run Size Total Beagle-Underdog Duration
    30. 30. © 2014 MapR Technologies 30 Pipeline steps and incremental change… – Incremental change over time – Supporting the business in a “just in time” Agile way 3 0 0 50000 100000 150000 200000 250000 500 3622 7243 9615 12353 16333 19522 22861 26642 31172 35986 40852 45252 49817 54738 61675 69496 77257 84337 90074 97448 104684 111937 119669 127194 134970 142232 149988 157710 165685 173719 181617 189817 197853 205855 213471 221290 228912 236516 243550 251315 259164 267266 275335 283114 291017 298823 306556 314662 322655 330745 338813 346847 354938 362954 371064 379208 387334 395432 Beagle-Underdog Phasing Pipeline Finalize Relationship Processing Germline-Jermline Results Processing Germline-Jermline Processing Beagle Post Phasing Admixture Plink Prep Pipeline Initialization Jermline replaces Germline Ethnicity V2 Release Underdog Replaces Beagle AdMixture on Hadoop
    31. 31. © 2014 MapR Technologies 31 …while the business continues to grow rapidly 3 1 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 #ofprocessedsamples) DNA Database Size
    32. 32. © 2014 MapR Technologies 32© 2014 MapR Technologies BigData App Development Lifecycle
    33. 33. © 2014 MapR Technologies 33 BigData App Development Lifecycle outputinput 1M rows tail | grep | sort | uniq -c
    34. 34. © 2014 MapR Technologies 34 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Over decades of progress, Unix-based systems have set the standard for compatibility and functionality
    35. 35. © 2014 MapR Technologies 35 BigData App Development Lifecycle outputinput 1M rows tail | grep | sort | uniq -c
    36. 36. © 2014 MapR Technologies 36 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1M rows 1B rows
    37. 37. © 2014 MapR Technologies 37 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1M rows 1B rows 1T rows
    38. 38. © 2014 MapR Technologies 38 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop Hadoop achieves much higher scalability by trading away essentially all of this compatibility
    39. 39. © 2014 MapR Technologies 39 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1T rows 1T rows input output Port to BigData Tools ($$$$)
    40. 40. © 2014 MapR Technologies 40 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance
    41. 41. © 2014 MapR Technologies 41 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1T rows POSIX (NFS) Hadoop HDFS Port
    42. 42. © 2014 MapR Technologies 42 BigData App Development Lifecycle tail | grep | sort | uniq -c 1 1 1 1 100 100 100 100 Prototype Tools Dev Cost BigData Tools Dev Cost Use When Possible Use When Needed
    43. 43. © 2014 MapR Technologies 43 BigData App Development Lifecycle tail | grep | sort | uniq -c 1 1 1 100 Prototype Tools Dev Cost BigData Tools Dev Cost Use When Possible Use When Needed
    44. 44. © 2014 MapR Technologies 44© 2014 MapR Technologies Aadhaar – World’s Largest Biometric Database
    45. 45. © 2014 MapR Technologies 45 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
    46. 46. © 2014 MapR Technologies 46 India: Problem • 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pays Income Tax, <20% banking – ~800 million mobile, ~200-300 mn migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40%
    47. 47. © 2014 MapR Technologies 47 India: Vision • Create a common “national identity” for every “resident” – Biometric backed identity to eliminate duplicates – “Verifiable online identity” for portability • Applications ecosystem using open APIs – Aadhaar enabled bank account and payment platform – Aadhaar enabled electronic, paperless KYC • Enrolment – One time in a person’s lifetime – Multi-modal biometrics (fingerprints, iris)
    48. 48. © 2014 MapR Technologies 48 Aadhaar Biometric Capture & Index
    49. 49. © 2014 MapR Technologies 49 Aadhaar Biometric Capture & Index
    50. 50. © 2014 MapR Technologies 50 Aadhaar Biometric Capture & Index
    51. 51. © 2014 MapR Technologies 51 Architectural Principles • Design for Scale – Every component needs to scale to large volumes – Millions of transactions and billions of records – Accommodate failure and design for recovery • Open Architecture – Open Source – Open APIs • Security – End-to-end security of resident data
    52. 52. © 2014 MapR Technologies 52 Design for Scale • Horizontal scale-out • Distributed computing • Distributed data storage and partitioning • No single points of failure • No single points of bottleneck • Asynchronous processing throughout the system – Allows loose coupling various components – Allows independent component level scaling
    53. 53. © 2014 MapR Technologies 53 MapR Filesystem Aadhaar Multi-DC Data Storage Stack* ID + Biometrics (M7 HBase) All raw packets (HDFS+NFS) Enrollment ID API ID + Demo + Photo + Benefits (MySQL, Solr) Authentication API Authorization API * as best I understand from public documents
    54. 54. © 2014 MapR Technologies 54 Enrollment Volume • 600 to 800 million UIDs in 4 years – 1 million a day – 200+ trillion matches every day!!! • ~5MB per resident – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!) – About 30 TB I/O every day – Replication and backup across DCs of about 5+ TB of incremental data every day – Lifecycle updates and new enrolments will continue for ever • Additional process data – Several million events on an average moving through async channels (some persistent and some transient) – Needing complete update and insert guarantees across data stores
    55. 55. © 2014 MapR Technologies 55 Authentication Volume • 100+ million authentications per day (10 hrs) – Possible high variance on peak and average – Sub second response – Guaranteed audits • Multi-DC architecture – All changes needs to be propagated from enrolment data stores to all authentication sites • Authentication request is about 4 K – 100 million authentications a day – 1 billion audit records in 10 days (30+ billion a year) – 4 TB encrypted audit logs in 10 days – Audit write must be guaranteed
    56. 56. © 2014 MapR Technologies 56 How Do Biometrics Relate to Genomics? Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! SNP DB 2º Analytics
    57. 57. © 2014 MapR Technologies 57 Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! Vector Pattern Matching SNP DB 2º Analytics ƒ-1(x): common features ƒ(x): unique features ƒ(x): uncommon features ƒ(x): other features
    58. 58. © 2014 MapR Technologies 58 Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! Topological Pattern Matching SNP DB 2º Analytics
    59. 59. © 2014 MapR Technologies 59© 2014 MapR Technologies MapR Platform
    60. 60. © 2014 MapR Technologies 60 Apache Hadoop NameNode High Availability (HA) NameNode A B C D E F HDFS-based Distributions DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Primary NameNode A B C D E F Standby NameNode A B C D E F NameNode A B NameNode C D NameNode E F NameNode A B NameNode C D NameNode E F NAS Appliance HDFS HA HDFS Federation Single point of failure Limited to 50-200 million files Performance bottleneck Metadata must fit in memory Only one active NameNode Limited to 50-200 million files Commercial NAS possibly needed Metadata must fit in memory Performance bottleneck Double the block reports Multiple single points of failure w/o HA Needs 20 NameNodes for 1 Billion files Commercial NAS needed Metadata must fit in memory Performance bottleneck Double the block reports
    61. 61. © 2014 MapR Technologies 61 No NameNode Architecture DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode NameNode A B C D E FAAA BBBB CCC DDD EEE FFF Up to 1T files (> 5000x advantage) Significantly less hardware & OpEx Higher performance No special config to enable HA Automatic failover & re-replication Metadata is persisted to disk
    62. 62. © 2014 MapR Technologies 62 MapR M7: The Best In-Hadoop Database  NoSQL Columnar Store  Apache HBase API  In-Hadoop database HBase JVM HDFS JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR M7 The most scalable, enterprise-grade, NoSQL database that supports online applications and analytics
    63. 63. © 2014 MapR Technologies 63 MapR M7: The Best In-Hadoop Database  NoSQL Columnar Store  Apache HBase API  In-Hadoop database Hbase Interface JVM HDFS Interface JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR M7 The most scalable, enterprise-grade, NoSQL database that supports online applications and analytics BigData Application
    64. 64. © 2014 MapR Technologies 64 Hbase Apps: High Performance with Consistent Low Latency --- M7 Read Latency --- Others Read Latency
    65. 65. © 2014 MapR Technologies 65© 2014 MapR Technologies MapR Services
    66. 66. © 2014 MapR Technologies 66 Professional Services • Installation • Migrations • SLA Plans • Best Practices • Performance Tuning Hadoop Core Services IT/ Infrastructure Linux Networking Data Center Storage Operations Big Data Workflows • Hive/Pig • Oozie/Sqoop • Flume • M7/HBase • Data Flow BI / DBA BI / ETL / Reporting Scripting / Java Hadoop MR Eco Projects (HBase, Hive, …) Solution Design • HBase/M7 • Map/Reduce • Application Development • Integration Development Java Hadoop Developer Architectural Design Advanced Analytics • Use case Discovery • Use case Modeling • POC • Workshops Modeler / Analyst PhD Statistics/Math MatLab / R / SAS Scripting / Java BI / ETL / Reporting Data Engineering Data Science AUDIENCE ENGAGEMENTS SKILLS
    67. 67. © 2014 MapR Technologies 67 Global PS Resources 17 Today (+8 in Q3) D.C. Keys Botzum (DE/Security/Developer) Joe Blue (Data Scientist) Venkat Gunnup (DE/Development) Alex Rodriguez (DE/Development) Kannappan Sirchabesa (DE/OPS) SAN JOSE Wayne Cappas (Director/DE) John Benninghoff (DE/OPS) Dmitry Gomerman (DE/OPS & Security) Ivan Bishop (DE/OPS) James Caseletto (Data Scientist) Sungwook Yoon (Data Scientist) Sridhar Reddy (Director - M7/Hbase) LOS ANGELES John Ewing (DE/OPS) Marco Vasquez (Data Scientist/DE) SOUTH CAROLINA David Schexnayder (DE/OPS) PHOENIX Michael Farnbach (DE/OPS) SINGAPORE Allen Day (Data Scientist)
    68. 68. © 2014 MapR Technologies 68 Use Case Data Flow Example MapR Data Platform Processing and Analytics Ingest Sqoop Flume HDFS NFS Access Tez Drill Hive Pig Impala Data Sources Clickstream Billing Data Mobile Data Product Catalog Social Media Server Logs Merchant Listings Online Chat Call Detail Records Visualization M7HBase MapReduce v1 & v2 StormCascadingPig Solr MahoutYARN Oozie Hive MLLib Set-Top Box Data
    69. 69. © 2014 MapR Technologies 69 Engagement Types • Customer engagement is typically 1-4 weeks (longer okay) • Well established partners (15,000 resources globally) • Custom training based on customer use-case • Small 1-3 days workshops • Extended support / Staff augmentation
    70. 70. © 2014 MapR Technologies 70 Q&A twitter.com/allenday aday@mapr.com Thanks! slideshare.net/allendaylinkedin.com/in/allenday
    71. 71. © 2014 MapR Technologies 71© 2014 MapR Technologies An Overview of Apache Spark
    72. 72. © 2014 MapR Technologies 72 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Preexisting MapReduce • Examples and Resources
    73. 73. © 2014 MapR Technologies 73© 2014 MapR Technologies MapReduce Refresher
    74. 74. © 2014 MapR Technologies 74 MapReduce Basics • Foundational model is based on a distributed file system – Scalability and fault-tolerance • Map – Loading of the data and defining a set of keys • Reduce – Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
    75. 75. © 2014 MapR Technologies 75 Languages and Frameworks • Languages – Java, Scala, Clojure – Python, Ruby • Higher Level Languages – Hive – Pig • Frameworks – Cascading, Crunch • DSLs – Scalding, Scrunch, Scoobi, Cascalog
    76. 76. © 2014 MapR Technologies 76 MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Or use a higher level language or DSL that does this for you
    77. 77. © 2014 MapR Technologies 77© 2014 MapR Technologies What is Spark?
    78. 78. © 2014 MapR Technologies 78 Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation
    79. 79. © 2014 MapR Technologies 79 The Spark Community
    80. 80. © 2014 MapR Technologies 80 Spark is the Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
    81. 81. © 2014 MapR Technologies 81 Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, Lamba Expressions) • Spark SQL (SQL on Spark, not just Hive) • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark)
    82. 82. © 2014 MapR Technologies 82 Supported Languages • Java • Scala • Python • Hive?
    83. 83. © 2014 MapR Technologies 83 Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • HBase
    84. 84. © 2014 MapR Technologies 84 Machine Learning - MLlib • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent * As of May 14, 2014 ** Don’t be surprised if you see the Mahout library converting to Spark soon
    85. 85. © 2014 MapR Technologies 85© 2014 MapR Technologies The Difference with Spark
    86. 86. © 2014 MapR Technologies 86 Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
    87. 87. © 2014 MapR Technologies 87 Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel – Parallelized Collection: Scala collection which is run in parallel – Hadoop Dataset: records of files supported by Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
    88. 88. © 2014 MapR Technologies 88 RDD Operations • Transformations – Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions – Return a value after running a computation • collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
    89. 89. © 2014 MapR Technologies 89 RDD Persistence / Caching • Variety of storage levels – memory_only (default), memory_and_disk, etc… • API Calls – persist(StorageLevel) – cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations – Read from disk vs. recompute (memory_and_disk) – Total memory storage size (memory_only_ser) – Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
    90. 90. © 2014 MapR Technologies 90 Cache Scaling Matters 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache
    91. 91. © 2014 MapR Technologies 91 Directed Acylic Graph (DAG) • Directed – Only in a single direction • Acyclic – No looping • Why does this matter? – This supports fault-tolerance
    92. 92. © 2014 MapR Technologies 92 RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
    93. 93. © 2014 MapR Technologies 93 Comparison to Storm • Higher throughput than Storm – Spark Streaming: 670k records/sec/node – Storm: 115k records/sec/node – Commercial systems: 100-500k records/sec/node 0 10 20 30 100 1000 Throughputpernode (MB/s) Record Size (bytes) WordCount Spark Storm 0 20 40 60 100 1000 Throughputpernode (MB/s) Record Size (bytes) Grep Spark Storm
    94. 94. © 2014 MapR Technologies 94 Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
    95. 95. © 2014 MapR Technologies 95 The Game Changer! • The – Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts – Port them over – Try SPORK! • Hive Queries….
    96. 96. © 2014 MapR Technologies 96© 2014 MapR Technologies Preexisting MapReduce
    97. 97. © 2014 MapR Technologies 97 Existing Jobs • Java MapReduce – Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts – Port them over – Try SPORK! • Hive Queries….
    98. 98. © 2014 MapR Technologies 98 Shark – SQL over Spark • Hive-compatible (HiveQL, UDFs, metadata) – Works in existing Hive warehouses without changing queries or data! • Augments Hive – In-memory tables and columnar memory store • Fast execution engine – Uses Spark as the underlying execution engine – Low-latency, interactive queries – Scale-out and tolerates worker failures
    99. 99. © 2014 MapR Technologies 99© 2014 MapR Technologies Examples and Resources
    100. 100. © 2014 MapR Technologies 100 SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]); JavaRDD<String> file = sc.textFile("hdfs://..."); JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" "))) .mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://..."); val sc = new SparkContext(master, appName, [sparkHome], [jars]) val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) – interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)
    101. 101. © 2014 MapR Technologies 101 Network Word Count – Streaming // Create the context with a 1 second batch size val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass)) // Create a NetworkInputDStream on target host:port and count the // words in input stream of n delimited text (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()
    102. 102. © 2014 MapR Technologies 102 Deploying Spark – Cluster Manager Types • Standalone mode – Comes bundled (EC2 capable) • YARN • Mesos
    103. 103. © 2014 MapR Technologies 103 Remember • If you want to use a new technology you must learn that new technology • For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it • To get the most out of a new technology you need to learn that technology, this includes tuning – There are switches you can use to optimize your work
    104. 104. © 2014 MapR Technologies 104 Configuration http://spark.apache.org/docs/latest/ Most Important • Application Configuration http://spark.apache.org/docs/latest/configuration.html • Standalone Cluster Configuration http://spark.apache.org/docs/latest/spark-standalone.html • Tuning Guide http://spark.apache.org/docs/latest/tuning.html
    105. 105. © 2014 MapR Technologies 105 Resources • Pig on Spark – http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html – https://github.com/aniket486/pig – https://github.com/twitter/pig/tree/spork – http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 – https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • Latest on Spark – http://databricks.com/categories/spark/ – http://www.spark-stack.org/
    106. 106. © 2014 MapR Technologies 106 • San Francisco June 30 – July 2 • Use Cases • Tech Talks • Training http://spark-summit.org/
    107. 107. © 2014 MapR Technologies 107 Q&A @mapr maprtech jscott@mapr.com Engage with us! MapR maprtech mapr-technologies

    ×