Your SlideShare is downloading. ×
© 2014 MapR Technologies 1© 2014 MapR Technologies
adawar@mapr.com
pat.mcdonough@databricks.com
© 2014 MapR Technologies 2
About MapR and Databricks
• Project leads for Spark,
formerly with UC Berkeley’s
AMPLab
• Found...
© 2014 MapR Technologies 3
Hadoop Evolves
Make it solid
• HA: eliminate SPOFs
• Data Protection: recover
from application/...
© 2014 MapR Technologies 4
MapR – Top ranked Hadoop distribution
Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSY...
© 2014 MapR Technologies 5
MapR – The Only Distribution to Integrate
the Complete Apache Spark Stack
Management
MapR Data ...
© 2014 MapR Technologies 6
Spark on MapR
World-record performance on
disk coupled with in-memory
processing advantages
Hig...
Apache Spark
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009
in UC Berkeley’...
The Spark Community
Spark is The Most Active Open Source
Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinp...
Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> Genera...
Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> Genera...
Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(la...
Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(la...
Easy: Clean API
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
...
Easy: Expressive API
map reduce
Easy: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupBy...
Easy: Example – Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritabl...
Easy: Example – Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritabl...
Easy: Works Well With Hadoop
Data Compatibility
• Access your existing
Hadoop Data
• Use the same data
formats
• Adheres t...
Easy: User-Driven Roadmap
Language support
> Improved Python
support
> SparkR
> Java 8
> Integrated Schema
and SQL support...
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(ite...
Fast: Logistic Regression Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Itera...
Fast: Using RAM, Operator Graphs
In-memory Caching
• Data Partitions read from
RAM instead of disk
Operator Graphs
• Sched...
Fast: Scaling Down
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Executiontime(s)
% of working ...
Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.fi...
Easy: Unified Platform
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engin...
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
comput...
Hive Compatibility
• Interfaces to access data and code in the Hive
ecosystem:
o Support for writing queries in HQL
o Cata...
Parquet Support
Native support for reading data stored in
Parquet:
• Columnar storage avoids reading
unneeded data.
• Curr...
Mixing SQL and Machine Learning
val trainingDataTable = sql(""" SELECT
e.action, u.age, u.latitude, u.logitude FROM Users ...
Relationship to
Borrows
• Hive data loading code / in-
memory columnar
representation
• hardened spark execution
engine
Ad...
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computatio...
Spark Streaming
Run a streaming computation as a series of very
small, deterministic batch jobs
34
Spark
Spark
Streaming
b...
DStream of data
Window-based Transformations
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getT...
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computatio...
MLlib – Machine Learning library
Logis] c*Regression,*Linear*SVM*(+L1,*L2),*Decision*
Trees,*Naive*Bayes"
Linear*Regressio...
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computatio...
Enabling users to easily and efficiently
express the entire graph analytics
pipeline
New API
Blurs the distinction
between...
Easy: Unified Platform
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
G...
Use Cases
Interactive Exploratory Analytics
• Leverage Spark’s in-memory caching and efficient
execution to explore large distribute...
Machine Learning
• Improve performance of iterative algorithms by caching
frequently accessed datasets
• Develop programs ...
Power Real-time Dashboards
• Use Spark Streaming to perform low-latency window-
based aggregations
• Combine offline model...
Faster ETL
• Leverage Spark’s optimized scheduling for more efficient
I/O on large datasets, and in-memory processing for
...
San Francisco
June 30 – July 2
• Use Cases
• Tech Talks
• Training
http://spark-summit.org/
© 2014 MapR Technologies 47
Q&A
@mapr maprtech
adawar@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
Upcoming SlideShare
Loading in...5
×

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

11,964

Published on

http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop

This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.

To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop

Published in: Technology
0 Comments
83 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
11,964
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
0
Comments
0
Likes
83
Embeds 0
No embeds

No notes for slide
  • The power of MapR begins with the power of open source innovation and community participation.In some cases MapR leads the community in projects like Apache Mahout (machine learning) or Apache Drill (SQL on Hadoop)In other areas, MapR contributes, integrates Apache and other open source software (OSS) projects into the MapR distribution, delivering a more reliable and performant system with lower overall TCO and easier system management.MapR releases a new version with the latest OSS innovations on a monthly basis. We add 2-4 new Apache projects annually as new projects become production ready and based on customer demand.
  • The power of MapR begins with the power of open source innovation and community participation.In some cases MapR leads the community in projects like Apache Mahout (machine learning) or Apache Drill (SQL on Hadoop)In other areas, MapR contributes, integrates Apache and other open source software (OSS) projects into the MapR distribution, delivering a more reliable and performant system with lower overall TCO and easier system management.MapR releases a new version with the latest OSS innovations on a monthly basis. We add 2-4 new Apache projects annually as new projects become production ready and based on customer demand.
  • You can find Project Resources on the Apache Incubator siteYou’ll also find information about the mailing list there (including archives)
  • One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
  • Key idea: add “variables” to the “functions” in functional programming
  • This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  • Gracefully
  • Transcript of " Let Spark Fly: Advantages and Use Cases for Spark on Hadoop"

    1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies adawar@mapr.com pat.mcdonough@databricks.com
    2. 2. © 2014 MapR Technologies 2 About MapR and Databricks • Project leads for Spark, formerly with UC Berkeley’s AMPLab • Founded in June 2013 and backed by Andreessen Horowitz • Strong Engineering focus * Forrester Wave Big Data Hadoop Solutions, Q1 2014 • Top Ranked distribution for Hadoop* • Hundreds of deployments – 17 of Fortune 100 – Largest deployment in FSI (1000+ nodes) • Strong focus on making Hadoop resilient and enterprise grade • Worldwide Presence
    3. 3. © 2014 MapR Technologies 3 Hadoop Evolves Make it solid • HA: eliminate SPOFs • Data Protection: recover from application/user errors • Disaster Recovery: data center outages • Enterprise Integration: breaking the wall that separates Hadoop from the rest • Security & Multi- tenancy: sharing the cluster and meeting SLA’s, secure authorization, data governance Make it do more (easily) • Interactive apps (i.e. SQL) • Iterative programs • Streaming apps • Medium/Small Data • Architecture: using memory efficiently • How many different tools should it take? – It’s hard to get interoperability amongst different data-parallel models right – Learning curves and operational costs increase with each new tool
    4. 4. © 2014 MapR Technologies 4 MapR – Top ranked Hadoop distribution Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Batch Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning / coordination Savannah* Mahout ML, Graph MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integratio n & Access HttpFS Hue * Certification/support planned for 2014 Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability • High availability • Data protection • Disaster recovery • Standard file access • Standard database access • Pluggable services • Broad developer support • Enterprise security authorization • Wire-level authentication • Data governance • Ability to support predictive analytics, real- time database operations, and support high arrival rate data • Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators • 2X to 7X higher performance • Consistent, low latency * Forrester Wave Big Data Hadoop Solutions, Q1 2014
    5. 5. © 2014 MapR Technologies 5 MapR – The Only Distribution to Integrate the Complete Apache Spark Stack Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Batch Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout ML, Graph MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governan ce Tez* Accumulo* Hive Impala Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integratio n & Access HttpFS Hue * Certification/support planned for 2014 Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Spark Spark Streaming MLLib GraphX Shark
    6. 6. © 2014 MapR Technologies 6 Spark on MapR World-record performance on disk coupled with in-memory processing advantages High Performance Industry-leading enterprise-grade High Availability, Data Protection and Disaster Recovery Enterprise-grade dependability for Spark Strategic partnership with Databricks to ensure enterprise support for the entire stack 24/7 Best-in-class Global Support Spark stack can also be deployed natively as an independent standalone service on the MapR cluster Can Run Natively on MapR
    7. 7. Apache Spark
    8. 8. Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 • Top-level Apache Project as of 2014
    9. 9. The Spark Community
    10. 10. Spark is The Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
    11. 11. Spark: Easy and Fast Big Data Easy to Develop > Rich APIs in Java, Scala, Pytho n > Interactive shell Fast to Run > General execution graphs > In-memory storage
    12. 12. Spark: Easy and Fast Big Data Easy to Develop > Rich APIs in Java, Scala, Pytho n > Interactive shell Fast to Run > General execution graphs > In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
    13. 13. Easy: Get Started Immediately • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
    14. 14. Easy: Get Started Immediately • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); Java 8 (Coming Soon) JavaRDD<String> lines = sc.textFile(...) lines.filter(x -> x.contains(“ERROR”)).count()
    15. 15. Easy: Clean API Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
    16. 16. Easy: Expressive API map reduce
    17. 17. Easy: Expressive API map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
    18. 18. Easy: Example – Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
    19. 19. Easy: Example – Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
    20. 20. Easy: Works Well With Hadoop Data Compatibility • Access your existing Hadoop Data • Use the same data formats • Adheres to data locality for efficient processing Deployment Models • “Standalone” deployment • YARN-based deployment • Mesos-based deployment • Deploy on existing Hadoop cluster or side- by-side
    21. 21. Easy: User-Driven Roadmap Language support > Improved Python support > SparkR > Java 8 > Integrated Schema and SQL support in Spark’s APIs Better ML > Sparse Data Support > Model Evaluation Framework > Performance Testing
    22. 22. Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w
    23. 23. Fast: Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s
    24. 24. Fast: Using RAM, Operator Graphs In-memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
    25. 25. Fast: Scaling Down 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache
    26. 26. Easy: Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
    27. 27. Easy: Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)
    28. 28. Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation)
    29. 29. Hive Compatibility • Interfaces to access data and code in the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs
    30. 30. Parquet Support Native support for reading data stored in Parquet: • Columnar storage avoids reading unneeded data. • Currently only supports flat structures (nested data on short-term roadmap). • RDDs can be written to parquet files, preserving the schema.
    31. 31. Mixing SQL and Machine Learning val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""")// Since `sql` returns an RDD, the results of can be easily used in MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)
    32. 32. Relationship to Borrows • Hive data loading code / in- memory columnar representation • hardened spark execution engine Adds • RDD-aware optimizer / query planner • execution engine • language interfaces. Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark
    33. 33. Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation)
    34. 34. Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs 34 Spark Spark Streaming batches of X seconds live data stream processed results • Chop up the live stream into batches of ½ second or more, leverage RDDs for micro-batch processing • Use the same familiar Spark APIs to process streams • Combine your batch and online processing in a single system • Guarantee exactly-once semantics
    35. 35. DStream of data Window-based Transformations val tweets = ssc.twitterStream() val hashTags = tweets.flatMap(status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval
    36. 36. Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation)
    37. 37. MLlib – Machine Learning library Logis] c*Regression,*Linear*SVM*(+L1,*L2),*Decision* Trees,*Naive*Bayes" Linear*Regression*(+Lasso,*Ridge)* Alterna] ng*Least*Squares* KZMeans,*SVD* SGD,*Parallel*Gradient* Scala,*Java,*PySpark*(0.9) MLlib Classifica. on:" Regression:" Collabora. ve"Filtering:" Clustering"/"Explora. on:" Op. miza. on"Primi. ves:" Interopera. lity:"
    38. 38. Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation)
    39. 39. Enabling users to easily and efficiently express the entire graph analytics pipeline New API Blurs the distinction between Tables and Graphs New System Combines Data-Parallel Graph-Parallel Systems The GraphX Unified Approach
    40. 40. Easy: Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.,: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)
    41. 41. Use Cases
    42. 42. Interactive Exploratory Analytics • Leverage Spark’s in-memory caching and efficient execution to explore large distributed datasets • Use Spark’s APIs to explore any kind of data (structured, unstructured, semi-structured, etc.) and combine programming models • Execute arbitrary code using a fully-functional interactive programming environment • Connect external tools via SQL Drivers
    43. 43. Machine Learning • Improve performance of iterative algorithms by caching frequently accessed datasets • Develop programs that are easy to reason using a fully- capable functional programming style • Refine algorithms using the interactive REPL • Use carefully-curated algorithms out-of-the-box with MLlib
    44. 44. Power Real-time Dashboards • Use Spark Streaming to perform low-latency window- based aggregations • Combine offline models with streaming data for online clustering and classification within the dashboard • Use Spark’s core APIs and/or Spark SQL to give users large-scale, low-latency drill-down capabilities in exploring dashboard data
    45. 45. Faster ETL • Leverage Spark’s optimized scheduling for more efficient I/O on large datasets, and in-memory processing for aggregations, shuffles, and more • Use Spark SQL to perform ETL using a familiar SQL interface • Easily port PIG scripts to Spark’s API • Run existing HIVE queries directly on Spark SQL or Shark
    46. 46. San Francisco June 30 – July 2 • Use Cases • Tech Talks • Training http://spark-summit.org/
    47. 47. © 2014 MapR Technologies 47 Q&A @mapr maprtech adawar@mapr.com Engage with us! MapR maprtech mapr-technologies

    ×