Your SlideShare is downloading. ×
0
What's New in the
Berkeley Data Analytics
Stack
Tathagata Das, Reynold Xin (AMPLab, UC
Berkeley)
Hadoop Summit 2013 UC BER...
Berkeley Data Analytics
Stack
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos / YARN Resource Manager
Spark
Streaming
GraphX M...
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos / YARN Resource Manager
Spark
Streaming
GraphX MLBase
Project History
2010: Spark (core execution engine) open
sourced
2012: Shark open sourced
Feb 2013: Spark Streaming alpha ...
Community
3000+ people online training
800+ meetup members
60+ developers contributing
17 companies contributing
Hadoop and continuous computing: looking beyond
MapReduce
Bruno Fernandez-Ruiz, Senior Fellow & VP
2012 Hadoop Summit
2012 Hadoop Summit (Future of Apache Hadoop)
2012 Hadoop Summit (Future of Apache Hadoop)
2013 Hadoop Summit
2012 Hadoop Summit (Future of Apache Hadoop)
2013 Hadoop Summit (Hadoop Economics)
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
Spark
Fast and expressive cluster computing
system interoperable with Apache Hadoop
Improves efficiency through:
»In-memor...
Why a New Framework?
MapReduce greatly simplified big data
analysis
But as soon as it got popular, users wanted
more:
»Mor...
Spark Programming Model
Key idea: resilient distributed datasets
(RDDs)
»Distributed collections of objects
»Can optionall...
Example: Log Mining
Exposes RDDs through a functional API in
Java, Python, Scala
lines = spark.textFile(“hdfs://...”)
erro...
Spark: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupB...
Machine Learning
Algorithms
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Cluster...
Spark in Java and Python
Python API
lines = spark.textFile(…)
errors = lines.filter(
lambda s: "ERROR" in s)
errors.count(...
Projects Building on Spark
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
GraphX
Combining data-parallel and graph-parallel
»Run graph analytics and ETL in the same engine
»Consume graph computati...
Scalable Machine Learning
Build a Classifier
for X
What you want to
do
What you have to
do• Learn the internals of ML
clas...
MLBase
Making large scale machine learning easy
»User specifies the task (e.g. “classify this
dataset”)
»MLBase picks the ...
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
Shark
Hive compatible: HiveQL, UDFs, metadata,
etc.
»Works in existing Hive warehouses without
changing queries or data!
F...
Real-world Performance
0
25
50
75
100
Q1 Q2 Q3 Q4
Runtime(seconds)
Shark Shark (disk) Hive
1.1 0.8 0.7 1.0
1.7 TB Real War...
Comparison
Impala
Impala (mem)
Redshift
Shark (disk)
Shark (mem)
0 5 10 15 20
Runtime (seconds)
http://tinyurl.com/bigdata...
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
Spark Streaming
Extends Spark for large scale stream
processing
»Receive data directly from Kafka, Flume, Twitter,
etc.
»F...
Motivation
Many important applications must process large
streams of live data and provide results in near-
real-time
» So...
Challenges
Require large clusters
Require latencies of few seconds
Require fault-tolerance
Require integration with batch ...
Integration with Batch
Processing
Many environments require processing same data
in live streaming as well as batch post-p...
Existing Streaming Systems
Storm – Limited fault-tolerance guarantee
»Replays records if not processed
»Processes each rec...
Spark Streaming
• Chop up the live stream into
batches of X seconds
• Spark treats each batch of data
as RDDs and processe...
Spark Streaming
Discretized Stream Processing - run a streaming
computation as a series of very small, deterministic
batch...
Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
DStream: a sequence of RDDs represent...
Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status...
Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status...
Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status...
Window-based
Transformations
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status ...
Arbitrary Stateful
Computations
Specify function to generate new state based on
previous state and new data
» Example: Mai...
Arbitrary Combination of Batch
and Streaming Computations
Inter-mix RDD and DStream operations!
» Example: Join incoming t...
DStream Input Sources
Out of the box we provide
»Kafka
»Twitter
»HDFS
»Flume
»Raw TCP sockets
Very simple API to write a r...
Performance
Can process 6 GB/sec (60M records/sec) of data
on 100 nodes at sub-second latency
» Tested with 100 text strea...
Comparison with Storm
Higher throughput than Storm
»Spark Streaming: 670k records/second/node
»Storm: 115k records/second/...
Fast Fault Recovery
Recovers from faults/stragglers within 1 sec
Real Applications: Traffic
Sensing
Traffic transit time estimation using online machine
learning on GPS observations
• Mar...
Unifying Batch and Stream
Models
Spark program on Twitter log file using RDDs
val tweets = sc.hadoopFile("hdfs://...")
val...
Conclusion
Berkeley Data Analytics Stack
»Next generation of data analytics stack with
speed and functionality
More inform...
Upcoming SlideShare
Loading in...5
×

What's New in the Berkeley Data Analytics Stack

13,774

Published on

The Berkeley Data Analytics Stack (BDAS) aims to address emerging challenges in data analysis through a set of systems, including Spark, Shark and Mesos, that enable faster and more powerful analytics. In this talk, we’ll cover two recent additions to BDAS:
* Spark Streaming is an extension of Spark that enables high-speed, fault-tolerant stream processing through a high-level API. It uses a new processing model called “discretized streams” to enable fault-tolerant stateful processing with exactly-once semantics, without the costly transactions required by existing systems. This lets applications process much higher rates of data per node. It also makes programming streaming applications easier by providing a set of high-level operators on streams (e.g. maps, filters, and windows) in Java and Scala.
* Shark is a Spark-based data warehouse system compatible with Hive. It can answer Hive QL queries up to 100 times faster than Hive without modification to existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions. It employs a number of novel and traditional database optimization techniques, including column-oriented storage and mid-query replanning, to efficiently execute SQL on top of Spark. The system is in early use at companies including Yahoo! and Conviva.

Published in: Technology, Education

Transcript of "What's New in the Berkeley Data Analytics Stack"

  1. 1. What's New in the Berkeley Data Analytics Stack Tathagata Das, Reynold Xin (AMPLab, UC Berkeley) Hadoop Summit 2013 UC BERKELEY
  2. 2. Berkeley Data Analytics Stack Spark Shark SQL HDFS / Hadoop Storage Mesos / YARN Resource Manager Spark Streaming GraphX MLBase
  3. 3. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos / YARN Resource Manager Spark Streaming GraphX MLBase
  4. 4. Project History 2010: Spark (core execution engine) open sourced 2012: Shark open sourced Feb 2013: Spark Streaming alpha open sourced Jun 2013: Spark entered Apache Incubator
  5. 5. Community 3000+ people online training 800+ meetup members 60+ developers contributing 17 companies contributing
  6. 6. Hadoop and continuous computing: looking beyond MapReduce Bruno Fernandez-Ruiz, Senior Fellow & VP
  7. 7. 2012 Hadoop Summit
  8. 8. 2012 Hadoop Summit (Future of Apache Hadoop)
  9. 9. 2012 Hadoop Summit (Future of Apache Hadoop) 2013 Hadoop Summit
  10. 10. 2012 Hadoop Summit (Future of Apache Hadoop) 2013 Hadoop Summit (Hadoop Economics)
  11. 11. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  12. 12. Spark Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: »In-memory computing primitives »General computation graphs Improves usability through: »Rich APIs in Scala, Java, Python »Interactive shell Up to 100× faster (2-10× on disk) Often 5× less code
  13. 13. Why a New Framework? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: »More complex, multi-pass analytics (e.g. ML, graph) »More interactive ad-hoc queries »More real-time stream processing
  14. 14. Spark Programming Model Key idea: resilient distributed datasets (RDDs) »Distributed collections of objects »Can optionally be cached in memory across cluster »Manipulated through parallel operators »Automatically recomputed on failure Programming interface »Functional APIs in Scala, Java, Python »Interactive use from Scala and Python shell
  15. 15. Example: Log Mining Exposes RDDs through a functional API in Java, Python, Scala lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) errors.persist() Block 1 Block 2 Block 3 Worke r errors.filter(_.contains(“foo”)).count() errors.filter(_.contains(“bar”)).count() tasks results Errors 2 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: 1 TB data in 5 sec (vs 170 sec for on-disk data) Worke r Errors 3 Worke r Errors 1 Master
  16. 16. Spark: Expressive API map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  17. 17. Machine Learning Algorithms 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop MR Time per Iteration (s)
  18. 18. Spark in Java and Python Python API lines = spark.textFile(…) errors = lines.filter( lambda s: "ERROR" in s) errors.count() Java API JavaRDD<String> lines = spark.textFile(…); errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } }); errors.count()
  19. 19. Projects Building on Spark Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  20. 20. GraphX Combining data-parallel and graph-parallel »Run graph analytics and ETL in the same engine »Consume graph computation output in Spark »Interactive shell Programmability »Support GraphLab / Pregel APIs in 20 LOC »Implement PageRank in 5 LOC Coming this summer as a Spark module
  21. 21. Scalable Machine Learning Build a Classifier for X What you want to do What you have to do• Learn the internals of ML classification algorithms, sampling, featur e selection, X-validation,…. • Potentially learn Spark/Hadoop/… • Implement 3-4 algorithms • Implement grid-search to find the right algorithm parameters • Implement validation algorithms • Experiment with different sampling- sizes, algorithms, features • …. and in the end Ask For Help 21
  22. 22. MLBase Making large scale machine learning easy »User specifies the task (e.g. “classify this dataset”) »MLBase picks the best algorithm and best parameters for the task Develop scalable, high-quality ML algorithms »Naïve Bayes »Logistic/Least Squares Regression (L1/L2 Regularization) »Matrix Factorization (ALS, CCD) »K-Means & DP-Means First release (summer): collection of scalable algorithms
  23. 23. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  24. 24. Shark Hive compatible: HiveQL, UDFs, metadata, etc. »Works in existing Hive warehouses without changing queries or data! Fast execution engine »Uses Spark as the underlying execution engine »Low-latency, interactive queries »Scales out and tolerate worker failures Easy to combine with Spark »Process data with SQL queries as well as raw
  25. 25. Real-world Performance 0 25 50 75 100 Q1 Q2 Q3 Q4 Runtime(seconds) Shark Shark (disk) Hive 1.1 0.8 0.7 1.0 1.7 TB Real Warehouse Data on 100 EC2 nodes
  26. 26. Comparison Impala Impala (mem) Redshift Shark (disk) Shark (mem) 0 5 10 15 20 Runtime (seconds) http://tinyurl.com/bigdata- benchmark
  27. 27. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  28. 28. Spark Streaming Extends Spark for large scale stream processing »Receive data directly from Kafka, Flume, Twitter, etc. »Fast, scalable, and fault-tolerant Simple, yet rich batch-like API »Easy to express your complex streaming computation »Fault-tolerant, stateful stream processing out of
  29. 29. Motivation Many important applications must process large streams of live data and provide results in near- real-time » Social network trends » Website statistics » Intrusion detection systems » Etc.
  30. 30. Challenges Require large clusters Require latencies of few seconds Require fault-tolerance Require integration with batch processing
  31. 31. Integration with Batch Processing Many environments require processing same data in live streaming as well as batch post-processing Hard for any existing single framework to achieve both » Provide low latency for streaming workloads » Handle large volumes of data for batch workloads Extremely painful to maintain two stacks » Different programming models » Double the implementation effort » Double the number of bugs
  32. 32. Existing Streaming Systems Storm – Limited fault-tolerance guarantee »Replays records if not processed »Processes each record at least once »May double count events! »Mutable state can be lost due to failure! Trident – Use transactions to update state »Processes each record exactly once »Per state transaction to external database is slow Neither integrate well with batch processing systems
  33. 33. Spark Streaming • Chop up the live stream into batches of X seconds • Spark treats each batch of data as RDDs and processes them using RDD operations • Finally, the processed results of the RDD operations are returned in batches Spark Spark Streamin g batches of X seconds live data stream processed results Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs
  34. 34. Spark Streaming Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs • Batch sizes as low as ½ second, latency ~ 1 second • Potential for combining batch processing and streaming processing in the same system Spark Spark Streamin g batches of X seconds live data stream processed results
  35. 35. Example: Get Twitter Hashtags val tweets = ssc.twitterStream(<username>, <password>) DStream: a sequence of RDDs representing a stream of data batch @ t+1 batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API
  36. 36. Example: Get Twitter Hashtags val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) batch @ t+1 batch @ t batch @ t+2 tweets DStream Twitter Streaming API transformation: modify data in one DStream to create another DStream new DStream flatMap flatMap flatMap … new RDDs created for every batch hashTags Dstream [#cat, #dog, … ]
  37. 37. Example: Get Twitter Hashtags val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  38. 38. Example: Get Twitter Hashtags val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.foreach(hashTagRDD => { … }) foreach: do whatever you want with the processed data flatMap flatMap flatMap foreach foreach foreach batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream Write to database, update analytics UI, do whatever you want
  39. 39. Window-based Transformations val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() DStream of data sliding window operation window length sliding interval window length sliding interval
  40. 40. Arbitrary Stateful Computations Specify function to generate new state based on previous state and new data » Example: Maintain per-user mood as state, and update it with their tweets updateMood(newTweets, lastMood) => newMood moods = tweets.updateStateByKey(tweets => updateMood(tweets)) » Exactly-once semantics even under worker failures
  41. 41. Arbitrary Combination of Batch and Streaming Computations Inter-mix RDD and DStream operations! » Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })
  42. 42. DStream Input Sources Out of the box we provide »Kafka »Twitter »HDFS »Flume »Raw TCP sockets Very simple API to write a receiver for your own data source!
  43. 43. Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency » Tested with 100 text streams on 100 EC2 instances with 4 cores each 0 1 2 3 4 5 6 7 0 50 100 ClusterThhroughput (GB/s) # Nodes in Cluster Grep 1 sec 2 sec 0 0.5 1 1.5 2 2.5 3 3.5 0 50 100 ClusterThroughput(GB/s) # Nodes in Cluster WordCount 1 sec 2 sec High Throughput and Low Latency
  44. 44. Comparison with Storm Higher throughput than Storm »Spark Streaming: 670k records/second/node »Storm: 115k records/second/node 0 40 80 120 100 1000 Throughputpernode (MB/s) Record Size (bytes) Grep Spark Stor m 0 10 20 30 100 1000 Throughputpernode (MB/s) Record Size (bytes) WordCount Spark Storm
  45. 45. Fast Fault Recovery Recovers from faults/stragglers within 1 sec
  46. 46. Real Applications: Traffic Sensing Traffic transit time estimation using online machine learning on GPS observations • Markov chain Monte Carlo simulations on GPS observations • Very CPU intensive, requires dozens of machines for useful computation • Scales linearly with cluster size 0 400 800 1200 1600 2000 0 20 40 60 80 GPSobservationspersecond # Nodes in Cluster
  47. 47. Unifying Batch and Stream Models Spark program on Twitter log file using RDDs val tweets = sc.hadoopFile("hdfs://...") val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFile("hdfs://...") Spark Streaming program on Twitter stream using DStreams val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Same code base works for both batch processing and stream processing
  48. 48. Conclusion Berkeley Data Analytics Stack »Next generation of data analytics stack with speed and functionality More information: www.spark-project.org Hands-on Tutorials: ampcamp.berkeley.edu »Video tutorials, EC2 exercises »AMP Camp 2 – August 29-30, 2013
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×