Spark
Next generation cloud
computing engine
Wisely Chen
Agenda
• What is Spark?
• Next big thing
• How to use Spark?
• Demo
• Q&A
Who am I?
• Wisely Chen ( thegiive@gmail.com ) 	

• Sr. Engineer inYahoo![Taiwan] data team 	

• Loves to promote open sou...
Taiwan Data Team
Data!
Highway
BI!
Report
Serving!
API
Data!
Mart
ETL /
Forecast
Machine!
Learning
Machine Learning
Distribute Computing
Big Data
Recommendation
Forecast
HADOOP
Faster ML
Distribute Computing
Bigger Big Data
Opinion from Cloudera
• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new proj...
What is Spark
• From UC Berkeley AMP Lab	

• Most activity Big data open
source project since Hadoop
Where is Spark?
HDFS
YARN
MapReduce
Hadoop 2.0
Storm HBase Others
HDFS
YARN
MapReduce
Hadoop Architecture
Hive
Storage
Resource Management
Computing Engine
SQL
HDFS
YARN
MapReduce
Hadoop vs Spark
Spark
Hive Shark
Spark vs Hadoop
• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can rea...
More than MapReduce
HDFS
Spark Core : MapReduce
Shark: Hive GraphX: Pregel MLib: Mahout
Streaming:
Storm
Resource Manageme...
Why Spark?
天下武功,無堅不破,惟快不破
3X~25X than MapReduce framework
!
From Matei’s paper: http://0rz.tw/VVqgP
Logistic
regression
RunningTime(S)
0
20
40
60
80...
What is Spark
• Apache Spark™ is a very fast and general
engine for large-scale data processing
Why is Spark so fast?
HDFS
• 100X lower than memory
• Store data into Network+Disk
• Network speed is 100X than memory
• Implement fault toleran...
MapReduce Pagerank
!
• …..readInputFromHDFS…
• for (int runs = 0; runs < iter_runnumber ; runs++) {
• …………..
• isCompleted...
Workflow
Input
HDFS
Iter 1
RunRank
Tmp
HDFS
Iter 2
RunRank
Tmp
HDFS
Iter N
RunRank
Input
HDFS
Iter 1
RunRank
Tmp
Mem
Iter 2...
First iteration!
take 200 sec
3rd iteration!
take 20 sec
Page Rank algorithm in 1 billion record url
2nd iteration!
take 2...
RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster,
stored in RAM or on Disk
• Built thr...
Fault Tolerance
天下武功,無堅不破,惟快不破
RDD
RDD a RDD b
val a =sc.textFile(“hdfs://....”)
val b = a.filer( line=>line.contain(“Spark”) )
Value c
val c = b.count()
...
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains...
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains...
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains...
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains...
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains...
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains...
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains...
1st
iteration(no cache)!
take same time
with cache!
take 7 sec
RDD Cache
RDD Cache
• Data locality
• Cache
A big shuffle!
take 20min
After cache, take
only 265ms
self join 5 billion record data
Easy to use
• Interactive Shell
• Multi Language API
• JVM: Scala, JAVA
• PySpark: Python
Scala Word Count
• val file = spark.textFile("hdfs://...")
• val counts = file.flatMap(line => line.split(" "))
• .map(word =...
Step by Step
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)
• .reduc...
Java Wordcount
• JavaRDD<String> file = spark.textFile("hdfs://...");
• JavaRDD<String> words = file.flatMap(new FlatMapFunct...
Java vs Scala
• Scala : file.flatMap(line => line.split(" "))
• Java version :
• JavaRDD<String> words = file.flatMap(new
Flat...
Python
• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) 
• .map(lambda word: (word...
Highly Recommend
• Scala : Latest API feature, Stable
• Python
• very familiar language
• Native Lib: NumPy, SciPy
FYI
• Combiner : ReduceByKey(_+_)
!
• Typical WordCount :
• groupByKey().mapValues{ arr =>
• var r = 0 ; arr.foreach{i=> r...
WordCount
ReduceByKey !
reduce a lot in map side
hadoop style shuffle!
send a lot data to network
DEMO
• FB 打卡 Yahoo! 徵人 息,獲
得 Yahoo! 沐浴小鴨
• FB打卡說 ”Yahoo!	
  APP超讚!!”
並附上超級商城或新聞APP截
圖,即可憑打卡記錄,獲得小
鴨護腕 或購物袋一只
Just memory?
• From Matei’s paper: http://0rz.tw/VVqgP	

• HBM: stores data in an in-memory HDFS instance. 	

• SP : Spark...
100GB data on 100 node cluster
Logistic regression
RunningTime(S)
0
35
70
105
140
HBM'1 HBM SP'1 SP
3
46
62
139
KMeans
Run...
There is more
• General DAG scheduler
• Control partition shuffle
• Fast driven RPC to launch task
!
• For more info, check...
Osd ctw spark
Osd ctw spark
Upcoming SlideShare
Loading in …5
×

Osd ctw spark

1,926 views

Published on

OSDC.tw 2014 at Taiwan

Published in: Software
0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,926
On SlideShare
0
From Embeds
0
Number of Embeds
140
Actions
Shares
0
Downloads
72
Comments
0
Likes
21
Embeds 0
No embeds

No notes for slide

Osd ctw spark

  1. 1. Spark Next generation cloud computing engine Wisely Chen
  2. 2. Agenda • What is Spark? • Next big thing • How to use Spark? • Demo • Q&A
  3. 3. Who am I? • Wisely Chen ( thegiive@gmail.com ) • Sr. Engineer inYahoo![Taiwan] data team • Loves to promote open source tech • Hadoop Summit 2013 San Jose • Jenkins Conf 2013 Palo Alto • Coscup 2006, 2012, 2013 , OSDC 2007,Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
  4. 4. Taiwan Data Team Data! Highway BI! Report Serving! API Data! Mart ETL / Forecast Machine! Learning
  5. 5. Machine Learning Distribute Computing Big Data
  6. 6. Recommendation Forecast
  7. 7. HADOOP
  8. 8. Faster ML Distribute Computing Bigger Big Data
  9. 9. Opinion from Cloudera • The leading candidate for “successor to MapReduce” today is Apache Spark • No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ! • From http://0rz.tw/y3OfM
  10. 10. What is Spark • From UC Berkeley AMP Lab • Most activity Big data open source project since Hadoop
  11. 11. Where is Spark?
  12. 12. HDFS YARN MapReduce Hadoop 2.0 Storm HBase Others
  13. 13. HDFS YARN MapReduce Hadoop Architecture Hive Storage Resource Management Computing Engine SQL
  14. 14. HDFS YARN MapReduce Hadoop vs Spark Spark Hive Shark
  15. 15. Spark vs Hadoop • Spark run on Yarn, Mesos or Standalone mode • Spark’s main concept is based on MapReduce • Spark can read from • HDFS: data locality • HBase • Cassandra
  16. 16. More than MapReduce HDFS Spark Core : MapReduce Shark: Hive GraphX: Pregel MLib: Mahout Streaming: Storm Resource Management System(Yarn, Mesos)
  17. 17. Why Spark?
  18. 18. 天下武功,無堅不破,惟快不破
  19. 19. 3X~25X than MapReduce framework ! From Matei’s paper: http://0rz.tw/VVqgP Logistic regression RunningTime(S) 0 20 40 60 80 MR Spark 3 76 KMeans 0 27.5 55 82.5 110 MR Spark 33 106 PageRank 0 45 90 135 180 MR Spark 23 171
  20. 20. What is Spark • Apache Spark™ is a very fast and general engine for large-scale data processing
  21. 21. Why is Spark so fast?
  22. 22. HDFS • 100X lower than memory • Store data into Network+Disk • Network speed is 100X than memory • Implement fault tolerance
  23. 23. MapReduce Pagerank ! • …..readInputFromHDFS… • for (int runs = 0; runs < iter_runnumber ; runs++) { • ………….. • isCompleted = runRankCalculation(inPath,lastResultPath); • ………… • } • …..writeOutputToHDFS….
  24. 24. Workflow Input HDFS Iter 1 RunRank Tmp HDFS Iter 2 RunRank Tmp HDFS Iter N RunRank Input HDFS Iter 1 RunRank Tmp Mem Iter 2 RunRank Tmp Mem Iter N RunRank MapReduce Spark
  25. 25. First iteration! take 200 sec 3rd iteration! take 20 sec Page Rank algorithm in 1 billion record url 2nd iteration! take 20 sec
  26. 26. RDD • Resilient Distributed Dataset • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations
  27. 27. Fault Tolerance 天下武功,無堅不破,惟快不破
  28. 28. RDD RDD a RDD b val a =sc.textFile(“hdfs://....”) val b = a.filer( line=>line.contain(“Spark”) ) Value c val c = b.count() Transformation Action
  29. 29. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! Worker! ! ! ! Worker! ! ! !Task TaskTask
  30. 30. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! !Block1 RDD a Worker! ! ! ! !Block2 RDD a Worker! ! ! ! !Block3 RDD a
  31. 31. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block1 Block2 Block3
  32. 32. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block1 Block2 Block3
  33. 33. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Cache1 Cache2 Cache3
  34. 34. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Cache1 Cache2 Cache3
  35. 35. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Cache1 Cache2 Cache3
  36. 36. 1st iteration(no cache)! take same time with cache! take 7 sec RDD Cache
  37. 37. RDD Cache • Data locality • Cache A big shuffle! take 20min After cache, take only 265ms self join 5 billion record data
  38. 38. Easy to use • Interactive Shell • Multi Language API • JVM: Scala, JAVA • PySpark: Python
  39. 39. Scala Word Count • val file = spark.textFile("hdfs://...") • val counts = file.flatMap(line => line.split(" ")) • .map(word => (word, 1)) • .reduceByKey(_ + _) • counts.saveAsTextFile("hdfs://...")
  40. 40. Step by Step • file.flatMap(line => line.split(" “)) => (aaa,bb,cc) • .map(word => (word, 1)) => ((aaa,1),(bb,1)..) • .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
  41. 41. Java Wordcount • JavaRDD<String> file = spark.textFile("hdfs://..."); • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } • }); • JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() • public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } • }); • JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() • public Integer call(Integer a, Integer b) { return a + b; } • }); • counts.saveAsTextFile("hdfs://...");
  42. 42. Java vs Scala • Scala : file.flatMap(line => line.split(" ")) • Java version : • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { • return Arrays.asList(s.split(" ")); } • });
  43. 43. Python • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...")
  44. 44. Highly Recommend • Scala : Latest API feature, Stable • Python • very familiar language • Native Lib: NumPy, SciPy
  45. 45. FYI • Combiner : ReduceByKey(_+_) ! • Typical WordCount : • groupByKey().mapValues{ arr => • var r = 0 ; arr.foreach{i=> r+=i} ; r • }
  46. 46. WordCount ReduceByKey ! reduce a lot in map side hadoop style shuffle! send a lot data to network
  47. 47. DEMO
  48. 48. • FB 打卡 Yahoo! 徵人 息,獲 得 Yahoo! 沐浴小鴨 • FB打卡說 ”Yahoo!  APP超讚!!” 並附上超級商城或新聞APP截 圖,即可憑打卡記錄,獲得小 鴨護腕 或購物袋一只
  49. 49. Just memory? • From Matei’s paper: http://0rz.tw/VVqgP • HBM: stores data in an in-memory HDFS instance. • SP : Spark • HBM’1, SP’1 : first run • Storage: HDFS with 256 MB blocks • Node information • m1.xlarge EC2 nodes • 4 cores • 15 GB of RAM
  50. 50. 100GB data on 100 node cluster Logistic regression RunningTime(S) 0 35 70 105 140 HBM'1 HBM SP'1 SP 3 46 62 139 KMeans RunningTime(S) 0 50 100 150 200 HBM'1 HBM SP'1 SP 33 8287 182
  51. 51. There is more • General DAG scheduler • Control partition shuffle • Fast driven RPC to launch task ! • For more info, check http://0rz.tw/jwYwI

×