SlideShare a Scribd company logo
Spark
Next generation cloud
computing engine
Wisely Chen
Agenda
• What is Spark?
• Next big thing
• How to use Spark?
• Demo
• Q&A
Who am I?
• Wisely Chen ( thegiive@gmail.com ) 	

• Sr. Engineer inYahoo![Taiwan] data team 	

• Loves to promote open source tech 	

• Hadoop Summit 2013 San Jose	

• Jenkins Conf 2013 Palo Alto	

• Coscup 2006, 2012, 2013 , OSDC 2007,Webconf 2013,
Coscup 2012, PHPConf 2012 , RubyConf 2012
Taiwan Data Team
Data!
Highway
BI!
Report
Serving!
API
Data!
Mart
ETL /
Forecast
Machine!
Learning
Machine Learning
Distribute Computing
Big Data
Recommendation
Forecast
HADOOP
Faster ML
Distribute Computing
Bigger Big Data
Opinion from Cloudera
• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch
up. Chasing Spark would be a waste of time,
and would delay availability of real-time analytic
and processing services for no good reason. !
• From http://0rz.tw/y3OfM
What is Spark
• From UC Berkeley AMP Lab	

• Most activity Big data open
source project since Hadoop
Where is Spark?
HDFS
YARN
MapReduce
Hadoop 2.0
Storm HBase Others
HDFS
YARN
MapReduce
Hadoop Architecture
Hive
Storage
Resource Management
Computing Engine
SQL
HDFS
YARN
MapReduce
Hadoop vs Spark
Spark
Hive Shark
Spark vs Hadoop
• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can read from
• HDFS: data locality
• HBase
• Cassandra
More than MapReduce
HDFS
Spark Core : MapReduce
Shark: Hive GraphX: Pregel MLib: Mahout
Streaming:
Storm
Resource Management System(Yarn, Mesos)
Why Spark?
天下武功,無堅不破,惟快不破
3X~25X than MapReduce framework
!
From Matei’s paper: http://0rz.tw/VVqgP
Logistic
regression
RunningTime(S)
0
20
40
60
80
MR Spark
3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
What is Spark
• Apache Spark™ is a very fast and general
engine for large-scale data processing
Why is Spark so fast?
HDFS
• 100X lower than memory
• Store data into Network+Disk
• Network speed is 100X than memory
• Implement fault tolerance
MapReduce Pagerank
!
• …..readInputFromHDFS…
• for (int runs = 0; runs < iter_runnumber ; runs++) {
• …………..
• isCompleted = runRankCalculation(inPath,lastResultPath);
• …………
• }
• …..writeOutputToHDFS….
Workflow
Input
HDFS
Iter 1
RunRank
Tmp
HDFS
Iter 2
RunRank
Tmp
HDFS
Iter N
RunRank
Input
HDFS
Iter 1
RunRank
Tmp
Mem
Iter 2
RunRank
Tmp
Mem
Iter N
RunRank
MapReduce
Spark
First iteration!
take 200 sec
3rd iteration!
take 20 sec
Page Rank algorithm in 1 billion record url
2nd iteration!
take 20 sec
RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster,
stored in RAM or on Disk
• Built through parallel transformations
Fault Tolerance
天下武功,無堅不破,惟快不破
RDD
RDD a RDD b
val a =sc.textFile(“hdfs://....”)
val b = a.filer( line=>line.contain(“Spark”) )
Value c
val c = b.count()
Transformation Action
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
Worker!
!
!
!
Worker!
!
!
!Task
TaskTask
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!Block1
RDD a
Worker!
!
!
!
!Block2
RDD a
Worker!
!
!
!
!Block3
RDD a
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Block1 Block2
Block3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Block1 Block2
Block3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Cache1 Cache2
Cache3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD m
Worker!
!
!
!
!
RDD m
Worker!
!
!
!
!
RDD m
Cache1 Cache2
Cache3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
!
RDD a
Worker!
!
!
!
!
RDD a
Worker!
!
!
!
!
RDD a
Cache1 Cache2
Cache3
1st
iteration(no cache)!
take same time
with cache!
take 7 sec
RDD Cache
RDD Cache
• Data locality
• Cache
A big shuffle!
take 20min
After cache, take
only 265ms
self join 5 billion record data
Easy to use
• Interactive Shell
• Multi Language API
• JVM: Scala, JAVA
• PySpark: Python
Scala Word Count
• val file = spark.textFile("hdfs://...")
• val counts = file.flatMap(line => line.split(" "))
• .map(word => (word, 1))
• .reduceByKey(_ + _)
• counts.saveAsTextFile("hdfs://...")
Step by Step
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)
• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
Java Wordcount
• JavaRDD<String> file = spark.textFile("hdfs://...");
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
• });
• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
• });
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
• public Integer call(Integer a, Integer b) { return a + b; }
• });
• counts.saveAsTextFile("hdfs://...");
Java vs Scala
• Scala : file.flatMap(line => line.split(" "))
• Java version :
• JavaRDD<String> words = file.flatMap(new
FlatMapFunction<String, String>()
• public Iterable<String> call(String s) {
• return Arrays.asList(s.split(" ")); }
• });
Python
• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) 
• .map(lambda word: (word, 1)) 
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Highly Recommend
• Scala : Latest API feature, Stable
• Python
• very familiar language
• Native Lib: NumPy, SciPy
FYI
• Combiner : ReduceByKey(_+_)
!
• Typical WordCount :
• groupByKey().mapValues{ arr =>
• var r = 0 ; arr.foreach{i=> r+=i} ; r
• }
WordCount
ReduceByKey !
reduce a lot in map side
hadoop style shuffle!
send a lot data to network
DEMO
• FB 打卡 Yahoo! 徵人 息,獲
得 Yahoo! 沐浴小鴨
• FB打卡說 ”Yahoo!	
  APP超讚!!”
並附上超級商城或新聞APP截
圖,即可憑打卡記錄,獲得小
鴨護腕 或購物袋一只
Just memory?
• From Matei’s paper: http://0rz.tw/VVqgP	

• HBM: stores data in an in-memory HDFS instance. 	

• SP : Spark 	

• HBM’1, SP’1 : first run	

• Storage: HDFS with 256 MB blocks 	

• Node information 	

• m1.xlarge EC2 nodes 	

• 4 cores 	

• 15 GB of RAM
100GB data on 100 node cluster
Logistic regression
RunningTime(S)
0
35
70
105
140
HBM'1 HBM SP'1 SP
3
46
62
139
KMeans
RunningTime(S)
0
50
100
150
200
HBM'1 HBM SP'1 SP
33
8287
182
There is more
• General DAG scheduler
• Control partition shuffle
• Fast driven RPC to launch task
!
• For more info, check http://0rz.tw/jwYwI
Osd ctw spark
Osd ctw spark

More Related Content

What's hot

Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
dhiguero
 
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew RayData Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Databricks
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Databricks
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
Dataconomy Media
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Databricks
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
Databricks
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Sahan Bulathwela
 

What's hot (20)

Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew RayData Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 

Similar to Osd ctw spark

OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
punesparkmeetup
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Data Con LA
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
Hadoop london
Hadoop londonHadoop london
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
10 Things About Spark
10 Things About Spark 10 Things About Spark
10 Things About Spark
Roger Brinkley
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 

Similar to Osd ctw spark (20)

OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
10 Things About Spark
10 Things About Spark 10 Things About Spark
10 Things About Spark
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 

Recently uploaded

Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
KrzysztofKkol1
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 

Recently uploaded (20)

Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 

Osd ctw spark

  • 2. Agenda • What is Spark? • Next big thing • How to use Spark? • Demo • Q&A
  • 3. Who am I? • Wisely Chen ( thegiive@gmail.com ) • Sr. Engineer inYahoo![Taiwan] data team • Loves to promote open source tech • Hadoop Summit 2013 San Jose • Jenkins Conf 2013 Palo Alto • Coscup 2006, 2012, 2013 , OSDC 2007,Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
  • 9. Opinion from Cloudera • The leading candidate for “successor to MapReduce” today is Apache Spark • No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ! • From http://0rz.tw/y3OfM
  • 10. What is Spark • From UC Berkeley AMP Lab • Most activity Big data open source project since Hadoop
  • 15. Spark vs Hadoop • Spark run on Yarn, Mesos or Standalone mode • Spark’s main concept is based on MapReduce • Spark can read from • HDFS: data locality • HBase • Cassandra
  • 16. More than MapReduce HDFS Spark Core : MapReduce Shark: Hive GraphX: Pregel MLib: Mahout Streaming: Storm Resource Management System(Yarn, Mesos)
  • 19. 3X~25X than MapReduce framework ! From Matei’s paper: http://0rz.tw/VVqgP Logistic regression RunningTime(S) 0 20 40 60 80 MR Spark 3 76 KMeans 0 27.5 55 82.5 110 MR Spark 33 106 PageRank 0 45 90 135 180 MR Spark 23 171
  • 20. What is Spark • Apache Spark™ is a very fast and general engine for large-scale data processing
  • 21. Why is Spark so fast?
  • 22. HDFS • 100X lower than memory • Store data into Network+Disk • Network speed is 100X than memory • Implement fault tolerance
  • 23. MapReduce Pagerank ! • …..readInputFromHDFS… • for (int runs = 0; runs < iter_runnumber ; runs++) { • ………….. • isCompleted = runRankCalculation(inPath,lastResultPath); • ………… • } • …..writeOutputToHDFS….
  • 24. Workflow Input HDFS Iter 1 RunRank Tmp HDFS Iter 2 RunRank Tmp HDFS Iter N RunRank Input HDFS Iter 1 RunRank Tmp Mem Iter 2 RunRank Tmp Mem Iter N RunRank MapReduce Spark
  • 25. First iteration! take 200 sec 3rd iteration! take 20 sec Page Rank algorithm in 1 billion record url 2nd iteration! take 20 sec
  • 26. RDD • Resilient Distributed Dataset • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations
  • 28. RDD RDD a RDD b val a =sc.textFile(“hdfs://....”) val b = a.filer( line=>line.contain(“Spark”) ) Value c val c = b.count() Transformation Action
  • 29. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! Worker! ! ! ! Worker! ! ! !Task TaskTask
  • 30. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! !Block1 RDD a Worker! ! ! ! !Block2 RDD a Worker! ! ! ! !Block3 RDD a
  • 31. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block1 Block2 Block3
  • 32. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block1 Block2 Block3
  • 33. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Cache1 Cache2 Cache3
  • 34. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Cache1 Cache2 Cache3
  • 35. Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Cache1 Cache2 Cache3
  • 36. 1st iteration(no cache)! take same time with cache! take 7 sec RDD Cache
  • 37. RDD Cache • Data locality • Cache A big shuffle! take 20min After cache, take only 265ms self join 5 billion record data
  • 38. Easy to use • Interactive Shell • Multi Language API • JVM: Scala, JAVA • PySpark: Python
  • 39. Scala Word Count • val file = spark.textFile("hdfs://...") • val counts = file.flatMap(line => line.split(" ")) • .map(word => (word, 1)) • .reduceByKey(_ + _) • counts.saveAsTextFile("hdfs://...")
  • 40. Step by Step • file.flatMap(line => line.split(" “)) => (aaa,bb,cc) • .map(word => (word, 1)) => ((aaa,1),(bb,1)..) • .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
  • 41. Java Wordcount • JavaRDD<String> file = spark.textFile("hdfs://..."); • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } • }); • JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() • public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } • }); • JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() • public Integer call(Integer a, Integer b) { return a + b; } • }); • counts.saveAsTextFile("hdfs://...");
  • 42. Java vs Scala • Scala : file.flatMap(line => line.split(" ")) • Java version : • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { • return Arrays.asList(s.split(" ")); } • });
  • 43. Python • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...")
  • 44. Highly Recommend • Scala : Latest API feature, Stable • Python • very familiar language • Native Lib: NumPy, SciPy
  • 45. FYI • Combiner : ReduceByKey(_+_) ! • Typical WordCount : • groupByKey().mapValues{ arr => • var r = 0 ; arr.foreach{i=> r+=i} ; r • }
  • 46. WordCount ReduceByKey ! reduce a lot in map side hadoop style shuffle! send a lot data to network
  • 47. DEMO
  • 48. • FB 打卡 Yahoo! 徵人 息,獲 得 Yahoo! 沐浴小鴨 • FB打卡說 ”Yahoo!  APP超讚!!” 並附上超級商城或新聞APP截 圖,即可憑打卡記錄,獲得小 鴨護腕 或購物袋一只
  • 49. Just memory? • From Matei’s paper: http://0rz.tw/VVqgP • HBM: stores data in an in-memory HDFS instance. • SP : Spark • HBM’1, SP’1 : first run • Storage: HDFS with 256 MB blocks • Node information • m1.xlarge EC2 nodes • 4 cores • 15 GB of RAM
  • 50. 100GB data on 100 node cluster Logistic regression RunningTime(S) 0 35 70 105 140 HBM'1 HBM SP'1 SP 3 46 62 139 KMeans RunningTime(S) 0 50 100 150 200 HBM'1 HBM SP'1 SP 33 8287 182
  • 51. There is more • General DAG scheduler • Control partition shuffle • Fast driven RPC to launch task ! • For more info, check http://0rz.tw/jwYwI