SlideShare a Scribd company logo
1 of 31
Download to read offline
S PA R K - N E W K I D O N
T H E B L O C K
A B O U T M E …
• I designed Bamboo (HP’s Big Data Analytics Platform)
• I write software (mostly with Scala but leaning towards Haskell
recently …)
• I like translating seq to parallel algorithms mostly using CUDA /
OpenCL; embedded assembly is an EVIL thing.
• I wrote 2 books
• OpenCL Parallel Programming Development Cookbook
• Developing an Akka Edge
W H AT ’ S C O V E R E D T O D AY ?
• What’s Apache Spark
• What’s a RDD ? How can i understand it ?
• What’s Spark SQL
• What’s Spark Streaming
• References
W H AT ’ S A PA C H E S PA R K
• As a beginner’s guide, you can refer to Tsai Li Ming’s talk.
• API model abstracts
• how to extract data from 3rd party s/w (via JDBC,
Cassandra, HBase)
• how to extract-compute data (via GraphX, MLLib,
SparkSQL)
• how to store data (data connectors to “local”, “hdfs”,
“s3”
R E S I L I E N T D I S T R I B U T E D D ATA S E T S
• Apache Spark works on data broken into chunks
• These chunks are called RDDs
• RDDs are chained into a lineage graph => a graph
that identifies relationships.
• RDDs can be queried, grouped, transformed in a
coarse grained manner to a fine grained manner.
• A RDD has a lifecycle:
• reification
• lazy-compute/lazy re-compute
• destruction
• RDD’s lifecycle is managed by the system unless …
• A program commands the RDD to persist() or unpersist()
which affects the lazy computation.
R E S I L I E N T D I S T R I B U T E D D ATA S E T S
“ A G G R E G AT E ” I N S PA R K
> val data = sc.parallelize( (1 to 4) toList,2)
> data.aggregate(0)
> .. (math.max(_, _),
> .. ( _ + _ ))
> …..
> result = 6
def aggregate(zerovalue: U)
(fbinary: (U, T) => U,
fagg: (U, U) => U): U
H O W “ A G G R E G AT E ” W O R K S I N S PA R K
e1
RDD
fagg
fbinary
e2 e3 e4
zerovalue
res1
fbinary
res2
fagg final result
caveat:
partition-sensitive
algorithm should work
correctly regardless of
partitions
“ C O G R O U P ” I N S PA R K
> val x = sc.parallelize(List(1, 2, 1, 3), 1)
> val y = x.map((_, "y"))
> val z = x.map((_, "z"))
> y.cogroup(z).collect
res72: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,
(Array(y, y),Array(z, z))), (3,(Array(y),Array(z))), (2,
(Array(y),Array(z))))
def cogroup[W1, W2, W3]
(other1: RDD[(K, W1)],
other2: RDD[(K, W2)],
other3: RDD[(K, W3)], numPartitions: Int):
RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2],
Iterable[W3]))]
H O W “ C O G R O U P ” W O R K S I N S PA R K
RDDx
(k1,va) (k2,vb) (k1,vc) (k3,vd) (k1,ve)
(k1,vf) (k2,vg) (k1,vh) RDDy
RDDx.cogroup(RDDy) =?
H O W “ C O G R O U P ” W O R K S I N S PA R K
Arraycombined
Array[(k1,[va,vc,ve,vf,vh]),
(k2,[vb,vg]),
(k3,[vd])]
RDDx.cogroup(RDDy) = *see below*
“ C O G R O U P ” I N S PA R K
• CoGroup works in both RDD and Spark Streams
• the ability to combine multiple RDDs allows higher
abstractions to be constructed
• A Stream in Spark is just a list of (Time,RDD[U])
W H AT ’ S S PA R K S Q L
• Spark SQL is new, largely replaced Shark
• Large scale queries (inline queries) to be embedded
into a Spark program
• Spark SQL supports Apache Hive, JSON, Parquet,
RDD.
• Spark SQL’s optimizer is clever!
• Supports UDFs from Hive or Write your own !
S PA R K S Q L
J S O N
S PA R K S Q L
PA R Q U E TH I V E
data sources
R D D
S PA R K S Q L ( A N E X A M P L E )
// import spark sql
import org.apache.spark.sql.hive.HiveContext
// create a spark sql hivecontext
val sc = new SparkContext(…)
val hiveCtx = new HiveContext(sc)
S PA R K S Q L ( A N E X A M P L E )
// import spark sql
import org.apache.spark.sql.hive.HiveContext
// create a spark sql hivecontext
val sc = new SparkContext(…)
val hiveCtx = new HiveContext(sc)
val input = hiveCtx.jsonFile(inputFile)
input.registerTempTable(“tweets”)
S PA R K S Q L ( A N E X A M P L E )
// import spark sql
import org.apache.spark.sql.hive.HiveContext
// create a spark sql hivecontext
val sc = new SparkContext(…)
val hiveCtx = new HiveContext(sc)
val input = hiveCtx.jsonFile(inputFile)
input.registerTempTable(“tweets”)
val topTweets = hiveCtx.sql(“SELECT text,
retweetCount
FROM tweets ORDER BY retweetCount LIMIT 10”)
S PA R K S Q L ( A N E X A M P L E )
// import spark sql
import org.apache.spark.sql.hive.HiveContext
// create a spark sql hivecontext
val sc = new SparkContext(…)
val hiveCtx = new HiveContext(sc)
val input = hiveCtx.jsonFile(inputFile)
input.registerTempTable(“tweets”)
val topTweets = hiveCtx.sql(“SELECT text,
retweetCount
FROM tweets ORDER BY retweetCount LIMIT 10”)
val topTweetContent = topTweets.map(row ⇒
row.getString(0))
W H AT ’ S S PA R K S T R E A M I N G
• Core component is a DStream
• DStream is an abstract RDD whose basic components
is a (key,value) pairs where key = Time, value = RDD.
• Forward and backward queries are supported
• Fault-Tolerance by check-pointing RDDs.
• What you can do with RDDs, you can do with
DStreams.
S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
// Create a StreamingContext with a 1-second batch
size from a SparkConf
val ssc = new StreamingContext(conf, Seconds(1))

S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
// Create a StreamingContext with a 1-second batch size from a
SparkConf
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream using data received after connecting to
// port 7777 on the local machine
val lines = ssc.socketTextStream("localhost", 7777)

// Filter our DStream for lines with "error"

val errorLines = lines.filter(_.contains("error"))

// Print out the lines with errors

errorLines.print()
S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
// Create a StreamingContext with a 1-second batch size from a SparkConf
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream using data received after connecting to
// port 7777 on the local machine
val lines = ssc.socketTextStream("localhost", 7777)

// Filter our DStream for lines with "error"

val errorLines = lines.filter(_.contains("error"))

// Print out the lines with errors

errorLines.print()
// Start our streaming context and wait for it to "finish"
ssc.start()

// Wait for the job to finish
ssc.awaitTermination()
A D S T R E A M L O O K S L I K E …
t1 to t2 t2 to t3 t3 to t4
timestart
DStream
A D S T R E A M C A N H AV E
T R A N S F O R M AT I O N S O N T H E M !
t1 to t2
timestart
DStream(s)
t1 to t2
data-1
data-2
f
transformation
on the fly!
S PA R K S T R E A M T R A N S F O R M AT I O N
t1 to t2t2 to t3
timestart
DStream(s)
t1 to t2t2 to t3
data-1
data-2
f f
data output in
batches
S PA R K S T R E A M T R A N S F O R M AT I O N
t3 to t4
timestart
DStream(s)
t3 to t4
data-1
data-2
f
t1 to t2t2 to t3
t1 to t2t2 to t3
f fff
S TAT E F U L S PA R K S T R E A M
T R A N S F O R M AT I O N
t3 to t4
timestart
DStream(s)
t3 to t4
data-1
data-2
f
t1 to t2t2 to t3
t1 to t2t2 to t3
f fff
H O W D O E S S PA R K S T R E A M I N G
H A N D L E FA U LT S ?
• As before, check-point is the key to fault-tolerance
(especially in stateful-dstream transformations)
• Programs can recover from check-points => no need
to restart all over again.
• You can use “monit” to restart Spark jobs or pass the
Spark flag “- - supervise” to the job config a.k.a driver
fault tolerance
• All incoming data to workers replicated
• In-house RDDs follow the lineage graph to recover
• The above is known as worker fault tolerance.
• Receivers fault tolerance is largely dependent on whether
data sources can re-send lost data
• Streams guarantee exactly-once semantics; caveat:
multiple writes can occur to the HDFS (app specific logic
needs to handle)
H O W D O E S S PA R K S T R E A M I N G
H A N D L E FA U LT S ?
R E F E R E N C E S
• Books:
• “Learning Spark: Lightning Fast Big Data ANlaytics”
• “Advanced Analytics with Spark: Patterns for Learning from Data At Scale”
• “Fast Data Processing with Spark”
• “Machine Learning with Spark”
• Berkeley Data Bootcamp
• Introduction to Big Data with Apache Spark
• Kien Dang’s introduction to Spark and R using Naive Bayes (click here)
• Spark Streaming with Scala and Akka (click here)
T H E E N D
Q U E S T I O N S ?
T W I T T E R : @ R AY M O N D TAY B L
G I T H U B : @ R AY G I T

More Related Content

What's hot

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousRussell Spitzer
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_sparkYiguang Hu
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandraDuyhai Doan
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 

What's hot (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxeSFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandra
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 

Viewers also liked

Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax Academy
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 

Viewers also liked (9)

Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
SPARK SQL
SPARK SQLSPARK SQL
SPARK SQL
 
Learning spark ch1-2
Learning spark ch1-2Learning spark ch1-2
Learning spark ch1-2
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and Future
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 

Similar to Toying with spark

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLphanleson
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 

Similar to Toying with spark (20)

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark core
Spark coreSpark core
Spark core
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 

More from Raymond Tay

Principled io in_scala_2019_distribution
Principled io in_scala_2019_distributionPrincipled io in_scala_2019_distribution
Principled io in_scala_2019_distributionRaymond Tay
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamRaymond Tay
 
Distributed computing for new bloods
Distributed computing for new bloodsDistributed computing for new bloods
Distributed computing for new bloodsRaymond Tay
 
Functional programming with_scala
Functional programming with_scalaFunctional programming with_scala
Functional programming with_scalaRaymond Tay
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Introduction to Erlang
Introduction to ErlangIntroduction to Erlang
Introduction to ErlangRaymond Tay
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 

More from Raymond Tay (8)

Principled io in_scala_2019_distribution
Principled io in_scala_2019_distributionPrincipled io in_scala_2019_distribution
Principled io in_scala_2019_distribution
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beam
 
Practical cats
Practical catsPractical cats
Practical cats
 
Distributed computing for new bloods
Distributed computing for new bloodsDistributed computing for new bloods
Distributed computing for new bloods
 
Functional programming with_scala
Functional programming with_scalaFunctional programming with_scala
Functional programming with_scala
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Introduction to Erlang
Introduction to ErlangIntroduction to Erlang
Introduction to Erlang
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 

Recently uploaded

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Toying with spark

  • 1. S PA R K - N E W K I D O N T H E B L O C K
  • 2. A B O U T M E … • I designed Bamboo (HP’s Big Data Analytics Platform) • I write software (mostly with Scala but leaning towards Haskell recently …) • I like translating seq to parallel algorithms mostly using CUDA / OpenCL; embedded assembly is an EVIL thing. • I wrote 2 books • OpenCL Parallel Programming Development Cookbook • Developing an Akka Edge
  • 3. W H AT ’ S C O V E R E D T O D AY ? • What’s Apache Spark • What’s a RDD ? How can i understand it ? • What’s Spark SQL • What’s Spark Streaming • References
  • 4. W H AT ’ S A PA C H E S PA R K • As a beginner’s guide, you can refer to Tsai Li Ming’s talk. • API model abstracts • how to extract data from 3rd party s/w (via JDBC, Cassandra, HBase) • how to extract-compute data (via GraphX, MLLib, SparkSQL) • how to store data (data connectors to “local”, “hdfs”, “s3”
  • 5. R E S I L I E N T D I S T R I B U T E D D ATA S E T S • Apache Spark works on data broken into chunks • These chunks are called RDDs • RDDs are chained into a lineage graph => a graph that identifies relationships. • RDDs can be queried, grouped, transformed in a coarse grained manner to a fine grained manner.
  • 6. • A RDD has a lifecycle: • reification • lazy-compute/lazy re-compute • destruction • RDD’s lifecycle is managed by the system unless … • A program commands the RDD to persist() or unpersist() which affects the lazy computation. R E S I L I E N T D I S T R I B U T E D D ATA S E T S
  • 7. “ A G G R E G AT E ” I N S PA R K > val data = sc.parallelize( (1 to 4) toList,2) > data.aggregate(0) > .. (math.max(_, _), > .. ( _ + _ )) > ….. > result = 6 def aggregate(zerovalue: U) (fbinary: (U, T) => U, fagg: (U, U) => U): U
  • 8. H O W “ A G G R E G AT E ” W O R K S I N S PA R K e1 RDD fagg fbinary e2 e3 e4 zerovalue res1 fbinary res2 fagg final result caveat: partition-sensitive algorithm should work correctly regardless of partitions
  • 9. “ C O G R O U P ” I N S PA R K > val x = sc.parallelize(List(1, 2, 1, 3), 1) > val y = x.map((_, "y")) > val z = x.map((_, "z")) > y.cogroup(z).collect res72: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1, (Array(y, y),Array(z, z))), (3,(Array(y),Array(z))), (2, (Array(y),Array(z)))) def cogroup[W1, W2, W3] (other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
  • 10. H O W “ C O G R O U P ” W O R K S I N S PA R K RDDx (k1,va) (k2,vb) (k1,vc) (k3,vd) (k1,ve) (k1,vf) (k2,vg) (k1,vh) RDDy RDDx.cogroup(RDDy) =?
  • 11. H O W “ C O G R O U P ” W O R K S I N S PA R K Arraycombined Array[(k1,[va,vc,ve,vf,vh]), (k2,[vb,vg]), (k3,[vd])] RDDx.cogroup(RDDy) = *see below*
  • 12. “ C O G R O U P ” I N S PA R K • CoGroup works in both RDD and Spark Streams • the ability to combine multiple RDDs allows higher abstractions to be constructed • A Stream in Spark is just a list of (Time,RDD[U])
  • 13. W H AT ’ S S PA R K S Q L • Spark SQL is new, largely replaced Shark • Large scale queries (inline queries) to be embedded into a Spark program • Spark SQL supports Apache Hive, JSON, Parquet, RDD. • Spark SQL’s optimizer is clever! • Supports UDFs from Hive or Write your own !
  • 14. S PA R K S Q L J S O N S PA R K S Q L PA R Q U E TH I V E data sources R D D
  • 15. S PA R K S Q L ( A N E X A M P L E ) // import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc)
  • 16. S PA R K S Q L ( A N E X A M P L E ) // import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc) val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”)
  • 17. S PA R K S Q L ( A N E X A M P L E ) // import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc) val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”) val topTweets = hiveCtx.sql(“SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10”)
  • 18. S PA R K S Q L ( A N E X A M P L E ) // import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc) val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”) val topTweets = hiveCtx.sql(“SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10”) val topTweetContent = topTweets.map(row ⇒ row.getString(0))
  • 19. W H AT ’ S S PA R K S T R E A M I N G • Core component is a DStream • DStream is an abstract RDD whose basic components is a (key,value) pairs where key = Time, value = RDD. • Forward and backward queries are supported • Fault-Tolerance by check-pointing RDDs. • What you can do with RDDs, you can do with DStreams.
  • 20. S PA R K S T R E A M I N G ( Q U I C K E X A M P L E ) import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration // Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))

  • 21. S PA R K S T R E A M I N G ( Q U I C K E X A M P L E ) import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration // Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))
 // Create a DStream using data received after connecting to // port 7777 on the local machine val lines = ssc.socketTextStream("localhost", 7777)
 // Filter our DStream for lines with "error"
 val errorLines = lines.filter(_.contains("error"))
 // Print out the lines with errors
 errorLines.print()
  • 22. S PA R K S T R E A M I N G ( Q U I C K E X A M P L E ) import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration // Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))
 // Create a DStream using data received after connecting to // port 7777 on the local machine val lines = ssc.socketTextStream("localhost", 7777)
 // Filter our DStream for lines with "error"
 val errorLines = lines.filter(_.contains("error"))
 // Print out the lines with errors
 errorLines.print() // Start our streaming context and wait for it to "finish" ssc.start()
 // Wait for the job to finish ssc.awaitTermination()
  • 23. A D S T R E A M L O O K S L I K E … t1 to t2 t2 to t3 t3 to t4 timestart DStream
  • 24. A D S T R E A M C A N H AV E T R A N S F O R M AT I O N S O N T H E M ! t1 to t2 timestart DStream(s) t1 to t2 data-1 data-2 f transformation on the fly!
  • 25. S PA R K S T R E A M T R A N S F O R M AT I O N t1 to t2t2 to t3 timestart DStream(s) t1 to t2t2 to t3 data-1 data-2 f f data output in batches
  • 26. S PA R K S T R E A M T R A N S F O R M AT I O N t3 to t4 timestart DStream(s) t3 to t4 data-1 data-2 f t1 to t2t2 to t3 t1 to t2t2 to t3 f fff
  • 27. S TAT E F U L S PA R K S T R E A M T R A N S F O R M AT I O N t3 to t4 timestart DStream(s) t3 to t4 data-1 data-2 f t1 to t2t2 to t3 t1 to t2t2 to t3 f fff
  • 28. H O W D O E S S PA R K S T R E A M I N G H A N D L E FA U LT S ? • As before, check-point is the key to fault-tolerance (especially in stateful-dstream transformations) • Programs can recover from check-points => no need to restart all over again. • You can use “monit” to restart Spark jobs or pass the Spark flag “- - supervise” to the job config a.k.a driver fault tolerance
  • 29. • All incoming data to workers replicated • In-house RDDs follow the lineage graph to recover • The above is known as worker fault tolerance. • Receivers fault tolerance is largely dependent on whether data sources can re-send lost data • Streams guarantee exactly-once semantics; caveat: multiple writes can occur to the HDFS (app specific logic needs to handle) H O W D O E S S PA R K S T R E A M I N G H A N D L E FA U LT S ?
  • 30. R E F E R E N C E S • Books: • “Learning Spark: Lightning Fast Big Data ANlaytics” • “Advanced Analytics with Spark: Patterns for Learning from Data At Scale” • “Fast Data Processing with Spark” • “Machine Learning with Spark” • Berkeley Data Bootcamp • Introduction to Big Data with Apache Spark • Kien Dang’s introduction to Spark and R using Naive Bayes (click here) • Spark Streaming with Scala and Akka (click here)
  • 31. T H E E N D Q U E S T I O N S ? T W I T T E R : @ R AY M O N D TAY B L G I T H U B : @ R AY G I T