All product images owned by respective companies/institutions
Intro	
  to	
  Apache	
  
Takeaways	
  
To understand:
•  Why we have big data today
•  What big data problems Spark solves
•  How Spark approaches big data differently
But most of all… to feel comfortable trying Spark out!
Image Credit: http://commons.wikimedia.org/wiki/File:BigData_2267x1146_white.png
Why	
  does	
  big	
  data	
  exist?	
  
All product images owned by respective companies/institutions
7.2 B 6.8 B 1.44 B 300 M 236 M 3.5 B / day
All product images owned by respective companies/institutions
When data is small it’s cute and cuddly, easy to contain…
When data gets big, we need tools to help us.
What	
  tools	
  can	
  help?	
  
2002 - MapReduce @ Google
2004 – MapReduce Paper
2006 – Hadoop @ Yahoo
2011 – Hadoop Released
Hadoop	
  Data	
  Flow	
  
But	
  MapReduce	
  falls	
  
short…	
  
Hadoop’s	
  Limitations	
  
Lacks one thing to succeed for:
•  Iterative Queries
•  Interactive Queries
Fast data sharing
Image courtesy of: http://workinganalytics.com/
a	
  better	
  way.	
  We	
  need…	
  
We	
  need…	
  fault	
  tolerance	
  
and	
  speed.	
  
We	
  need…	
  a	
  better	
  data	
  
abstraction.	
  
The	
  Solution..	
  
•  Resilient Distributed Datasets
– A distributed memory abstraction that lets
programmers perform in-memory computations
on large clusters in a fault-tolerant manner.
2002 - MapReduce @ Google
2004 – MapReduce Paper
2006 – Hadoop @ Yahoo
2013 – Spark @ Apache
2014 – Spark 1.0 Released
2011 – Hadoop Released
2011 – Hadoop Released2009 – Spark at UC Berkeley
2002 - MapReduce @ Google
2004 – MapReduce Paper
2006 – Hadoop @ Yahoo
2013 – Spark @ Apache
2014 – Spark 1.0 Released
2009 – Spark at UC Berkeley
2011 – Hadoop Released
Hadoop	
  Data	
  Flow	
  
Spark	
  Data	
  Flow	
  
a
Why	
  Spark?	
  
Fast
Image Credit: http://pixabay.com/en/tunnel-light-speed-fast-auto-101976/
Why	
  Spark?	
  
Fast
General Purpose
Image Credit: http://www.freestockphotos.biz/stockphoto/9182
Why	
  Spark?	
  
Fast
General Purpose
Easy
Image Credit: http://upload.wikimedia.org/wikipedia/commons/9/92/Easy_button.JPG
Why	
  Spark?	
  
Fast
General Purpose
Easy
Streaming
Image Credit: http://pixabay.com/en/faucet-water-bad-sanitaryblock-686958/
Why	
  Spark?	
  
Fast
General Purpose
Easy
Streaming
Adoption
All product images owned by respective companies/institutions
All product images owned by respective companies/institutions
Use	
  Cases	
  
Spark	
  Use	
  Cases	
  
ETL
Spark	
  Use	
  Cases	
  
ETL
Machine Learning
Spark	
  Use	
  Cases	
  
ETL
Machine Learning
Analytics
Table Credit: http://www.wsj.com/articles/SB10001424052970203914304576630742911364206
Spark	
  Use	
  Cases	
  
ETL
Machine Learning
Analytics
Modeling
Table Credit: http://www.wsj.com/articles/SB10001424052970203914304576630742911364206
Spark	
  Use	
  Cases	
  
ETL
Machine Learning
Analytics
Modeling
Data Mining
Table Credit: http://www.wsj.com/articles/SB10001424052970203914304576630742911364206
Spark	
  Modules	
  
Image Credit: http://www.numaq.com
Basics	
  
All product images owned by respective companies/institutions
a
Spark	
  Data	
  Flow	
  
Creating	
  RDDs	
  
•  From practically any data source
–  HDFS
–  Local file system
–  S3
–  NoSQL (Cassandra, Hbase, …)
–  JDBC
•  From any collection
•  Transform an existing RDD
Text
File
We start with some data.
Put it in a form Spark understands…
File
RDD
Read File
Text
File
File
RDD
Read File
RDDs:
•  Computation blueprint
•  Lazy: Hold instructions – not data
Text
File
File
RDD
Word
RDD
Word
Count
RDD
Read File Split Words Count Words
Transformations chain operations together
Nothing actually computed yet…
Text
File
File
RDD
Word
RDD
Word
Count
RDD
All
Word
Counts
Read File Split Words Count Words Store Result
Actions compute results.
Why is laziness good?
Text
File
File
RDD
Word
RDD
Word
Count
RDD
All
Word
Counts
Read File Split Words Count Words Store Result
Top 10
Words
Only compute what we need
Allows you to:
- Focus more on algorithm
- Worry less about performance
Text
File
File
RDD
Word
RDD
Word
Count
RDD
All
Word
Counts
Read File Split Words Count Words Store Result
Top 10
Words
“A”
Word
RDD
Words starting with “A”
By default,
RDDs recomputed each use
Word
RDD
Text
File
File
RDD
Word
Count
RDD
All
Word
Counts
Read File Split Words Count Words Store Result
Top 10
Words
“A”
Word
RDD
Words starting with “A”
For better performance…
Persist reused RDDs
Word
RDD
Word
RDD
Text
File
File
RDD
Word
Count
RDD
All
Word
Counts
Read File Split Words Count Words Store Result
Top 10
Words
“A”
Word
RDD
Words starting with “A”
RDDs are fault tolerant.
Text
File
File
RDD
Word
RDD
Word
Count
RDD
All
Word
Counts
Read File Split Words Count Words Store Result
Top 10
Words
“A”
Word
RDD
Words starting with “A”
RDDs are fault tolerant.
Lineage allows recreation.
Once	
  more,	
  with	
  code	
  
Word	
  Count	
  Example	
  
val input = sc.textFile(”hdfs://...") // HadoopRDD
//Transformation
val words = input.flatMap(line => line.split(" ")) //FlatMappedRDD
//Transformation
val result = words.map(word => (word, 1)).reduceByKey((acc, curr) =>
acc + curr)
//Action
val collectedResult = result.collect()
Image courtesy of http://blog.jetoile.fr
Cluster	
  Basics	
  
All product images owned by respective companies/institutions
Image courtesy of https://spark.apache.org
Image courtesy of https://spark.apache.org
Creates RDDs
Executes code on cluster
Connects our program to Spark
“Main”
Image courtesy of https://spark.apache.org
Acquires cluster resources
YARN, Mesos, Standalone…
Image courtesy of https://spark.apache.org
Spawns executors
Performs tasks
Image courtesy of https://spark.apache.org
Managed by SparkManaged
by You
All product images owned by respective companies/institutions
In	
  Action	
  
Questions?	
  
More	
  Information	
  on	
  Spark	
  
•  https://spark.apache.org/docs/latest/index.html
•  http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf
•  http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
•  http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
•  http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
•  https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
•  http://www.meetup.com/Washington-DC-Area-Spark-Interactive/
•  https://spark-summit.org/
Shared	
  Variables	
  
●  Broadcast variables
o  Allows user to keep a read-only variable cached on each machine vs
shipping it with tasks.
o  e.g. lookup table
●  Accumulators
o  workers can “add” to using associative operations
o  only driver can read
o  used for
§  counters
§  sums

Intro to Apache Spark