Intro to Apache Spark

All product images owned by respective companies/institutions
Intro
to
Apache

Takeaways

To understand:
•  Why we have big data today
•  What big data problems Spark solves
•  How Spark approaches big data differently
But most of all… to feel comfortable trying Spark out!

Image Credit: http://commons.wikimedia.org/wiki/File:BigData_2267x1146_white.png

Why
does
big
data
exist?

7.2 B 6.8 B 1.44 B 300 M 236 M 3.5 B / day

When data is small it’s cute and cuddly, easy to contain…

When data gets big, we need tools to help us.

What
tools
can
help?

2002 - MapReduce @ Google
2004 – MapReduce Paper
2006 – Hadoop @ Yahoo
2011 – Hadoop Released

But
MapReduce
falls

short…

Hadoop’s
Limitations

Lacks one thing to succeed for:
•  Iterative Queries
•  Interactive Queries
Fast data sharing

Image courtesy of: http://workinganalytics.com/

a
better
way.
We
need…

We
need…
fault
tolerance

and
speed.

We
need…
a
better
data

abstraction.

The
Solution..

•  Resilient Distributed Datasets
– A distributed memory abstraction that lets
programmers perform in-memory computations
on large clusters in a fault-tolerant manner.

2013 – Spark @ Apache
2014 – Spark 1.0 Released
2011 – Hadoop Released2009 – Spark at UC Berkeley

2013 – Spark @ Apache
2014 – Spark 1.0 Released
2009 – Spark at UC Berkeley

Why
Spark?

Fast
Image Credit: http://pixabay.com/en/tunnel-light-speed-fast-auto-101976/

Why
Spark?

Fast
General Purpose
Image Credit: http://www.freestockphotos.biz/stockphoto/9182

Why
Spark?

Fast
General Purpose
Easy
Image Credit: http://upload.wikimedia.org/wikipedia/commons/9/92/Easy_button.JPG

Why
Spark?

Fast
General Purpose
Easy
Streaming
Image Credit: http://pixabay.com/en/faucet-water-bad-sanitaryblock-686958/

Why
Spark?

Fast
General Purpose
Easy
Streaming
Adoption

Use
Cases

Spark
Use
Cases

ETL

Spark
Use
Cases

ETL
Machine Learning

Spark
Use
Cases

ETL
Machine Learning
Analytics
Table Credit: http://www.wsj.com/articles/SB10001424052970203914304576630742911364206

Spark
Use
Cases

ETL
Machine Learning
Analytics
Modeling

Spark
Use
Cases

ETL
Machine Learning
Analytics
Modeling
Data Mining

Spark
Modules

Image Credit: http://www.numaq.com

Basics


Creating
RDDs

•  From practically any data source
–  HDFS
–  Local file system
–  S3
–  NoSQL (Cassandra, Hbase, …)
–  JDBC
•  From any collection
•  Transform an existing RDD

Text
File
We start with some data.
Put it in a form Spark understands…
File
RDD
Read File

Text
File
File
RDD
Read File
RDDs:
•  Computation blueprint
•  Lazy: Hold instructions – not data

Text
File
File
RDD
Word
RDD
Word
Count
RDD
Read File Split Words Count Words
Transformations chain operations together
Nothing actually computed yet…

Text
File
File
RDD
Word
RDD
Word
Count
RDD
All
Word
Counts
Read File Split Words Count Words Store Result
Actions compute results.
Why is laziness good?

Text
File
File
RDD
Word
RDD
Word
Count
RDD
All
Word
Counts
Top 10
Words
Only compute what we need
Allows you to:
- Focus more on algorithm
- Worry less about performance

Text
File
File
RDD
Word
RDD
Word
Count
RDD
All
Word
Counts
Top 10
Words
“A”
Word
RDD
Words starting with “A”
By default,
RDDs recomputed each use

Word
RDD
Text
File
File
RDD
Word
Count
RDD
All
Word
Counts
Top 10
Words
“A”
Word
RDD
For better performance…
Persist reused RDDs
Word
RDD

Word
RDD
Text
File
File
RDD
Word
Count
RDD
All
Word
Counts
Top 10
Words
“A”
Word
RDD
RDDs are fault tolerant.

Text
File
File
RDD
Word
RDD
Word
Count
RDD
All
Word
Counts
Top 10
Words
“A”
Word
RDD
RDDs are fault tolerant.
Lineage allows recreation.

Once
more,
with
code

Word
Count
Example

val input = sc.textFile(”hdfs://...") // HadoopRDD
//Transformation
val words = input.flatMap(line => line.split(" ")) //FlatMappedRDD
//Transformation
val result = words.map(word => (word, 1)).reduceByKey((acc, curr) =>
acc + curr)
//Action
val collectedResult = result.collect()

Image courtesy of http://blog.jetoile.fr

Cluster
Basics


Image courtesy of https://spark.apache.org

Creates RDDs
Executes code on cluster
Connects our program to Spark
“Main”

Acquires cluster resources
YARN, Mesos, Standalone…

Spawns executors
Performs tasks

Managed by SparkManaged
by You

In
Action

More
Information
on
Spark

•  https://spark.apache.org/docs/latest/index.html
•  http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf
•  http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
•  http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
•  http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
•  https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
•  http://www.meetup.com/Washington-DC-Area-Spark-Interactive/
•  https://spark-summit.org/

Shared
Variables

●  Broadcast variables
o  Allows user to keep a read-only variable cached on each machine vs
shipping it with tasks.
o  e.g. lookup table
●  Accumulators
o  workers can “add” to using associative operations
o  only driver can read
o  used for
§  counters
§  sums

Intro to Apache Spark

More Related Content

What's hot

Viewers also liked

Similar to Intro to Apache Spark

Recently uploaded

Intro to Apache Spark