Apache Spark 101 [in 50 min]

twitter: @rabbitonweb,
email: paul.szulc@gmail.com
Apache Spark 101
Large Scale Data Processing
by Paweł Szulc

Apache Spark 101
by Paweł Szulc

Apache Spark 101
by Paweł Szulc
blog: http://www.rabbitonweb.com

Apache Spark 101
by Paweł Szulc (@rabbitonweb)

Apache Spark 101
by Paweł Szulc (@rabbitonweb) [@ApacheSpark]

Apache Spark 101
by Paweł Szulc (@rabbitonweb) [@ApacheSpark]
IN
50
M
INUTES

Why?

Why?
buzzword: Big Data

Big Data is like...

Big Data is like...
“Big Data is like teenage sex:

Big Data is like...
“Big Data is like teenage sex: everyone
talks about it,

Big Data is like...
talks about it, nobody really knows how
to do it,

Big Data is like...
to do it, everyone thinks everyone else
is doing it,

Big Data is like...
to do it, everyone thinks everyone else
is doing it, so everyone claims they are
doing it”

Big Data is all about...

● well, the data :)

● It is said that 2.5 exabytes (2.5×10^18) of
data is being created around the world every
single day

“Every two days, we generate as much
information as we did from the dawn of
civilization until 2003”
-- Eric Schmidt
Former CEO Google

source: http://papyrus.greenville.edu/2014/03/selfiesteem/

● It is said that 2.5 exabytes (2.5×10^18) of
data is being created around the world every
single day
● It's a capacity on which you can not any
longer use standard tools and methods of
evaluation

Challenges of Big Data
● The gathering
● Processing and discovery
● Present it to business
● Hardware and network failures

What was
before?

To the rescue
MAP REDUCE

To the rescue
MAP REDUCE
“'MapReduce' is a framework for processing
parallelizable problems across huge datasets
using a cluster, taking into consideration
scalability and fault-tolerance”

MapReduce - phases (1)
Map Reduce is
combined of
sequences of two
phases:

Map Reduce is
combined of
sequences of two
phases:
1. Map

Map Reduce is
combined of
sequences of two
phases:
1. Map
2. Reduce

Map Reduce - key/value
“In MapReduce, no value stands on its own.
Every value has a key associated with it. Keys
identify related values.

Map Reduce - key/value
“In MapReduce, no value stands on its own.
Every value has a key associated with it. Keys
identify related values.
The mapping and reducing functions receive
not just values, but (key, value) pairs. The
output of each of these functions is the same:
both a key and a value.”

Word Count
● The “Hello World” of Big Data world.
● For initial input of multiple lines, extract all
words with number of occurrences
To be or not to be
Let it be
Be me
It must be
Let it be
be 7
to 2
let 2
or 1
not 1
must 2
me 1

Input
To be or not to be
Let it be
Be me
It must be
Let it be

Input Splitting
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me

Input Splitting Mapping
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1

Input Splitting Mapping Shuffling
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1

Input Splitting Mapping Shuffling Reducing
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
be 6
to 2
or 1
not 1
let 2
must 1
me 1

Input Splitting Mapping Shuffling Reducing Final result
To be or not to be
Let it be
Be me
It must be
Let it be
To be or not to be
Let it be
It must be
Let it be
Be me
to 1
be 1
or 1
not 1
to 1
be 1
let 1
it 1
be 1
be 1
me 1
let 1
it 1
be 1
it 1
must 1
be 1
be 1
be 1
be 1
be 1
be 1
be 1
to 1
to 1
or 1
not 1
let 1
let 1
must 1
me 1
be 6
to 2
or 1
not 1
let 2
must 1
me 1
be 6
to 2
let 2
or 1
not 1
must 2
me 1

Word count - pseudo-code
function map(String name, String document):
for each word w in document:
emit (w, 1)

Word count - pseudo-code
emit (w, 1)
function reduce(String word, Iterator
partialCounts):
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)

Why Apache Spark?
We have MapReduce open-sourced
implementation (Hadoop) running successfully
for the last 10 years. Why to bother?

Problems with Map Reduce
1. MapReduce provides a difficult programming
model for developers

Word count - revisited
emit (w, 1)
function reduce(String word, Iterator
partialCounts):
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)

Word count: Hadoop implementation
15 public class WordCount { 16
17 public static class Map extends Mapper<LongWritable, Text,
Text, IntWritable> {
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text(); 20
21 public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable,
Text, IntWritable> {
33 public void reduce(Text key, Iterable<IntWritable> values,
Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) { sum += val.get(); }
39 context.write(key, new IntWritable(sum));
40 }
41 }
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
46 Job job = new Job(conf, "wordcount");
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
54 job.setInputFormatClass(TextInputFormat.class);

Hadoop addressing the issue

● Hive - SQL on Hadoop Cluster

● Hive - SQL on Hadoop Cluster,
● Declarative language

● Hive - SQL on Hadoop Cluster,
● Declarative language
● But…

Declarative?
select count(distinct user_id) from logs;

Declarative?
select count(distinct user_id) from logs;
select (count(*) from (select distinct user_id
from logs);

2. It suffers from a number of performance
issues

Performance issues
● Map-Reduce pair combination

Performance issues
● Output saved to the file

Performance issues
● Iterative algorithms go through IO path again
and again

Performance issues
● Iterative algorithms go through IO path again
and again
● Poor API (key, value), even basic join
requires expensive code

2. It suffers from a number of performance
issues
3. While batch-mode analysis is still important,
reacting to events as they arrive has become
more important (lack support of “almost”
real-time)

Spark to the rescue

Spark to the rescue
1. Intuitive programming model

Word count once again

val wc = scala.io.Source.fromFile(args(0)).getLines
Scala solution

.map(line => line.toLowerCase)
Scala solution

.flatMap(line => line.split(“ ”)).toSeq
Scala solution

.groupBy(word => 1)
Scala solution

.groupBy(word => 1)
.map { case (word, group) => (word, group.size) }
Scala solution

.groupBy(word => 1)
val wc = new SparkContext(“local”, “Word Count”).textFile(args(0))
Scala solution
Spark solution (in Scala
language)

.groupBy(word => 1)
Scala solution
language)

.groupBy(word => 1)
.flatMap(line => line.split(“ ”))
Scala solution
language)

.groupBy(word => 1)
.groupBy(word => 1)
Scala solution
language)

Spark to the rescue
2. Performance boost

Spark performance - not tied to map-
reduce cycle

reduce cycle
map

reduce cycle
map groupy

reduce cycle
map groupy map

reduce cycle
map groupy map reduceByKey

reduce cycle
task

reduce cycle
task
Wait for calculations on all partitions before moving on

reduce cycle
task task

stage1
reduce cycle

sda
stage2stage1
reduce cycle

Spark performance - shuffle optimization
map groupBy

map groupBy join

map groupBy join
Optimization: shuffle avoided if
data is already partitioned

Spark performance - caching

Spark performance - vs Hadoop (1)
“Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.”

“(...) we decided to participate in the Sort Benchmark (...),
an industry benchmark on how fast a system can sort 100
TB of data (1 trillion records).

TB of data (1 trillion records). The previous world record
was 72 minutes, set by (...) Hadoop (...) cluster of 2100
nodes.

nodes. Using Spark on 206 nodes, we completed the
benchmark in 23 minutes.

benchmark in 23 minutes. This means that Spark sorted
the same data 3X faster using 10X fewer machines.

benchmark in 23 minutes. This means that Spark sorted
the same data 3X faster using 10X fewer machines. All (...)
without using Spark’s in-memory cache.”

Spark to the rescue
3. Spark Streaming

Spark to the rescue
3. Spark Streaming
○ but also: graphs, machine learning and SQL

How?

The Big Picture
Cluster (Standalone, Yarn, Mesos)

The Big Picture
Driver Program

The Big Picture
Driver Program
SPARK API:
1. Scala
2. Java
3. Python

The Big Picture
Driver Program
SPARK API:
1. Scala
2. Java
3. Python
Master

The Big Picture
Driver Program
Master

The Big Picture
Driver Program
Master
val master = “spark://host:pt”

The Big Picture
Driver Program
Master
val conf = new SparkConf()
.setMaster(master)

The Big Picture
Driver Program
Master
.setMaster(master)
val sc = new SparkContext
(conf)

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
Executor 1
Executor 2
Executor 3

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
sc.textFile(“logs.
txt”)
Executor 1
Executor 2
Executor 3
HDFS,
GlusterFS,
locality

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
txt”)
println(logs.count())
Executor 1
Executor 2
Executor 3
T1
T2
T3

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
txt”)
Executor 1
Executor 2
Executor 3
T1T2T3

The Big Picture
Driver Program
Master
val master = “spark://host:pt
.setMaster(master)
(conf)
val logs =
txt”)
Executor 1
Executor 2
Executor 3
T1T2T3

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
txt”)
Executor 1
Executor 2
Executor 3
T1
T2T3

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
txt”)
Executor 1
Executor 2
Executor 3
T1 T2
T3

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
txt”)
Executor 1
Executor 2
Executor 3
T1 T2 T3

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
txt”)
Executor 1
Executor 2
EDeDEADutor
3
T1
T2

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
txt”)
Executor 1
Executor 2
T1 T2

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
txt”)
Executor 1
Executor 2
T1 T2 T3

The Big Picture
Driver Program
Master
.setMaster(master)
(conf)
val logs =
txt”)
Executor 1
Executor 2
T1 T2
T3

RDD - the definition

RDD stands for resilient distributed dataset

Resilient - if data is lost, data can be recreated

Distributed - stored in nodes among the cluster

Distributed - stored in nodes among the cluster
Dataset - initial data comes from a file
or can be created programmatically

RDD - example

RDD - example
val logs = sc.textFile("hdfs://logs.txt")

RDD - example
From Hadoop Distributed
File System

RDD - example
From Hadoop Distributed
File System
This is the RDD

RDD - example
val logs = sc.textFile("/home/rabbit/logs.txt")
From local file system
(must be available on
executors)
This is the RDD

RDD - example
val logs = sc.parallelize(List(1, 2, 3, 4))
Programmatically from a
collection of elements
This is the RDD

RDD - example
val logs = sc.textFile("logs.txt")

RDD - example
val lcLogs = logs.map(_.toLowerCase)

RDD - example
Creates a new RDD

RDD - example
val errors = lcLogs.filter(_.contains(“error”))

RDD - example
And yet another RDD

RDD - example
And yet another RDD
Performance Alert?!?!

RDD - Operations
1. Transformations
a. Map
b. Filter
c. FlatMap
d. Sample
e. Union
f. Intersect
g. Distinct
h. GroupByKey
i. ….
2. Actions
a. Reduce
b. Collect
c. Count
d. First
e. Take(n)
f. TakeSample
g. SaveAsTextFile
h. ….

RDD - example
val numberOfErrors = errors.count

RDD - example
This will trigger the
computation

RDD - example
This will trigger the
computation
This will the calculated
value (Int)

DEMO (1)
with Spark REPL

Spark Stack

Why Spark Streaming
A need to process data in almost real-time
● monitoring
● web logs analysis
● fraud detection
● online ads

Why Spark Streaming
A need to process data in almost real-time
● monitoring
● web logs analysis
● fraud detection
● online ads
Problem: no framework to do both batch &
stream processing

How Spark Streaming works?
Spark Streaming
live streamed data

Spark Streaming
RDD
RDD
RDD
live streamed data
small RDDs

Spark Streaming
RDD
RDD
RDD
Spark Core
live streamed data
small RDDs
output data

Spark Streaming - Usage
val ssc = new StreamingContext(conf, Seconds(1))
Similar to SparkContext,
we need to have an entry
point for the new API

val lines = ssc.socketTextStream("localhost", 9999)
DStream is created (think
of it as streamed RDD)

val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
Exact same API as for
RDD

val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
ssc.start()

Q&A
… if I manage

Q&A paul.szulc@gmail.com,
@rabbitonweb
http://www.rabbitonweb.com
… if I manage

Thank you
very much!

Apache Spark 101 [in 50 min]

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Apache Spark 101 [in 50 min]

Similar to Apache Spark 101 [in 50 min] (20)

More from Pawel Szulc

More from Pawel Szulc (20)

Recently uploaded

Recently uploaded (20)

Apache Spark 101 [in 50 min]