From Big Data
to Fast Data
An introduction to Apache Spark
Stefano Baghino
Codemotion Milan 2015
From Big Data to Fast
Data with Functional
Reactive Containerized
Microservices and AI-
driven Monads in a
galaxy far far away…
Hello!
I am Stefano Baghino
Software Engineer @ DATABIZ

stefano.baghino@databiz.it
@stefanobaghino

Favorite PL: Scala
My hero: XKCD’s Beret Guy
What I fear: [object Object]
Agenda
u Big Data?
u Fast Data?
u What do we have now?
u How can we do better? 
u What is Spark? 
u What does it do? 
u How does it work? 
And also code, somewhere here and there.
1.
What is Big Data?
More than a buzzword, I guess
“
Really, what is it?
u Data that cannot be stored on a single box
u Requires horizontal scalability
u Requires a shift from traditional solutions
2.
What is Fast Data?
More than yet another buzzword
Basically:
Streaming
The need to process huge
quantities of incoming
data in real-time
Disk I/O all the time

Each step reads input
from and writes output to
disk
Let’s look at MapReduce
Limited model

It’s difficult to fit all algos
in the MapReduce model
Ok, so what is so good about Spark?
May sit on top of an existing
Hadoop deployment.

Builds heavily on simple
functional programming ideas.

Computes and caches data in-
memory to deliver blazing
performances.
Fast? Really? Yes!
Hadoop 102.5 TB Spark 100 TB Spark 1 PB
Elapsed Time 72’
 23’
 234’
# Cores 50400
 6592
 6080
Rate/Node 0.67 GB/min
 20.7 GB/min
 22.5 GB/min
Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
So, where can I use it?
Java
 Scala
 Python
Momentum
+700 contributors
+50 companies
3.
What is Spark?
Let’s get to the point
The architecture
Deploy on the cluster manager of your choice
Local



127.0.0.1
Standalone
 Hadoop
 Mesos
Working with Spark
◎ Resilient Distributed Dataset
◎ Closely resembles a Scala collection
◎ Very natural to use for Scala devs
By the user’s point of view, the RDD is effectively
a collection, hiding all the details of its
distribution throughout the cluster.
Example
Word Count
Let’s get our hands a little bit dirty
The anatomy of a Resilient Distributed Dataset
What about
resilience?
Let’s learn what RDDs
really are and how Spark
works in order to get it
What is an RDD, really?
create
 filter
filter
join
 collect
create
Transformations

Produce a new RDD,
extending the execution
graph at each step

e.g.: 
u  map
u  flatMap
u  filter
What can I do with an RDD?
Actions

They are “terminal”
operations, actually calling
for the execution to
extract a value

e.g.:
u  collect
u  reduce
The execution model
1.  Create DAG of RDDs to represent comp.
2.  Create logical execution plan for the DAG
3.  Schedule and execute individual tasks
The execution model in action
Let’s count distinct names grouped by their initial
sc.textFile("hdfs://...")
.map(n => (n.charAt(0), n))
.groupByKey()
.mapValues(n => n.toSet.size)
.collect()
Step 1: Create the logical DAG
HadoopRDD
MappedRDD
ShuffledRDD
MappedValuesRDD
Array[(Char, Int)]
sc.textFile...
map(n => (n.charAt(0),...
groupByKey()
mapValues(n => n.toSet...
collect()
Step 2: Create the execution plan
u Pipeline as much as possible
u Split into “stages” based on the need to “shuffle” data
HadoopRDD
MappedRDD
ShuffledRDD
MappedValuesRDD
Array[(Char, Int)]
Alice
 Bob
 Andy
(A, Alice)
 (B, Bob)
 (A, Andy)
(A, (Alice, Andy))
 (B, Bob)
(A, 2)
Res0 = [(A, 2),….]
(B, 1)
Stage
1
Res0 = [(A, 2), (B, 1)]
Stage
2
So, how is it a Resilient Distributed Dataset?
Being a lazy, immutable representation of
computation, rather than an actual collection
of data, RDDs achieve resiliency by simply
being re-executed when their results are
lost*.
* because distributed systems and Murphy’s Law are best buddies.
The ecosystem
Spark SQL

Structured data
Spark Streaming

Real-time
MLLib

Machine learning
GraphX

Graph processing
Spark Core
Standalone Scheduler
 YARN
 Mesos
Spark R

Stat. analysis
What we’ll see today: Spark Streaming
Spark SQL

Structured data
Spark Streaming

Real-time
MLLib

Machine learning
GraphX

Graph processing
Spark Core
Standalone Scheduler
 YARN
 Mesos
Spark R

Stat. analysis
Let’s get to
Spark Streaming
It’s Fast Data time!
Surprise!
You already know
everything you
need
Spark Streaming
Spark
Streaming
Spark
Live data stream
“Mini-batches”
Processed result
“Mini-batches” are DStreams
These “mini-batches” are DStreams or
discretized streams and they are basically a
collection of RDDs.

DStreams can be created from streaming
sources or by applying transformations to an
existing DStream.
Example
Twitter streaming
“Sentiment analysis” for dummies
Sure, it’s on Github!
https://github.com/stefanobaghino/spark-twitter-stream-example
A lot more to be said!
u Caching
u Shared variables
u Partioning optimization
u DataFrames
u A huge API
u A huge ecosystem
Tomorrow at Codemotion!
Spark SQL

Structured data
Spark Streaming

Real-time
MLLib

Machine learning
GraphX

Graph processing
Spark Core
Standalone Scheduler
 YARN
 Mesos
Spark R

Stat. analysis
Thanks!
Any questions?
You can find me at:
@stefanobaghino
stefano.baghino@databiz.it

Stefano Baghino - From Big Data to Fast Data: Apache Spark

  • 1.
    From Big Data toFast Data An introduction to Apache Spark Stefano Baghino Codemotion Milan 2015
  • 2.
    From Big Datato Fast Data with Functional Reactive Containerized Microservices and AI- driven Monads in a galaxy far far away…
  • 3.
    Hello! I am StefanoBaghino Software Engineer @ DATABIZ stefano.baghino@databiz.it @stefanobaghino Favorite PL: Scala My hero: XKCD’s Beret Guy What I fear: [object Object]
  • 4.
    Agenda u Big Data? u Fast Data? u Whatdo we have now? u How can we do better? u What is Spark? u What does it do? u How does it work? And also code, somewhere here and there.
  • 5.
    1. What is BigData? More than a buzzword, I guess
  • 6.
  • 7.
    Really, what isit? u Data that cannot be stored on a single box u Requires horizontal scalability u Requires a shift from traditional solutions
  • 8.
    2. What is FastData? More than yet another buzzword
  • 9.
    Basically: Streaming The need toprocess huge quantities of incoming data in real-time
  • 10.
    Disk I/O allthe time Each step reads input from and writes output to disk Let’s look at MapReduce Limited model It’s difficult to fit all algos in the MapReduce model
  • 11.
    Ok, so whatis so good about Spark? May sit on top of an existing Hadoop deployment. Builds heavily on simple functional programming ideas. Computes and caches data in- memory to deliver blazing performances.
  • 12.
    Fast? Really? Yes! Hadoop102.5 TB Spark 100 TB Spark 1 PB Elapsed Time 72’ 23’ 234’ # Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  • 13.
    So, where canI use it? Java Scala Python
  • 14.
  • 15.
    3. What is Spark? Let’sget to the point
  • 16.
  • 17.
    Deploy on thecluster manager of your choice Local 127.0.0.1 Standalone Hadoop Mesos
  • 18.
    Working with Spark ◎ ResilientDistributed Dataset ◎ Closely resembles a Scala collection ◎ Very natural to use for Scala devs By the user’s point of view, the RDD is effectively a collection, hiding all the details of its distribution throughout the cluster.
  • 19.
    Example Word Count Let’s getour hands a little bit dirty
  • 20.
    The anatomy ofa Resilient Distributed Dataset
  • 21.
    What about resilience? Let’s learnwhat RDDs really are and how Spark works in order to get it
  • 22.
    What is anRDD, really? create filter filter join collect create
  • 23.
    Transformations Produce a newRDD, extending the execution graph at each step e.g.: u  map u  flatMap u  filter What can I do with an RDD? Actions They are “terminal” operations, actually calling for the execution to extract a value e.g.: u  collect u  reduce
  • 24.
    The execution model 1. Create DAG of RDDs to represent comp. 2.  Create logical execution plan for the DAG 3.  Schedule and execute individual tasks
  • 25.
    The execution modelin action Let’s count distinct names grouped by their initial sc.textFile("hdfs://...") .map(n => (n.charAt(0), n)) .groupByKey() .mapValues(n => n.toSet.size) .collect()
  • 26.
    Step 1: Createthe logical DAG HadoopRDD MappedRDD ShuffledRDD MappedValuesRDD Array[(Char, Int)] sc.textFile... map(n => (n.charAt(0),... groupByKey() mapValues(n => n.toSet... collect()
  • 27.
    Step 2: Createthe execution plan u Pipeline as much as possible u Split into “stages” based on the need to “shuffle” data HadoopRDD MappedRDD ShuffledRDD MappedValuesRDD Array[(Char, Int)] Alice Bob Andy (A, Alice) (B, Bob) (A, Andy) (A, (Alice, Andy)) (B, Bob) (A, 2) Res0 = [(A, 2),….] (B, 1) Stage 1 Res0 = [(A, 2), (B, 1)] Stage 2
  • 28.
    So, how isit a Resilient Distributed Dataset? Being a lazy, immutable representation of computation, rather than an actual collection of data, RDDs achieve resiliency by simply being re-executed when their results are lost*. * because distributed systems and Murphy’s Law are best buddies.
  • 29.
    The ecosystem Spark SQL Structureddata Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis
  • 30.
    What we’ll seetoday: Spark Streaming Spark SQL Structured data Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis
  • 31.
    Let’s get to SparkStreaming It’s Fast Data time!
  • 32.
  • 33.
    Spark Streaming Spark Streaming Spark Live datastream “Mini-batches” Processed result
  • 34.
    “Mini-batches” are DStreams These“mini-batches” are DStreams or discretized streams and they are basically a collection of RDDs. DStreams can be created from streaming sources or by applying transformations to an existing DStream.
  • 35.
    Example Twitter streaming “Sentiment analysis”for dummies Sure, it’s on Github! https://github.com/stefanobaghino/spark-twitter-stream-example
  • 36.
    A lot moreto be said! u Caching u Shared variables u Partioning optimization u DataFrames u A huge API u A huge ecosystem
  • 37.
    Tomorrow at Codemotion! SparkSQL Structured data Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis
  • 38.
    Thanks! Any questions? You canfind me at: @stefanobaghino stefano.baghino@databiz.it