Simplifying Big Data Analysis
with Apache Spark
Matei Zaharia
April 27, 2015
What is Apache Spark?
Fast and general cluster computing engine interoperable
with Apache Hadoop
Improves efficiency through:
–  In-memory data sharing
–  General computation graphs
Improves usability through:
–  Rich APIs in Java, Scala, Python
–  Interactive shell
Up to 100× faster
2-5× less code
Spark Core
Spark
Streaming
real-time
Spark SQL
structured
data
MLlib
machine
learning
GraphX
graph
A General Engine
…	
  
A Large Community
MapReduce
YARN
HDFS
Storm
Spark
0
500
1000
1500
2000
2500
3000
3500
4000
4500
MapReduce
YARN
HDFS
Storm
Spark
0
100000
200000
300000
400000
500000
600000
700000
800000
Commits in past year Lines of code changed
in past year
About Databricks
Founded by creators of Spark and remains largest contributor
Offers a hosted service, Databricks Cloud
–  Spark on EC2 with notebooks, dashboards, scheduled jobs
This Talk
Introduction to Spark
Built-in libraries
New APIs in 2015
–  DataFrames
–  Data sources
–  ML Pipelines
Why a New Programming Model?
MapReduce simplified big data analysis
But users quickly wanted more:
–  More complex, multi-pass analytics (e.g. ML, graph)
–  More interactive ad-hoc queries
–  More real-time stream processing
All 3 need faster data sharing in parallel apps
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS
read
Slow due to data replication and disk I/O
iter. 1 iter. 2 . . .
Input
What We’d Like
Distributed
memory
Input
query 1
query 2
query 3
. . .
one-time
processing
10-100× faster than network and disk
Spark Model
Write programs in terms of transformations on datasets
Resilient Distributed Datasets (RDDs)
–  Collections of objects that can be stored in memory or disk
across a cluster
–  Built via parallel transformations (map, filter, …)
–  Automatically rebuilt on failure
Example: Log Mining
Load error messages from a log into memory, then interactively
search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Fault Tolerance
RDDs track the transformations used to build them (their lineage)
to recompute lost data
messages = textFile(...).filter(lambda s: “ERROR” in s)
.map(lambda s: s.split(“t”)[2])
HadoopRDD
path = hdfs://…
FilteredRDD
func = lambda s: …
MappedRDD
func = lambda s: …
0
1000
2000
3000
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
later iterations 1 s
Example: Logistic Regression
14
On-Disk Performance
Time to sort 100TB
Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes
Supported Operators
map
filter
groupBy
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
flatMap
take
first
partitionBy
pipe
distinct
save
...
// Scala:
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
// Java:
JavaRDD<String> lines = sc.textFile(...);
lines.filter(s -> s.contains(“ERROR”)).count();
Spark in Scala and Java
User Community
Over 500 production users
Clusters up to 8000 nodes,
processing 1 PB/day
Single jobs over 1 PB
17
This Talk
Introduction to Spark
Built-in libraries
New APIs in 2015
–  DataFrames
–  Data sources
–  ML Pipelines
Spark Core
Spark
Streaming
real-time
Spark SQL
structured data
MLlib
machine learning
GraphX
graph
Built-in Libraries
Key Idea
Instead of having separate execution engines for each
task, all libraries work directly on RDDs
Caching + DAG model is enough to run them efficiently
Combining libraries into one program is much faster
20
Represents tables as RDDs
Tables = Schema + Data
Spark SQL
c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)
rows.filter(lambda r: r.year > 2013).collect()
From Hive:
{“text”: “hi”,
“user”: {
“name”: “matei”,
“id”: 123
}}
c.jsonFile(“tweets.json”).registerTempTable(“tweets”)
c.sql(“select text, user.name from tweets”)
From JSON:
tweets.json
Time
Input
Spark Streaming
RDDRDDRDDRDDRDDRDD
Represents streams as a series of RDDs over time
sc.twitterStream(...)
.map(lambda t: (t.username, 1))
.reduceByWindow(“30s”, lambda a, b: a + b)
.print()
Spark Streaming
Time
Vectors, Matrices = RDD[Vector]
Iterative computation
MLlib
points = sc.textFile(“data.txt”).map(parsePoint)
model = KMeans.train(points, 10)
model.predict(newPoint)
Represents graphs as RDDs of vertices and edges
GraphX
Represents graphs as RDDs of vertices and edges
GraphX
Hive
Impala(disk)
Impala(mem)
Spark(disk)
Spark(mem)
0
10
20
30
40
50
ResponseTime(sec)
SQL
Mahout
GraphLab
Spark
0
10
20
30
40
50
60
ResponseTime(min)
ML
Performance vs. Specialized Engines
Storm
Spark
0
5
10
15
20
25
30
35
Throughput(MB/s/node)
Streaming
Combining Processing Types
// Load data using SQL
points = ctx.sql(“select latitude, longitude from hive_tweets”)
// Train a machine learning model
model = KMeans.train(points, 10)
// Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
Combining Processing Types
Separate engines:
. . .
HDFS
read
HDFS
write
prepare
HDFS
read
HDFS
write
train
HDFS
read
HDFS
write
apply
HDFS
write
HDFS
read
prepare
train
apply
Spark:
HDFS
Interactive
analysis
This Talk
Introduction to Spark
Built-in libraries
New APIs in 2015
–  DataFrames
–  Data sources
–  ML Pipelines
31
Main Directions in 2015
Data Science
Making it easier for
wider class of users
Platform Interfaces
Scaling the ecosystem
From MapReduce to Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Beyond MapReduce Experts
33
Early adopters
Data Scientists
Statisticians
R users
PyData
…
users
understands
MapReduce &
functional APIs
Data Frames
De facto data processing abstraction for data science
(R and Python)
Google Trends for “dataframe”
From RDDs to DataFrames
35
36
Spark DataFrames
Collections of structured
data similar to R, pandas
Automatically optimized
via Spark SQL
–  Columnar storage
–  Code-gen. execution
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime
Optimization via Spark SQL
37
DataFrame expressions are relational queries, letting Spark inspect them
Automatically perform expression optimization, join algorithm selection,
columnar storage, compilation to Java bytecode
38
Machine Learning Pipelines
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
modelDataFrame
High-level API similar to
SciKit-Learn
Operates on DataFrames
Grid search to tune params
across a whole pipeline
39
Spark R Interface
Exposes DataFrames and
ML pipelines in R
Parallelize calls to R code
df = jsonFile(“tweets.json”) 
summarize(                         
  group_by(                        
    df[df$user == “matei”,],
    “date”),
  sum(“retweets”)) 
Target: Spark 1.4 (June)
40
Main Directions in 2015
Data Science
Making it easier for
wider class of users
Platform Interfaces
Scaling the ecosystem
41
Data Sources API
Allows plugging smart data
sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
Spark
{JSON}
42
Allows plugging smart data
sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
SELECT * FROM mysql_users u JOIN
hive_logs h
WHERE u.lang = “en”
Spark
{JSON}
SELECT * FROM users WHERE lang=“en”
Data Sources API
43
Current Data Sources
Built-in:
Hive, JSON, Parquet, JDBC
Community:
CSV, Avro, ElasticSearch,
Redshift, Cloudant, Mongo,
Cassandra, SequoiaDB
List at spark-packages.org
44
Goal: unified engine across data sources,
workloads and environments
To Learn More
Downloads & docs: spark.apache.org
Try Spark in Databricks Cloud: databricks.com
Spark Summit: spark-summit.org
45

Simplifying Big Data Analytics with Apache Spark

  • 1.
    Simplifying Big DataAnalysis with Apache Spark Matei Zaharia April 27, 2015
  • 2.
    What is ApacheSpark? Fast and general cluster computing engine interoperable with Apache Hadoop Improves efficiency through: –  In-memory data sharing –  General computation graphs Improves usability through: –  Rich APIs in Java, Scala, Python –  Interactive shell Up to 100× faster 2-5× less code
  • 3.
  • 4.
  • 5.
    About Databricks Founded bycreators of Spark and remains largest contributor Offers a hosted service, Databricks Cloud –  Spark on EC2 with notebooks, dashboards, scheduled jobs
  • 6.
    This Talk Introduction toSpark Built-in libraries New APIs in 2015 –  DataFrames –  Data sources –  ML Pipelines
  • 7.
    Why a NewProgramming Model? MapReduce simplified big data analysis But users quickly wanted more: –  More complex, multi-pass analytics (e.g. ML, graph) –  More interactive ad-hoc queries –  More real-time stream processing All 3 need faster data sharing in parallel apps
  • 8.
    Data Sharing inMapReduce iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to data replication and disk I/O
  • 9.
    iter. 1 iter.2 . . . Input What We’d Like Distributed memory Input query 1 query 2 query 3 . . . one-time processing 10-100× faster than network and disk
  • 10.
    Spark Model Write programsin terms of transformations on datasets Resilient Distributed Datasets (RDDs) –  Collections of objects that can be stored in memory or disk across a cluster –  Built via parallel transformations (map, filter, …) –  Automatically rebuilt on failure
  • 11.
    Example: Log Mining Loaderror messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDDTransformed RDD Action Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
  • 12.
    Fault Tolerance RDDs trackthe transformations used to build them (their lineage) to recompute lost data messages = textFile(...).filter(lambda s: “ERROR” in s) .map(lambda s: s.split(“t”)[2]) HadoopRDD path = hdfs://… FilteredRDD func = lambda s: … MappedRDD func = lambda s: …
  • 13.
    0 1000 2000 3000 4000 1 5 1020 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s later iterations 1 s Example: Logistic Regression
  • 14.
    14 On-Disk Performance Time tosort 100TB Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes
  • 15.
  • 16.
    // Scala: val lines= sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count(); Spark in Scala and Java
  • 17.
    User Community Over 500production users Clusters up to 8000 nodes, processing 1 PB/day Single jobs over 1 PB 17
  • 18.
    This Talk Introduction toSpark Built-in libraries New APIs in 2015 –  DataFrames –  Data sources –  ML Pipelines
  • 19.
    Spark Core Spark Streaming real-time Spark SQL structureddata MLlib machine learning GraphX graph Built-in Libraries
  • 20.
    Key Idea Instead ofhaving separate execution engines for each task, all libraries work directly on RDDs Caching + DAG model is enough to run them efficiently Combining libraries into one program is much faster 20
  • 21.
    Represents tables asRDDs Tables = Schema + Data Spark SQL c = HiveContext(sc) rows = c.sql(“select text, year from hivetable”) rows.filter(lambda r: r.year > 2013).collect() From Hive: {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} c.jsonFile(“tweets.json”).registerTempTable(“tweets”) c.sql(“select text, user.name from tweets”) From JSON: tweets.json
  • 22.
  • 23.
    RDDRDDRDDRDDRDDRDD Represents streams asa series of RDDs over time sc.twitterStream(...) .map(lambda t: (t.username, 1)) .reduceByWindow(“30s”, lambda a, b: a + b) .print() Spark Streaming Time
  • 24.
    Vectors, Matrices =RDD[Vector] Iterative computation MLlib points = sc.textFile(“data.txt”).map(parsePoint) model = KMeans.train(points, 10) model.predict(newPoint)
  • 25.
    Represents graphs asRDDs of vertices and edges GraphX
  • 26.
    Represents graphs asRDDs of vertices and edges GraphX
  • 27.
  • 28.
    Combining Processing Types //Load data using SQL points = ctx.sql(“select latitude, longitude from hive_tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
  • 29.
    Combining Processing Types Separateengines: . . . HDFS read HDFS write prepare HDFS read HDFS write train HDFS read HDFS write apply HDFS write HDFS read prepare train apply Spark: HDFS Interactive analysis
  • 30.
    This Talk Introduction toSpark Built-in libraries New APIs in 2015 –  DataFrames –  Data sources –  ML Pipelines
  • 31.
    31 Main Directions in2015 Data Science Making it easier for wider class of users Platform Interfaces Scaling the ecosystem
  • 32.
    From MapReduce toSpark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 33.
    Beyond MapReduce Experts 33 Earlyadopters Data Scientists Statisticians R users PyData … users understands MapReduce & functional APIs
  • 34.
    Data Frames De factodata processing abstraction for data science (R and Python) Google Trends for “dataframe”
  • 35.
    From RDDs toDataFrames 35
  • 36.
    36 Spark DataFrames Collections ofstructured data similar to R, pandas Automatically optimized via Spark SQL –  Columnar storage –  Code-gen. execution df = jsonFile(“tweets.json”) df[df[“user”] == “matei”] .groupBy(“date”) .sum(“retweets”) 0 5 10 Python Scala DataFrame RunningTime
  • 37.
    Optimization via SparkSQL 37 DataFrame expressions are relational queries, letting Spark inspect them Automatically perform expression optimization, join algorithm selection, columnar storage, compilation to Java bytecode
  • 38.
    38 Machine Learning Pipelines tokenizer= Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline([tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR modelDataFrame High-level API similar to SciKit-Learn Operates on DataFrames Grid search to tune params across a whole pipeline
  • 39.
    39 Spark R Interface ExposesDataFrames and ML pipelines in R Parallelize calls to R code df = jsonFile(“tweets.json”)  summarize(                            group_by(                             df[df$user == “matei”,],     “date”),   sum(“retweets”))  Target: Spark 1.4 (June)
  • 40.
    40 Main Directions in2015 Data Science Making it easier for wider class of users Platform Interfaces Scaling the ecosystem
  • 41.
    41 Data Sources API Allowsplugging smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources Spark {JSON}
  • 42.
    42 Allows plugging smartdata sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en” Data Sources API
  • 43.
    43 Current Data Sources Built-in: Hive,JSON, Parquet, JDBC Community: CSV, Avro, ElasticSearch, Redshift, Cloudant, Mongo, Cassandra, SequoiaDB List at spark-packages.org
  • 44.
    44 Goal: unified engineacross data sources, workloads and environments
  • 45.
    To Learn More Downloads& docs: spark.apache.org Try Spark in Databricks Cloud: databricks.com Spark Summit: spark-summit.org 45