2. What is Apache Spark?
Fast and general cluster computing engine interoperable
with Apache Hadoop
Improves efficiency through:
– In-memory data sharing
– General computation graphs
Improves usability through:
– Rich APIs in Java, Scala, Python
– Interactive shell
Up to 100× faster
2-5× less code
5. About Databricks
Founded by creators of Spark and remains largest contributor
Offers a hosted service, Databricks Cloud
– Spark on EC2 with notebooks, dashboards, scheduled jobs
6. This Talk
Introduction to Spark
Built-in libraries
New APIs in 2015
– DataFrames
– Data sources
– ML Pipelines
7. Why a New Programming Model?
MapReduce simplified big data analysis
But users quickly wanted more:
– More complex, multi-pass analytics (e.g. ML, graph)
– More interactive ad-hoc queries
– More real-time stream processing
All 3 need faster data sharing in parallel apps
8. Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS
read
Slow due to data replication and disk I/O
9. iter. 1 iter. 2 . . .
Input
What We’d Like
Distributed
memory
Input
query 1
query 2
query 3
. . .
one-time
processing
10-100× faster than network and disk
10. Spark Model
Write programs in terms of transformations on datasets
Resilient Distributed Datasets (RDDs)
– Collections of objects that can be stored in memory or disk
across a cluster
– Built via parallel transformations (map, filter, …)
– Automatically rebuilt on failure
11. Example: Log Mining
Load error messages from a log into memory, then interactively
search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
12. Fault Tolerance
RDDs track the transformations used to build them (their lineage)
to recompute lost data
messages = textFile(...).filter(lambda s: “ERROR” in s)
.map(lambda s: s.split(“t”)[2])
HadoopRDD
path = hdfs://…
FilteredRDD
func = lambda s: …
MappedRDD
func = lambda s: …
13. 0
1000
2000
3000
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
later iterations 1 s
Example: Logistic Regression
20. Key Idea
Instead of having separate execution engines for each
task, all libraries work directly on RDDs
Caching + DAG model is enough to run them efficiently
Combining libraries into one program is much faster
20
21. Represents tables as RDDs
Tables = Schema + Data
Spark SQL
c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)
rows.filter(lambda r: r.year > 2013).collect()
From Hive:
{“text”: “hi”,
“user”: {
“name”: “matei”,
“id”: 123
}}
c.jsonFile(“tweets.json”).registerTempTable(“tweets”)
c.sql(“select text, user.name from tweets”)
From JSON:
tweets.json
23. RDDRDDRDDRDDRDDRDD
Represents streams as a series of RDDs over time
sc.twitterStream(...)
.map(lambda t: (t.username, 1))
.reduceByWindow(“30s”, lambda a, b: a + b)
.print()
Spark Streaming
Time
28. Combining Processing Types
// Load data using SQL
points = ctx.sql(“select latitude, longitude from hive_tweets”)
// Train a machine learning model
model = KMeans.train(points, 10)
// Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
30. This Talk
Introduction to Spark
Built-in libraries
New APIs in 2015
– DataFrames
– Data sources
– ML Pipelines
31. 31
Main Directions in 2015
Data Science
Making it easier for
wider class of users
Platform Interfaces
Scaling the ecosystem
32. From MapReduce to Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
33. Beyond MapReduce Experts
33
Early adopters
Data Scientists
Statisticians
R users
PyData
…
users
understands
MapReduce &
functional APIs
34. Data Frames
De facto data processing abstraction for data science
(R and Python)
Google Trends for “dataframe”
36. 36
Spark DataFrames
Collections of structured
data similar to R, pandas
Automatically optimized
via Spark SQL
– Columnar storage
– Code-gen. execution
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime
37. Optimization via Spark SQL
37
DataFrame expressions are relational queries, letting Spark inspect them
Automatically perform expression optimization, join algorithm selection,
columnar storage, compilation to Java bytecode
38. 38
Machine Learning Pipelines
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
modelDataFrame
High-level API similar to
SciKit-Learn
Operates on DataFrames
Grid search to tune params
across a whole pipeline
39. 39
Spark R Interface
Exposes DataFrames and
ML pipelines in R
Parallelize calls to R code
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))
Target: Spark 1.4 (June)
40. 40
Main Directions in 2015
Data Science
Making it easier for
wider class of users
Platform Interfaces
Scaling the ecosystem
41. 41
Data Sources API
Allows plugging smart data
sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
Spark
{JSON}
42. 42
Allows plugging smart data
sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
SELECT * FROM mysql_users u JOIN
hive_logs h
WHERE u.lang = “en”
Spark
{JSON}
SELECT * FROM users WHERE lang=“en”
Data Sources API
43. 43
Current Data Sources
Built-in:
Hive, JSON, Parquet, JDBC
Community:
CSV, Avro, ElasticSearch,
Redshift, Cloudant, Mongo,
Cassandra, SequoiaDB
List at spark-packages.org