2. What is Apache Spark?
Open source computing
engine for clusters
» Generalizes MapReduce
Rich set of APIs & libraries
» APIs in Scala, Java, Python, R
» SQL, machine learning, graphs
ML
Streaming SQL Graph
…
3. Project History
Started as research project at Berkeley in 2009
Open sourced in 2010
Joined Apache foundation in 2013
1000+ contributors to date
7. Key Idea
Write apps in terms of transformations on
distributed datasets
Resilient distributed datasets (RDDs)
» Collections of objects spread across a cluster
» Built through parallel transformations (map, filter, etc)
» Automatically rebuilt on failure
» Controllable persistence (e.g. caching in RAM)
8. Operations
Transformations (e.g. map, filter, groupBy)
» Lazy operations to build RDDs from other RDDs
Actions (e.g. count, collect, save)
» Return a result or write it to storage
9. Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs 20 s for on-disk data)
10. Fault Recovery
RDDs track lineage information that can be used
to efficiently recompute lost data
Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = _.contains(...))
map
(func = _.split(...))
11. Behavior with Less RAM
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Execution
time
(s)
% of working set in cache
16. Learning Spark
Easiest way: the shell (spark-shell or pyspark)
» Special Scala/Python interpreters for cluster use
Runs in local mode on all cores by default, but
can connect to clusters too (see docs)
17. First Stop: SparkContext
Main entry point to Spark functionality
Available in shell as variable sc
In standalone apps, you create your own
18. Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)
# Use existing Hadoop InputFormat (Java/Scala only)
sc.hadoopFile(keyClass, valClass, inputFmt, conf)
19. Basic Transformations
nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
squares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0) // {4}
# Map each element to zero or more others
nums.flatMap(lambda x: range(x))
# => {0, 0, 1, 0, 1, 2}
Range object (sequence
of numbers 0, 1, …, x-1)
20. Basic Actions
nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
nums.collect() # => [1, 2, 3]
# Return first K elements
nums.take(2) # => [1, 2]
# Count number of elements
nums.count() # => 3
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
nums.saveAsTextFile(“hdfs://file.txt”)
21. Working with Key-Value Pairs
Spark’s “distributed reduce” transformations
operate on RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => b
22. Some Key-Value Operations
pets = sc.parallelize(
[(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}
pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also aggregates on the map side
25. Setting the Level of Parallelism
All the pair RDD operations take an optional
second parameter for number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
26. Using Local Variables
Any external variables you use in a closure will
automatically be shipped to the cluster:
query = sys.stdin.readline()
pages.filter(lambda x: query in x).count()
Some caveats:
» Each task gets a new copy (updates aren’t sent back)
» Variable must be Serializable / Pickle-able
» Don’t use fields of an outer object (ships all of it!)
29. Components
Spark runs as a library in your
driver program
Runs tasks locally or on cluster
» Standalone, Mesos or YARN
Accesses storage via data
source plugins
» Can use S3, HDFS, GCE, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
33. Libraries Built on Spark
Spark Core
Spark
Streaming
real-time
Spark SQL+
DataFrames
structured data
MLlib
machine
learning
GraphX
graph
34. Spark SQL & DataFrames
APIs for structured data (table-like data)
» SQL
» DataFrames: dynamically typed
» Datasets: statically typed
Similar optimizations to relational databases
35. DataFrame API
Domain-specific API similar to Pandas and R
» DataFrames are tables with named columns
users = spark.sql(“select * from users”)
ca_users = users[users[“state”] == “CA”]
ca_users.count()
ca_users.groupBy(“name”).avg(“age”)
caUsers.map(lambda row: row.name.upper())
Expression AST
37. Performance
37
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
38. Performance
38
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
39. MLlib
High-level pipeline API
similar to SciKit-Learn
Acts on DataFrames
Grid search and cross
validation for tuning
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline(
[tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
model
DataFrame
40. 40
MLlib Algorithms
Generalized linear models
Alternating least squares
Decision trees
Random forests, GBTs
Naïve Bayes
PCA, SVD
AUC, ROC, f-measure
K-means
Latent Dirichlet allocation
Power iteration clustering
Gaussian mixtures
FP-growth
Word2Vec
Streaming k-means
42. RDD
RDD
RDD
RDD
RDD
RDD
Time
Represents streams as a series of RDDs over time
val spammers = sc.sequenceFile(“hdfs://spammers.seq”)
sc.twitterStream(...)
.filter(t => t.text.contains(“Stanford”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()
Spark Streaming
43. Combining Libraries
# Load data using Spark SQL
points = spark.sql(
“select latitude, longitude from tweets”)
# Train a machine learning model
model = KMeans.train(points, 10)
# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
44. Spark offers a wide range of high-level APIs
for parallel data processing
Can run on your laptop or a cloud service
Online tutorials:
»spark.apache.org/docs/latest
»Databricks Community Edition
Conclusion