Artigo 81 - spark_tutorial.pdf

Matei Zaharia
CS 341, Spring 2017
Parallel Programming
with Apache Spark

What is Apache Spark?
Open source computing
engine for clusters
» Generalizes MapReduce
Rich set of APIs & libraries
» APIs in Scala, Java, Python, R
» SQL, machine learning, graphs
ML
Streaming SQL Graph
…

Project History
Started as research project at Berkeley in 2009
Open sourced in 2010
Joined Apache foundation in 2013
1000+ contributors to date

Spark Community
1000+ companies, clusters up to 8000 nodes

Community Growth
2014 2015 2016
Spark Meetup
Members
2014 2015 2016
Developers
Contributing
350
600
1100 230K
66K
20K

This Talk
Introduction to Spark
Tour of Spark operations
Job execution
Higher-level libraries

Key Idea
Write apps in terms of transformations on
distributed datasets
Resilient distributed datasets (RDDs)
» Collections of objects spread across a cluster
» Built through parallel transformations (map, filter, etc)
» Automatically rebuilt on failure
» Controllable persistence (e.g. caching in RAM)

Operations
Transformations (e.g. map, filter, groupBy)
» Lazy operations to build RDDs from other RDDs
Actions (e.g. count, collect, save)
» Return a result or write it to storage

Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs 20 s for on-disk data)

Fault Recovery
RDDs track lineage information that can be used
to efficiently recompute lost data
Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = _.contains(...))
map
(func = _.split(...))

Behavior with Less RAM
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Execution
time
(s)
% of working set in cache

Iterative Algorithms
121
4.1
0 50 100 150
K-means Clustering
Spark
Hadoop MR
sec
80
0.96
0 20 40 60 80 100
Logistic Regression
Spark
Hadoop MR
sec

Spark in Scala and Java
// Scala:
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
// Java:
JavaRDD<String> lines = sc.textFile(...);
lines.filter(s -> s.contains(“error”)).count();

Installing Spark
Spark runs on your laptop: download it from
spark.apache.org
Cloud services:
»Google Cloud DataProc
»Databricks Community Edition

Learning Spark
Easiest way: the shell (spark-shell or pyspark)
» Special Scala/Python interpreters for cluster use
Runs in local mode on all cores by default, but
can connect to clusters too (see docs)

First Stop: SparkContext
Main entry point to Spark functionality
Available in shell as variable sc
In standalone apps, you create your own

Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)
# Use existing Hadoop InputFormat (Java/Scala only)
sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Basic Transformations
nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
squares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0) // {4}
# Map each element to zero or more others
nums.flatMap(lambda x: range(x))
# => {0, 0, 1, 0, 1, 2}
Range object (sequence
of numbers 0, 1, …, x-1)

Basic Actions
nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
nums.collect() # => [1, 2, 3]
# Return first K elements
nums.take(2) # => [1, 2]
# Count number of elements
nums.count() # => 3
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
nums.saveAsTextFile(“hdfs://file.txt”)

Working with Key-Value Pairs
Spark’s “distributed reduce” transformations
operate on RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => b

Some Key-Value Operations
pets = sc.parallelize(
[(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}
pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also aggregates on the map side

lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ ”))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
Example: Word Count
“to be or”
“not to be”
“to”
“be”
“or”
“not”
“to”
“be”
(to, 1)
(be, 1)
(or, 1)
(not, 1)
(to, 1)
(be, 1)
(be, 2)
(not, 1)
(or, 1)
(to, 2)

Other Key-Value Operations
visits = sc.parallelize([ (“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”) ])
pageNames = sc.parallelize([ (“index.html”, “Home”),
(“about.html”, “About”) ])
visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
visits.cogroup(pageNames)
# (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”]))
# (“about.html”, ([“3.4.5.6”], [“About”]))

Setting the Level of Parallelism
All the pair RDD operations take an optional
second parameter for number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)

Using Local Variables
Any external variables you use in a closure will
automatically be shipped to the cluster:
query = sys.stdin.readline()
pages.filter(lambda x: query in x).count()
Some caveats:
» Each task gets a new copy (updates aren’t sent back)
» Variable must be Serializable / Pickle-able
» Don’t use fields of an outer object (ships all of it!)

Other RDD Operators
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
More details: spark.apache.org/docs/latest

Components
Spark runs as a library in your
driver program
Runs tasks locally or on cluster
» Standalone, Mesos or YARN
Accesses storage via data
source plugins
» Can use S3, HDFS, GCE, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage

Job Scheduler
General task graphs
Automatically
pipelines functions
Data locality aware
Partitioning aware
to avoid shuffles
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map

Debugging
Spark UI available at http://<master-node>:4040

Libraries Built on Spark
Spark Core
Spark
Streaming
real-time
Spark SQL+
DataFrames
structured data
MLlib
machine
learning
GraphX
graph

Spark SQL & DataFrames
APIs for structured data (table-like data)
» SQL
» DataFrames: dynamically typed
» Datasets: statically typed
Similar optimizations to relational databases

DataFrame API
Domain-specific API similar to Pandas and R
» DataFrames are tables with named columns
users = spark.sql(“select * from users”)
ca_users = users[users[“state”] == “CA”]
ca_users.count()
ca_users.groupBy(“name”).avg(“age”)
caUsers.map(lambda row: row.name.upper())
Expression AST

Execution Steps
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
Catalog
SQL
Data
Frame
Code
Generator
Dataset

Performance
37
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)

Performance
38
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)

MLlib
High-level pipeline API
similar to SciKit-Learn
Acts on DataFrames
Grid search and cross
validation for tuning
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline(
[tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
model
DataFrame

40
MLlib Algorithms
Generalized linear models
Alternating least squares
Decision trees
Random forests, GBTs
Naïve Bayes
PCA, SVD
AUC, ROC, f-measure
K-means
Latent Dirichlet allocation
Power iteration clustering
Gaussian mixtures
FP-growth
Word2Vec
Streaming k-means

RDD
RDD
RDD
RDD
RDD
RDD
Time
Represents streams as a series of RDDs over time
val spammers = sc.sequenceFile(“hdfs://spammers.seq”)
sc.twitterStream(...)
.filter(t => t.text.contains(“Stanford”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()
Spark Streaming

Combining Libraries
# Load data using Spark SQL
points = spark.sql(
“select latitude, longitude from tweets”)
# Train a machine learning model
model = KMeans.train(points, 10)
# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)

Spark offers a wide range of high-level APIs
for parallel data processing
Can run on your laptop or a cloud service
Online tutorials:
»spark.apache.org/docs/latest
»Databricks Community Edition
Conclusion

Artigo 81 - spark_tutorial.pdf

Recommended

Recommended

More Related Content

Similar to Artigo 81 - spark_tutorial.pdf

Similar to Artigo 81 - spark_tutorial.pdf (20)

Recently uploaded

Recently uploaded (20)

Artigo 81 - spark_tutorial.pdf