SlideShare a Scribd company logo
1 of 44
Download to read offline
Matei Zaharia
CS 341, Spring 2017
Parallel Programming
with Apache Spark
What is Apache Spark?
Open source computing
engine for clusters
» Generalizes MapReduce
Rich set of APIs & libraries
» APIs in Scala, Java, Python, R
» SQL, machine learning, graphs
ML
Streaming SQL Graph
…
Project History
Started as research project at Berkeley in 2009
Open sourced in 2010
Joined Apache foundation in 2013
1000+ contributors to date
Spark Community
1000+ companies, clusters up to 8000 nodes
Community Growth
2014 2015 2016
Spark Meetup
Members
2014 2015 2016
Developers
Contributing
350
600
1100 230K
66K
20K
This Talk
Introduction to Spark
Tour of Spark operations
Job execution
Higher-level libraries
Key Idea
Write apps in terms of transformations on
distributed datasets
Resilient distributed datasets (RDDs)
» Collections of objects spread across a cluster
» Built through parallel transformations (map, filter, etc)
» Automatically rebuilt on failure
» Controllable persistence (e.g. caching in RAM)
Operations
Transformations (e.g. map, filter, groupBy)
» Lazy operations to build RDDs from other RDDs
Actions (e.g. count, collect, save)
» Return a result or write it to storage
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs 20 s for on-disk data)
Fault Recovery
RDDs track lineage information that can be used
to efficiently recompute lost data
Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = _.contains(...))
map
(func = _.split(...))
Behavior with Less RAM
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Execution
time
(s)
% of working set in cache
Iterative Algorithms
121
4.1
0 50 100 150
K-means Clustering
Spark
Hadoop MR
sec
80
0.96
0 20 40 60 80 100
Logistic Regression
Spark
Hadoop MR
sec
Spark in Scala and Java
// Scala:
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
// Java:
JavaRDD<String> lines = sc.textFile(...);
lines.filter(s -> s.contains(“error”)).count();
Installing Spark
Spark runs on your laptop: download it from
spark.apache.org
Cloud services:
»Google Cloud DataProc
»Databricks Community Edition
This Talk
Introduction to Spark
Tour of Spark operations
Job execution
Higher-level libraries
Learning Spark
Easiest way: the shell (spark-shell or pyspark)
» Special Scala/Python interpreters for cluster use
Runs in local mode on all cores by default, but
can connect to clusters too (see docs)
First Stop: SparkContext
Main entry point to Spark functionality
Available in shell as variable sc
In standalone apps, you create your own
Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)
# Use existing Hadoop InputFormat (Java/Scala only)
sc.hadoopFile(keyClass, valClass, inputFmt, conf)
Basic Transformations
nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
squares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0) // {4}
# Map each element to zero or more others
nums.flatMap(lambda x: range(x))
# => {0, 0, 1, 0, 1, 2}
Range object (sequence
of numbers 0, 1, …, x-1)
Basic Actions
nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
nums.collect() # => [1, 2, 3]
# Return first K elements
nums.take(2) # => [1, 2]
# Count number of elements
nums.count() # => 3
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
nums.saveAsTextFile(“hdfs://file.txt”)
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations
operate on RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => b
Some Key-Value Operations
pets = sc.parallelize(
[(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}
pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also aggregates on the map side
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ ”))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
Example: Word Count
“to be or”
“not to be”
“to”
“be”
“or”
“not”
“to”
“be”
(to, 1)
(be, 1)
(or, 1)
(not, 1)
(to, 1)
(be, 1)
(be, 2)
(not, 1)
(or, 1)
(to, 2)
Other Key-Value Operations
visits = sc.parallelize([ (“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”) ])
pageNames = sc.parallelize([ (“index.html”, “Home”),
(“about.html”, “About”) ])
visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
visits.cogroup(pageNames)
# (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”]))
# (“about.html”, ([“3.4.5.6”], [“About”]))
Setting the Level of Parallelism
All the pair RDD operations take an optional
second parameter for number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
Using Local Variables
Any external variables you use in a closure will
automatically be shipped to the cluster:
query = sys.stdin.readline()
pages.filter(lambda x: query in x).count()
Some caveats:
» Each task gets a new copy (updates aren’t sent back)
» Variable must be Serializable / Pickle-able
» Don’t use fields of an outer object (ships all of it!)
Other RDD Operators
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
More details: spark.apache.org/docs/latest
This Talk
Introduction to Spark
Tour of Spark operations
Job execution
Higher-level libraries
Components
Spark runs as a library in your
driver program
Runs tasks locally or on cluster
» Standalone, Mesos or YARN
Accesses storage via data
source plugins
» Can use S3, HDFS, GCE, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
Job Scheduler
General task graphs
Automatically
pipelines functions
Data locality aware
Partitioning aware
to avoid shuffles
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
Debugging
Spark UI available at http://<master-node>:4040
This Talk
Introduction to Spark
Tour of Spark operations
Job execution
Higher-level libraries
Libraries Built on Spark
Spark Core
Spark
Streaming
real-time
Spark SQL+
DataFrames
structured data
MLlib
machine
learning
GraphX
graph
Spark SQL & DataFrames
APIs for structured data (table-like data)
» SQL
» DataFrames: dynamically typed
» Datasets: statically typed
Similar optimizations to relational databases
DataFrame API
Domain-specific API similar to Pandas and R
» DataFrames are tables with named columns
users = spark.sql(“select * from users”)
ca_users = users[users[“state”] == “CA”]
ca_users.count()
ca_users.groupBy(“name”).avg(“age”)
caUsers.map(lambda row: row.name.upper())
Expression AST
Execution Steps
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
Catalog
SQL
Data
Frame
Code
Generator
Dataset
Performance
37
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
Performance
38
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
MLlib
High-level pipeline API
similar to SciKit-Learn
Acts on DataFrames
Grid search and cross
validation for tuning
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline(
[tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
model
DataFrame
40
MLlib Algorithms
Generalized linear models
Alternating least squares
Decision trees
Random forests, GBTs
Naïve Bayes
PCA, SVD
AUC, ROC, f-measure
K-means
Latent Dirichlet allocation
Power iteration clustering
Gaussian mixtures
FP-growth
Word2Vec
Streaming k-means
Time
Input
Spark Streaming
RDD
RDD
RDD
RDD
RDD
RDD
Time
Represents streams as a series of RDDs over time
val spammers = sc.sequenceFile(“hdfs://spammers.seq”)
sc.twitterStream(...)
.filter(t => t.text.contains(“Stanford”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()
Spark Streaming
Combining Libraries
# Load data using Spark SQL
points = spark.sql(
“select latitude, longitude from tweets”)
# Train a machine learning model
model = KMeans.train(points, 10)
# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
Spark offers a wide range of high-level APIs
for parallel data processing
Can run on your laptop or a cloud service
Online tutorials:
»spark.apache.org/docs/latest
»Databricks Community Edition
Conclusion

More Related Content

Similar to Artigo 81 - spark_tutorial.pdf

Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 

Similar to Artigo 81 - spark_tutorial.pdf (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Artigo 81 - spark_tutorial.pdf

  • 1. Matei Zaharia CS 341, Spring 2017 Parallel Programming with Apache Spark
  • 2. What is Apache Spark? Open source computing engine for clusters » Generalizes MapReduce Rich set of APIs & libraries » APIs in Scala, Java, Python, R » SQL, machine learning, graphs ML Streaming SQL Graph …
  • 3. Project History Started as research project at Berkeley in 2009 Open sourced in 2010 Joined Apache foundation in 2013 1000+ contributors to date
  • 4. Spark Community 1000+ companies, clusters up to 8000 nodes
  • 5. Community Growth 2014 2015 2016 Spark Meetup Members 2014 2015 2016 Developers Contributing 350 600 1100 230K 66K 20K
  • 6. This Talk Introduction to Spark Tour of Spark operations Job execution Higher-level libraries
  • 7. Key Idea Write apps in terms of transformations on distributed datasets Resilient distributed datasets (RDDs) » Collections of objects spread across a cluster » Built through parallel transformations (map, filter, etc) » Automatically rebuilt on failure » Controllable persistence (e.g. caching in RAM)
  • 8. Operations Transformations (e.g. map, filter, groupBy) » Lazy operations to build RDDs from other RDDs Actions (e.g. count, collect, save) » Return a result or write it to storage
  • 9. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data)
  • 10. Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  • 11. Behavior with Less RAM 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Execution time (s) % of working set in cache
  • 12. Iterative Algorithms 121 4.1 0 50 100 150 K-means Clustering Spark Hadoop MR sec 80 0.96 0 20 40 60 80 100 Logistic Regression Spark Hadoop MR sec
  • 13. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“error”)).count();
  • 14. Installing Spark Spark runs on your laptop: download it from spark.apache.org Cloud services: »Google Cloud DataProc »Databricks Community Edition
  • 15. This Talk Introduction to Spark Tour of Spark operations Job execution Higher-level libraries
  • 16. Learning Spark Easiest way: the shell (spark-shell or pyspark) » Special Scala/Python interpreters for cluster use Runs in local mode on all cores by default, but can connect to clusters too (see docs)
  • 17. First Stop: SparkContext Main entry point to Spark functionality Available in shell as variable sc In standalone apps, you create your own
  • 18. Creating RDDs # Turn a Python collection into an RDD sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) sc.hadoopFile(keyClass, valClass, inputFmt, conf)
  • 19. Basic Transformations nums = sc.parallelize([1, 2, 3]) # Pass each element through a function squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others nums.flatMap(lambda x: range(x)) # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
  • 20. Basic Actions nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection nums.collect() # => [1, 2, 3] # Return first K elements nums.take(2) # => [1, 2] # Count number of elements nums.count() # => 3 # Merge elements with an associative function nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file nums.saveAsTextFile(“hdfs://file.txt”)
  • 21. Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b
  • 22. Some Key-Value Operations pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also aggregates on the map side
  • 23. lines = sc.textFile(“hamlet.txt”) counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word: (word, 1)) .reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • 24. Other Key-Value Operations visits = sc.parallelize([ (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ]) pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))
  • 25. Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks words.reduceByKey(lambda x, y: x + y, 5) words.groupByKey(5) visits.join(pageViews, 5)
  • 26. Using Local Variables Any external variables you use in a closure will automatically be shipped to the cluster: query = sys.stdin.readline() pages.filter(lambda x: query in x).count() Some caveats: » Each task gets a new copy (updates aren’t sent back) » Variable must be Serializable / Pickle-able » Don’t use fields of an outer object (ships all of it!)
  • 28. This Talk Introduction to Spark Tour of Spark operations Job execution Higher-level libraries
  • 29. Components Spark runs as a library in your driver program Runs tasks locally or on cluster » Standalone, Mesos or YARN Accesses storage via data source plugins » Can use S3, HDFS, GCE, … Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage
  • 30. Job Scheduler General task graphs Automatically pipelines functions Data locality aware Partitioning aware to avoid shuffles = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • 31. Debugging Spark UI available at http://<master-node>:4040
  • 32. This Talk Introduction to Spark Tour of Spark operations Job execution Higher-level libraries
  • 33. Libraries Built on Spark Spark Core Spark Streaming real-time Spark SQL+ DataFrames structured data MLlib machine learning GraphX graph
  • 34. Spark SQL & DataFrames APIs for structured data (table-like data) » SQL » DataFrames: dynamically typed » Datasets: statically typed Similar optimizations to relational databases
  • 35. DataFrame API Domain-specific API similar to Pandas and R » DataFrames are tables with named columns users = spark.sql(“select * from users”) ca_users = users[users[“state”] == “CA”] ca_users.count() ca_users.groupBy(“name”).avg(“age”) caUsers.map(lambda row: row.name.upper()) Expression AST
  • 37. Performance 37 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time for aggregation benchmark (s)
  • 38. Performance 38 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time for aggregation benchmark (s)
  • 39. MLlib High-level pipeline API similar to SciKit-Learn Acts on DataFrames Grid search and cross validation for tuning tokenizer = Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline( [tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR model DataFrame
  • 40. 40 MLlib Algorithms Generalized linear models Alternating least squares Decision trees Random forests, GBTs Naïve Bayes PCA, SVD AUC, ROC, f-measure K-means Latent Dirichlet allocation Power iteration clustering Gaussian mixtures FP-growth Word2Vec Streaming k-means
  • 42. RDD RDD RDD RDD RDD RDD Time Represents streams as a series of RDDs over time val spammers = sc.sequenceFile(“hdfs://spammers.seq”) sc.twitterStream(...) .filter(t => t.text.contains(“Stanford”)) .transform(tweets => tweets.map(t => (t.user, t)).join(spammers)) .print() Spark Streaming
  • 43. Combining Libraries # Load data using Spark SQL points = spark.sql( “select latitude, longitude from tweets”) # Train a machine learning model model = KMeans.train(points, 10) # Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
  • 44. Spark offers a wide range of high-level APIs for parallel data processing Can run on your laptop or a cloud service Online tutorials: »spark.apache.org/docs/latest »Databricks Community Edition Conclusion