SlideShare a Scribd company logo
Matei Zaharia
CS 341, Spring 2017
Parallel Programming
with Apache Spark
What is Apache Spark?
Open source computing
engine for clusters
» Generalizes MapReduce
Rich set of APIs & libraries
» APIs in Scala, Java, Python, R
» SQL, machine learning, graphs
ML
Streaming SQL Graph
…
Project History
Started as research project at Berkeley in 2009
Open sourced in 2010
Joined Apache foundation in 2013
1000+ contributors to date
Spark Community
1000+ companies, clusters up to 8000 nodes
Community Growth
2014 2015 2016
Spark Meetup
Members
2014 2015 2016
Developers
Contributing
350
600
1100 230K
66K
20K
This Talk
Introduction to Spark
Tour of Spark operations
Job execution
Higher-level libraries
Key Idea
Write apps in terms of transformations on
distributed datasets
Resilient distributed datasets (RDDs)
» Collections of objects spread across a cluster
» Built through parallel transformations (map, filter, etc)
» Automatically rebuilt on failure
» Controllable persistence (e.g. caching in RAM)
Operations
Transformations (e.g. map, filter, groupBy)
» Lazy operations to build RDDs from other RDDs
Actions (e.g. count, collect, save)
» Return a result or write it to storage
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs 20 s for on-disk data)
Fault Recovery
RDDs track lineage information that can be used
to efficiently recompute lost data
Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = _.contains(...))
map
(func = _.split(...))
Behavior with Less RAM
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Execution
time
(s)
% of working set in cache
Iterative Algorithms
121
4.1
0 50 100 150
K-means Clustering
Spark
Hadoop MR
sec
80
0.96
0 20 40 60 80 100
Logistic Regression
Spark
Hadoop MR
sec
Spark in Scala and Java
// Scala:
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
// Java:
JavaRDD<String> lines = sc.textFile(...);
lines.filter(s -> s.contains(“error”)).count();
Installing Spark
Spark runs on your laptop: download it from
spark.apache.org
Cloud services:
»Google Cloud DataProc
»Databricks Community Edition
This Talk
Introduction to Spark
Tour of Spark operations
Job execution
Higher-level libraries
Learning Spark
Easiest way: the shell (spark-shell or pyspark)
» Special Scala/Python interpreters for cluster use
Runs in local mode on all cores by default, but
can connect to clusters too (see docs)
First Stop: SparkContext
Main entry point to Spark functionality
Available in shell as variable sc
In standalone apps, you create your own
Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)
# Use existing Hadoop InputFormat (Java/Scala only)
sc.hadoopFile(keyClass, valClass, inputFmt, conf)
Basic Transformations
nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
squares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0) // {4}
# Map each element to zero or more others
nums.flatMap(lambda x: range(x))
# => {0, 0, 1, 0, 1, 2}
Range object (sequence
of numbers 0, 1, …, x-1)
Basic Actions
nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
nums.collect() # => [1, 2, 3]
# Return first K elements
nums.take(2) # => [1, 2]
# Count number of elements
nums.count() # => 3
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
nums.saveAsTextFile(“hdfs://file.txt”)
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations
operate on RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => b
Some Key-Value Operations
pets = sc.parallelize(
[(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}
pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also aggregates on the map side
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ ”))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
Example: Word Count
“to be or”
“not to be”
“to”
“be”
“or”
“not”
“to”
“be”
(to, 1)
(be, 1)
(or, 1)
(not, 1)
(to, 1)
(be, 1)
(be, 2)
(not, 1)
(or, 1)
(to, 2)
Other Key-Value Operations
visits = sc.parallelize([ (“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”) ])
pageNames = sc.parallelize([ (“index.html”, “Home”),
(“about.html”, “About”) ])
visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
visits.cogroup(pageNames)
# (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”]))
# (“about.html”, ([“3.4.5.6”], [“About”]))
Setting the Level of Parallelism
All the pair RDD operations take an optional
second parameter for number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
Using Local Variables
Any external variables you use in a closure will
automatically be shipped to the cluster:
query = sys.stdin.readline()
pages.filter(lambda x: query in x).count()
Some caveats:
» Each task gets a new copy (updates aren’t sent back)
» Variable must be Serializable / Pickle-able
» Don’t use fields of an outer object (ships all of it!)
Other RDD Operators
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
More details: spark.apache.org/docs/latest
This Talk
Introduction to Spark
Tour of Spark operations
Job execution
Higher-level libraries
Components
Spark runs as a library in your
driver program
Runs tasks locally or on cluster
» Standalone, Mesos or YARN
Accesses storage via data
source plugins
» Can use S3, HDFS, GCE, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
Job Scheduler
General task graphs
Automatically
pipelines functions
Data locality aware
Partitioning aware
to avoid shuffles
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
Debugging
Spark UI available at http://<master-node>:4040
This Talk
Introduction to Spark
Tour of Spark operations
Job execution
Higher-level libraries
Libraries Built on Spark
Spark Core
Spark
Streaming
real-time
Spark SQL+
DataFrames
structured data
MLlib
machine
learning
GraphX
graph
Spark SQL & DataFrames
APIs for structured data (table-like data)
» SQL
» DataFrames: dynamically typed
» Datasets: statically typed
Similar optimizations to relational databases
DataFrame API
Domain-specific API similar to Pandas and R
» DataFrames are tables with named columns
users = spark.sql(“select * from users”)
ca_users = users[users[“state”] == “CA”]
ca_users.count()
ca_users.groupBy(“name”).avg(“age”)
caUsers.map(lambda row: row.name.upper())
Expression AST
Execution Steps
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
Catalog
SQL
Data
Frame
Code
Generator
Dataset
Performance
37
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
Performance
38
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time for aggregation benchmark (s)
MLlib
High-level pipeline API
similar to SciKit-Learn
Acts on DataFrames
Grid search and cross
validation for tuning
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline(
[tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
model
DataFrame
40
MLlib Algorithms
Generalized linear models
Alternating least squares
Decision trees
Random forests, GBTs
Naïve Bayes
PCA, SVD
AUC, ROC, f-measure
K-means
Latent Dirichlet allocation
Power iteration clustering
Gaussian mixtures
FP-growth
Word2Vec
Streaming k-means
Time
Input
Spark Streaming
RDD
RDD
RDD
RDD
RDD
RDD
Time
Represents streams as a series of RDDs over time
val spammers = sc.sequenceFile(“hdfs://spammers.seq”)
sc.twitterStream(...)
.filter(t => t.text.contains(“Stanford”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()
Spark Streaming
Combining Libraries
# Load data using Spark SQL
points = spark.sql(
“select latitude, longitude from tweets”)
# Train a machine learning model
model = KMeans.train(points, 10)
# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
Spark offers a wide range of high-level APIs
for parallel data processing
Can run on your laptop or a cloud service
Online tutorials:
»spark.apache.org/docs/latest
»Databricks Community Edition
Conclusion

More Related Content

Similar to Artigo 81 - spark_tutorial.pdf

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 

Similar to Artigo 81 - spark_tutorial.pdf (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Recently uploaded

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

Artigo 81 - spark_tutorial.pdf

  • 1. Matei Zaharia CS 341, Spring 2017 Parallel Programming with Apache Spark
  • 2. What is Apache Spark? Open source computing engine for clusters » Generalizes MapReduce Rich set of APIs & libraries » APIs in Scala, Java, Python, R » SQL, machine learning, graphs ML Streaming SQL Graph …
  • 3. Project History Started as research project at Berkeley in 2009 Open sourced in 2010 Joined Apache foundation in 2013 1000+ contributors to date
  • 4. Spark Community 1000+ companies, clusters up to 8000 nodes
  • 5. Community Growth 2014 2015 2016 Spark Meetup Members 2014 2015 2016 Developers Contributing 350 600 1100 230K 66K 20K
  • 6. This Talk Introduction to Spark Tour of Spark operations Job execution Higher-level libraries
  • 7. Key Idea Write apps in terms of transformations on distributed datasets Resilient distributed datasets (RDDs) » Collections of objects spread across a cluster » Built through parallel transformations (map, filter, etc) » Automatically rebuilt on failure » Controllable persistence (e.g. caching in RAM)
  • 8. Operations Transformations (e.g. map, filter, groupBy) » Lazy operations to build RDDs from other RDDs Actions (e.g. count, collect, save) » Return a result or write it to storage
  • 9. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data)
  • 10. Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  • 11. Behavior with Less RAM 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Execution time (s) % of working set in cache
  • 12. Iterative Algorithms 121 4.1 0 50 100 150 K-means Clustering Spark Hadoop MR sec 80 0.96 0 20 40 60 80 100 Logistic Regression Spark Hadoop MR sec
  • 13. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“error”)).count();
  • 14. Installing Spark Spark runs on your laptop: download it from spark.apache.org Cloud services: »Google Cloud DataProc »Databricks Community Edition
  • 15. This Talk Introduction to Spark Tour of Spark operations Job execution Higher-level libraries
  • 16. Learning Spark Easiest way: the shell (spark-shell or pyspark) » Special Scala/Python interpreters for cluster use Runs in local mode on all cores by default, but can connect to clusters too (see docs)
  • 17. First Stop: SparkContext Main entry point to Spark functionality Available in shell as variable sc In standalone apps, you create your own
  • 18. Creating RDDs # Turn a Python collection into an RDD sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) sc.hadoopFile(keyClass, valClass, inputFmt, conf)
  • 19. Basic Transformations nums = sc.parallelize([1, 2, 3]) # Pass each element through a function squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others nums.flatMap(lambda x: range(x)) # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
  • 20. Basic Actions nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection nums.collect() # => [1, 2, 3] # Return first K elements nums.take(2) # => [1, 2] # Count number of elements nums.count() # => 3 # Merge elements with an associative function nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file nums.saveAsTextFile(“hdfs://file.txt”)
  • 21. Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b
  • 22. Some Key-Value Operations pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also aggregates on the map side
  • 23. lines = sc.textFile(“hamlet.txt”) counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word: (word, 1)) .reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • 24. Other Key-Value Operations visits = sc.parallelize([ (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ]) pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))
  • 25. Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks words.reduceByKey(lambda x, y: x + y, 5) words.groupByKey(5) visits.join(pageViews, 5)
  • 26. Using Local Variables Any external variables you use in a closure will automatically be shipped to the cluster: query = sys.stdin.readline() pages.filter(lambda x: query in x).count() Some caveats: » Each task gets a new copy (updates aren’t sent back) » Variable must be Serializable / Pickle-able » Don’t use fields of an outer object (ships all of it!)
  • 28. This Talk Introduction to Spark Tour of Spark operations Job execution Higher-level libraries
  • 29. Components Spark runs as a library in your driver program Runs tasks locally or on cluster » Standalone, Mesos or YARN Accesses storage via data source plugins » Can use S3, HDFS, GCE, … Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage
  • 30. Job Scheduler General task graphs Automatically pipelines functions Data locality aware Partitioning aware to avoid shuffles = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • 31. Debugging Spark UI available at http://<master-node>:4040
  • 32. This Talk Introduction to Spark Tour of Spark operations Job execution Higher-level libraries
  • 33. Libraries Built on Spark Spark Core Spark Streaming real-time Spark SQL+ DataFrames structured data MLlib machine learning GraphX graph
  • 34. Spark SQL & DataFrames APIs for structured data (table-like data) » SQL » DataFrames: dynamically typed » Datasets: statically typed Similar optimizations to relational databases
  • 35. DataFrame API Domain-specific API similar to Pandas and R » DataFrames are tables with named columns users = spark.sql(“select * from users”) ca_users = users[users[“state”] == “CA”] ca_users.count() ca_users.groupBy(“name”).avg(“age”) caUsers.map(lambda row: row.name.upper()) Expression AST
  • 37. Performance 37 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time for aggregation benchmark (s)
  • 38. Performance 38 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time for aggregation benchmark (s)
  • 39. MLlib High-level pipeline API similar to SciKit-Learn Acts on DataFrames Grid search and cross validation for tuning tokenizer = Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline( [tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR model DataFrame
  • 40. 40 MLlib Algorithms Generalized linear models Alternating least squares Decision trees Random forests, GBTs Naïve Bayes PCA, SVD AUC, ROC, f-measure K-means Latent Dirichlet allocation Power iteration clustering Gaussian mixtures FP-growth Word2Vec Streaming k-means
  • 42. RDD RDD RDD RDD RDD RDD Time Represents streams as a series of RDDs over time val spammers = sc.sequenceFile(“hdfs://spammers.seq”) sc.twitterStream(...) .filter(t => t.text.contains(“Stanford”)) .transform(tweets => tweets.map(t => (t.user, t)).join(spammers)) .print() Spark Streaming
  • 43. Combining Libraries # Load data using Spark SQL points = spark.sql( “select latitude, longitude from tweets”) # Train a machine learning model model = KMeans.train(points, 10) # Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
  • 44. Spark offers a wide range of high-level APIs for parallel data processing Can run on your laptop or a cloud service Online tutorials: »spark.apache.org/docs/latest »Databricks Community Edition Conclusion