SlideShare a Scribd company logo
1 of 32
Learning PySpark
A Tutorial
By:
Maria Mestre (@mariarmestre)
Sahan Bulathwela (@in4maniac)
Erik Pazos (@zerophewl)
This tutorial
Skimlinks | Spark… A view from the trenches !!
● Some key Spark concepts (2 minute crash course)
● First part: Spark core
○ Notebook: basic operations
○ Spark execution model
● Second part: Dataframes and SparkSQL
○ Notebook : using DataFrames and Spark SQL
○ DataFrames execution model
● Final note on Spark configs and useful areas to go from here
How to setup the tutorial
Skimlinks | Spark… A view from the trenches !!
● Directions and resources to setup the tutorial in your local
environment can be found at the below mentioned blog post
https://in4maniac.wordpress.com/2016/10/09/spark-tutorial/
● Data Extracted from Amazon Dataset
o Image-based recommendations on styles and substitutes , J. McAuley, C. Targett, J.
Shi, A. van den Hengel, SIGIR, 2015
o Inferring networks of substitutable and complementary products, J. McAuley, R.
Pandey, J. Leskovec, Knowledge Discovery and Data Mining, 2015
● sample of Amazon product reviews
o fashion.json, electronics.json, sports.json
o fields: ASIN, review text, reviewer name, …
● sample of product metadata
o sample_metadata.json
o fields: ASIN, price, category, ...
The datasets
Skimlinks | Spark… A view from the trenches
Some Spark definitions (1)
Skimlinks | Spark… A view from the trenches
● An RDD is a distributed dataset
● The dataset is divided into partitions
● It is possible to cache data in memory
Some Spark definitions (2)
Skimlinks | Spark… A view from the trenches
● A cluster = a master node and slave nodes
● Transformations through the Spark context
● Only the master node has access to the Spark context
● Actions and transformations
Skimlinks | Spark… A view from the trenches
Why understanding Spark internals?
● essential to understand failures and improve
performance
This section is a condensed version of: https://spark-
summit.org/2014/talk/a-deeper-understanding-of-spark-internals
Skimlinks | Spark… A view from the trenches !!
From code to computations
Skimlinks | Spark… A view from the trenches
rd = sc.textFile(‘product_reviews.txt’)
rd.map(lambda x: (x[‘asin’], x[‘overall’]))
.groupByKey()
.filter(lambda x: len(x[1])> 1)
.count()
From code to computations
Skimlinks | Spark… A view from the trenches
1. You write code using RDDs
2. Spark creates a graph of RDDs
rd = sc.textFile(‘product_reviews.txt’)
rd..map(lambda x: (x[‘asin’], x[‘overall’]))
.groupByKey()
.filter(lambda x: len(x[1])> 1)
.count()
Execution model
Skimlinks | Spark… A view from the trenches
Stage 1
3. Spark figures out logical
execution plan for each
computation
Stage 2
Execution model
Skimlinks | Spark… A view from the trenches
4. Schedules and executes individual tasks
Skimlinks | Spark… A view from the trenches
If your shuffle fails...
● Shuffles are usually the bottleneck:
o if very large tasks ⇒ memory pressure
o if too many tasks ⇒ network overhead
o if too few tasks ⇒ suboptimal cluster utilisation
● Best practices:
o always tune the number of partitions!
o between 100 and 10,000 partitions
o lower bound: at least ~2x number of cores
o upper bound: task should take at least 100 ms
● https://spark.apache.org/docs/latest/tuning.html
Skimlinks | Spark… A view from the trenches
Other things failing...
● I’m trying to save a file but it keeps failing...
○ Turn speculation off!
● I get an error “no space left on device”!
○ Make sure the SPARK_LOCAL_DIRS use the right disk
partition on the slaves
● I keep losing my executors
○ could be a memory problem: increase executor memory, or
reduce the number of cores
Skimlinks | Spark… A view from the trenches
Skimlinks | Spark… A view from the trenches
Apache Spark
Skimlinks | Spark… A view from the trenches
DataFrames API
Skimlinks | Spark… A view from the trenches
DataFrames API
DataFrames and Spark SQL
Skimlinks | Spark… A view from the trenches
A DataFrame is a collection of data that is organized with named
columns.
● API very similar to Pandas/R DataFrames
Spark SQL is a functionality that allows to query from DataFrames
using SQL-like schematic language
● Catalyst SQL engine
● Hive Context opens up most of HQL functionality with
DataFrames
RDDs and DataFrames
Skimlinks | Spark… A view from the trenches
RDD
Data is stored as independent
objects in partitions
Does process optimization on
RDD level
More focus on “HOW” to
obtain the required data
DataFrame
Data has higher level column
information in addition to
partitioning
Does optimizations on
schematic structure
More focus on “WHAT” data is
required
Transformable
Skimlinks | Spark… A view from the trenches
How do DataFrames work?
●WHY DATAFRAMES??
●Overview
This section is inspired by:
http://www.slideshare.net/databricks/introducing-dataframes-in-spark-
for-large-scale-data-science
Skimlinks | Spark… A view from the trenches
Main Considerations
Skimlinks | Spark… A view from the trenches
Chart extracted from :
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-
spark-for-large-scale-data-science.html
Fundamentals
Skimlinks | Spark… A view from the trenches
Un Resolved
Logical
Plan Logical
Plan
Optimized
Logical
Plan
Efficient
Physical
Plan
Physical
Plans
SELECT cols
FROM tables
WHERE cond
Code:
more_code
more()
Code=1
DataFrame SparkSQL
RDD
COMPANYNAME.COM | PRESENTATION
New stuff: Data Source APIs
●Schema Evolution
oIn parquet, you can start from a basic schema and
keep adding new fields.
●Run SQL directly on the file
oIn Parquet files, run the SQL on the file itself as
parquet has got structure
Data Source APIs
●Partition Discovery
oTable partitioning is used in systems like Hive
oData is normally stored in different directories
spark-sklearn
●Parameter Tuning is the problem
oDataset is small
oGrid search is BIG
More info: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html
New stuff: DataSet API
● Spark : Complex
analyses with minimal
programming effort
● Run Spark applications
faster
o Closely knit to Catalyst
engine and Tungsten Engine
● Extension of DataFrame
API: type safe, object
oriented programming
interface
More info:
https://databricks.com/blog/2016/01/04/introduci
ng-spark-datasets.html
Spark 2.0
● API Changes
● A lot of work on
Tungsten Execution
engine
● Support of Dataset API
● Unification of DataFrame
& Dataset APIs
More info: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-
dataframes-and-datasets.html
Important Links
Skimlinks | Spark… A view from the trenches
● Amazon Dataset :
https://snap.stanford.edu/data/web-Amazon.html
● Spark DataFrames :
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-
science.html
● More resources about Apache Spark:
○ http://www.slideshare.net/databricks
○ https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q-_UUbA
● Spark SQL programming guide for 1.6.1:
https://spark.apache.org/docs/latest/sql-programming-guide.html
● Using Apache Spark in real world applications:
http://files.meetup.com/13722842/Spark%20Meetup.pdf
● Tungsten
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-
metal.html
● Further Questions:
○ Maria : @mariarmestre
○ Erik : @zerophewl
○ Sahan : @in4maniac
Skimlinks is hiring Data
Scientists and Senior
Software Engineers !!
● Machine Learning
● Apache Spark and Big Data
Get in touch with:
● Sahan : sahan@skimlinks.com
● Erik : erik@skimlinks.com

More Related Content

What's hot

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySparkSpark Summit
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedinYukti Kaura
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 

What's hot (20)

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

Viewers also liked

Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
Spark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotronSpark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotronDuyhai Doan
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansSpark Summit
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaAbhinav Singh
 
Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Matt Turck
 
Seq2 seq learning
Seq2 seq learningSeq2 seq learning
Seq2 seq learningVu Pham
 
Hardware Startups: The VC Perspective
Hardware Startups: The VC PerspectiveHardware Startups: The VC Perspective
Hardware Startups: The VC PerspectiveMatt Turck
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark) Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark) Matt Turck
 
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)Matt Turck
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Gianmario Spacagna
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...Data Science Milan
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 

Viewers also liked (20)

Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Spark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotronSpark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotron
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
 
Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)
 
Seq2 seq learning
Seq2 seq learningSeq2 seq learning
Seq2 seq learning
 
Hardware Startups: The VC Perspective
Hardware Startups: The VC PerspectiveHardware Startups: The VC Perspective
Hardware Startups: The VC Perspective
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark) Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark)
 
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
The Astonishing Resurrection of AI (A Primer on Artificial Intelligence)
 
Bayes rpp bristol
Bayes rpp bristolBayes rpp bristol
Bayes rpp bristol
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 

Similar to Spark tutorial

Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!Edureka!
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 

Similar to Spark tutorial (20)

Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Spark tutorial

  • 1. Learning PySpark A Tutorial By: Maria Mestre (@mariarmestre) Sahan Bulathwela (@in4maniac) Erik Pazos (@zerophewl)
  • 2. This tutorial Skimlinks | Spark… A view from the trenches !! ● Some key Spark concepts (2 minute crash course) ● First part: Spark core ○ Notebook: basic operations ○ Spark execution model ● Second part: Dataframes and SparkSQL ○ Notebook : using DataFrames and Spark SQL ○ DataFrames execution model ● Final note on Spark configs and useful areas to go from here
  • 3. How to setup the tutorial Skimlinks | Spark… A view from the trenches !! ● Directions and resources to setup the tutorial in your local environment can be found at the below mentioned blog post https://in4maniac.wordpress.com/2016/10/09/spark-tutorial/
  • 4. ● Data Extracted from Amazon Dataset o Image-based recommendations on styles and substitutes , J. McAuley, C. Targett, J. Shi, A. van den Hengel, SIGIR, 2015 o Inferring networks of substitutable and complementary products, J. McAuley, R. Pandey, J. Leskovec, Knowledge Discovery and Data Mining, 2015 ● sample of Amazon product reviews o fashion.json, electronics.json, sports.json o fields: ASIN, review text, reviewer name, … ● sample of product metadata o sample_metadata.json o fields: ASIN, price, category, ... The datasets Skimlinks | Spark… A view from the trenches
  • 5. Some Spark definitions (1) Skimlinks | Spark… A view from the trenches ● An RDD is a distributed dataset ● The dataset is divided into partitions ● It is possible to cache data in memory
  • 6. Some Spark definitions (2) Skimlinks | Spark… A view from the trenches ● A cluster = a master node and slave nodes ● Transformations through the Spark context ● Only the master node has access to the Spark context ● Actions and transformations
  • 7. Skimlinks | Spark… A view from the trenches
  • 8. Why understanding Spark internals? ● essential to understand failures and improve performance This section is a condensed version of: https://spark- summit.org/2014/talk/a-deeper-understanding-of-spark-internals Skimlinks | Spark… A view from the trenches !!
  • 9. From code to computations Skimlinks | Spark… A view from the trenches rd = sc.textFile(‘product_reviews.txt’) rd.map(lambda x: (x[‘asin’], x[‘overall’])) .groupByKey() .filter(lambda x: len(x[1])> 1) .count()
  • 10. From code to computations Skimlinks | Spark… A view from the trenches 1. You write code using RDDs 2. Spark creates a graph of RDDs rd = sc.textFile(‘product_reviews.txt’) rd..map(lambda x: (x[‘asin’], x[‘overall’])) .groupByKey() .filter(lambda x: len(x[1])> 1) .count()
  • 11. Execution model Skimlinks | Spark… A view from the trenches Stage 1 3. Spark figures out logical execution plan for each computation Stage 2
  • 12. Execution model Skimlinks | Spark… A view from the trenches 4. Schedules and executes individual tasks
  • 13. Skimlinks | Spark… A view from the trenches If your shuffle fails... ● Shuffles are usually the bottleneck: o if very large tasks ⇒ memory pressure o if too many tasks ⇒ network overhead o if too few tasks ⇒ suboptimal cluster utilisation ● Best practices: o always tune the number of partitions! o between 100 and 10,000 partitions o lower bound: at least ~2x number of cores o upper bound: task should take at least 100 ms ● https://spark.apache.org/docs/latest/tuning.html
  • 14. Skimlinks | Spark… A view from the trenches Other things failing... ● I’m trying to save a file but it keeps failing... ○ Turn speculation off! ● I get an error “no space left on device”! ○ Make sure the SPARK_LOCAL_DIRS use the right disk partition on the slaves ● I keep losing my executors ○ could be a memory problem: increase executor memory, or reduce the number of cores
  • 15. Skimlinks | Spark… A view from the trenches
  • 16. Skimlinks | Spark… A view from the trenches Apache Spark
  • 17. Skimlinks | Spark… A view from the trenches DataFrames API
  • 18. Skimlinks | Spark… A view from the trenches DataFrames API
  • 19. DataFrames and Spark SQL Skimlinks | Spark… A view from the trenches A DataFrame is a collection of data that is organized with named columns. ● API very similar to Pandas/R DataFrames Spark SQL is a functionality that allows to query from DataFrames using SQL-like schematic language ● Catalyst SQL engine ● Hive Context opens up most of HQL functionality with DataFrames
  • 20. RDDs and DataFrames Skimlinks | Spark… A view from the trenches RDD Data is stored as independent objects in partitions Does process optimization on RDD level More focus on “HOW” to obtain the required data DataFrame Data has higher level column information in addition to partitioning Does optimizations on schematic structure More focus on “WHAT” data is required Transformable
  • 21. Skimlinks | Spark… A view from the trenches
  • 22. How do DataFrames work? ●WHY DATAFRAMES?? ●Overview This section is inspired by: http://www.slideshare.net/databricks/introducing-dataframes-in-spark- for-large-scale-data-science Skimlinks | Spark… A view from the trenches
  • 23. Main Considerations Skimlinks | Spark… A view from the trenches Chart extracted from : https://databricks.com/blog/2015/02/17/introducing-dataframes-in- spark-for-large-scale-data-science.html
  • 24. Fundamentals Skimlinks | Spark… A view from the trenches Un Resolved Logical Plan Logical Plan Optimized Logical Plan Efficient Physical Plan Physical Plans SELECT cols FROM tables WHERE cond Code: more_code more() Code=1 DataFrame SparkSQL RDD
  • 26. New stuff: Data Source APIs ●Schema Evolution oIn parquet, you can start from a basic schema and keep adding new fields. ●Run SQL directly on the file oIn Parquet files, run the SQL on the file itself as parquet has got structure
  • 27. Data Source APIs ●Partition Discovery oTable partitioning is used in systems like Hive oData is normally stored in different directories
  • 28. spark-sklearn ●Parameter Tuning is the problem oDataset is small oGrid search is BIG More info: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html
  • 29. New stuff: DataSet API ● Spark : Complex analyses with minimal programming effort ● Run Spark applications faster o Closely knit to Catalyst engine and Tungsten Engine ● Extension of DataFrame API: type safe, object oriented programming interface More info: https://databricks.com/blog/2016/01/04/introduci ng-spark-datasets.html
  • 30. Spark 2.0 ● API Changes ● A lot of work on Tungsten Execution engine ● Support of Dataset API ● Unification of DataFrame & Dataset APIs More info: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds- dataframes-and-datasets.html
  • 31. Important Links Skimlinks | Spark… A view from the trenches ● Amazon Dataset : https://snap.stanford.edu/data/web-Amazon.html ● Spark DataFrames : https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data- science.html ● More resources about Apache Spark: ○ http://www.slideshare.net/databricks ○ https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q-_UUbA ● Spark SQL programming guide for 1.6.1: https://spark.apache.org/docs/latest/sql-programming-guide.html ● Using Apache Spark in real world applications: http://files.meetup.com/13722842/Spark%20Meetup.pdf ● Tungsten https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare- metal.html ● Further Questions: ○ Maria : @mariarmestre ○ Erik : @zerophewl ○ Sahan : @in4maniac
  • 32. Skimlinks is hiring Data Scientists and Senior Software Engineers !! ● Machine Learning ● Apache Spark and Big Data Get in touch with: ● Sahan : sahan@skimlinks.com ● Erik : erik@skimlinks.com

Editor's Notes

  1. -partitions and tasks sometimes used interchangably
  2. -partitions and tasks sometimes used interchangably
  3. CREDITS
  4. CREDITS
  5. -Understanding the way Spark distributes its computations across the cluster is very important to understand why things fail. -must read: Spark overview
  6. -RDD graph: this is how we represent the computations -each operation creates an RDD
  7. -logical plan: how can we execute the computations efficiently? -goal is to pipeline as much as possible (fuse operations together so that we dont go over the data multiple times and dont have too much overhead of multiple operations) -fusing means we take the output of a function and put it directly into another function call (overhead of multiple operations that are pipelineable is extremely small) ⇒ we group all operations together into a single super-operation that we call a stage. -until when can you just fuse operations? ⇒ until we need to reorganise the data! -how do we generate the result? if independent of any other data, then pipelineable (e.g. first map). GroupByKey needs to be reorganised and depends on the results of multiple previous tasks.
  8. Each stage is split into tasks: each task is data + computation The bottom of the first stage if the map() and the top of the first stage is the groupBy() we assume here that we have as many input tasks/partitions as we have output tasks/partitions in a shuffle, we typically need to group data by some key so often in a typical reduceByKey, we will have to send tasks from each mapper (output of stage 1) to each single reducer (input of stage 2) we hash all the asins to the same bucket and group them in the same place e.g. if we need to reduceByKey on the asin, then each reducer will contain a range of asins We execute all tasks of one stage before we can start another stage Shuffle ⇒ data is moved across the network, expensive operation, avoided whenever possible intermediate files written to disk data is partitioned before the shuffle into 4 files once all files are there, the second stage begins. Each task in the input of stage 2 will read these files. if the data for the same key is already in the same place, then there is no need to send data over the network, which is highly desirable Spark does some pre-aggregation before sending over the network as an optimisation
  9. -data skew: e.g. many reviews for the same product, one of the partitions will be very large -this is just the tip of the iceberg, but gives you an overview of what Spark does behind the scenes. It is very useful to know once you start dealing with larger amounts of data, and you need to debug a job. symptoms: -machine/executor failures: memory problems or too many shuffle files
  10. -partitions and tasks sometimes used interchangably
  11. RDDs can do all the transformations that are available to DataFrames, So why dataframes?? What you need rather than how to get what you need Ability to Enable you entire organization to use the power of big data without getting intimidated
  12. -partitions and tasks sometimes used interchangably