2. This tutorial
Skimlinks | Spark… A view from the trenches !!
● Some key Spark concepts (2 minute crash course)
● First part: Spark core
○ Notebook: basic operations
○ Spark execution model
● Second part: Dataframes and SparkSQL
○ Notebook : using DataFrames and Spark SQL
○ DataFrames execution model
● Final note on Spark configs and useful areas to go from here
3. How to setup the tutorial
Skimlinks | Spark… A view from the trenches !!
● Directions and resources to setup the tutorial in your local
environment can be found at the below mentioned blog post
https://in4maniac.wordpress.com/2016/10/09/spark-tutorial/
4. ● Data Extracted from Amazon Dataset
o Image-based recommendations on styles and substitutes , J. McAuley, C. Targett, J.
Shi, A. van den Hengel, SIGIR, 2015
o Inferring networks of substitutable and complementary products, J. McAuley, R.
Pandey, J. Leskovec, Knowledge Discovery and Data Mining, 2015
● sample of Amazon product reviews
o fashion.json, electronics.json, sports.json
o fields: ASIN, review text, reviewer name, …
● sample of product metadata
o sample_metadata.json
o fields: ASIN, price, category, ...
The datasets
Skimlinks | Spark… A view from the trenches
5. Some Spark definitions (1)
Skimlinks | Spark… A view from the trenches
● An RDD is a distributed dataset
● The dataset is divided into partitions
● It is possible to cache data in memory
6. Some Spark definitions (2)
Skimlinks | Spark… A view from the trenches
● A cluster = a master node and slave nodes
● Transformations through the Spark context
● Only the master node has access to the Spark context
● Actions and transformations
8. Why understanding Spark internals?
● essential to understand failures and improve
performance
This section is a condensed version of: https://spark-
summit.org/2014/talk/a-deeper-understanding-of-spark-internals
Skimlinks | Spark… A view from the trenches !!
9. From code to computations
Skimlinks | Spark… A view from the trenches
rd = sc.textFile(‘product_reviews.txt’)
rd.map(lambda x: (x[‘asin’], x[‘overall’]))
.groupByKey()
.filter(lambda x: len(x[1])> 1)
.count()
10. From code to computations
Skimlinks | Spark… A view from the trenches
1. You write code using RDDs
2. Spark creates a graph of RDDs
rd = sc.textFile(‘product_reviews.txt’)
rd..map(lambda x: (x[‘asin’], x[‘overall’]))
.groupByKey()
.filter(lambda x: len(x[1])> 1)
.count()
11. Execution model
Skimlinks | Spark… A view from the trenches
Stage 1
3. Spark figures out logical
execution plan for each
computation
Stage 2
13. Skimlinks | Spark… A view from the trenches
If your shuffle fails...
● Shuffles are usually the bottleneck:
o if very large tasks ⇒ memory pressure
o if too many tasks ⇒ network overhead
o if too few tasks ⇒ suboptimal cluster utilisation
● Best practices:
o always tune the number of partitions!
o between 100 and 10,000 partitions
o lower bound: at least ~2x number of cores
o upper bound: task should take at least 100 ms
● https://spark.apache.org/docs/latest/tuning.html
14. Skimlinks | Spark… A view from the trenches
Other things failing...
● I’m trying to save a file but it keeps failing...
○ Turn speculation off!
● I get an error “no space left on device”!
○ Make sure the SPARK_LOCAL_DIRS use the right disk
partition on the slaves
● I keep losing my executors
○ could be a memory problem: increase executor memory, or
reduce the number of cores
19. DataFrames and Spark SQL
Skimlinks | Spark… A view from the trenches
A DataFrame is a collection of data that is organized with named
columns.
● API very similar to Pandas/R DataFrames
Spark SQL is a functionality that allows to query from DataFrames
using SQL-like schematic language
● Catalyst SQL engine
● Hive Context opens up most of HQL functionality with
DataFrames
20. RDDs and DataFrames
Skimlinks | Spark… A view from the trenches
RDD
Data is stored as independent
objects in partitions
Does process optimization on
RDD level
More focus on “HOW” to
obtain the required data
DataFrame
Data has higher level column
information in addition to
partitioning
Does optimizations on
schematic structure
More focus on “WHAT” data is
required
Transformable
22. How do DataFrames work?
●WHY DATAFRAMES??
●Overview
This section is inspired by:
http://www.slideshare.net/databricks/introducing-dataframes-in-spark-
for-large-scale-data-science
Skimlinks | Spark… A view from the trenches
23. Main Considerations
Skimlinks | Spark… A view from the trenches
Chart extracted from :
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-
spark-for-large-scale-data-science.html
24. Fundamentals
Skimlinks | Spark… A view from the trenches
Un Resolved
Logical
Plan Logical
Plan
Optimized
Logical
Plan
Efficient
Physical
Plan
Physical
Plans
SELECT cols
FROM tables
WHERE cond
Code:
more_code
more()
Code=1
DataFrame SparkSQL
RDD
26. New stuff: Data Source APIs
●Schema Evolution
oIn parquet, you can start from a basic schema and
keep adding new fields.
●Run SQL directly on the file
oIn Parquet files, run the SQL on the file itself as
parquet has got structure
27. Data Source APIs
●Partition Discovery
oTable partitioning is used in systems like Hive
oData is normally stored in different directories
28. spark-sklearn
●Parameter Tuning is the problem
oDataset is small
oGrid search is BIG
More info: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html
29. New stuff: DataSet API
● Spark : Complex
analyses with minimal
programming effort
● Run Spark applications
faster
o Closely knit to Catalyst
engine and Tungsten Engine
● Extension of DataFrame
API: type safe, object
oriented programming
interface
More info:
https://databricks.com/blog/2016/01/04/introduci
ng-spark-datasets.html
30. Spark 2.0
● API Changes
● A lot of work on
Tungsten Execution
engine
● Support of Dataset API
● Unification of DataFrame
& Dataset APIs
More info: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-
dataframes-and-datasets.html
31. Important Links
Skimlinks | Spark… A view from the trenches
● Amazon Dataset :
https://snap.stanford.edu/data/web-Amazon.html
● Spark DataFrames :
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-
science.html
● More resources about Apache Spark:
○ http://www.slideshare.net/databricks
○ https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q-_UUbA
● Spark SQL programming guide for 1.6.1:
https://spark.apache.org/docs/latest/sql-programming-guide.html
● Using Apache Spark in real world applications:
http://files.meetup.com/13722842/Spark%20Meetup.pdf
● Tungsten
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-
metal.html
● Further Questions:
○ Maria : @mariarmestre
○ Erik : @zerophewl
○ Sahan : @in4maniac
32. Skimlinks is hiring Data
Scientists and Senior
Software Engineers !!
● Machine Learning
● Apache Spark and Big Data
Get in touch with:
● Sahan : sahan@skimlinks.com
● Erik : erik@skimlinks.com
Editor's Notes
-partitions and tasks sometimes used interchangably
-partitions and tasks sometimes used interchangably
CREDITS
CREDITS
-Understanding the way Spark distributes its computations across the cluster is very important to understand why things fail.
-must read: Spark overview
-RDD graph: this is how we represent the computations
-each operation creates an RDD
-logical plan: how can we execute the computations efficiently?
-goal is to pipeline as much as possible (fuse operations together so that we dont go over the data multiple times and dont have too much overhead of multiple operations)
-fusing means we take the output of a function and put it directly into another function call (overhead of multiple operations that are pipelineable is extremely small) ⇒ we group all operations together into a single super-operation that we call a stage.
-until when can you just fuse operations? ⇒ until we need to reorganise the data!
-how do we generate the result? if independent of any other data, then pipelineable (e.g. first map). GroupByKey needs to be reorganised and depends on the results of multiple previous tasks.
Each stage is split into tasks: each task is data + computation
The bottom of the first stage if the map() and the top of the first stage is the groupBy()
we assume here that we have as many input tasks/partitions as we have output tasks/partitions
in a shuffle, we typically need to group data by some key so often in a typical reduceByKey, we will have to send tasks from each mapper (output of stage 1) to each single reducer (input of stage 2)
we hash all the asins to the same bucket and group them in the same place
e.g. if we need to reduceByKey on the asin, then each reducer will contain a range of asins
We execute all tasks of one stage before we can start another stage
Shuffle ⇒ data is moved across the network, expensive operation, avoided whenever possible
intermediate files written to disk
data is partitioned before the shuffle into 4 files
once all files are there, the second stage begins. Each task in the input of stage 2 will read these files.
if the data for the same key is already in the same place, then there is no need to send data over the network, which is highly desirable
Spark does some pre-aggregation before sending over the network as an optimisation
-data skew: e.g. many reviews for the same product, one of the partitions will be very large
-this is just the tip of the iceberg, but gives you an overview of what Spark does behind the scenes. It is very useful to know once you start dealing with larger amounts of data, and you need to debug a job.
symptoms:
-machine/executor failures: memory problems or too many shuffle files
-partitions and tasks sometimes used interchangably
RDDs can do all the transformations that are available to DataFrames, So why dataframes??
What you need rather than how to get what you need
Ability to Enable you entire organization to use the power of big data without getting intimidated
-partitions and tasks sometimes used interchangably