SlideShare a Scribd company logo
A Year With Spark
Martin Goodson, VP Data Science
Skimlinks
Phase I of my big data experience
R files, python files, Awk, Sed, job scheduler
(sun grid engine), Make/bash scripts
Phase II of my big data experience
Pig + python user defined functions
Phase III of my big data experience
PySpark?
Skimlinks data
Automated Affiliatization Tech
140,000 publisher sites
Collect 30TB month of user behaviour (clicks,
impressions, purchases)
Data science team
5 Data scientists
Machine learning or statistical computing
Varying programming experience
Not engineers
No devops
Reality
Spark Can Be Unpredictable
Reality
Learning in depth how spark works
Try to divide and conquer
Learning how to configure spark properly
Learning in depth how spark works
Read all this:
https://spark.apache.org/docs/1.2.1/programming-guide.html
https://spark.apache.org/docs/1.2.1/configuration.html
https://spark.apache.org/docs/1.2.1/cluster-overview.html
And then:
https://www.youtube.com/watch?v=49Hr5xZyTEA (spark internals)
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py
Try to divide and conquer
Don't throw 30Tb of data at a spark script and
expect it to just work.
Divide the work into bite sized chunks -
aggregating and projecting as you go.
Try to divide and conquer
Use reduceByKey() not groupByKey()
Use max() and add()
(cf. http://www.slideshare.net/samthemonad/spark-meetup-talk-final)
Start with this
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)
(k2, 1)
(k2, 2)
Use RDD.reduceByKey(add) to get this:
(k1, 10)
(k2, 3)
Key concept: reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …}
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)
Key concept: reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …} reduceByKey(numPartitions)
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)
PySpark Memory: worked example
PySpark Memory: worked example
10 x r3.4xlarge (122G, 16 cores)
Use half for each executor: 60GB
Number of cores = 120
Cache = 60% x 60GB x 10 = 360GB
Each java thread: 40% x 60GB / 12 = ~2GB
Each python process: ~4GB
OS: ~12GB
PySpark Memory: worked example
spark.executor.memory=60g
spark.cores.max=120g
spark.driver.memory=60g
PySpark Memory: worked example
spark.executor.memory=60g
spark.cores.max=120g
spark.driver.memory=60g
~/spark/bin/pyspark --driver-memory 60g
PySpark: other memory configuration
spark.akka.frameSize=1000
spark.kryoserializer.buffer.max.mb=10
(spark.python.worker.memory)
PySpark: other configuration
spark.shuffle.consolidateFiles=True
spark.rdd.compress=True
spark.speculation=true
Errors
java.net.SocketException: Connection reset
java.net.SocketTimeoutException: Read timed out
Lost executor, cancelled key exceptions, etc
Errors
java.net.SocketException: Connection reset
java.net.SocketTimeoutException: Read timed out
Lost executor, cancelled key exceptions, etc
All of the above are caused by memory errors!
Errors
‘ERROR LiveListenerBus: Dropping SparkListenerEvent
because no remaining room in event queue’: filter() little data
from many partitions - use coalesce()
Collect() fails - increase driver memory + akka framesize
Were our assumptions correct?
We have a very fast development process.
Use spark for development and for scale-up.
Scale-able data science development.
Large-scale machine learning
with Spark and Python
Empowering the data scientist
by Maria Mestre
ML @ Skimlinks
● Mostly for research and prototyping
● No developer background
● Familiar with scikit-learn and Spark
● Building a data scientist toolbox
➢ Scraping pages ➢ Training a
classifier
Every ML system….
➢ Filtering
➢ Segmenting
urls
➢ Sample training
instances
➢ Applying a
classifier
Data collection: scraping lots of pages
This is how I would do it in my local machine…
● use of Scrapy package
● write a function scrape() that creates a Scrapy object
urls = open(‘list_urls.txt’, ‘r’).readlines()
output = s3_bucket + ‘results.json’
scrape(urls, output)
Distributing over the cluster
def distributed_scrape(urls, index, s3_bucket):
output = s3_bucket + ‘part’ + str(index) + ‘.json’
scrape(urls, output)
urls = open(‘list_urls.txt’, ‘r’).readlines()
urls = sc.parallelize(urls, 100)
urls.mapPartitionsWithIndex(lambda index, urls: distributed_scrape(urls, index,
s3_bucket))
Installing scrapy over the cluster
1/ need to use Python 2.7
echo 'export PYSPARK_PYTHON=python2.7' >> ~/spark/conf/spark-env.sh
2/ use pssh to install packages in the slaves
pssh -h /root/spark-ec2/slaves ‘easy_install-2.7 Scrapy’
➢ Scraping pages ➢ Training a
classifier
Every ML system….
➢ Filtering
➢ Segmenting
urls
➢ Sample training
instances
➢ Applying a
classifier
Example: filtering
● we want to find activity of 30M users in 2
months of activity: 2 Gb vs 6 Tb
○ map-side join using broadcast() ⇒ does not work with
large objects!
■ e.g. input.filter(lambda x: x[‘user’] in user_list_b)
○ use of mapPartitions()
■ e.g. input.mapPartitions(lambda x: read_file_and_filter(x))
6 TB,
~11B input
35 mins 113 Gb,
529M matches
60 Gb,
515M matches
9 mins
bloom filter join
Example: segmenting urls
● we want to convert an url ‘www.iloveshoes.
com’ to [‘i’, ‘love’, ‘shoes’]
● Segmentation
○ wordsegment package in python ⇒ very slow!
○ 300M urls take 10 hours with 120 cores!
Example: getting a representative sample
Our solution in Spark!
sample = sc.parallelize([],1)
sample_size = 1000
input.cache()
for category, proportion in stats.items():
category_pages = input.filter(lambda x: x[‘category’] == category)
category_sample = category_pages.takeSample(False, sample_size * proportion)
sample = sample.union(category_sample)
MLLib offers a probabilistic solution (not exact sample size):
sample = sampleByKey(input, stats)
➢ Scraping pages ➢ Training a
classifier
Every ML system….
➢ Filtering
➢ Segmenting
urls
➢ Sample training
instances
➢ Applying a
classifier
Grid search for hyperparameters
Problem: we have some candidate [ 1
, 2,
..., 10000
] values for a hyperparameter
, which one should we choose?
If the data is small enough that processing time is fine
➢ Do it in a single machine
If the data is too large to process on a single machine
➢ Use MLlib
If the data can be processed on a single machine but takes too long to train
➢ The next slide!
number of combinations = {parameters} = 2
number of combinations = {parameters} = 2
Using cross-validation to optimise a hyperparameter
1. separate the data into k equally-sized chunks
2. for each candidate value i
a. use (k-1) chunks to fit the classifier parameters
b. use the remaining chunk to get a classification score
c. report average score
3. At the end, select the that achieves the best average score
number of combinations = {parameters} x {folds} = 4
➢ Scraping pages ➢ Training a
classifier
Every ML system….
➢ Filtering
➢ Segmenting
urls
➢ Sample training
instances
➢ Applying a
classifier
Apply the classifier over the new_data: easy!
With scikit-learn:
classifier_b = sc.broadcast(classifier)
new_labels = new_data.map(lambda x: classifier_b.value.predict(x))
With scikit-learn but cannot broadcast:
save classifier models to files, ship to s3
use mapPartitions to read model parameters and classify
With MLlib:
(model._threshold = None)
new_labels = new_data.map(lambda x: model.predict(x))
Thanks!
Apache Spark for Big
Data
Spark at Scale & Performance Tuning
Sahan Bulathwela |
Data Science Engineer @ Skimlinks |
Outline
● Spark at scale: Big Data Example
● Tuning and Performance
Spark at Scale: Big Data Example
● Yes, we use Spark !!
● Not just to prototype or one-time analyses
● Run automated analyses at a large scale on
daily basis
● Use-case: Generating audience statistics for
our customers
Before…
● We provide data products based on
audience statistics to customers
● Extract event data from Datastore
● Generate Audience statistics and reports
Data
● Skimlinks records web data in terms of user
events such as clicks, impressions and etc…
● Our Data!!
○ Records 18M clicks (11 GB)
○ Records 203M impressions (950 GB)
○ These numbers are on daily basis (Oct 01, 2014)
● About 1TB of relevant events
A few days and data scientists
later...
Statistics
Major pain points
● Most of the data is not relevant
○ Only 3-4 out of 30ish fields are
useful for each report
● Many duplicate steps
○ Reading the data
○ Extracting relevant fields
○ Transformations such as classifying
events
Solution
Solution
Solution
Aggregation doing its magic
● Mostly grouping events and summarizing
● Distribute the workload in time
● “Reduce by” instead of “Group by”
● BOTS
Deep Dive
Datastore
Build Daily
profiles
Intermediate Data
Structure
(Compressed in
GZIP)
Events
(1 TB)
Daily Profiles
1.8 GB
Build
Monthly
profiles
Monthly Aggregate
40 GB
Generate Audience StatisticsCustomers
Statistics
7 GB
● Takes 4 hours
● 150 Statistics
● Delivered daily to
clients
Deep Dive
Datastore
Build Daily
profiles
Intermediate Data
Structure
(Compressed in
GZIP)
Events
(1 TB)
Daily Profiles
1.8 GB
Build
Monthly
profiles
Generate Audience StatisticsCustomers
Statistics
7 GB
● Takes 4 hours
● 150 Statistics
● Delivered daily to
clients
Monthly Aggregate
40 GB
Deep Dive
Datastore
Build Daily
profiles
Intermediate Data
Structure
(Compressed in
GZIP)
Events
(1 TB)
Daily Profiles
1.8 GB
Build
Monthly
profiles
Generate Audience StatisticsCustomers
Statistics
7 GB
● Takes 4 hours
● 150 Statistics
● Delivered daily to
clients
Monthly Aggregate
40 GB
SO WHAT???
Before After
Computing Daily event summary 1+ DAYS !!! 20 Mins
Computing monthly aggregate 40 Mins
Storing Daily event summary 100’s of GBs 1.8 GB
Storing monthly aggregate 40 GB
Total time taken for generating Stats 1+ DAYS !!! 3 hrs 30 mins
time taken per Report 1+ DAYS !!! 1.4 mins
Parquet enabled us to
reduce our storage
costs by 86% and
increase data loading
speed by 5x
Storage
Performance when parsing 31 daily
profiles
Thank You !!

More Related Content

What's hot

Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
Samir Bessalah
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
David Wellman
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Olga Lavrentieva
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
Jihoon Son
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
Jason Shao
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
Joseph Adler
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
Amund Tveit
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
Uwe Printz
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
Krishna Sangeeth KS
 
Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by Beginner
Hirokazu Tokuno
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
Joseph Adler
 

What's hot (20)

Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by Beginner
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Scala+data
Scala+dataScala+data
Scala+data
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 

Similar to Spark Meetup

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
Etti Gur
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
Amit Raj
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
MesosCon 2018
MesosCon 2018MesosCon 2018
MesosCon 2018
Pablo Delgado
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
David Pilato
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
Jen Waller
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
David Pilato
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 

Similar to Spark Meetup (20)

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
MesosCon 2018
MesosCon 2018MesosCon 2018
MesosCon 2018
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 

Spark Meetup

  • 1. A Year With Spark Martin Goodson, VP Data Science Skimlinks
  • 2. Phase I of my big data experience R files, python files, Awk, Sed, job scheduler (sun grid engine), Make/bash scripts
  • 3. Phase II of my big data experience Pig + python user defined functions
  • 4. Phase III of my big data experience PySpark?
  • 5. Skimlinks data Automated Affiliatization Tech 140,000 publisher sites Collect 30TB month of user behaviour (clicks, impressions, purchases)
  • 6. Data science team 5 Data scientists Machine learning or statistical computing Varying programming experience Not engineers No devops
  • 8. Spark Can Be Unpredictable
  • 9. Reality Learning in depth how spark works Try to divide and conquer Learning how to configure spark properly
  • 10. Learning in depth how spark works Read all this: https://spark.apache.org/docs/1.2.1/programming-guide.html https://spark.apache.org/docs/1.2.1/configuration.html https://spark.apache.org/docs/1.2.1/cluster-overview.html And then: https://www.youtube.com/watch?v=49Hr5xZyTEA (spark internals) https://github.com/apache/spark/blob/master/python/pyspark/rdd.py
  • 11. Try to divide and conquer Don't throw 30Tb of data at a spark script and expect it to just work. Divide the work into bite sized chunks - aggregating and projecting as you go.
  • 12. Try to divide and conquer Use reduceByKey() not groupByKey() Use max() and add() (cf. http://www.slideshare.net/samthemonad/spark-meetup-talk-final)
  • 13. Start with this (k1, 1) (k1, 1) (k1, 2) (k1, 1) (k1, 5) (k2, 1) (k2, 2) Use RDD.reduceByKey(add) to get this: (k1, 10) (k2, 3)
  • 14. Key concept: reduceByKey(combineByKey) {k1: 2, …} (k1, 2) (k1, 3) (k1,5) {k1: 10, …} {…} combineLocally _mergeCombiners {k1: 3, …} {k1: 5, …} (k1, 1) (k1, 1) (k1, 2) (k1, 1) (k1, 5)
  • 15. Key concept: reduceByKey(combineByKey) {k1: 2, …} (k1, 2) (k1, 3) (k1,5) {k1: 10, …} {…} combineLocally _mergeCombiners {k1: 3, …} {k1: 5, …} reduceByKey(numPartitions) (k1, 1) (k1, 1) (k1, 2) (k1, 1) (k1, 5)
  • 17. PySpark Memory: worked example 10 x r3.4xlarge (122G, 16 cores) Use half for each executor: 60GB Number of cores = 120 Cache = 60% x 60GB x 10 = 360GB Each java thread: 40% x 60GB / 12 = ~2GB Each python process: ~4GB OS: ~12GB
  • 18. PySpark Memory: worked example spark.executor.memory=60g spark.cores.max=120g spark.driver.memory=60g
  • 19. PySpark Memory: worked example spark.executor.memory=60g spark.cores.max=120g spark.driver.memory=60g ~/spark/bin/pyspark --driver-memory 60g
  • 20. PySpark: other memory configuration spark.akka.frameSize=1000 spark.kryoserializer.buffer.max.mb=10 (spark.python.worker.memory)
  • 22. Errors java.net.SocketException: Connection reset java.net.SocketTimeoutException: Read timed out Lost executor, cancelled key exceptions, etc
  • 23. Errors java.net.SocketException: Connection reset java.net.SocketTimeoutException: Read timed out Lost executor, cancelled key exceptions, etc All of the above are caused by memory errors!
  • 24. Errors ‘ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue’: filter() little data from many partitions - use coalesce() Collect() fails - increase driver memory + akka framesize
  • 25. Were our assumptions correct? We have a very fast development process. Use spark for development and for scale-up. Scale-able data science development.
  • 26. Large-scale machine learning with Spark and Python Empowering the data scientist by Maria Mestre
  • 27. ML @ Skimlinks ● Mostly for research and prototyping ● No developer background ● Familiar with scikit-learn and Spark ● Building a data scientist toolbox
  • 28. ➢ Scraping pages ➢ Training a classifier Every ML system…. ➢ Filtering ➢ Segmenting urls ➢ Sample training instances ➢ Applying a classifier
  • 29. Data collection: scraping lots of pages This is how I would do it in my local machine… ● use of Scrapy package ● write a function scrape() that creates a Scrapy object urls = open(‘list_urls.txt’, ‘r’).readlines() output = s3_bucket + ‘results.json’ scrape(urls, output)
  • 30. Distributing over the cluster def distributed_scrape(urls, index, s3_bucket): output = s3_bucket + ‘part’ + str(index) + ‘.json’ scrape(urls, output) urls = open(‘list_urls.txt’, ‘r’).readlines() urls = sc.parallelize(urls, 100) urls.mapPartitionsWithIndex(lambda index, urls: distributed_scrape(urls, index, s3_bucket))
  • 31. Installing scrapy over the cluster 1/ need to use Python 2.7 echo 'export PYSPARK_PYTHON=python2.7' >> ~/spark/conf/spark-env.sh 2/ use pssh to install packages in the slaves pssh -h /root/spark-ec2/slaves ‘easy_install-2.7 Scrapy’
  • 32. ➢ Scraping pages ➢ Training a classifier Every ML system…. ➢ Filtering ➢ Segmenting urls ➢ Sample training instances ➢ Applying a classifier
  • 33. Example: filtering ● we want to find activity of 30M users in 2 months of activity: 2 Gb vs 6 Tb ○ map-side join using broadcast() ⇒ does not work with large objects! ■ e.g. input.filter(lambda x: x[‘user’] in user_list_b) ○ use of mapPartitions() ■ e.g. input.mapPartitions(lambda x: read_file_and_filter(x))
  • 34. 6 TB, ~11B input 35 mins 113 Gb, 529M matches 60 Gb, 515M matches 9 mins bloom filter join
  • 35. Example: segmenting urls ● we want to convert an url ‘www.iloveshoes. com’ to [‘i’, ‘love’, ‘shoes’] ● Segmentation ○ wordsegment package in python ⇒ very slow! ○ 300M urls take 10 hours with 120 cores!
  • 36. Example: getting a representative sample
  • 37. Our solution in Spark! sample = sc.parallelize([],1) sample_size = 1000 input.cache() for category, proportion in stats.items(): category_pages = input.filter(lambda x: x[‘category’] == category) category_sample = category_pages.takeSample(False, sample_size * proportion) sample = sample.union(category_sample) MLLib offers a probabilistic solution (not exact sample size): sample = sampleByKey(input, stats)
  • 38. ➢ Scraping pages ➢ Training a classifier Every ML system…. ➢ Filtering ➢ Segmenting urls ➢ Sample training instances ➢ Applying a classifier
  • 39. Grid search for hyperparameters Problem: we have some candidate [ 1 , 2, ..., 10000 ] values for a hyperparameter , which one should we choose? If the data is small enough that processing time is fine ➢ Do it in a single machine If the data is too large to process on a single machine ➢ Use MLlib If the data can be processed on a single machine but takes too long to train ➢ The next slide!
  • 40.
  • 41.
  • 42. number of combinations = {parameters} = 2
  • 43. number of combinations = {parameters} = 2
  • 44. Using cross-validation to optimise a hyperparameter 1. separate the data into k equally-sized chunks 2. for each candidate value i a. use (k-1) chunks to fit the classifier parameters b. use the remaining chunk to get a classification score c. report average score 3. At the end, select the that achieves the best average score
  • 45. number of combinations = {parameters} x {folds} = 4
  • 46. ➢ Scraping pages ➢ Training a classifier Every ML system…. ➢ Filtering ➢ Segmenting urls ➢ Sample training instances ➢ Applying a classifier
  • 47. Apply the classifier over the new_data: easy! With scikit-learn: classifier_b = sc.broadcast(classifier) new_labels = new_data.map(lambda x: classifier_b.value.predict(x)) With scikit-learn but cannot broadcast: save classifier models to files, ship to s3 use mapPartitions to read model parameters and classify With MLlib: (model._threshold = None) new_labels = new_data.map(lambda x: model.predict(x))
  • 49. Apache Spark for Big Data Spark at Scale & Performance Tuning Sahan Bulathwela | Data Science Engineer @ Skimlinks |
  • 50. Outline ● Spark at scale: Big Data Example ● Tuning and Performance
  • 51. Spark at Scale: Big Data Example ● Yes, we use Spark !! ● Not just to prototype or one-time analyses ● Run automated analyses at a large scale on daily basis ● Use-case: Generating audience statistics for our customers
  • 52. Before… ● We provide data products based on audience statistics to customers ● Extract event data from Datastore ● Generate Audience statistics and reports
  • 53. Data ● Skimlinks records web data in terms of user events such as clicks, impressions and etc… ● Our Data!! ○ Records 18M clicks (11 GB) ○ Records 203M impressions (950 GB) ○ These numbers are on daily basis (Oct 01, 2014) ● About 1TB of relevant events
  • 54. A few days and data scientists later... Statistics
  • 55. Major pain points ● Most of the data is not relevant ○ Only 3-4 out of 30ish fields are useful for each report ● Many duplicate steps ○ Reading the data ○ Extracting relevant fields ○ Transformations such as classifying events
  • 59. Aggregation doing its magic ● Mostly grouping events and summarizing ● Distribute the workload in time ● “Reduce by” instead of “Group by” ● BOTS
  • 60. Deep Dive Datastore Build Daily profiles Intermediate Data Structure (Compressed in GZIP) Events (1 TB) Daily Profiles 1.8 GB Build Monthly profiles Monthly Aggregate 40 GB Generate Audience StatisticsCustomers Statistics 7 GB ● Takes 4 hours ● 150 Statistics ● Delivered daily to clients
  • 61. Deep Dive Datastore Build Daily profiles Intermediate Data Structure (Compressed in GZIP) Events (1 TB) Daily Profiles 1.8 GB Build Monthly profiles Generate Audience StatisticsCustomers Statistics 7 GB ● Takes 4 hours ● 150 Statistics ● Delivered daily to clients Monthly Aggregate 40 GB
  • 62. Deep Dive Datastore Build Daily profiles Intermediate Data Structure (Compressed in GZIP) Events (1 TB) Daily Profiles 1.8 GB Build Monthly profiles Generate Audience StatisticsCustomers Statistics 7 GB ● Takes 4 hours ● 150 Statistics ● Delivered daily to clients Monthly Aggregate 40 GB
  • 63. SO WHAT??? Before After Computing Daily event summary 1+ DAYS !!! 20 Mins Computing monthly aggregate 40 Mins Storing Daily event summary 100’s of GBs 1.8 GB Storing monthly aggregate 40 GB Total time taken for generating Stats 1+ DAYS !!! 3 hrs 30 mins time taken per Report 1+ DAYS !!! 1.4 mins
  • 64. Parquet enabled us to reduce our storage costs by 86% and increase data loading speed by 5x
  • 66. Performance when parsing 31 daily profiles