Spark Meetup

A Year With Spark
Martin Goodson, VP Data Science
Skimlinks

Phase I of my big data experience
R files, python files, Awk, Sed, job scheduler
(sun grid engine), Make/bash scripts

Phase II of my big data experience
Pig + python user defined functions

Phase III of my big data experience
PySpark?

Skimlinks data
Automated Affiliatization Tech
140,000 publisher sites
Collect 30TB month of user behaviour (clicks,
impressions, purchases)

Data science team
5 Data scientists
Machine learning or statistical computing
Varying programming experience
Not engineers
No devops

Reality
Learning in depth how spark works
Try to divide and conquer
Learning how to configure spark properly

Learning in depth how spark works
Read all this:
https://spark.apache.org/docs/1.2.1/programming-guide.html
https://spark.apache.org/docs/1.2.1/configuration.html
https://spark.apache.org/docs/1.2.1/cluster-overview.html
And then:
https://www.youtube.com/watch?v=49Hr5xZyTEA (spark internals)
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py

Don't throw 30Tb of data at a spark script and
expect it to just work.
Divide the work into bite sized chunks -
aggregating and projecting as you go.

Use reduceByKey() not groupByKey()
Use max() and add()
(cf. http://www.slideshare.net/samthemonad/spark-meetup-talk-final)

Start with this
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)
(k2, 1)
(k2, 2)
Use RDD.reduceByKey(add) to get this:
(k1, 10)
(k2, 3)

Key concept: reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …}
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)

Key concept: reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …} reduceByKey(numPartitions)
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)

PySpark Memory: worked example

10 x r3.4xlarge (122G, 16 cores)
Use half for each executor: 60GB
Number of cores = 120
Cache = 60% x 60GB x 10 = 360GB
Each java thread: 40% x 60GB / 12 = ~2GB
Each python process: ~4GB
OS: ~12GB

spark.executor.memory=60g
spark.cores.max=120g
spark.driver.memory=60g

spark.executor.memory=60g
spark.cores.max=120g
spark.driver.memory=60g
~/spark/bin/pyspark --driver-memory 60g

PySpark: other memory configuration
spark.akka.frameSize=1000
spark.kryoserializer.buffer.max.mb=10
(spark.python.worker.memory)

PySpark: other configuration
spark.shuffle.consolidateFiles=True
spark.rdd.compress=True
spark.speculation=true

Errors
java.net.SocketException: Connection reset
java.net.SocketTimeoutException: Read timed out
Lost executor, cancelled key exceptions, etc

Errors
java.net.SocketException: Connection reset
java.net.SocketTimeoutException: Read timed out
Lost executor, cancelled key exceptions, etc
All of the above are caused by memory errors!

Errors
‘ERROR LiveListenerBus: Dropping SparkListenerEvent
because no remaining room in event queue’: filter() little data
from many partitions - use coalesce()
Collect() fails - increase driver memory + akka framesize

Were our assumptions correct?
We have a very fast development process.
Use spark for development and for scale-up.
Scale-able data science development.

Large-scale machine learning
with Spark and Python
Empowering the data scientist
by Maria Mestre

ML @ Skimlinks
● Mostly for research and prototyping
● No developer background
● Familiar with scikit-learn and Spark
● Building a data scientist toolbox

➢ Scraping pages ➢ Training a
classifier
Every ML system….
➢ Filtering
➢ Segmenting
urls
➢ Sample training
instances
➢ Applying a
classifier

Data collection: scraping lots of pages
This is how I would do it in my local machine…
● use of Scrapy package
● write a function scrape() that creates a Scrapy object
urls = open(‘list_urls.txt’, ‘r’).readlines()
output = s3_bucket + ‘results.json’
scrape(urls, output)

Distributing over the cluster
def distributed_scrape(urls, index, s3_bucket):
output = s3_bucket + ‘part’ + str(index) + ‘.json’
scrape(urls, output)
urls = open(‘list_urls.txt’, ‘r’).readlines()
urls = sc.parallelize(urls, 100)
urls.mapPartitionsWithIndex(lambda index, urls: distributed_scrape(urls, index,
s3_bucket))

Installing scrapy over the cluster
1/ need to use Python 2.7
echo 'export PYSPARK_PYTHON=python2.7' >> ~/spark/conf/spark-env.sh
2/ use pssh to install packages in the slaves
pssh -h /root/spark-ec2/slaves ‘easy_install-2.7 Scrapy’

Example: filtering
● we want to find activity of 30M users in 2
months of activity: 2 Gb vs 6 Tb
○ map-side join using broadcast() ⇒ does not work with
large objects!
■ e.g. input.filter(lambda x: x[‘user’] in user_list_b)
○ use of mapPartitions()
■ e.g. input.mapPartitions(lambda x: read_file_and_filter(x))

6 TB,
~11B input
35 mins 113 Gb,
529M matches
60 Gb,
515M matches
9 mins
bloom filter join

Example: segmenting urls
● we want to convert an url ‘www.iloveshoes.
com’ to [‘i’, ‘love’, ‘shoes’]
● Segmentation
○ wordsegment package in python ⇒ very slow!
○ 300M urls take 10 hours with 120 cores!

Example: getting a representative sample

Our solution in Spark!
sample = sc.parallelize([],1)
sample_size = 1000
input.cache()
for category, proportion in stats.items():
category_pages = input.filter(lambda x: x[‘category’] == category)
category_sample = category_pages.takeSample(False, sample_size * proportion)
sample = sample.union(category_sample)
MLLib offers a probabilistic solution (not exact sample size):
sample = sampleByKey(input, stats)

Grid search for hyperparameters
Problem: we have some candidate [ 1
, 2,
..., 10000
] values for a hyperparameter
, which one should we choose?
If the data is small enough that processing time is fine
➢ Do it in a single machine
If the data is too large to process on a single machine
➢ Use MLlib
If the data can be processed on a single machine but takes too long to train
➢ The next slide!

number of combinations = {parameters} = 2

Using cross-validation to optimise a hyperparameter
1. separate the data into k equally-sized chunks
2. for each candidate value i
a. use (k-1) chunks to fit the classifier parameters
b. use the remaining chunk to get a classification score
c. report average score
3. At the end, select the that achieves the best average score

number of combinations = {parameters} x {folds} = 4

Apply the classifier over the new_data: easy!
With scikit-learn:
classifier_b = sc.broadcast(classifier)
new_labels = new_data.map(lambda x: classifier_b.value.predict(x))
With scikit-learn but cannot broadcast:
save classifier models to files, ship to s3
use mapPartitions to read model parameters and classify
With MLlib:
(model._threshold = None)
new_labels = new_data.map(lambda x: model.predict(x))

Apache Spark for Big
Data
Spark at Scale & Performance Tuning
Sahan Bulathwela |
Data Science Engineer @ Skimlinks |

Outline
● Spark at scale: Big Data Example
● Tuning and Performance

Spark at Scale: Big Data Example
● Yes, we use Spark !!
● Not just to prototype or one-time analyses
● Run automated analyses at a large scale on
daily basis
● Use-case: Generating audience statistics for
our customers

Before…
● We provide data products based on
audience statistics to customers
● Extract event data from Datastore
● Generate Audience statistics and reports

Data
● Skimlinks records web data in terms of user
events such as clicks, impressions and etc…
● Our Data!!
○ Records 18M clicks (11 GB)
○ Records 203M impressions (950 GB)
○ These numbers are on daily basis (Oct 01, 2014)
● About 1TB of relevant events

A few days and data scientists
later...
Statistics

Major pain points
● Most of the data is not relevant
○ Only 3-4 out of 30ish fields are
useful for each report
● Many duplicate steps
○ Reading the data
○ Extracting relevant fields
○ Transformations such as classifying
events

Aggregation doing its magic
● Mostly grouping events and summarizing
● Distribute the workload in time
● “Reduce by” instead of “Group by”
● BOTS

Deep Dive
Datastore
Build Daily
profiles
Intermediate Data
Structure
(Compressed in
GZIP)
Events
(1 TB)
Daily Profiles
1.8 GB
Build
Monthly
profiles
Monthly Aggregate
40 GB
Generate Audience StatisticsCustomers
Statistics
7 GB
● Takes 4 hours
● 150 Statistics
● Delivered daily to
clients

Deep Dive
Datastore
Build Daily
profiles
Intermediate Data
Structure
(Compressed in
GZIP)
Events
(1 TB)
Daily Profiles
1.8 GB
Build
Monthly
profiles
Generate Audience StatisticsCustomers
Statistics
7 GB
● Takes 4 hours
● 150 Statistics
● Delivered daily to
clients
Monthly Aggregate
40 GB

SO WHAT???
Before After
Computing Daily event summary 1+ DAYS !!! 20 Mins
Computing monthly aggregate 40 Mins
Storing Daily event summary 100’s of GBs 1.8 GB
Storing monthly aggregate 40 GB
Total time taken for generating Stats 1+ DAYS !!! 3 hrs 30 mins
time taken per Report 1+ DAYS !!! 1.4 mins

Parquet enabled us to
reduce our storage
costs by 86% and
increase data loading
speed by 5x

Performance when parsing 31 daily
profiles

Spark Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark Meetup

Similar to Spark Meetup (20)

Spark Meetup