Martin Goodson describes his experience with Spark over three phases. In Phase I, he worked with various data processing tools like R, Python, Pig and Spark. In Phase II, he focused on Pig and Python UDFs. In Phase III, he plans to explore PySpark. He also discusses Skimlinks' data volume of 30TB per month, their data science team, and some realities of working with Spark including configuration challenges and common errors.
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
Presentation of an investigation into how Python's RDFLib and SQLAlchemy can be used to leverage PostgreSQL's capabilities to provide a persistent storage back-end for Graphs, and become the elusive practical RDF triple store for the Semantic Web (or simply help you export your data to someone who's expecting RDF)!
Talk presented at FOSDEM 2017 in Brussels on 04-05/02/2017. Practical & hands-on presentation with example code which is certainly not optimal ;)
Video:
MP4: http://video.fosdem.org/2017/H.1309/postgresql_semantic_web.mp4
WebM/VP8: http://ftp.osuosl.org/pub/fosdem/2017/H.1309/postgresql_semantic_web.vp8.webm
Apache Spark Toronto Meetup, July 27, 2016.
Wattpad talks about their experiences with Apache Spark. From starting in 2014 with Shark, to building distributed recommendation algorithms using ANN, to improving search results using a sessionized query log. We also talk about some of the issues we faced building our analytics pipeline, including getting spark to work with Luigi, an open source project by Spotify.
Introduction To Elastic MapReduce at WHUGAdam Kawa
Elasic MapReduce presentation given at 2nd meeting of Warsaw Hadoop User Group.
Watch also demonstration at www.youtube.com/watch?v=Azwilbn8GCs (it show how to create Hadoop cluster on Amazon Elastic MapReduce with Karashpere Studio for EMR (a plugin for Eclipse) to launch big calculations quickly and easily.
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
Slides from: https://www.meetup.com/Sydney-Apache-Spark-User-Group/events/246892684/
Welcome to the first Sydney Spark Meetup in 2018!
We are very glad to have an visiting Apache Spark committer Holden Karau to give a talk on streaming machine learning. Title: Streaming ML w/Spark (and why it's a bit painful today & #workingonit)
Apache Spark is one of the most popular distributed systems, and it has built in libraries for both machine learning and streaming. This talk will cover Spark's two streaming libraries, look at the future, and how to make streaming ML work today (for both serving and prediction). If you aren't familiar with Spark, that's ok! We'll spend the first ~5 minutes covering just enough to get through the rest of the talk, and for those of you already familiar you can spend those ~5 minutes downloading the sample code :)
About Holden:
Holden is a transgender Canadian open source developer advocate @ Google with a focus on Apache Spark, BEAM, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that's a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.
• What to bring
• Important to know
A couple of us will be at the doors of 60 Margaret St to let people in until 6.10pm.
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
Presentation of an investigation into how Python's RDFLib and SQLAlchemy can be used to leverage PostgreSQL's capabilities to provide a persistent storage back-end for Graphs, and become the elusive practical RDF triple store for the Semantic Web (or simply help you export your data to someone who's expecting RDF)!
Talk presented at FOSDEM 2017 in Brussels on 04-05/02/2017. Practical & hands-on presentation with example code which is certainly not optimal ;)
Video:
MP4: http://video.fosdem.org/2017/H.1309/postgresql_semantic_web.mp4
WebM/VP8: http://ftp.osuosl.org/pub/fosdem/2017/H.1309/postgresql_semantic_web.vp8.webm
Apache Spark Toronto Meetup, July 27, 2016.
Wattpad talks about their experiences with Apache Spark. From starting in 2014 with Shark, to building distributed recommendation algorithms using ANN, to improving search results using a sessionized query log. We also talk about some of the issues we faced building our analytics pipeline, including getting spark to work with Luigi, an open source project by Spotify.
Introduction To Elastic MapReduce at WHUGAdam Kawa
Elasic MapReduce presentation given at 2nd meeting of Warsaw Hadoop User Group.
Watch also demonstration at www.youtube.com/watch?v=Azwilbn8GCs (it show how to create Hadoop cluster on Amazon Elastic MapReduce with Karashpere Studio for EMR (a plugin for Eclipse) to launch big calculations quickly and easily.
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
Slides from: https://www.meetup.com/Sydney-Apache-Spark-User-Group/events/246892684/
Welcome to the first Sydney Spark Meetup in 2018!
We are very glad to have an visiting Apache Spark committer Holden Karau to give a talk on streaming machine learning. Title: Streaming ML w/Spark (and why it's a bit painful today & #workingonit)
Apache Spark is one of the most popular distributed systems, and it has built in libraries for both machine learning and streaming. This talk will cover Spark's two streaming libraries, look at the future, and how to make streaming ML work today (for both serving and prediction). If you aren't familiar with Spark, that's ok! We'll spend the first ~5 minutes covering just enough to get through the rest of the talk, and for those of you already familiar you can spend those ~5 minutes downloading the sample code :)
About Holden:
Holden is a transgender Canadian open source developer advocate @ Google with a focus on Apache Spark, BEAM, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that's a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.
• What to bring
• Important to know
A couple of us will be at the doors of 60 Margaret St to let people in until 6.10pm.
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
Bucket your partitions wisely - Cassandra summit 2016Markus Höfer
When we talk about bucketing we essentially talk about possibilities to split cassandra partitions in several smaller parts, rather than having only one large partition.
Bucketing of cassandra partitions can be crucial for optimizing queries, preventing large partitions or to fight TombstoneOverwhelmingException which can occur when creating too many tombstones.
In this talk I want to show how to recognize large partitions during datamodeling. I will also show different strategies we used in our projects to create, use and maintain buckets for our partitions.
Beyond shuffling - Scala Days Berlin 2016Holden Karau
This session will cover our & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate.
The planned topics are:
Using Spark counters for performance investigation
Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI.
Working with Key/Value Data
Replacing groupByKey for awesomeness
groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations.
Effective caching & checkpointing
Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact.
Considerations for noisy clusters
Functional transformations with Spark Datasets
How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
Presented by Eoin Brazil, Proactive Technical Services Engineer, MongoDB
Experience level: Advanced
MongoDB offers a flexible, scalable, and easy way to store your large data set. Python provides many useful data science tools (e.g. NumPy, SciPy, Scikit-learn, etc.). This talk will discuss the concerns for creating operational data analytic pipelines, introduce Monary as alternative for loading data into NumPy, and give examples of accessing data with Monary, as well as how to build scalable data analysis pipelines using these open source tools.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Optimizing spark based data pipelines - are you up for it?Etti Gur
Etti Gur from Israel, Senior Big Data Engineer @ Nielsen, will talk about Optimizing spark-based data pipelines - are you up for it?
In Nielsen, we ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner. In this talk, we will discuss how we significantly optimized our Spark-based in-flight analytics daily pipeline, reducing its total execution time from over 20 hours down to 1 hour, resulting in a huge cost reduction.
Topics include:
* Ways to identify Spark optimization opportunities;
* Optimizing Spark resource allocation;
* Parallelizing Spark output phase with dynamic partition inserts;
* Running multiple Spark ''jobs' in parallel within a single Spark application;
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
Bucket your partitions wisely - Cassandra summit 2016Markus Höfer
When we talk about bucketing we essentially talk about possibilities to split cassandra partitions in several smaller parts, rather than having only one large partition.
Bucketing of cassandra partitions can be crucial for optimizing queries, preventing large partitions or to fight TombstoneOverwhelmingException which can occur when creating too many tombstones.
In this talk I want to show how to recognize large partitions during datamodeling. I will also show different strategies we used in our projects to create, use and maintain buckets for our partitions.
Beyond shuffling - Scala Days Berlin 2016Holden Karau
This session will cover our & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate.
The planned topics are:
Using Spark counters for performance investigation
Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI.
Working with Key/Value Data
Replacing groupByKey for awesomeness
groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations.
Effective caching & checkpointing
Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact.
Considerations for noisy clusters
Functional transformations with Spark Datasets
How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
Presented by Eoin Brazil, Proactive Technical Services Engineer, MongoDB
Experience level: Advanced
MongoDB offers a flexible, scalable, and easy way to store your large data set. Python provides many useful data science tools (e.g. NumPy, SciPy, Scikit-learn, etc.). This talk will discuss the concerns for creating operational data analytic pipelines, introduce Monary as alternative for loading data into NumPy, and give examples of accessing data with Monary, as well as how to build scalable data analysis pipelines using these open source tools.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Optimizing spark based data pipelines - are you up for it?Etti Gur
Etti Gur from Israel, Senior Big Data Engineer @ Nielsen, will talk about Optimizing spark-based data pipelines - are you up for it?
In Nielsen, we ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner. In this talk, we will discuss how we significantly optimized our Spark-based in-flight analytics daily pipeline, reducing its total execution time from over 20 hours down to 1 hour, resulting in a huge cost reduction.
Topics include:
* Ways to identify Spark optimization opportunities;
* Optimizing Spark resource allocation;
* Parallelizing Spark output phase with dynamic partition inserts;
* Running multiple Spark ''jobs' in parallel within a single Spark application;
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Intro to Apache Kafka I gave at the Big Data Meetup in Geneva in June 2016. Covers the basics and gets into some more advanced topics. Includes demo and source code to write clients and unit tests in Java (GitHub repo on the last slides).
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Presented by Landon Robinson and Jack Chapa
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
Managing your black friday logs - Code EuropeDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
Spark Gotchas and Lessons Learned (2/20/20)Jen Waller
Presentation from the Boulder/Denver Big Data Meetup on 2/20/2020 in Boulder, CO. Topics covered: Troubleshooting Spark jobs (groupby, shuffle) for big data, tuning AWS EMR Spark clusters, EMR cluster resource utilization, writing scaleable Scala for scanning S3 metadata.
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks
"Databricks Unified Analytics Platform (UAP) is a cloud-based service for running all analytics in one place - from highly reliable and performant data pipelines to state-of-the-art Machine Learning. From the original creators of Apache Spark and MLflow, it provides data science and engineering teams ready to use pre-packaged clusters with optimized Apache Spark and various ML frameworks coupled with powerful collaboration capabilities to improve productivity across the ML lifecycle. Yada yada yada... But in addition to being a vendor Databricks is also a user of UAP.
So, what have we learned by eating our own dogfood? Attend a “from the trenches report” from Suraj Acharya, Director Engineering responsible for Databricks’ in-house data engineering team how his team put Databricks technology to use, the lessons they have learned along the way and best practices for using Databricks for data engineering.
"
Managing your Black Friday Logs NDC OsloDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
9. Reality
Learning in depth how spark works
Try to divide and conquer
Learning how to configure spark properly
10. Learning in depth how spark works
Read all this:
https://spark.apache.org/docs/1.2.1/programming-guide.html
https://spark.apache.org/docs/1.2.1/configuration.html
https://spark.apache.org/docs/1.2.1/cluster-overview.html
And then:
https://www.youtube.com/watch?v=49Hr5xZyTEA (spark internals)
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py
11. Try to divide and conquer
Don't throw 30Tb of data at a spark script and
expect it to just work.
Divide the work into bite sized chunks -
aggregating and projecting as you go.
12. Try to divide and conquer
Use reduceByKey() not groupByKey()
Use max() and add()
(cf. http://www.slideshare.net/samthemonad/spark-meetup-talk-final)
13. Start with this
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)
(k2, 1)
(k2, 2)
Use RDD.reduceByKey(add) to get this:
(k1, 10)
(k2, 3)
17. PySpark Memory: worked example
10 x r3.4xlarge (122G, 16 cores)
Use half for each executor: 60GB
Number of cores = 120
Cache = 60% x 60GB x 10 = 360GB
Each java thread: 40% x 60GB / 12 = ~2GB
Each python process: ~4GB
OS: ~12GB
18. PySpark Memory: worked example
spark.executor.memory=60g
spark.cores.max=120g
spark.driver.memory=60g
19. PySpark Memory: worked example
spark.executor.memory=60g
spark.cores.max=120g
spark.driver.memory=60g
~/spark/bin/pyspark --driver-memory 60g
20. PySpark: other memory configuration
spark.akka.frameSize=1000
spark.kryoserializer.buffer.max.mb=10
(spark.python.worker.memory)
24. Errors
‘ERROR LiveListenerBus: Dropping SparkListenerEvent
because no remaining room in event queue’: filter() little data
from many partitions - use coalesce()
Collect() fails - increase driver memory + akka framesize
25. Were our assumptions correct?
We have a very fast development process.
Use spark for development and for scale-up.
Scale-able data science development.
27. ML @ Skimlinks
● Mostly for research and prototyping
● No developer background
● Familiar with scikit-learn and Spark
● Building a data scientist toolbox
28. ➢ Scraping pages ➢ Training a
classifier
Every ML system….
➢ Filtering
➢ Segmenting
urls
➢ Sample training
instances
➢ Applying a
classifier
29. Data collection: scraping lots of pages
This is how I would do it in my local machine…
● use of Scrapy package
● write a function scrape() that creates a Scrapy object
urls = open(‘list_urls.txt’, ‘r’).readlines()
output = s3_bucket + ‘results.json’
scrape(urls, output)
31. Installing scrapy over the cluster
1/ need to use Python 2.7
echo 'export PYSPARK_PYTHON=python2.7' >> ~/spark/conf/spark-env.sh
2/ use pssh to install packages in the slaves
pssh -h /root/spark-ec2/slaves ‘easy_install-2.7 Scrapy’
32. ➢ Scraping pages ➢ Training a
classifier
Every ML system….
➢ Filtering
➢ Segmenting
urls
➢ Sample training
instances
➢ Applying a
classifier
33. Example: filtering
● we want to find activity of 30M users in 2
months of activity: 2 Gb vs 6 Tb
○ map-side join using broadcast() ⇒ does not work with
large objects!
■ e.g. input.filter(lambda x: x[‘user’] in user_list_b)
○ use of mapPartitions()
■ e.g. input.mapPartitions(lambda x: read_file_and_filter(x))
34. 6 TB,
~11B input
35 mins 113 Gb,
529M matches
60 Gb,
515M matches
9 mins
bloom filter join
35. Example: segmenting urls
● we want to convert an url ‘www.iloveshoes.
com’ to [‘i’, ‘love’, ‘shoes’]
● Segmentation
○ wordsegment package in python ⇒ very slow!
○ 300M urls take 10 hours with 120 cores!
38. ➢ Scraping pages ➢ Training a
classifier
Every ML system….
➢ Filtering
➢ Segmenting
urls
➢ Sample training
instances
➢ Applying a
classifier
39. Grid search for hyperparameters
Problem: we have some candidate [ 1
, 2,
..., 10000
] values for a hyperparameter
, which one should we choose?
If the data is small enough that processing time is fine
➢ Do it in a single machine
If the data is too large to process on a single machine
➢ Use MLlib
If the data can be processed on a single machine but takes too long to train
➢ The next slide!
44. Using cross-validation to optimise a hyperparameter
1. separate the data into k equally-sized chunks
2. for each candidate value i
a. use (k-1) chunks to fit the classifier parameters
b. use the remaining chunk to get a classification score
c. report average score
3. At the end, select the that achieves the best average score
46. ➢ Scraping pages ➢ Training a
classifier
Every ML system….
➢ Filtering
➢ Segmenting
urls
➢ Sample training
instances
➢ Applying a
classifier
47. Apply the classifier over the new_data: easy!
With scikit-learn:
classifier_b = sc.broadcast(classifier)
new_labels = new_data.map(lambda x: classifier_b.value.predict(x))
With scikit-learn but cannot broadcast:
save classifier models to files, ship to s3
use mapPartitions to read model parameters and classify
With MLlib:
(model._threshold = None)
new_labels = new_data.map(lambda x: model.predict(x))
51. Spark at Scale: Big Data Example
● Yes, we use Spark !!
● Not just to prototype or one-time analyses
● Run automated analyses at a large scale on
daily basis
● Use-case: Generating audience statistics for
our customers
52. Before…
● We provide data products based on
audience statistics to customers
● Extract event data from Datastore
● Generate Audience statistics and reports
53. Data
● Skimlinks records web data in terms of user
events such as clicks, impressions and etc…
● Our Data!!
○ Records 18M clicks (11 GB)
○ Records 203M impressions (950 GB)
○ These numbers are on daily basis (Oct 01, 2014)
● About 1TB of relevant events
54. A few days and data scientists
later...
Statistics
55. Major pain points
● Most of the data is not relevant
○ Only 3-4 out of 30ish fields are
useful for each report
● Many duplicate steps
○ Reading the data
○ Extracting relevant fields
○ Transformations such as classifying
events
63. SO WHAT???
Before After
Computing Daily event summary 1+ DAYS !!! 20 Mins
Computing monthly aggregate 40 Mins
Storing Daily event summary 100’s of GBs 1.8 GB
Storing monthly aggregate 40 GB
Total time taken for generating Stats 1+ DAYS !!! 3 hrs 30 mins
time taken per Report 1+ DAYS !!! 1.4 mins
64. Parquet enabled us to
reduce our storage
costs by 86% and
increase data loading
speed by 5x