Description
This talk assumes you have a basic understanding of Spark (if not check out one of the intro videos on youtube - http://bit.ly/hkPySpark ) and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs - this is the talk for you.
Abstract
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames and traditional RDDs with Python. Looking at Spark 2.0; we examine how to mix functional transformations with relational queries for performance using the new (to PySpark) Dataset API. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you. This talk covers a number of important topics for making scalable Apache Spark programs – from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with an introduction to one of Spark's newest features: Datasets.
Testing and validating spark programs - Strata SJ 2016Holden Karau
Apache Spark is a fast, general engine for big data processing. As Spark jobs are used for more mission-critical tasks, it is important to have effective tools for testing and validation. Expanding her Strata NYC talk, “Effective Testing of Spark Programs,” Holden Karau details reasonable validation rules for production jobs and best practices for creating effective tests, as well as options for generating test data.
Holden explores best practices for generating complex test data, setting up performance testing, as well as basic unit testing. The validation component will focus on how to create reasonable validation rules given the constraints of Spark’s accumulators.
Unit testing of Spark programs is deceptively simple. Holden looks at how unit testing of Spark itself is accomplished and distills a number of best practices into traits we can use. This includes dealing with local mode cluster creation and tear down during test suites, factoring our functions to increase testability, mock data for RDDs, and mock data for Spark SQL. A number of interesting problems also arise when testing Spark Streaming programs, including handling of starting and stopping the streaming context, providing mock data, and collecting results, and Holden pulls out simple takeaways for dealing with these issues.
Holden also explores Spark’s internal methods for generating random data, as well as options using external libraries to generate effective test datasets (for both small- and large-scale testing). And while acceptance tests are not always thought of as part of testing, they share a number of similarities, so Holden discusses which counters Spark programs generate that we can use for creating acceptance tests, best practices for storing historic values, and some common counters we can easily use to track the success of our job, all while working within the constraints of Spark’s accumulators.
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a preview of some of the work being done to add code generation to Spark ML.
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Holden Karau
We all know testing is important, but often end up cutting corners because its too much effort. Come learn how to make testing Spark programs less effort and save your self from future production disasters when your recommendation system starts to return no results. We will explore how to quickly make tests for regular Spark programs, working with DataFrames, and special considerations for making effective unit tests for Spark Streaming. If you are super excited about the subject of testing Spark programs, make sure to also checkout the corresponding Strata NY talk for even more Spark testing fun. http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/42993
Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production.
Topics include:
Working with key/value data
Replacing groupByKey for awesomeness
Key skew: your data probably has it and how to survive
Effective caching and checkpointing
Considerations for noisy clusters
Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development
How to make our code testable
Beyond parallelize and collect - Spark Summit East 2016Holden Karau
As Spark jobs are used for more mission critical tasks, beyond exploration, it is important to have effective tools for testing. This talk expands on “Effective Testing For Spark Programs” (not required to have been seen) to discuss how to create large scale test jobs without depending on collect & parallelize which limit the sizes of datasets we can work with. Testing Spark Streaming jobs can be especially challenging, as the normal techniques for loading test data don’t work and additional work must be done to collect the results and stop streaming. We will explore the difficulties with testing Streaming Programs, options for setting up integration testing, beyond just local mode, with Spark, and also examine best practices for acceptance tests.
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you. This talk covers a number of important topics for making scalable Apache Spark programs – from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with an introduction to one of Spark's newest features: Datasets.
Testing and validating spark programs - Strata SJ 2016Holden Karau
Apache Spark is a fast, general engine for big data processing. As Spark jobs are used for more mission-critical tasks, it is important to have effective tools for testing and validation. Expanding her Strata NYC talk, “Effective Testing of Spark Programs,” Holden Karau details reasonable validation rules for production jobs and best practices for creating effective tests, as well as options for generating test data.
Holden explores best practices for generating complex test data, setting up performance testing, as well as basic unit testing. The validation component will focus on how to create reasonable validation rules given the constraints of Spark’s accumulators.
Unit testing of Spark programs is deceptively simple. Holden looks at how unit testing of Spark itself is accomplished and distills a number of best practices into traits we can use. This includes dealing with local mode cluster creation and tear down during test suites, factoring our functions to increase testability, mock data for RDDs, and mock data for Spark SQL. A number of interesting problems also arise when testing Spark Streaming programs, including handling of starting and stopping the streaming context, providing mock data, and collecting results, and Holden pulls out simple takeaways for dealing with these issues.
Holden also explores Spark’s internal methods for generating random data, as well as options using external libraries to generate effective test datasets (for both small- and large-scale testing). And while acceptance tests are not always thought of as part of testing, they share a number of similarities, so Holden discusses which counters Spark programs generate that we can use for creating acceptance tests, best practices for storing historic values, and some common counters we can easily use to track the success of our job, all while working within the constraints of Spark’s accumulators.
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a preview of some of the work being done to add code generation to Spark ML.
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Holden Karau
We all know testing is important, but often end up cutting corners because its too much effort. Come learn how to make testing Spark programs less effort and save your self from future production disasters when your recommendation system starts to return no results. We will explore how to quickly make tests for regular Spark programs, working with DataFrames, and special considerations for making effective unit tests for Spark Streaming. If you are super excited about the subject of testing Spark programs, make sure to also checkout the corresponding Strata NY talk for even more Spark testing fun. http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/42993
Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production.
Topics include:
Working with key/value data
Replacing groupByKey for awesomeness
Key skew: your data probably has it and how to survive
Effective caching and checkpointing
Considerations for noisy clusters
Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development
How to make our code testable
Beyond parallelize and collect - Spark Summit East 2016Holden Karau
As Spark jobs are used for more mission critical tasks, beyond exploration, it is important to have effective tools for testing. This talk expands on “Effective Testing For Spark Programs” (not required to have been seen) to discuss how to create large scale test jobs without depending on collect & parallelize which limit the sizes of datasets we can work with. Testing Spark Streaming jobs can be especially challenging, as the normal techniques for loading test data don’t work and additional work must be done to collect the results and stop streaming. We will explore the difficulties with testing Streaming Programs, options for setting up integration testing, beyond just local mode, with Spark, and also examine best practices for acceptance tests.
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a an introduction to Datasets with Structured Streaming (new in Spark 2.0) and how to do weird things with them.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Beyond shuffling - Scala Days Berlin 2016Holden Karau
This session will cover our & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate.
The planned topics are:
Using Spark counters for performance investigation
Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI.
Working with Key/Value Data
Replacing groupByKey for awesomeness
groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations.
Effective caching & checkpointing
Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact.
Considerations for noisy clusters
Functional transformations with Spark Datasets
How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.
Extending spark ML for custom models now with python!Holden Karau
Are you interested in adding your own custom algorithms to Spark ML? This is the talk for you! See the companion examples in the High Performance Spark, and Sparkling ML project.
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
This talk starts with a focus on "How to not make Spark Explode" as a developer, and then shifts to look towards the future of all of the cool nifty things we will be able to do with structured streaming.
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
Apache Spark is one of the most popular big data systems, but once the shiny finish starts to wear off you can find yourself wondering if you've accidentally deployed a Ford Pinto into production. This talk will look at the challenges that come with scaling Spark jobs. Also, the talk will explore Spark's new(ish) Dataset/DataFrame API, as well as how it’s evolving in Spark 2.3 with improved Python support.
If you're already a Spark user, come to find out why it’s not all your fault. If you aren't already a Spark user, come to find out how to save yourself from some of the pitfalls once you move beyond the example code.
Check out Holden's newest book, High Performance Spark, for more information!
From https://niketechtalksjan2018.splashthat.com/
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Beyond shuffling global big data tech conference 2015 sjHolden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production.
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Getting started contributing to Apache SparkHolden Karau
Are you interested in contributing to Apache Spark? This workshop and associated slides walk through the basics of contributing to Apache Spark as a developer. This advice is based on my 3 years of contributing to Apache Spark but should not be considered official in any way.
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
Distributed systems can seem magical, and sometimes all of the magic works and our job succeeds. However, if you've worked with them for a long enough time you've found a few places where the magic starts to break down and the fact that it's actually a collection of several hundred garden gnomes* rather than a single large garden gnome.
This talk will use Apache Spark, Beam, Flink, Kafka, and Map Reduce to explore the world of data parallel distributed systems. We'll start with some happy pieces of magic, like how we can combine different transformations into a single pass over the data, working between different languages, data partitioning, and lambda serialization. After each new piece of magic is introduced we'll look at how it breaks in one (or two) of the systems.
Come to be told it's not your fault everything is broken, or if your distributed software still works an exciting preview of everything that's going to go wrong. Don't work with distributed systems? Come to be reassured you've made good life choices.
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a an introduction to Datasets with Structured Streaming (new in Spark 2.0) and how to do weird things with them.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Beyond shuffling - Scala Days Berlin 2016Holden Karau
This session will cover our & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate.
The planned topics are:
Using Spark counters for performance investigation
Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI.
Working with Key/Value Data
Replacing groupByKey for awesomeness
groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations.
Effective caching & checkpointing
Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact.
Considerations for noisy clusters
Functional transformations with Spark Datasets
How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.
Extending spark ML for custom models now with python!Holden Karau
Are you interested in adding your own custom algorithms to Spark ML? This is the talk for you! See the companion examples in the High Performance Spark, and Sparkling ML project.
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
This talk starts with a focus on "How to not make Spark Explode" as a developer, and then shifts to look towards the future of all of the cool nifty things we will be able to do with structured streaming.
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
Apache Spark is one of the most popular big data systems, but once the shiny finish starts to wear off you can find yourself wondering if you've accidentally deployed a Ford Pinto into production. This talk will look at the challenges that come with scaling Spark jobs. Also, the talk will explore Spark's new(ish) Dataset/DataFrame API, as well as how it’s evolving in Spark 2.3 with improved Python support.
If you're already a Spark user, come to find out why it’s not all your fault. If you aren't already a Spark user, come to find out how to save yourself from some of the pitfalls once you move beyond the example code.
Check out Holden's newest book, High Performance Spark, for more information!
From https://niketechtalksjan2018.splashthat.com/
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Beyond shuffling global big data tech conference 2015 sjHolden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production.
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Getting started contributing to Apache SparkHolden Karau
Are you interested in contributing to Apache Spark? This workshop and associated slides walk through the basics of contributing to Apache Spark as a developer. This advice is based on my 3 years of contributing to Apache Spark but should not be considered official in any way.
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
Distributed systems can seem magical, and sometimes all of the magic works and our job succeeds. However, if you've worked with them for a long enough time you've found a few places where the magic starts to break down and the fact that it's actually a collection of several hundred garden gnomes* rather than a single large garden gnome.
This talk will use Apache Spark, Beam, Flink, Kafka, and Map Reduce to explore the world of data parallel distributed systems. We'll start with some happy pieces of magic, like how we can combine different transformations into a single pass over the data, working between different languages, data partitioning, and lambda serialization. After each new piece of magic is introduced we'll look at how it breaks in one (or two) of the systems.
Come to be told it's not your fault everything is broken, or if your distributed software still works an exciting preview of everything that's going to go wrong. Don't work with distributed systems? Come to be reassured you've made good life choices.
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
En esta charla miraremos al futuro introduciendo Spark como alternativa al clásico motor de Hadoop MapReduce. Describiremos las diferencias más importantes frente al mismo, se detallarán los componentes principales que componen el ecosistema Spark, e introduciremos conceptos básicos que permitan empezar con el desarrollo de aplicaciones básicas sobre el mismo.
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
Apache Spark has been a great driver of not only Scala adoption, but introducing a new generation of developers to functional programming concepts. As Spark places more emphasis on its newer DataFrame & Dataset APIs, it’s important to ask ourselves how we can benefit from this while still keeping our fun functional roots. We will explore the cases where the Dataset APIs empower us to do cool things we couldn’t before, what the different approaches to serialization mean, and how to figure out when the shiny new API is actually just trying to steal your lunch money (aka CPU cycles).
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
Apache Spark has driven a lot of adoption of both Scala and functional programming concepts in non-traditionally industries. For many programmers in the big data world they coming looking for a solution to scaling their code, and quickly find themselves dealing with immutable data structures and lambdas, and those who love it stay. However, there is a dark side (of escape), much of Spark’s functional programming is changing, and even though it encourages functional programming it’s in a variety of languages with different expectations (in-line XML as a valid part of your language is fun!). This talk will look at how Spark does a good job of introduce folks to concepts like immutability, but also places where we maybe don’t do a great job of setting up developers for a life of functional programming. Things like accumulators, our three different models for streaming data, and an “interesting” approach to closures (come to find out what the ClosuerCleaner does, stay to find out why). The talk will close out with a look at how the functional inspired API is in exposed in the different languages, and how this impacts the kind of code written (Scala, Java, and Python – other languages are supported by Spark but I don’t want to re-learn Javascript or learn R just for this talk). Pictures of cute animals will be included in the slides to distract from the sad parts.
Video: https://www.youtube.com/watch?v=EDJfpkDpoE4
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
This presentation includes a comprehensive introduction to Apache Spark. From an explanation of its rapid ascent to performance and developer advantages over MapReduce. We also explore its built-in functionality for application types involving streaming, machine learning, and Extract, Transform and Load (ETL).
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark's core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark's unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark's libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
In this second part, we'll continue the Spark's review and introducing SparkSQL which allows to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
2. Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming this next year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
3. What is going to be covered:
● What I think I might know about you
● A quick background of how PySpark works
● RDD re-use (caching, persistence levels, and checkpointing)
● Working with key/value data
○ Why group key is evil and what we can do about it
● When Spark SQL can be amazing and wonderful
● A brief introduction to Datasets (new in Spark 1.6)
● Calling Scala code from Python with Spark
● How we can make PySpark go fast in the future (vroom vroom)
Torsten Reuschling
5. Who I think you wonderful humans are?
● Nice people - we are at PyData conference :)
● Don’t mind pictures of cats
● Might know some Apache Spark
● Want to scale your Apache Spark jobs
● Don’t overly mind a grab-bag of topics
Lori Erickson
6. What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
7. The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
Jon Ross
8. SparkContext: entry to the world
● Can be used to create distributed data from many input
sources
○ Native collections, local & remote FS
○ Any Hadoop Data Source
● Also create counters & accumulators
● Automatically created in the shells (called sc)
● Specify master & app name when creating
○ Master can be local[*], spark:// , yarn, etc.
○ app name should be human readable and make sense
● etc.
Petfu
l
9. RDDs: Spark’s Primary abstraction
RDD (Resilient Distributed Dataset)
● Distributed collection
● Recomputed on node failure
● Distributes data & work across the cluster
● Lazily evaluated (transformations & actions)
Helen Olney
10. What’s new for PySpark in 2.0?
● Newer Py4J bridge
● SparkSession now replaces SQLContext & HiveContext
● DataFrame/SQL speedups
● Better filter push downs in SQL
● Much better ML interop
● Streaming DataFrames* (ALPHA)
● WARNING: Slightly Different Persistence Levels
● And a bunch more :)
11.
12. A detour into PySpark’s internals
Photo by Bill Ward
13. Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
kristin klein
14. So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
15. So how does that impact PySpark?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● etc.
16. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
17. Lets look at some old stand bys:
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
grouped.mapValues(lambda counts: sum(counts))
grouped.saveAsTextFile("counts")
warnings = rdd.filter(lambda x: x.lower.find("warning") !=
-1).count()
Tomomi
18. RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_AND_DISK
○ checkpointing
○ The options changed in Spark 2.0 (we can’t easily specify serialized anymore since there is no
benefit on RDDs - but things get complicated when sharing RDDs or working with
DataFrames)
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
19. What is key skew and why do we care?
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● groupByKey will explode (but it's pretty easy to break)
● We can have really unbalanced partitions
○ If we have enough key skew sortByKey could even fail
○ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
20. groupByKey - just how evil is it?
● Pretty evil
● Groups all of the records with the same key into a single record
○ Even if we immediately reduce it (e.g. sum it or similar)
○ This can be too big to fit in memory, then our job fails
● Unless we are in SQL then happy pandas
PROgeckoam
21. So what does that look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
22. “Normal” Word count w/RDDs
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(output)
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
These are still
pipelined
inside of the
same python
executor
Trish Hamme
25. So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
26. Can just the shuffle cause problems?
● Sorting by key can put all of the records in the same partition
● We can run into partition size limits (around 2GB)
● Or just get bad performance
● So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
27. Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(10003, D, E)
javier_artiles
28. 100% less explosions
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110_A, A, B)
(94110_A, A, C)
(94110_A, A, R)
(94110_D, D, R)
(94110_T, T, R)
(10003_A, A, R)
(10003_D, D, E)
(67843_T, T, R)
(94110_E, E, R)
(94110_E, E, R)
(94110_E, E, F)
(94110_T, T, R)
Jennifer Williams
29. Well there is a bit of magic in the shuffle….
● We can reuse shuffle files
● But it can (and does) explode*
Sculpture by Flaming Lotus Girls
Photo by Zaskoda
30. Our saviour from serialization: DataFrames
● For the most part keeps data in the JVM
○ Notable exception is UDFs written in Python
● Takes our python calls and turns it into a query plan
● If we need more than the native operations in Spark’s
DataFrames
● be wary of Distributed Systems bringing claims of
usability….
Andy
Blackledge
31. So what are Spark DataFrames?
● More than SQL tables
● Not Pandas or R DataFrames
● Semi-structured (have schema information)
● tabular
● work on expression instead of lambdas
○ e.g. df.filter(df.col(“happy”) == true) instead of rdd.filter(lambda x:
x.happy == true))
● Not a subset of “Datasets” - since Dataset API isn’t
exposed in Python yet :(
Quinn Dombrowski
32. Why are DataFrames good for performance?
● Space efficient columnar cached representation
● Able to push down operations to the data store
● Reduced serialization/data transfer overhead
● Able to perform some operations on serialized data
● Optimizer is able to look inside of our operations
○ Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and
(append(_, _))
34. Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
35. What about other data formats?
● Built in
○ Parquet
○ JDBC
○ Json (which is amazing!)
○ Orc
○ Hive
● Available as packages
○ csv*
○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
○ +34 at
http://spark-packages.org/?q=tags%3A%22Data%20Sources%22
Michael Coghlan
*pre-2.0 package, 2.0+ built in hopefully
36. Ok so we’ve got our Data, what now?
● We can inspect the Schema
● We can start to apply some transformations (relational)
● We can do some machine learning
● We can jump into an RDD or a Dataset for functional
transformations
● We could wordcount - again!
37. Getting the schema
● printSchema() for human readable
● schema for machine readable
40. Word count w/Dataframes
df = sqlCtx.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
*(Also in 2.0 have to
explicitly switch to
RDD)
41. Buuuut….
● UDFs / custom maps will be “slow” (e.g. require data
copy from executor and back)
Nick
Ellis
42. Mixing Python & JVM code FTW:
● DataFrames are an example of pushing our processing
to the JVM
● Python UDFS & maps lose this benefit
● But we can write Scala UDFS and call them from
Python
○ py4j error messages can be difficult to understand :(
● Work to make JVM UDFs easier to register in PR #9766
● Trickier with RDDs since stores pickled objects
43. Exposing functions to be callable from
Python:
// functions we want to be callable from python
object functions {
def kurtosis(e: Column): Column = new
Column(Kurtosis(EvilSqlTools.getExpr(e)))
def registerUdfs(sqlCtx: SQLContext): Unit = {
sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _)
}
}
Fiona
Henderson
44. Calling the functions with py4j*:
● The SparkContext has a reference to the jvm (_jvm)
● Many Python objects which are wrappers of JVM
objects have _j[objtype] to get the JVM object
○ rdd._jrdd
○ df._jdf
○ sc._jsc
● These are private and may change
*The py4j bridge only exists on the driver**
** Not exactly true but close enough
Fiona
Henderson
46. More things to keep in mind with DFs (in Python)
● Schema serialized as json from JVM
● toPandas is essentially collect
● joins can result in the cross product
○ big data x big data =~ out of memory
● Pre 2.0: Use the HiveContext
○ you don’t need a hive install
○ more powerful UDFs, window functions, etc.
47.
48. The “future*”: Awesome UDFs
● Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
○ Maybe we can try similar approaches with Python?
● Very early work going on to use Jython for simple UDFs
(e.g. 2.7 compat & no native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 2x faster than regular Python
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
49.
50. Want to help with reviewing the code?
● https://github.com/apache/spark/pull/13571
Some open questions:
● Do we want to make the Jython dependency optional?
● If so how do we want people to load it?
● Do we want to fall back automatically on Jython failure?
E-mail me: holden@pigscanfly.ca :)
51. The “future*”: Faster interchange
● Faster interchange between Python and Spark (e.g.
Tungsten + Apache Arrow)? (SPARK-13391 &
SPARK-13534)
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
● Dask integration?
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
52. Additional Spark Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/
● Books
● Videos
● Spark Office Hours
○ Normally in the bay area - will do Google Hangouts ones soon
○ follow me on twitter for future ones - https://twitter.com/holdenkarau
raider of gin
53. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
54. And the next book…..
First five chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Chapter 9(ish) - Going Beyond Scala
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
55. Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
56. k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
PySpark Users: Have some simple
UDFs you wish ran faster you are
willing to share?:
http://bit.ly/pySparkUDF