Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Video: https://www.youtube.com/watch?v=A0jYQlxc2FU&feature=youtu.be
Getting started contributing to Apache SparkHolden Karau
Are you interested in contributing to Apache Spark? This workshop and associated slides walk through the basics of contributing to Apache Spark as a developer. This advice is based on my 3 years of contributing to Apache Spark but should not be considered official in any way.
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a an introduction to Datasets with Structured Streaming (new in Spark 2.0) and how to do weird things with them.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Getting started contributing to Apache SparkHolden Karau
Are you interested in contributing to Apache Spark? This workshop and associated slides walk through the basics of contributing to Apache Spark as a developer. This advice is based on my 3 years of contributing to Apache Spark but should not be considered official in any way.
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a an introduction to Datasets with Structured Streaming (new in Spark 2.0) and how to do weird things with them.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Extending spark ML for custom models now with python!Holden Karau
Are you interested in adding your own custom algorithms to Spark ML? This is the talk for you! See the companion examples in the High Performance Spark, and Sparkling ML project.
Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production.
Topics include:
Working with key/value data
Replacing groupByKey for awesomeness
Key skew: your data probably has it and how to survive
Effective caching and checkpointing
Considerations for noisy clusters
Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development
How to make our code testable
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you. This talk covers a number of important topics for making scalable Apache Spark programs – from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a preview of some of the work being done to add code generation to Spark ML.
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
Apache Spark is one of the most popular big data systems, but once the shiny finish starts to wear off you can find yourself wondering if you've accidentally deployed a Ford Pinto into production. This talk will look at the challenges that come with scaling Spark jobs. Also, the talk will explore Spark's new(ish) Dataset/DataFrame API, as well as how it’s evolving in Spark 2.3 with improved Python support.
If you're already a Spark user, come to find out why it’s not all your fault. If you aren't already a Spark user, come to find out how to save yourself from some of the pitfalls once you move beyond the example code.
Check out Holden's newest book, High Performance Spark, for more information!
From https://niketechtalksjan2018.splashthat.com/
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
Description
This talk assumes you have a basic understanding of Spark (if not check out one of the intro videos on youtube - http://bit.ly/hkPySpark ) and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs - this is the talk for you.
Abstract
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames and traditional RDDs with Python. Looking at Spark 2.0; we examine how to mix functional transformations with relational queries for performance using the new (to PySpark) Dataset API. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
This talk starts with a focus on "How to not make Spark Explode" as a developer, and then shifts to look towards the future of all of the cool nifty things we will be able to do with structured streaming.
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Many popular big data technologies (such as Apache Spark, BEAM, Flink, and Kafka) are built in the JVM, and many interesting tools are built in other languages (ranging from Python to CUDA). For simple operations the cost of copying the data can quickly dominate, and in complex cases can limit our ability to take advantage of specialty hardware. This talk explores how improved formats are being integrated to reduce these hurdles to co-operation.
Many popular big data technologies (such as Apache Spark, BEAM, and Flink) are built in the JVM, and many interesting AI tools are built in other languages, and some requiring copying to the GPU. As many folks have experienced, while we may wish that we spend all of our time playing with cool algorithms -- we often need to spend more of our time working on data prep. Having to copy our data slowly between the JVM and the target language of computation can remove much of the benefit of being able to access our specialized tooling. Thankfully, as illustrated in the soon to be released Spark 2.3, Apache Arrow and related tools offer the ability to reduce this overhead. This talk will explore how Arrow is being integrated into Spark, and how it can be integrated into other systems, but also limitations and places where Apache Arrow will not magically save us.
Link: https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Slides from PyData London exploring how the big data ecosystem (currently) works together as well as how different parts of the ecosystem work with Python. Proof-of-concept examples are provided using nltk & spacy with Spark. Then we look to the future and how we can improve.
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?”
Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.
Extending spark ML for custom models now with python!Holden Karau
Are you interested in adding your own custom algorithms to Spark ML? This is the talk for you! See the companion examples in the High Performance Spark, and Sparkling ML project.
Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production.
Topics include:
Working with key/value data
Replacing groupByKey for awesomeness
Key skew: your data probably has it and how to survive
Effective caching and checkpointing
Considerations for noisy clusters
Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development
How to make our code testable
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you. This talk covers a number of important topics for making scalable Apache Spark programs – from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a preview of some of the work being done to add code generation to Spark ML.
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
Apache Spark is one of the most popular big data systems, but once the shiny finish starts to wear off you can find yourself wondering if you've accidentally deployed a Ford Pinto into production. This talk will look at the challenges that come with scaling Spark jobs. Also, the talk will explore Spark's new(ish) Dataset/DataFrame API, as well as how it’s evolving in Spark 2.3 with improved Python support.
If you're already a Spark user, come to find out why it’s not all your fault. If you aren't already a Spark user, come to find out how to save yourself from some of the pitfalls once you move beyond the example code.
Check out Holden's newest book, High Performance Spark, for more information!
From https://niketechtalksjan2018.splashthat.com/
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
Description
This talk assumes you have a basic understanding of Spark (if not check out one of the intro videos on youtube - http://bit.ly/hkPySpark ) and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs - this is the talk for you.
Abstract
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames and traditional RDDs with Python. Looking at Spark 2.0; we examine how to mix functional transformations with relational queries for performance using the new (to PySpark) Dataset API. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
This talk starts with a focus on "How to not make Spark Explode" as a developer, and then shifts to look towards the future of all of the cool nifty things we will be able to do with structured streaming.
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Many popular big data technologies (such as Apache Spark, BEAM, Flink, and Kafka) are built in the JVM, and many interesting tools are built in other languages (ranging from Python to CUDA). For simple operations the cost of copying the data can quickly dominate, and in complex cases can limit our ability to take advantage of specialty hardware. This talk explores how improved formats are being integrated to reduce these hurdles to co-operation.
Many popular big data technologies (such as Apache Spark, BEAM, and Flink) are built in the JVM, and many interesting AI tools are built in other languages, and some requiring copying to the GPU. As many folks have experienced, while we may wish that we spend all of our time playing with cool algorithms -- we often need to spend more of our time working on data prep. Having to copy our data slowly between the JVM and the target language of computation can remove much of the benefit of being able to access our specialized tooling. Thankfully, as illustrated in the soon to be released Spark 2.3, Apache Arrow and related tools offer the ability to reduce this overhead. This talk will explore how Arrow is being integrated into Spark, and how it can be integrated into other systems, but also limitations and places where Apache Arrow will not magically save us.
Link: https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Slides from PyData London exploring how the big data ecosystem (currently) works together as well as how different parts of the ecosystem work with Python. Proof-of-concept examples are provided using nltk & spacy with Spark. Then we look to the future and how we can improve.
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?”
Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Are general purpose big data systems eating the world?Holden Karau
Every-time there is a new piece of big data technology we often see many different specific implementations of the concepts, which often eventually consolidate down to a few viable options, and then frequently end up getting rolled into part of another larger project. This talk will examine this trend in big data ecosystem, look at the exceptions to the "rule", and look at how better interchange formats like Apache Arrow have the potential to change this going forward. In addition to general vague happy feelings (or sad depending on your ideas about how software should be made), this talk will look at some specific examples with deep learning, so if anyone is looking for a little bit of pixie dust to sprinkle on a failing business plan to take to silicon valley to raise a series A, you'll get something out this as well.
Video - https://www.youtube.com/watch?v=P_YKrLFZQJo
Simplifying training deep and serving learning models with big data in python...Holden Karau
More Serious Business Kitty Description:
While some deep learning systems have promised to not require any kind of data preparation or cleaning, in practice many folks find that effectively training their models requires some amount of data preparation and often we spend more time on our data preparation than anything else. This talk will examine tools for data preparation that can be used at scale on "big-data" and then how to use their results on-line at serving time (where we hopefully no longer require a cluster to predict every new user).
Less Serious Business Kitty Description:
Deep Learning, in addition to being a world class tool for detecting the presence of cats, requires large amounts of data for training. As much vendors may say "no data prep required", they are all lying*. This talk will look tools to build a deep learning pipeline with feature prep on top of existing big data technologies without rewriting your code for serving.
Traditionally feature prep done in a big data system, like Spark, Flink, or Beam, would have to be rewritting for the on-line serving component. This is about as much fun as when we have to rewrite our sample Python code into Java, as for some reason that's what a lot companies associate with "production." Come for the deep learning buzz-words, stay for the how to perform on-line serving without writing Java code.
*All vendors are optimists when it comes to their own products, including the vendors who pay Holden and Gris but they pay us so its ok.
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
Apache Spark has driven a lot of adoption of both Scala and functional programming concepts in non-traditionally industries. For many programmers in the big data world they coming looking for a solution to scaling their code, and quickly find themselves dealing with immutable data structures and lambdas, and those who love it stay. However, there is a dark side (of escape), much of Spark’s functional programming is changing, and even though it encourages functional programming it’s in a variety of languages with different expectations (in-line XML as a valid part of your language is fun!). This talk will look at how Spark does a good job of introduce folks to concepts like immutability, but also places where we maybe don’t do a great job of setting up developers for a life of functional programming. Things like accumulators, our three different models for streaming data, and an “interesting” approach to closures (come to find out what the ClosuerCleaner does, stay to find out why). The talk will close out with a look at how the functional inspired API is in exposed in the different languages, and how this impacts the kind of code written (Scala, Java, and Python – other languages are supported by Spark but I don’t want to re-learn Javascript or learn R just for this talk). Pictures of cute animals will be included in the slides to distract from the sad parts.
Video: https://www.youtube.com/watch?v=EDJfpkDpoE4
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
Apache Spark has been a great driver of not only Scala adoption, but introducing a new generation of developers to functional programming concepts. As Spark places more emphasis on its newer DataFrame & Dataset APIs, it’s important to ask ourselves how we can benefit from this while still keeping our fun functional roots. We will explore the cases where the Dataset APIs empower us to do cool things we couldn’t before, what the different approaches to serialization mean, and how to figure out when the shiny new API is actually just trying to steal your lunch money (aka CPU cycles).
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Holden Karau
TensorFlow is all kinds of fancy, from helping startups raising their series A in Silicon Valley to detecting if something is a cat. However, when things start to get “real,” you may find yourself no longer just dealing with mnist.csv but instead needing do large-scale data prep as well as training.
Holden Karau details how to use TensorFlow in conjunction with Apache Spark, Flink, and Beam to create a full machine learning pipeline—including the annoying feature engineering and data prep components that we like to pretend don’t exist. Holden also explains why these feature prep stages need to be integrated into the serving layer. She concludes by examining changing industry trends, like Apache Arrow, and how they impact cross-language development for things like deep learning. Even if you’re not trying to raise a round of funding in Silicon Valley, this talk will give you tools to do interesting machine learning problems at scale.
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
Abstract:-
This talk will introduce Spark new machine learning frame work (Spark ML) and how to train basic models with it. A companion Jupyter notebook for people to follow along with will be provided. Once we've got the basics down we'll look at what to do when we find we need more than the tools available in Spark ML (and I'll try and convince people to contribute to my latest side project -- Sparkling ML).
Bio:-
Holden Karau is a transgender Canadian, Apache Spark committer, an active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo. Outside of computers she enjoys scootering and playing with fire.
Validating Big Data Pipelines - Big Data Spain 2018Holden Karau
As big data jobs move from the proof-of-concept phase into powering real production services, we have to start considering what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and it’s important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes.
Testing and validating spark programs - Strata SJ 2016Holden Karau
Apache Spark is a fast, general engine for big data processing. As Spark jobs are used for more mission-critical tasks, it is important to have effective tools for testing and validation. Expanding her Strata NYC talk, “Effective Testing of Spark Programs,” Holden Karau details reasonable validation rules for production jobs and best practices for creating effective tests, as well as options for generating test data.
Holden explores best practices for generating complex test data, setting up performance testing, as well as basic unit testing. The validation component will focus on how to create reasonable validation rules given the constraints of Spark’s accumulators.
Unit testing of Spark programs is deceptively simple. Holden looks at how unit testing of Spark itself is accomplished and distills a number of best practices into traits we can use. This includes dealing with local mode cluster creation and tear down during test suites, factoring our functions to increase testability, mock data for RDDs, and mock data for Spark SQL. A number of interesting problems also arise when testing Spark Streaming programs, including handling of starting and stopping the streaming context, providing mock data, and collecting results, and Holden pulls out simple takeaways for dealing with these issues.
Holden also explores Spark’s internal methods for generating random data, as well as options using external libraries to generate effective test datasets (for both small- and large-scale testing). And while acceptance tests are not always thought of as part of testing, they share a number of similarities, so Holden discusses which counters Spark programs generate that we can use for creating acceptance tests, best practices for storing historic values, and some common counters we can easily use to track the success of our job, all while working within the constraints of Spark’s accumulators.
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
There are many great tools for training machine learning tools, ranging from sci-kit to Apache Spark, and tensorflow. However many of these systems largely leave open the question how to use our models outside of the batch world (like in a reactive application). Different options exist for persisting the results and using them for live training, and we will explore the trade-offs of the different formats and their corresponding serving/prediction layers.
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks
As big data jobs move from the proof-of-concept phase into powering real production services, we have to start consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and its important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes.
Figuring out when things have gone terribly wrong is trickier than it first appears, since we want to catch the errors before our users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist us in writing relative validation rules based on historical data.
For folks working in streaming, we will talk about the unique challenges of attempting to validate in a real-time system, and what we can do besides keeping an up-to-date resume on file for when things go wrong. To keep the talk interesting real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures and an adorable panda GIF.
If you’ve seen Holden’s previous testing Spark talks this can be viewed as a deep dive on the second half focused around what else we need to do besides good testing practices to create production quality pipelines. If you haven’t seen the testing talks watch those on YouTube after you come see this one
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...Databricks
If you’re subscribed to user@spark.apache.org, or work in a large company, you may see some common Spark error messages. Even attending Spark Summit over the past few years you have seen talks like the “Top K Mistakes in Spark.” While cool non-machine learning based tools do exist to examine Spark’s logs — they don’t use machine learning and therefore are not as cool but also limited in by the amount of effort humans can put into writing rules for them. This talk will look what happens when we train “regular” clustering models on stack traces, and explore DL models for classifying user message to the Spark list. Come for the reassurance that the robots are not yet able to fix themselves, and stay to learn how to work better with the help of our robot friends. The tl;dr of this talk is Spark ML on Spark output, plus a little bit of Tensorflow is fun for the whole family, but probably shouldn’t automatically respond to user list posts just yet.
Similar to Debugging PySpark - Spark Summit East 2017 (20)
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
1. Debugging PySpark
Or why is there a JVM stack trace and what
does it mean?
Holden Karau
IBM - Spark Technology Center
2. Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer (as of last month!) :)
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
3.
4. What is the Spark Technology Center?
● An IBM technology center focused around Spark
● We work on open source Apache Spark to make it more awesome
○ Python, SQL, ML, and more! :)
● Related components as well:
○ Apache Toree [Incubating] (Notebook solution for Spark with Jupyter)
○ spark-testing-base (testing utilites on top of Spark)
○ Apache Bahir
○ Apache System ML Incubating - Machine Learning
● Partner with the Scala Foundation and other important players
● Multiple Spark Committers (Nick Pentreath, Xiao (Sean) Li, Prashant Sharma,
Holden Karau (me!))
● Lots of contributions in Spark 2.0 & beyond :)
5.
6. Who I think you wonderful humans are?
● Friendly people (this is a Python focused talk after all)
● Don’t mind pictures of cats or stuffed animals
● Know some Python
● Know some Spark
● Want to debug your Spark applications
● Ok with things getting a little bit silly
Lori Erickson
7. What will be covered?
● A quick overview of PySpark architecture to understand how it can impact our
debugging
● Getting at Spark’s logs & persisting them
● What your options for logging are
● Attempting to understand Spark error messages
● My some what subtle attempts to get you to use spark-testing-base or similar
● My even less subtle attempts to get you to buy my new book
● Pictures of cats & stuffed animals
10. Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
kristin klein
11. So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
12. So how does that impact PySpark?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● etc.
13. So where are the logs/errors?
(e.g. before we can identify a monster we have to find it)
● Error messages reported to the console*
● Log messages reported to the console*
● Log messages on the workers - access through the
Spark Web UI or Spark History Server :)
(*When running in client mode)
PROAndrey
14. Working in Jupyter?
“The error messages were so useless -
I looked up how to disabled error reporting in Jupyter”
(paraphrased from PyData DC)
15. Working in Jupyter - try your terminal for help
Possibly fix by https://issues.apache.org/jira/browse/SPARK-19094 but may not get in
tonynetone
16. Working in YARN?
(e.g. before we can identify a monster we have to find it)
● Use yarn logs to get logs after log collection
● Or set up the Spark history server
● Or yarn.nodemanager.delete.debug-delay-sec :)
Lauren Mitchell
17. Spark is pretty verbose by default
● Most of the time it tells you things you already know
● Or don’t need to know
● You can dynamically control the log level with
sc.setLogLevel
● This is especially useful to increase logging near the
point of error in your code
18. But what about when we get an error?
● Python Spark errors come in two-ish-parts often
● JVM Stack Trace (Friend Monster - comes most errors)
● Python Stack Trace (Boo - has information)
● Buddy - Often used to report the information from Friend
Monster and Boo
19. So what is that JVM stack trace?
● Doesn’t want your error messages to get lonely
● Often not very informative
○ Except if the error happens purely in the JVM - like asking Spark to
load a file which doesn’t exist
20. Let’s make some mistakes & debug :)
● Error in transformation
● Run out of memory in the workers
Image by: Tomomi
21. Bad outer transformation:
data = sc.parallelize(range(10))
transform1 = data.map(lambda x: x + 1)
transform2 = transform1.map(lambda x: x / 0)
transform2.count()
David Martyn
Hunt
22. Let’s look at the error messages for it:
[Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
Continued for ~400 lines
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
25. A scroll down (not quite to the bottom)
File "high_performance_pyspark/bad_pyspark.py",
line 32, in <lambda>
transform2 = transform1.map(lambda x: x / 0)
ZeroDivisionError: integer division or modulo by zero
26. Or look at the bottom of console logs:
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line
180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line
175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
27. Or look at the bottom of console logs:
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
transform2 = transform1.map(lambda x: x / 0)
ZeroDivisionError: integer division or modulo by zero
28. Python Pipelines
● Some pipelining happens inside of Python
○ For performance (less copies from Python to Scala)
● DAG visualization is generated inside of Scala
○ Misses Python pipelines :(
Regardless of language
● Can be difficult to determine which element failed
● Stack trace _sometimes_ helps (it did this time)
● take(1) + count() are your friends - but a lot of work :(
29. Side note: Lambdas aren’t always your friend
● Lambda’s can make finding the error more challenging
● I love lambda x, y: x / y as much as the next human but
when y is zero :(
● A small bit of refactoring for your debugging never hurt
anyone*
● If your inner functions are causing errors it’s a good time
to have tests for them!
● Difficult to put logs inside of them
*A blatant lie, but…. it hurts less often than it helps
30. Testing - you should do it!
● spark-testing-base is on pip now for your happy test
adventures
● That’s a talk unto itself though (but it's on YouTube)
31. Adding your own logging:
● Java users use Log4J & friends
● Python users: use logging library (or even print!)
● Accumulators
○ Behave a bit weirdly, don’t put large amounts of data in them
32. Also not all errors are “hard” errors
● Parsing input? Going to reject some malformed records
● flatMap or filter + map can make this simpler
● Still want to track number of rejected records (see
accumulators)
Mustafasari
33. So using names & logging & accs could be:
data = sc.parallelize(range(10))
rejectedCount = sc.accumulator(0)
def loggedDivZero(x):
import logging
try:
return [x / 0]
except Exception as e:
rejectedCount.add(1)
logging.warning("Error found " + repr(e))
return []
transform1 = data.flatMap(loggedDivZero)
transform2 = transform1.map(add1)
transform2.count()
print("Reject " + str(rejectedCount.value))
34. Spark accumulators
● Really “great” way for keeping track of failed records
● Double counting makes things really tricky
○ Jobs which worked “fine” don’t continue to work “fine” when minor changes happen
● Relative rules can save us* under certain conditions
Found Animals Foundation Follow
35. Could we just us -mtrace?
● Spark makes certain assumptions about how Python is
launched on the workers this doesn’t (currently) work
● Namely it assumes PYSPARK_PYTHON points to a file
● Also assumes arg[0] has certain meanings :(
paul
36. Ok what about if we run out of memory?
In the middle of some Java stack traces:
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "high_performance_pyspark/bad_pyspark.py", line 132, in generate_too_much
return range(10000000000000)
MemoryError
37. Tubbs doesn’t always look the same
● Out of memory can be pure JVM (worker)
○ OOM exception during join
○ GC timelimit exceeded
● OutOfMemory error, Executors being killed by kernel,
etc.
● Running in YARN? “Application overhead exceeded”
● JVM out of memory on the driver side from Py4J
38. Reasons for JVM worker OOMs
(w/PySpark)
● Unbalanced shuffles
● Buffering of Rows with PySpark + UDFs
○ If you have a down stream select move it up stream
● Individual jumbo records (after pickling)
39. Reasons for Python worker OOMs
(w/PySpark)
● Insufficient memory reserved for Python worker
● Jumbo records
● Eager entire partition evaluation (e.g. sort +
mapPartitions)
● Too large partitions (unbalanced or not enough
partitions)
● Native code memory leak
40. And loading invalid paths:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/doesnotexist
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
42. What about if that isn’t enough to debug?
● Move take(1) up the dependency chain
● DAG in the WebUI -- less useful for Python :(
● toDebugString -- also less useful in Python :(
● Sample data and run locally
43. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
Coming Soon:
Learning PySpark
44. High Performance Spark (soon!)
First seven chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Python is in Chapter 7 & Debugging in Appendix
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
45. K thnx bye!
Get in touch if you want:
@holdenkarau on twitter
Have some simple UDFs you wish ran faster?: http://bit.ly/pySparkUDF
If you care about Spark testing: http://bit.ly/holdenTestingSpark
Want to start contributing to PySpark? Talk to me IRL or
E-mail: holden.karau+contributing@gmail.com