This is a version of a talk I presented at Spark Summit East 2016 with Rachel Warren. In this version, I also discuss memory management on the JVM with pictures from Alexey Grishchenko, Sandy Ryza, and Mark Grover.
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
by Anya Bida and Rachel Warren from Alpine Data
https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/
Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionNeelesh Srinivas Salian
Spark has been growing in deployments for the past year. The increasing amount of data being analyzed and processed through the framework is massive and continues to push the boundaries of the engine. Drawing on experiences across 150+ production deployments, Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark and offers guidelines to help setup a real-world environment when planning an Apache Spark deployment in a cluster. Attendees can use these observations to improve the usability and supportability of Apache Spark and avoid such issues in their projects.
Topics include:
Scaling the architecture
Memory configurations
End-user code
Incompatible dependencies
Administration- and operation-related issues
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
Harnessing Spark and Cassandra with GroovySteve Pember
This talk is an introduction to a powerful combination in the big data space: Apache Spark and Cassandra. Spark is a cluster-computing framework that allows users to perform calculations against resilient in-memory datasets using a functional programming interface. Cassandra is a linearly scalable, fault tolerant, decentralized datastore. These two technologies are complicated, but integrate well and provide such a level of utility that whole companies have formed around them.
In this talk we’ll learn how Spark and Cassandra can be leveraged within your Groovy Application: Spark normally asks for a Scala environment. We’ll talk about Spark and Cassandra from a high level and walk through code examples. We’ll discuss the pitfalls of working with these technologies - like modeling your data appropriately to ensure even distribution in Cassandra and general packaging woes with Spark - and ways to avoid them. Finally, we’ll explore how we at ThirdChannel are using these technologies.
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
by Anya Bida and Rachel Warren from Alpine Data
https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/
Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionNeelesh Srinivas Salian
Spark has been growing in deployments for the past year. The increasing amount of data being analyzed and processed through the framework is massive and continues to push the boundaries of the engine. Drawing on experiences across 150+ production deployments, Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark and offers guidelines to help setup a real-world environment when planning an Apache Spark deployment in a cluster. Attendees can use these observations to improve the usability and supportability of Apache Spark and avoid such issues in their projects.
Topics include:
Scaling the architecture
Memory configurations
End-user code
Incompatible dependencies
Administration- and operation-related issues
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
Harnessing Spark and Cassandra with GroovySteve Pember
This talk is an introduction to a powerful combination in the big data space: Apache Spark and Cassandra. Spark is a cluster-computing framework that allows users to perform calculations against resilient in-memory datasets using a functional programming interface. Cassandra is a linearly scalable, fault tolerant, decentralized datastore. These two technologies are complicated, but integrate well and provide such a level of utility that whole companies have formed around them.
In this talk we’ll learn how Spark and Cassandra can be leveraged within your Groovy Application: Spark normally asks for a Scala environment. We’ll talk about Spark and Cassandra from a high level and walk through code examples. We’ll discuss the pitfalls of working with these technologies - like modeling your data appropriately to ensure even distribution in Cassandra and general packaging woes with Spark - and ways to avoid them. Finally, we’ll explore how we at ThirdChannel are using these technologies.
Getting Buzzed on Buzzwords: Using Cloud & Big Data to Pentest at ScaleBishop Fox
You’ve heard about cloud, big data, server-less infrastructure, web scale, and other buzzwords that cause VCs to throw money at people - but how does this help you? If you’re getting bored going over the same checklist in your pentests then you’re missing out on what some of these new technologies can offer you. Using some of the newer cloud technologies not only can you automate all of your workflows, but you can do so with almost zero maintenance at a low cost with almost infinite scalability! This talk will show you how to blow conventional pentesters out of the water using some cool new technologies along with a little bit of trickery.
Some of the topics we’ll go over include: * Cheap and scalable rainbow tables with BigQuery, 5TB in 10 seconds * SQS & Lambda, like Burp Intruder but 10K QPS * Scalable GPU Clusters on the cheap with Spot Instances and Elastic Beanstalk * Cloud exit nodes, rotating IPs via Elastic Beanstalk and nano instances * Cost effective fuzzing with Elastic Beanstalk and Spot Instances
(This was originally presented on November 16, 2018 at Kiwicon 2038).
SF Solr Meetup - Interactively Search and Visualize Your Big Datagethue
Open up your user base to the data! Contrary to programming and SQL, almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop. The underlying technical details of the application and its interaction with Apache Solr will be clarified.
The session will detail how to get started with data indexing in just a few clicks as well as explore several data analysis scenarios with the latest Solr Analytics Facets and Spark Streaming. Through a Web browser, attendees will be shown how to explore and visualize data for quick answers. The search dashboard in Hue, with its draggable charts and dynamic interface, lets any non-technical user look for documents or patterns.
Attendees of this talk will learn how to get started with interactive search visualization in their Solr cluster.
Riot Games serves an international base of players that creates terabytes of data daily. With their data centers quickly reaching capacity, Riot Games migrated their entire warehouse to AWS in order to scale operations more effectively. Learn how Riot Games used DynamoDB and Elastic MapReduce to import millions of rows of customer metrics. Hear about the technical specifics involved, and the lessons learned along the way of migrating large amounts of data to AWS.
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...Amazon Web Services
Parse is a BaaS for mobile developers that is built entirely on AWS. With over 150,000 mobile apps hosted on Parse, the stability of the platform is our primary concern, but it coexists with rapid growth and a demanding release schedule. This session is a technical discussion of the current architecture and the design decisions that went in to scaling the platform rapidly and robustly over the past year and a half. We talk about some of the lessons learned managing and scaling MongoDB, Cassandra, Redis, and MySQL in the cloud. We also discuss how Parse went from launching individual instances using chef to managing clusters of hosts with Auto Scaling groups, with instance discovery and registry handled by ZooKeeper, thus enabling us to manage vastly larger sets of services with fewer human resources. This session is useful to anyone who is trying to scale up from startup to established platform without sacrificing agility.
Building highly scalable website requires to understand the core building blocks of your applicative environment. In this talk we dive into Jahia core components to understand how they interact and how by (1) respecting a few architectural practices and (2) fine tuning Jahia components and the JVM, you will be able to build a highly scalable service
At Yahoo! over the past year we have helped migrate hundreds of our grids? users to YARN. Our YARN clusters have in aggregate run over 18 million jobs with more than 3 billion tasks consuming over 10 thousand years of compute time. With one single cluster running 90 thousand jobs a day. From this experience we would like to share what we have learned about running YARN well, how this is different from running a 1.0 based cluster, and what it takes to migrate your jobs to YARN from 1.0.
Serverless is a new framework that allows developers to easily harness AWS Lambda and Api Gateway to build and deploy full fledged API services without needing to deal with any ops level overhead or paying for servers when they're not in use. It's kinda like Heroku on-demand for single functions.
The job throughput and Apache Hadoop cluster utilization benefits of YARN and MapReduce v2 are widely known. Who wouldn’t want job throughput increased by 2x? Most likely you’ve heard (repeatedly) about the key benefits that could be gained from migrating your Hadoop cluster from MapReduce v1 to YARN: namely around improved job throughput and cluster utilization, as well as around permitting different computational frameworks to run on Hadoop. What you probably haven’t heard about are the configuration tweaks needed to ensure your existing MR v1 jobs can run on your YARN cluster as well as YARN specific configuration settings. In this session we’ll start with a list of recommended YARN configurations, and then step through the most common use-cases we’ve seen in the field. Production migrations can quickly go awry without proper guidance. Learn from others’ misconfigurations to get your YARN cluster configured right the first time.
In recent years a wide range of new technologies have disrupted traditional data management. We're now in the middle of a revolution in data processing methods. Choosing allegiances in revolution is risky. In this talk, Doug will present the underlying causes of the revolution and predict how the data world might look once we're through it.
Getting Buzzed on Buzzwords: Using Cloud & Big Data to Pentest at ScaleBishop Fox
You’ve heard about cloud, big data, server-less infrastructure, web scale, and other buzzwords that cause VCs to throw money at people - but how does this help you? If you’re getting bored going over the same checklist in your pentests then you’re missing out on what some of these new technologies can offer you. Using some of the newer cloud technologies not only can you automate all of your workflows, but you can do so with almost zero maintenance at a low cost with almost infinite scalability! This talk will show you how to blow conventional pentesters out of the water using some cool new technologies along with a little bit of trickery.
Some of the topics we’ll go over include: * Cheap and scalable rainbow tables with BigQuery, 5TB in 10 seconds * SQS & Lambda, like Burp Intruder but 10K QPS * Scalable GPU Clusters on the cheap with Spot Instances and Elastic Beanstalk * Cloud exit nodes, rotating IPs via Elastic Beanstalk and nano instances * Cost effective fuzzing with Elastic Beanstalk and Spot Instances
(This was originally presented on November 16, 2018 at Kiwicon 2038).
SF Solr Meetup - Interactively Search and Visualize Your Big Datagethue
Open up your user base to the data! Contrary to programming and SQL, almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop. The underlying technical details of the application and its interaction with Apache Solr will be clarified.
The session will detail how to get started with data indexing in just a few clicks as well as explore several data analysis scenarios with the latest Solr Analytics Facets and Spark Streaming. Through a Web browser, attendees will be shown how to explore and visualize data for quick answers. The search dashboard in Hue, with its draggable charts and dynamic interface, lets any non-technical user look for documents or patterns.
Attendees of this talk will learn how to get started with interactive search visualization in their Solr cluster.
Riot Games serves an international base of players that creates terabytes of data daily. With their data centers quickly reaching capacity, Riot Games migrated their entire warehouse to AWS in order to scale operations more effectively. Learn how Riot Games used DynamoDB and Elastic MapReduce to import millions of rows of customer metrics. Hear about the technical specifics involved, and the lessons learned along the way of migrating large amounts of data to AWS.
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...Amazon Web Services
Parse is a BaaS for mobile developers that is built entirely on AWS. With over 150,000 mobile apps hosted on Parse, the stability of the platform is our primary concern, but it coexists with rapid growth and a demanding release schedule. This session is a technical discussion of the current architecture and the design decisions that went in to scaling the platform rapidly and robustly over the past year and a half. We talk about some of the lessons learned managing and scaling MongoDB, Cassandra, Redis, and MySQL in the cloud. We also discuss how Parse went from launching individual instances using chef to managing clusters of hosts with Auto Scaling groups, with instance discovery and registry handled by ZooKeeper, thus enabling us to manage vastly larger sets of services with fewer human resources. This session is useful to anyone who is trying to scale up from startup to established platform without sacrificing agility.
Building highly scalable website requires to understand the core building blocks of your applicative environment. In this talk we dive into Jahia core components to understand how they interact and how by (1) respecting a few architectural practices and (2) fine tuning Jahia components and the JVM, you will be able to build a highly scalable service
At Yahoo! over the past year we have helped migrate hundreds of our grids? users to YARN. Our YARN clusters have in aggregate run over 18 million jobs with more than 3 billion tasks consuming over 10 thousand years of compute time. With one single cluster running 90 thousand jobs a day. From this experience we would like to share what we have learned about running YARN well, how this is different from running a 1.0 based cluster, and what it takes to migrate your jobs to YARN from 1.0.
Serverless is a new framework that allows developers to easily harness AWS Lambda and Api Gateway to build and deploy full fledged API services without needing to deal with any ops level overhead or paying for servers when they're not in use. It's kinda like Heroku on-demand for single functions.
The job throughput and Apache Hadoop cluster utilization benefits of YARN and MapReduce v2 are widely known. Who wouldn’t want job throughput increased by 2x? Most likely you’ve heard (repeatedly) about the key benefits that could be gained from migrating your Hadoop cluster from MapReduce v1 to YARN: namely around improved job throughput and cluster utilization, as well as around permitting different computational frameworks to run on Hadoop. What you probably haven’t heard about are the configuration tweaks needed to ensure your existing MR v1 jobs can run on your YARN cluster as well as YARN specific configuration settings. In this session we’ll start with a list of recommended YARN configurations, and then step through the most common use-cases we’ve seen in the field. Production migrations can quickly go awry without proper guidance. Learn from others’ misconfigurations to get your YARN cluster configured right the first time.
In recent years a wide range of new technologies have disrupted traditional data management. We're now in the middle of a revolution in data processing methods. Choosing allegiances in revolution is risky. In this talk, Doug will present the underlying causes of the revolution and predict how the data world might look once we're through it.
Learn why 451 Research believes Infochimps is well-positioned with an easy-to-consume managed service for those without Hadoop expertise, as well as a stack of technologically interesting projects for the 'devops' crowd.
Opening with a market positioning statement and ending with a competitive and SWOT analysis, Matt Aslett provides a comprehensive impact report.
該投影片介紹了利用CDH(Cloudera’s Cloudera Distribution Including Apache Hadoop)來架設Hadoop叢集時所需要準備的環境,其中包括硬體規格、系統環境與軟體版本。
未來蒐集的相關資源會放在Diggo如下的兩個library中:
https://www.diigo.com/user/phate334/cloudera
https://www.diigo.com/user/phate334/hadoop
The new YARN framework promises to make Hadoop a general-purpose platform for Big Data and enterprise data hub applications. In this talk, you'll learn about writing and taking advantage of applications built on YARN.
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
Cloudera Navigator Optimizer analyzes existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop.
In this document, we will present a very brief introduction to BigData (what is BigData?), Hadoop (how does Hadoop fits the picture?) and Cloudera Hadoop (what is the difference between Cloudera Hadoop and regular Hadoop?).
Please note that this document is for Hadoop beginners looking for a place to start.
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
Learn about the skills and tools a data scientist needs and how to start training to be one.
There's so much noise about what a data scientist is or isn't that it can be challenging to identify the skills needed to start training a team or becoming one yourself. What exactly is a data scientist and where do you start?
Cloudera's Director of Data Science, Sean Owen, will start by walking through the different skills data scientist should have and why businesses need them. Afterwards, Tom Wheeler, Cloudera's Principal Curriculum Developer, will introduce the latest data science course developed by Cloudera University designed to help people take their first steps to becoming a data scientist.
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Anya Bida
Abstract: Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully.
Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018
https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s
Bio: Anya Bida (https://www.linkedin.com/in/anyabida/)
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.
Just enough DevOps for Data Scientists (Part II)Databricks
Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully.
Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018
https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s
Getting Started with Apache Spark on KubernetesDatabricks
Community adoption of Kubernetes (instead of YARN) as a scheduler for Apache Spark has been accelerating since the major improvements from Spark 3.0 release. Companies choose to run Spark on Kubernetes to use a single cloud-agnostic technology across their entire stack, and to benefit from improved isolation and resource sharing for concurrent workloads. In this talk, the founders of Data Mechanics, a serverless Spark platform powered by Kubernetes, will show how to easily get started with Spark on Kubernetes.
Debugging PySpark - Spark Summit East 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Video: https://www.youtube.com/watch?v=A0jYQlxc2FU&feature=youtu.be
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Understanding Spark Tuning: Strata New YorkRachel Warren
How to design a Spark Auto Tuner.
The first section coves how to set basic Spark settings e.g. executor memory, driver memory, dynamic allocation, shuffle settings, number of partitions etc. The second section it covers how to collect historical data about a spark Job and the third section discusses designing an auto tuner application which will programmatically configure Spark jobs using that historical data.
Apache Spark is an amazing distributed system, but part of the bargain we’ve made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Tuning Apache Spark is somewhat of a dark art, although thankfully, when it goes wrong, all we tend to lose is several hours of our day and our employer’s money.
Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using both historical and live job information, using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Much of the data required to effectively tune jobs is already collected inside of Spark. You just need to understand it. Holden, Rachel, and Anya outline sample auto-tuners and discuss the options for improving them and applying similar techniques in your own work. They also discuss what kind of tuning can be done statically (e.g., without depending on historic information) and look at Spark’s own built-in components for auto-tuning (currently dynamically scaling cluster size) and how you can improve them.
Even if the idea of building an auto-tuner sounds as appealing as using a rusty spoon to debug the JVM on a haunted supercomputer, this talk will give you a better understanding of the knobs available to you to tune your Apache Spark jobs.
Also, to be clear, Holden, Rachel, and Anya don’t promise to stop your pager going off at 2:00am, but hopefully this helps.
Apache Spark Performance is too hard. Let's make it easierDatabricks
Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
6. Default != Recommended
Example: By default, spark.executor.memory = 1g
1g allows small jobs to finish out of the box.
Spark assumes you'll increase this parameter.
!6
7. Which parameters are important?
!
How do I configure them?
!7
Default != Recommended
8. Filter* data
before an
expensive reduce
or aggregation
consider*
coalesce(
Use* data
structures that
require less
memory
Serialize*
PySpark
serializing
is built-in
Scala/
Java?
persist(storageLevel.[*]_SER)
Recommended:
kryoserializer *
tuning.html#tuning-
data-structures
See "Optimize partitions."
*
See "GC investigation." *
See "Checkpointing." *
The Spark Tuning Cheat-Sheet
28. !28
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit > driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
29. !29
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit > driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
30. !30
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit > driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
Limitation: Driver must not be
larger than a single node.
32. !32
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit > driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
34. !34
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit > driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
Verify my calculations respect this
limitation.
39. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
40. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
41. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
here let's talk about one scenario
42.
43. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
44. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
45. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
46. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
47. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
48. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
here let's talk about one scenario