How to design a Spark Auto Tuner.
The first section coves how to set basic Spark settings e.g. executor memory, driver memory, dynamic allocation, shuffle settings, number of partitions etc. The second section it covers how to collect historical data about a spark Job and the third section discusses designing an auto tuner application which will programmatically configure Spark jobs using that historical data.
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
A deeper explanation of Spark's evaluation principals including lazy evaluation, the Spark execution environment, anatomy of a Spark Job (Tasks, Stages, Query execution plan) and presents one use case to demonstrate these concepts.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
The document discusses 5 common mistakes people make when writing Spark applications:
1) Not properly sizing executors for memory and cores.
2) Having shuffle blocks larger than 2GB which can cause jobs to fail.
3) Not addressing data skew which can cause joins and shuffles to be very slow.
4) Not properly managing the DAG to minimize shuffles and stages.
5) Classpath conflicts from mismatched dependencies causing errors.
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
The document discusses tuning Spark parameters to optimize performance. It describes how to control Spark's resource usage through parameters like num-executors, executor-cores, and executor-memory. Advanced parameters like spark.shuffle.memoryFraction and spark.reducer.maxSizeInFlight are also covered. Dynamic allocation allows scaling resources up and down based on workload. Tips provided include tuning memory usage, choosing serialization and storage levels, setting parallelism, and avoiding operations like groupByKey. An example recommends tuning the collaborative filtering algorithm in the RW project, reducing runtime from 27 minutes to under 7 minutes.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
A deeper explanation of Spark's evaluation principals including lazy evaluation, the Spark execution environment, anatomy of a Spark Job (Tasks, Stages, Query execution plan) and presents one use case to demonstrate these concepts.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
The document discusses 5 common mistakes people make when writing Spark applications:
1) Not properly sizing executors for memory and cores.
2) Having shuffle blocks larger than 2GB which can cause jobs to fail.
3) Not addressing data skew which can cause joins and shuffles to be very slow.
4) Not properly managing the DAG to minimize shuffles and stages.
5) Classpath conflicts from mismatched dependencies causing errors.
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
The document discusses tuning Spark parameters to optimize performance. It describes how to control Spark's resource usage through parameters like num-executors, executor-cores, and executor-memory. Advanced parameters like spark.shuffle.memoryFraction and spark.reducer.maxSizeInFlight are also covered. Dynamic allocation allows scaling resources up and down based on workload. Tips provided include tuning memory usage, choosing serialization and storage levels, setting parallelism, and avoiding operations like groupByKey. An example recommends tuning the collaborative filtering algorithm in the RW project, reducing runtime from 27 minutes to under 7 minutes.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
EMR Spark tuning involves configuring Spark and YARN parameters like executor memory and cores to optimize performance. The default Spark configurations depend on the deployment method (Thrift, Zeppelin etc). YARN is used for resource management in cluster mode, and allocates resources to containers based on minimum and maximum thresholds. When tuning, factors like available cluster resources, executor instances and cores should be considered to avoid overcommitting resources.
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
A tour of pyspark streaming in Apache Spark with an example calculating CPU usage using the Docker stats API. Two buzzwordy technologies for the price of one.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuDatabricks
Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.
Apache Spark Performance: Past, Future and PresentDatabricks
This document discusses improving performance instrumentation in Apache Spark. It summarizes that existing instrumentation focuses on blocked times in the main task thread, but opportunities exist to better instrument read/write times and machine-level resource utilization. The author proposes combining per-task I/O metrics with machine utilization data to provide complete metrics about time spent using each resource on a per-task basis, improving performance clarity for users. More details are available at the listed website.
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
This document discusses scaling topological data analysis (TDA) using the Mapper algorithm to analyze large datasets. It describes how the authors built the first open-source scalable implementation of Mapper called Betti Mapper using Spark. Betti Mapper uses locality-sensitive hashing to bin data points and compute topological summaries on prototype points to achieve an 8-11x performance improvement over a naive Spark implementation. The key aspects of Betti Mapper that enable scaling to enterprise datasets are locality-sensitive hashing for sampling and using prototype points to reduce the distance matrix computation.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today's data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability. To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends. Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization. Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability. Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier. In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Key attributes for modern real time streaming processing and interactive analytics
What is so exciting to me about Spark?
What are some of the myths?
What is missing in Spark for real time?
SnappyData’s mission – fuse Spark with in-memory data management in one unified cluster to offer – OLTP + OLAP + Stream processing + Probabilistic data
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
HPC in Barcelona is centered around the MareNostrum supercomputer and BSC's 425-person team from 40 countries. MareNostrum allows simulation and analysis in fields like life sciences, earth sciences, and engineering. To meet new demands of big data analytics, BSC developed the Spark4MN module to run Spark workloads on MareNostrum. Benchmarking showed Spark4MN achieved good speed-up and scale-out. Further work profiles Spark using BSC tools and benchmarks workloads like image analysis on different hardware. BSC's vision is to advance understanding through technologies like cognitive computing and deep learning.
Solving low latency query over big data with Spark SQLJulien Pierre
This document provides an overview of client data, capabilities, and architecture for a data analytics platform. It discusses data size and query latency, processing and storage using Cosmos, SparkSQL and HDFS, a Mesos cluster architecture with Zookeeper, and interactive analytics using Zeppelin and Avocado notebooks. The platform aims to provide a unified environment for data ingestion, transformation, storage, processing and analytics to enable intelligent data products and experiences.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
This document discusses Apache Spark-on-YARN, which allows Spark applications to leverage existing Hadoop clusters. Spark improves efficiency over Hadoop via in-memory computing and supports rich APIs. Spark-on-YARN provides access to HDFS data and resources on Hadoop clusters without extra deployment costs. It supports running Spark jobs in YARN cluster and client modes. The document describes Yahoo's use of Spark-on-YARN for machine learning applications on large datasets.
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
This document discusses architectures for processing both streaming and batch data. It begins by explaining why both streaming and batch processing are needed. It then discusses challenges with maintaining separate streaming and batch applications that do the same work. Various architectures and technologies are presented as solutions, including the Kappa architecture, Lambda architecture, SummingBird, Apache Spark, Apache Flink, and bringing your own framework. Specific examples are provided for how to implement word count in many of these technologies for both batch and streaming data.
Accelerating Data Processing in Spark SQL with Pandas UDFsDatabricks
Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases.
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
This document discusses Indicium, a system for enabling interactive querying at scale on Spark. It describes a unified data platform using Spark for storing and processing data, as well as enabling scheduled jobs, BI reports, and real-time lookups. It then discusses two parts of Indicium: (1) a managed context pool using Apache Zeppelin and Spark Job Server for multi-user SQL queries, and (2) a smart query scheduler for dynamic scheduling, load balancing, and high availability of queries. The smart scheduler aims to improve on the managed context pool by addressing issues like FIFO scheduling and lack of failure handling.
Slides for presentation on Cloudera Impala I gave at the DC/NOVA Java Users Group on 7/9/2013. It is a slightly updated set of slides from the ones I uploaded a few months ago on 4/19/2013. It covers version 1.0.1 and also includes some new slides on HortonWorks' Stinger Initiative.
This document discusses programmatically tuning Spark jobs. It recommends collecting historical metrics like stage durations and task metrics from previous job runs. These metrics can then be used along with information about the execution environment and input data size to optimize configuration settings like memory, cores, partitions for new jobs. The document demonstrates using the Robin Sparkles library to save metrics and get an optimized configuration based on prior run data and metrics. Tuning goals include reducing out of memory errors, shuffle spills, and improving cluster utilization.
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
Spark is by its nature very fault tolerant. However, faults, and application failures, can and do happen, in production at scale.
In this talk, we’ll discuss the nuts and bolts of fault tolerance in Spark.
We will begin with a brief overview of the sorts of fault tolerance offered, and lead into a deep dive of the internals of fault tolerance. This will include a discussion of Spark on YARN, scheduling, and resource allocation.
We will then spend some time on a case study and discussing some tools used to find and verify fault tolerance issues. Our case study comes from a customer who experienced an application outage that was root caused to a scheduler bug. We discuss the analysis we did to reach this conclusion and the work that we did to reproduce it locally. We highlight some of the techniques used to simulate faults and find bugs.
At the end, we’ll discuss some future directions for fault tolerance improvements in Spark, such as scheduler and checkpointing changes.
EMR Spark tuning involves configuring Spark and YARN parameters like executor memory and cores to optimize performance. The default Spark configurations depend on the deployment method (Thrift, Zeppelin etc). YARN is used for resource management in cluster mode, and allocates resources to containers based on minimum and maximum thresholds. When tuning, factors like available cluster resources, executor instances and cores should be considered to avoid overcommitting resources.
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
A tour of pyspark streaming in Apache Spark with an example calculating CPU usage using the Docker stats API. Two buzzwordy technologies for the price of one.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuDatabricks
Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.
Apache Spark Performance: Past, Future and PresentDatabricks
This document discusses improving performance instrumentation in Apache Spark. It summarizes that existing instrumentation focuses on blocked times in the main task thread, but opportunities exist to better instrument read/write times and machine-level resource utilization. The author proposes combining per-task I/O metrics with machine utilization data to provide complete metrics about time spent using each resource on a per-task basis, improving performance clarity for users. More details are available at the listed website.
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
This document discusses scaling topological data analysis (TDA) using the Mapper algorithm to analyze large datasets. It describes how the authors built the first open-source scalable implementation of Mapper called Betti Mapper using Spark. Betti Mapper uses locality-sensitive hashing to bin data points and compute topological summaries on prototype points to achieve an 8-11x performance improvement over a naive Spark implementation. The key aspects of Betti Mapper that enable scaling to enterprise datasets are locality-sensitive hashing for sampling and using prototype points to reduce the distance matrix computation.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today's data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability. To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends. Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization. Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability. Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier. In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Key attributes for modern real time streaming processing and interactive analytics
What is so exciting to me about Spark?
What are some of the myths?
What is missing in Spark for real time?
SnappyData’s mission – fuse Spark with in-memory data management in one unified cluster to offer – OLTP + OLAP + Stream processing + Probabilistic data
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
HPC in Barcelona is centered around the MareNostrum supercomputer and BSC's 425-person team from 40 countries. MareNostrum allows simulation and analysis in fields like life sciences, earth sciences, and engineering. To meet new demands of big data analytics, BSC developed the Spark4MN module to run Spark workloads on MareNostrum. Benchmarking showed Spark4MN achieved good speed-up and scale-out. Further work profiles Spark using BSC tools and benchmarks workloads like image analysis on different hardware. BSC's vision is to advance understanding through technologies like cognitive computing and deep learning.
Solving low latency query over big data with Spark SQLJulien Pierre
This document provides an overview of client data, capabilities, and architecture for a data analytics platform. It discusses data size and query latency, processing and storage using Cosmos, SparkSQL and HDFS, a Mesos cluster architecture with Zookeeper, and interactive analytics using Zeppelin and Avocado notebooks. The platform aims to provide a unified environment for data ingestion, transformation, storage, processing and analytics to enable intelligent data products and experiences.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
This document discusses Apache Spark-on-YARN, which allows Spark applications to leverage existing Hadoop clusters. Spark improves efficiency over Hadoop via in-memory computing and supports rich APIs. Spark-on-YARN provides access to HDFS data and resources on Hadoop clusters without extra deployment costs. It supports running Spark jobs in YARN cluster and client modes. The document describes Yahoo's use of Spark-on-YARN for machine learning applications on large datasets.
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
This document discusses architectures for processing both streaming and batch data. It begins by explaining why both streaming and batch processing are needed. It then discusses challenges with maintaining separate streaming and batch applications that do the same work. Various architectures and technologies are presented as solutions, including the Kappa architecture, Lambda architecture, SummingBird, Apache Spark, Apache Flink, and bringing your own framework. Specific examples are provided for how to implement word count in many of these technologies for both batch and streaming data.
Accelerating Data Processing in Spark SQL with Pandas UDFsDatabricks
Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases.
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
This document discusses Indicium, a system for enabling interactive querying at scale on Spark. It describes a unified data platform using Spark for storing and processing data, as well as enabling scheduled jobs, BI reports, and real-time lookups. It then discusses two parts of Indicium: (1) a managed context pool using Apache Zeppelin and Spark Job Server for multi-user SQL queries, and (2) a smart query scheduler for dynamic scheduling, load balancing, and high availability of queries. The smart scheduler aims to improve on the managed context pool by addressing issues like FIFO scheduling and lack of failure handling.
Slides for presentation on Cloudera Impala I gave at the DC/NOVA Java Users Group on 7/9/2013. It is a slightly updated set of slides from the ones I uploaded a few months ago on 4/19/2013. It covers version 1.0.1 and also includes some new slides on HortonWorks' Stinger Initiative.
This document discusses programmatically tuning Spark jobs. It recommends collecting historical metrics like stage durations and task metrics from previous job runs. These metrics can then be used along with information about the execution environment and input data size to optimize configuration settings like memory, cores, partitions for new jobs. The document demonstrates using the Robin Sparkles library to save metrics and get an optimized configuration based on prior run data and metrics. Tuning goals include reducing out of memory errors, shuffle spills, and improving cluster utilization.
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
Spark is by its nature very fault tolerant. However, faults, and application failures, can and do happen, in production at scale.
In this talk, we’ll discuss the nuts and bolts of fault tolerance in Spark.
We will begin with a brief overview of the sorts of fault tolerance offered, and lead into a deep dive of the internals of fault tolerance. This will include a discussion of Spark on YARN, scheduling, and resource allocation.
We will then spend some time on a case study and discussing some tools used to find and verify fault tolerance issues. Our case study comes from a customer who experienced an application outage that was root caused to a scheduler bug. We discuss the analysis we did to reach this conclusion and the work that we did to reproduce it locally. We highlight some of the techniques used to simulate faults and find bugs.
At the end, we’ll discuss some future directions for fault tolerance improvements in Spark, such as scheduler and checkpointing changes.
The document provides guidance on tuning Apache Spark jobs. It discusses tuning memory and garbage collection, optimizing shuffle operations, increasing parallelism through partitioning, monitoring jobs, and testing Spark applications.
Spark is an in-memory cluster computing framework that provides high performance for large-scale data processing. It excels over Hadoop by keeping data in memory as RDDs (Resilient Distributed Datasets) for faster processing. The document provides an overview of Spark architecture including its core-based execution model compared to Hadoop's JVM-based model. It also demonstrates Spark's programming model using RDD transformations and actions through an example of log mining, showing how jobs are lazily evaluated and distributed across the cluster.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
This document discusses big data, Hadoop, and using Hadoop in the cloud via Amazon EMR. It provides an overview of big data and what Hadoop is, explains how Hadoop works and how it can help store and process large datasets. It then discusses how Amazon EMR can be used to deploy Hadoop clusters in the cloud without having to manage the underlying infrastructure, and provides instructions on setting up and using EMR. Finally, it discusses debugging, profiling, and performance tuning Hadoop jobs and EMR clusters.
Tech talk about performance tools provided with standard go distribution given at go meetup group in Seattle,
http://www.meetup.com/golang/events/231455969/
Architecting and productionising data science applications at scalesamthemonad
This document discusses architecting and productionizing data science applications at scale. It covers topics like parallel processing with Spark, streaming platforms like Kafka, and scalable machine learning approaches. It also discusses architectures for data pipelines and productionizing models, with a focus on automation, avoiding SQL databases, and using Kafka streams and Spark for batch and streaming workloads.
Caching and tuning fun for high scalability @ FrOSCon 2011Wim Godden
The document discusses using caching and tuning techniques to improve scalability for websites. It covers caching full pages, parts of pages, SQL queries, and complex processing results. Memcache is presented as a fast and distributed caching solution. The document also discusses installing and using Memcache, as well as replacing Apache with Nginx as a lighter-weight web server that can serve static files and forward dynamic requests.
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
Apache Spark has driven a lot of adoption of both Scala and functional programming concepts in non-traditionally industries. For many programmers in the big data world they coming looking for a solution to scaling their code, and quickly find themselves dealing with immutable data structures and lambdas, and those who love it stay. However, there is a dark side (of escape), much of Spark’s functional programming is changing, and even though it encourages functional programming it’s in a variety of languages with different expectations (in-line XML as a valid part of your language is fun!). This talk will look at how Spark does a good job of introduce folks to concepts like immutability, but also places where we maybe don’t do a great job of setting up developers for a life of functional programming. Things like accumulators, our three different models for streaming data, and an “interesting” approach to closures (come to find out what the ClosuerCleaner does, stay to find out why). The talk will close out with a look at how the functional inspired API is in exposed in the different languages, and how this impacts the kind of code written (Scala, Java, and Python – other languages are supported by Spark but I don’t want to re-learn Javascript or learn R just for this talk). Pictures of cute animals will be included in the slides to distract from the sad parts.
Video: https://www.youtube.com/watch?v=EDJfpkDpoE4
Building Apache Cassandra clusters for massive scaleAlex Thompson
Covering theory and operational aspects of bring up Apache Cassandra clusters - this presentation can be used as a field reference. Presented by Alex Thompson at the Sydney Cassandra Meetup.
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
Abstract:
Data exploration often requires running aggregation/slice-dice queries on data sourced from disparate sources. You may want to identify distribution patterns, outliers, etc and aid the feature selection process as you train your predictive models. As you begin to understand your data, you want to ask ad-hoc questions expressed through your visualization tool (which typically translates to SQL queries), study the results and iteratively explore the data set through more queries. Unfortunately, even when data sets can be in-memory, large data set computations take time breaking the train of thought and increasing time to insight . We know Spark can be fast through its in-memory parallel processing. But, Spark 1.x isn’t quite there. Spark 2.0 promises to offer 10X better speed than its predecessor. Spark 2.0 ushers some impressive improvements to interactive query performance. We first explore these advances - compiling the query plan eliminating virtual function calls, and other improvements in the Catalyst engine. We compare the performance to other popular popular query processing engines by studying the spark query plans. We then go through SnappyData (an open source project that integrates Spark with a database that offers OLTP, OLAP and stream processing in a single cluster) where we use smarter data colocation and Synopses data (.e.g. Stratified sampling) to dramatically cut down on the memory requirements as well as the query latency. We explain the key concepts in summarizing data using structures like stratified sampling by walking through some examples in Apache Zeppelin notebooks (a open source visualization tool for spark) and demonstrate how we can explore massive data sets with just your laptop resources while achieving remarkable speeds.
Bio:
Jags is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory Bio:
Jags Ramnarayan is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.
UKOUG version of a presentation trying to establish the sensible limits of parallelism on a couple of hardware configurations. Detailed white paper is at http://oracledoug.com/px_slaves.pdf
Fetch policy determines when pages are brought into main memory, with demand paging bringing pages in only when referenced while prepaging brings in more pages, often pages not referenced. The TLB caches page table entries to map virtual to physical addresses, with TLB misses similarly slow as instruction/data cache misses. For page replacement when no free frames exist, a page replacement algorithm selects a victim frame to swap out and free the frame.
Caching and tuning fun for high scalability @ phpBenelux 2011Wim Godden
This document summarizes Wim Godden's presentation on caching and tuning for high scalability. It discusses various caching techniques including caching entire pages, parts of pages, SQL queries, and complex PHP results. It also covers different caching storage options like Memcache and APC. The presentation aims to increase performance, reliability, and scalability through proper caching and tuning techniques.
The document provides information on migrating to and managing databases on Amazon RDS/Aurora. Some key points include:
- RDS/Aurora handles complexity and makes the database highly available, but it also limits customization options compared to managing your own databases.
- Aurora is a MySQL-compatible database cluster that shares storage across nodes for high availability without replication lag. A cluster has writer and reader endpoints.
- CloudFormation is recommended for creating and managing Aurora clusters due to its native AWS support and ability to integrate with other services.
- Loading large amounts of data into Aurora may require using parallel dump/load tools like Mydumper/Myloader instead of mysqldump due to improved
This presentation reviews of the many aspects of PHP performance that can impact day-to-day living. It explores basic concepts for resolution when PHP performance has got you down. The focus is on Zend Server configuration options including, but not limited to: caching, Apache settings, PHP syntax fundamentals, diagnosing bottlenecks, and DB2/SQL optimization.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Similar to Understanding Spark Tuning: Strata New York (20)
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
2. Rachel
- Rachel Warren → She/ Her
- Data Scientist / Software engineer at Salesforce Einstein
- Formerly at Alpine Data (with Holden)
- Lots of experience scaling Spark in different production environments
- The other half of the High Performance Spark team :)
- @warre_n_peace
- Linked in: https://www.linkedin.com/in/rachelbwarren/
- Slideshare: https://www.slideshare.net/RachelWarren4/
- Github: https://github.com/rachelwarren
3. Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC :)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams & live coding: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
● Talk feedback: http://bit.ly/holdenTalkFeedback http://bit.ly/holdenTalkFeedback
4.
5. Who we think you wonderful humans are?
● Nice enough people
● I’m sure you love pictures of cats
● Might know something about using Spark, or are using it in production
● Maybe sys-admin or developer
● Are tired of spending so much time fussing with Spark Settings to get jobs to
run
Lori Erickson
6. The goal of this talk is to give you the resources to programatically tune
your Spark jobs so that they run consistently and efficiently
In terms of and $$$$$
7. What we will cover?
- A run down of the most important settings
- Getting the most out of Spark’s built-in Auto Tuner Options dynamic allocation
- A few examples of errors and performance problems that can be addressed
by tuning
- A job can go out of tune over time as the world and it changes, much like Holden’s Vespa.
- How to tune jobs “statically” e.g. without historical data
- How to collect historical data
- An example
- The latest and greatest auto tuner tools
8. I can haz application :p
val conf = new SparkConf()
.setMaster("local")
.setAppName("my_awesome_app")
val sc = SparkContext.getOrCreate(newConf)
val rdd = sc.textFile(inputFile)
val words: RDD[String] = rdd.flatMap(_.split(“ “).
map(_.trim.toLowerCase))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCount.saveAsTextFile(outputFile)
Trish Hamme
Settings go here
This is a shuffle
9. I can haz application :p
val conf = new SparkConf()
.setMaster("local")
.setAppName("my_awesome_app")
val sc = SparkContext.getOrCreate(newConf)
val rdd = sc.textFile(inputFile)
val words: RDD[String] = rdd.flatMap(_.split(“ “).
map(_.trim.toLowerCase))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCount.saveAsTextFile(outputFile)
Trish Hamme
Start of application
End Stage 1
Stage 2
Action, Launches Job
10. Spark Execution Environment
Other App
My Spark App
- Node can have several executors
- But on executor can only be on
one node
- Executors have same amount of
memory and cores
- One task per core
- Task is the compute for one
partition
- Rdd is distributed set of partitions
11. How many resources to give my application?
● Spark.executor.memory
● Spark.driver.memory
● spark.executor.vcores
● Enable dynamic allocation
○ (or set # number of executors)
val conf = new SparkConf()
.setMaster("local")
.setAppName("my_awesome_app")
.set("spark.executor.memory", ???)
.set("spark.driver.memory", ???)
.set("spark.executor.vcores", ???)
12. Executor and Driver Memory
- Driver Memory
- As small as it can be without failing (but that can be pretty big)
- Executor memory + overhead < less than the size of the container
- Think about binning
- if you have 12 gig nodes making an 8 gig executor is maybe silly
- Pros of Fewer Larger Executors Per Node
- Maybe less likely to oom
- Pros of More Small Executors
- Some people report slow down with more than 5ish cores … (more on that later)
- If using dynamic allocation easier to “scale up” on a busy cluster
13. Vcores
- Remember 1 core = 1 task*. So number of concurrent tasks is limited by total
cores
- In HDFS too many cores per executor may cause issue with concurrent hdfs threads maybe?
- 1 core per executor takes away some benefit of things like broadcast
variables
- Think about “burning” cpu and memory equally
- If you have 60Gb ram & 10 core nodes, making default executor size 30 gb but with ten cores
maybe not so great
14. How To Enable Dynamic Allocation
Dynamic Allocation allows Spark to add and subtract executors between
Jobs over the course of an application
- To configure
- spark.dynamicAllocation.enabled=true
- spark.shuffle.service.enabled=true (you have to configure external shuffle service on each worker)
- spark.dynamicAllocation.minExecutors
- spark.dynamicAllocation.maxExecutors
- spark.dynamicAllocation.initialExecutors
- To Adjust
- Spark will add executors when there are pending tasks (spark.dynamicAllocation.schedulerBacklogTimeout)
- and exponentially increase them as long as tasks in the backlog persist
(spark...sustainedSchedulerBacklogTimeout)
- Executors are decommissioned when they have been idle for
spark.dynamicAllocation.executorIdleTimeout
15. Why To Enable Dynamic Allocation
When
- Most important for shared or cost sensitive environments
- Good when an application contains several jobs of differing sizes
- The only real way to adjust resources throughout an application
Improvements
- If jobs are very short adjust the timeouts to be shorter
- For jobs that you know are large start with a higher number of initial executors to avoid slow spin up
- If you are sharing a cluster, setting max executors can prevent you from hogging it
17. Oh no! It failed :( How could we adjust it?
Suppose that in the driver log, we see a “container lost exception” and on the
executor logs we see:
java.lang.OutOfMemoryError: Java heap space
Or:
Application Memory Overhead Exceeded
This points to an out of memory error on the executors
hkase
18. Addressing Executor OOM
- If we have more executor memory to give it, try that!
- Lets try increasing the number of partitions so that each executor will process
smaller pieces of the data at once
- Spark.default.parallelism = 10
- Or by adding the number of partitions to the code e.g. reduceByKey(numPartitions = 10)
- Many more things you can do to improve the code
20. What to do about it?
- If we see idle executors but the total size of our job is small, we may just be
requesting too many executors
- If all executors are idle it maybe because we are doing a large computation in
the driver
- If the computation is very large, and we see idle executors, this maybe
because the executors are waiting for a “large” task → so we can increase
partitions
- At some point adding partitions will slow the job down
- But only if not too much skew
Toshiyuki IMAI
22. Preventing Shuffle Spill to Disk
- Larger executors
- Configure off heap storage
- More partitions can help (ideally the labor of all the partitions on one executor
can “fit” in that executor’s memory)
- We can adjust shuffle settings
- Increase shuffle memory fraction (spark.shuffle.memory.fraction)
- Try increasing:
- spark.shuffle.file.buffer
- Configure an external shuffle service, so that the shuffle files will not need to
be stored in the spark executors
- spark.shuffle.io.serverThreads
- spark.shuffle.io.backLog
jaci XIII
23. Signs of Too Many Partitions
Number of partitions is the size of the data each core is computing … smaller
pieces are easier to process only up to a point
- Spark needs to keep metadata about each partition on the driver
- Driver memory errors & driver overhead errors
- Very long task “spin up” time
- Too many partitions at read usually caused by small part files
- Lots of pending tasks & Low memory utilization
- Long file write time for relatively small I/O “size” (especially with blockstores)
Dorian Wallender
24. PYTHON SETTINGS
● Application memory overhead
○ We can tune this based on if an app is PySpark or not
○ In the proposed PySpark on K8s PR this is done for us
○ More tuning may still be required
● Buffers & batch sizes oh my
○ spark.sql.execution.arrow.maxRecordsPerBatch
○ spark.python.worker.memory - default to 512 but default mem for Python can be lower :(
■ Set based on amount memory assigned to Python to reduce OOMs
○ Normal: automatic, sometimes set wrong - code change required :(
Nessima E.
25. Tuning Can Help With
- Low cluster utilization ($$$$)
- Out of memory errors
- Spill to Disk / Slow shuffles
- GC errors / High GC overhead
- Driver / Executor overhead exceptions
- Reliable deployment
Melinda Seckington
26. Things you can’t “tune away”
- Silly Shuffles
- You can make each shuffle faster through tuning but you cannot change the fundamental size
or number of shuffles without adjusting code
- Think about how much you are moving data, and how to do it more efficiently
- Be very careful with column based operations on Spark SQL
- Unbalanced shuffles (caused by unbalanced keys)
- if one or two tasks are much larger than the others
- #Partitions is bounded by #distinct keys
- Bad Object management/ Data structures choices
- Can lead to memory exceptions / memory overhead exceptions / gc errors
- Serialization exceptions*
- *Exceptions due to cost of serialization can be tuned
- Python
- This makes Holden sad, buuuut. Arrow?
Neil Piddock
27. But enough doom, lets watch (our new) software fail.
Christian Walther
28. But what if we run a billion Spark jobs per day?
29. Tuning Could Depend on 4 Factors:
1. The execution environment
2. The size of the input data
3. The kind of computation (ML, python, streaming)
_______________________________
1. The historical runs of that job
We can get all this information programmatically!
30. Execution Environment
What we need to know about where the job will run
- How much memory do I have available to me: on a single node/ on the cluster
- In my queue
- How much CPU: on a single node, on the cluster → corresponds to total
number of concurrent tasks
- We can get all this information from the yarn api (https://dzone.com/articles/how-to-use-the-yarn-
api-to-determine-resources-ava-1)
- Can I configure dynamic allocation?
- Cluster Health
31. Size of the input data
- How big is the input data on Disk
- How big will it be in memory?
- “Coefficient of In Memory Expansion: shuffle Spill Memory / shuffle Spill disk
- Can get historically or guess
- How many part files? (the default number of partitions at read)
- Cardinality and type?
32. Setting # partitions on the first try
Assuming you have many distinct keys, you want to try to make
partitions small enough that each partition fits in the memory “available
to each task” to avoid spilling to disk or failing
See Sandy Ryza’s very much sighted post
https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-
jobs-part-2/
Size of input data in memory
Amount of memory available per task on the executors
This bite is too big
Child = executor
with one core
Partition of
melon
33. How much memory should each task use?
def availableTaskMemoryMB(executorMemory : Long): Double = {
val memFraction = sparkConf.getDouble("spark.memory.fraction", 0.6)
val storageFraction = sparkConf.getDouble("spark.memory.storageFraction", 0.5)
val nonStorage = 1-storageFraction
val cores = sparkConf.getInt("spark.executor.cores", 1)
Math.ceil((executorMemory * memFraction * nonStorage )/ cores)
}
34. Partitions could be inputDataSize/ memory per task
def determinePartitionsFromInputDataSize(inputDataSize: Double) : Int = {
Math.round(inputDataSize/availableTaskMemoryMB()).toInt
}
36. What kinds of historical data could we use?
- Historical cluster info
- (was one node sad? How many executors were used?)
- Stage Metrics
- duration (only on an isolated cluster)
- executor CPU
- vcores/Second
- Task Metrics
- Shuffle read & shuffle write → how much data is being moved & how big is input data
- “Total task time” = time for each task * number of tasks
- Shuffle read spill memory / shuffle spill disk = input data expansion in memory
- What kind of computation was run (python/streaming)
- Who was running the computation
37. How do we save the metrics?
- For a human: Spark history server can be persisted even if the cluster is
terminated
- For a program: Can use listeners to save everything from web UI
- (See Spark Measure and read these at the start of job)
- Can also get significantly more by capturing the full event stream
- We can save information at task/ stage/ job level.
38. Meet Robin Sparkles!!!!!
- Saves historical data using listeners from Spark Measure
- ( https://github.com/LucaCanali/sparkMeasure)
- Create directory scheme to save task and stage data for
successive runs
- Read in n previous runs before creating conf
- Use the result of the previous runs in tandem with current Spark
Conf to Optimize some Spark Settings. (WIP)
- Note: not to be used in production, simply the start of a spell
book.
Lets go to the
Mall!
39. Start a Spark listener and save metrics
Extend Spark Measure flight recorder which automatically saves metrics:
class RobinStageListener(sc: SparkContext, override val metricsFileName: String)
extends ch.cern.sparkmeasure.FlightRecorderStageMetrics(sc.getConf) {
}
Start Spark Listener
val myStageListener = new RobinStageListener(sc,
stageMetricsPath(runNumber))
sc.addSparkListener(myStageListener)
40. Add listener to program code
def run(sc: SparkContext, id: Int, metricsDir: String,
inputFile: String, outputFile: String): Unit = {
val metricsCollector = new MetricsCollector(sc, metricsDir)
metricsCollector.startSparkJobWithRecording(id)
//some app code
}
41. Read in the metrics
val STAGE_METRICS_SUBDIR = "stage_metrics"
val metricsDir = s"$metricsRootDir/${appName}"
val stageMetricsDir = s"$metricsDir/$STAGE_METRICS_SUBDIR"
def stageMetricsPath(n: Int): String = s"$metricsDir/run=$n"
def readStageInfo(n : Int) =
ch.cern.sparkmeasure.Utils.readSerializedStageMetrics(stageMetricsPath(n))
42. To use: read in metrics, then create optimized conf
val conf = new SparkConf() ….
val (newConf: SparkConf, id: Int) = Runner.getOptimizedConf(metricsDir, conf)
val sc = SparkContext.getOrCreate(newConf)
Runner.run(sc, id, metricsDir, inputFile, outputFile)
43. Put it all together!
A programmatic way of defining the number of
partitions …
Partitions
Performance
44. Parallelism / number of partitions
Keep increasing the number of partitions until the metric we care about stops
improving
- Spark.default.parallelism is only default if not in code
- Different stages could / should have different partitions
- can also compute for each stage and then set in shuffle phase:
- rdd.reduceByKey(numPartitions = x)
- You can design your job to get number of partitions from a variable in the conf
- Which stage to tune if we can only do one:
- We can search for the longest stage
- If we are tuning for parallelism, we might want to capture the stage “with the biggest shuffle”
- We use “total shuffle write” which is the number of bytes written in shuffle files during the stage
(this should be setting agnostic)
45. What metric defines better?
- Depends on your business … often a trade off between speed and efficiency
- If you just want speed Holden has several cloud solutions to sell you (or would if she knew
anyone in the sales department).
- cluster is “fully utilized”
- Stage duration * executors = sum(task time)/ executors) + fuge factor
- WARNING: Stage duration is dependent on instance & exec spin up, network delays excetera
- Could use anything that Spark measures:
- Executor Serialization time
- Deserialization time
- Executor Idle time
- I use Stage is “Fastest”-> Executor Cpu Time (total cpu time for all the
tasks)
46. Compute partitions given a list of metrics for the “biggest stage” across several runs
def fromStageMetricSharedCluster(previousRuns: List[StageInfo]): Int = {
previousRuns match {
case Nil =>
//If this is the first run and parallelism is not provided, use the number of concurrent tasks
//We could also look at the file on disk
possibleConcurrentTasks()
case first :: Nil =>
val fromInputSize = determinePartitionsFromInputDataSize(first.totalInputSize)
Math.max(first.numPartitionsUsed + math.max(first.numExecutors,1), fromInputSize)
}
47. If we have several runs of the same computation
case _ => val first = previousRuns(previousRuns.length - 2)
val second = previousRuns(previousRuns.length - 1)
if(morePartitionsIsBetter(first,second)) { //Increase the number of partitions from everything that we have tried
Seq(first.numPartitionsUsed, second.numPartitionsUsed).max + second.numExecutors
} else{ //If we overshot the number of partitions, use whichever run had the best executor cpu time
previousRuns.sortBy(_.executorCPUTime).head.numPartitionsUsed
} }
}
48. How to use Robin-Sparkles
- Assume user constructs the Spark Context
- Configure listener to write to a persistent location
- Maybe need to be in an external location if you are using a web based system
- Modify the application code that creates the Spark context to use Robin
Sparkles to set the settings
Note: not applicable for use where Spark context is created for you except for
tuning number of partitions or variables that can change at the job (rather than
application level)
49. She could do so much more!
- Additional settings to tune
- Set executor memory and driver memory based on system
- Executor health and system monitoring
- Could surface warnings about data skew
- Shuffle settings could be adjusted
- Currently we don’t get info if the stage fails
- Our reader would actually fail if stage data is malformed, we should read partial stage
- We could use the firehost listener that records every event from the event stream, to monitor
failures
- These other listeners would contain additional information that is not in the web ui
51. Other Tools
Sometimes the right answer isn’t tuning, it’s telling the user to change the code
(see Sparklens) or telling the administrator to look at their cluster
Melinda Seckington
52. Sparklens - Tells us what to tune or refactor
● AppTimelineAnalyzer
● EfficiencyStatisticsAnalyzer
● ExecutorTimelineAnalyzer
● ExecutorWallclockAnalyzer
● HostTimelineAnalyzer
● JobOverlapAnalyzer
● SimpleAppAnalyzer
● StageOverlapAnalyzer
● StageSkewAnalyzer
Kitty Terwolbeck
53. Dr Elephant
● Spark ConfigurationHeurestic
○ val SPARK_DRIVER_MEMORY_KEY = "spark.driver.memory"
○ val SPARK_EXECUTOR_MEMORY_KEY = "spark.executor.memory"
○ val SPARK_EXECUTOR_INSTANCES_KEY = "spark.executor.instances"
○ val SPARK_EXECUTOR_CORES_KEY = "spark.executor.cores"
○ val SPARK_SERIALIZER_KEY = "spark.serializer"
○ val SPARK_APPLICATION_DURATION = "spark.application.duration"
○ val SPARK_SHUFFLE_SERVICE_ENABLED = "spark.shuffle.service.enabled"
○ val SPARK_DYNAMIC_ALLOCATION_ENABLED = "spark.dynamicAllocation.enabled"
○ val SPARK_DRIVER_CORES_KEY = "spark.driver.cores"
○ val SPARK_DYNAMIC_ALLOCATION_MIN_EXECUTORS =
"spark.dynamicAllocation.minExecutors"
○ etc.
● Among other things
55. Upcoming settings in 2.4 (probably)
spark.executor.instances used dynamic allocation for Spark Streaming (zomg!)
spark.executor.pyspark.memory - hey what if we told Python how much memory
to use?
56. Some upcoming talks
● Lambda World Seattle - Bringing the Jewels of Python to Scala
● Spark Summit London
● Reversim Tel Aviv (keynote)
● PyCon Canada Toronto (keynote)
● ScalaX London
● Big Data Spain
57. High Performance Spark!
You can buy it today! And come to our book signing at 2:30.
The settings didn’t get its own chapter is in the appendix, (doing things on time is hard)
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print rather than e-book.
58. Thanks and keep in touch!
Discussion and slides - http://bit.ly/2CNcpUQ :D
Editor's Notes
Photo from https://www.flickr.com/photos/lorika/4148361363/in/photolist-7jzriM-9h3my2-9Qn7iD-bp55TS-7YCJ4G-4pVTXa-7AFKbm-bkBfKJ-9Qn6FH-aniTRF-9LmYvZ-HD6w6-4mBo3t-8sekvz-mgpFzD-5z6BRK-de513-8dVhBu-bBZ22n-4Vi2vS-3g13dh-e7aPKj-b6iHHi-4ThGzv-7NcFNK-aniTU6-Kzqxd-7LPmYs-4ok2qy-dLY9La-Nvhey-Kte6U-74B7Ma-6VfnBK-6VjrY7-58kAY9-7qUeDK-4eoSxM-6Vjs5A-9v5Pvb-26mja-4scwq3-GHzAL-672eVr-nFUomD-4s8u8F-5eiQmQ-bxXXCc-5P9cCT-5GX8no
What will be cover
The most important spark settings in the context of the spark execution environment
Some example of errors and performance problems that we see in spark jobs
Tuning jobs statically statically but with historical data
Open source data collect historical data
Do stuff
Other auto tuning spark tools
Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using both historical and live job information, using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Much of the data required to effectively tune jobs is already collected inside of Spark. You just need to understand it. Holden, Rachel, and Anya outline sample auto-tuners and discuss the options for improving them and applying similar techniques in your own work. They also discuss what kind of tuning can be done statically (e.g., without depending on historic information) and look at Spark’s own built-in components for auto-tuning (currently dynamically scaling cluster size) and how you can improve them.
Even if the idea of building an auto-tuner sounds as appealing as using a rusty spoon to debug the JVM on a haunted supercomputer, this talk will give you a better understanding of the knobs available to you to tune your Apache Spark jobs.
Also, to be clear, Holden, Rachel, and Anya don’t promise to stop your pager going off at 2:00am, but hopefully this helps.
Big data talk joke
So this is the spark conf
This is where we put the spark settings if we need them
Starting the spark context correspond to starting the spark applications (kicks of the mathince)
Action
Two stages everything thing that happens before the data movement goes
Mention that it doesn’t always look like this and that there are so other ways of
Think about how to present this because the executors are different sizes;...
A few larger executors is bad because of GC
Look up which HDFS deamon this is
Sort of, unless you change it. Terms and conditions apply to Python users.
Spark will add executors when there are pending tasks (spark.dynamicAllocation.schedulerBacklogTimeout)
and exponentially increase them as long as tasks in the backlog persist (spark...sustainedSchedulerBacklogTimeout)
Executors are decommissioned when they have been idle for
Spark will add executors when there are pending tasks (spark.dynamicAllocation.schedulerBacklogTimeout)
and exponentially increase them as long as tasks in the backlog persist (spark...sustainedSchedulerBacklogTimeout)
Executors are decommissioned when they have been idle for
Now we are going to look at some ways that we can fall over
HOLDEN Speakzzzz
RESEARCH OTHER METRICS OF CLUSTER UTALIZATION
if partitions = total core cores, then next stage will wait on slowest partition
If we have more partitions than the number of total cores then several tasks with less records could be executed while one tasks with many records is executed.
At some point there are too many partitions and overhead gets high and job will slow down
Shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it, whereas shuffle spill (disk) is the size of the serialized form of the data on disk after we spill it. This is why the latter tends to be much smaller than the former. Note that both metrics are aggregated over the entire duration of the task (i.e. within each task you can spill multiple times).
External shuffle service is storing shuffle files somewhere else
Relative CPU and GC time maybe define them ...
Requesting too many resources should say that earlier
Rachel Starts to Speak again :)))))
Give that configuration
Mem is everything that is spark fraction that is reserved for execution and storage.
Storage is the amount that is reserved for caching (but can be used for execution) .. e.g. could have cached data in it wooo
Storage is for caching