Presented at Spark+AI Summit Europe 2019
https://databricks.com/session_eu19/apache-spark-at-scale-in-the-cloud
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
Hadoop is emerging as the standard for big data processing and analytics. However, as usage of the Hadoop clusters grow, so do the demands of managing and monitoring these systems.
In this full-day Strata Hadoop World tutorial, attendees will get an overview of all phases for successfully managing Hadoop clusters, with an emphasis on production systems — from installation, to configuration management, service monitoring, troubleshooting and support integration.
We will review tooling capabilities and highlight the ones that have been most helpful to users, and share some of the lessons learned and best practices from users who depend on Hadoop as a business-critical system.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
Take a peak behind the curtain at how the operations team at LinkedIn deploys and configures Hadoop and its surrounding infrastructure. This talk will feature information for both new and expert users alike. Topics will include user and machine provisioning, software deployment, configuration management, and a walk through some of the custom patches for one of the leading Hadoop installations in the world.
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
Hadoop is emerging as the standard for big data processing and analytics. However, as usage of the Hadoop clusters grow, so do the demands of managing and monitoring these systems.
In this full-day Strata Hadoop World tutorial, attendees will get an overview of all phases for successfully managing Hadoop clusters, with an emphasis on production systems — from installation, to configuration management, service monitoring, troubleshooting and support integration.
We will review tooling capabilities and highlight the ones that have been most helpful to users, and share some of the lessons learned and best practices from users who depend on Hadoop as a business-critical system.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
Take a peak behind the curtain at how the operations team at LinkedIn deploys and configures Hadoop and its surrounding infrastructure. This talk will feature information for both new and expert users alike. Topics will include user and machine provisioning, software deployment, configuration management, and a walk through some of the custom patches for one of the leading Hadoop installations in the world.
Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery.
Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.
Josh Berkus
You've heard that PostgreSQL is the highest-performance transactional open source database, but you're not seeing it on YOUR server. In fact, your PostgreSQL application is kind of poky. What should you do? While doing advanced performance engineering for really high-end systems takes years to learn, you can learn the basics to solve performance issues for 80% of PostgreSQL installations in less than an hour. In this session, you will learn: -- The parts of database application performance -- The performance setup procedure -- Basic troubleshooting tools -- The 13 postgresql.conf settings you need to know -- Where to look for more information.
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax
Connecting Apache Spark to C* is easy, thanks to DataStax Spark Cassandra Connector. But what about Security?
The DSE bring Enterprise security and Kerberos support to C*. Latest Hadoop distribution has Spark support and also support Kerberos. So now you can add a Cassandra to you Hadoop infrastructure with integrated security and build reliable speed level and streaming applications by combining data from both worlds.
This presentation will show all that fun around security configurations
1. DSE client with SSL and Kerberos
2. Connect from Hadoop Spark to DSE
3. Connect DSE Spark to HDFS sources.
4. And all above even with Widows DC :)
About the Speaker
Artem Aliev Software Developer, DataStax
Artem Aliev is a software developer in the DataStax Analytics team. He works on integrating C* database with analytics solution like Spark and Hive.
This presentation answer a lot of your questions about PostgreSQL and the Red Hat Cluster Suite.
It reviews how you can create failover/standby capabilities with the following activities:
General PostgreSQL clustering options
Overview of Red Hat Cluster Service
Identification of candidate databases for clustering
Identification of hardware for clustering
Analysis of uptime requirements and data latency
Implementation of clustering
Testing of clustering
PostgreSQL installation tips for RHCS
OpenStack is rapidly gaining popularity with businesses as they realize the benefits of a private cloud architecture. This presentation was delivered by Dave Page, Chief Architect, Tools & Installers at EnterpriseDB & PostgreSQL Core Team member during PG Open 2014. He addressed some of the common components of OpenStack deployments, how they can affect Postgres servers, and how users might best utilize some of the features they offer when deploying Postgres, including:
• Different configurations for the Nova compute service
• Use of the Cinder block store
• Virtual networking options with Neutron
• WAL archiving with the Swift object store
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
The performance of modern Big Data frameworks, e.g. Spark, depends greatly on high-speed storage and shuffling, which impose a significant memory burden on production data centers. In many production situations, the persistence and shuffling intensive applications can lead to a major performance loss due to lack of memory. Thus, the common practice is usually to over-allocate the memory assigned to the data workers for production applications, which in turn reduces overall resource utilization. One efficient way to address the dilemma between the performance and cost efficiency of Big Data applications is through data center computing resource disaggregation. This paper proposes and implements a system that incorporates the Apache Spark Big Data framework with a novel in-memory distributed file system to achieve memory disaggregation for data persistence and shuffling. We address the challenge of optimizing performance at a low cost by co-designing the proposed in-memory distributed file system with large-volume DIMMbased persistent memory (PMEM) and RDMA technology. The disaggregation design allows each part of the system to be scaled independently, which is particularly suitable for cloud deployments. The proposed system is evaluated in a productionlevel cluster using real enterprise-level Spark production applications. The results of an empirical evaluation show that the system can achieve up to a 3.5-fold performance improvement for shuffle-intensive applications with the same amount of memory, compared to the default Spark setup. Moreover, by leveraging PMEM, we demonstrate that our system can effectively increase by 66.5% the memory capacity of the computing cluster with affordable cost, with a reasonable execution time overhead with respect to using local DRAM only.
Apache Impala is a complex engine and requires a thorough technical understanding to utilize it fully. Without proper configuration or usage, Impala’s performance becomes unpredictable, and end-user experience suffers. However, for many users and administrators, the right configuration of Impala is still a mystery.
Drawing on work with some of the largest clusters in the world, Manish Maheshwari shares ingestion best practices to keep an Impala deployment scalable and details admission control configuration to provide a consistent experience to end users. Manish also takes a high-level look at Impala’s query profile, which is used as a first step in any performance troubleshooting, and discusses common mistakes users and BI tools make when interacting with Impala. Manish concludes by detailing an ideal setup to show all of this in practice.
Josh Berkus
You've heard that PostgreSQL is the highest-performance transactional open source database, but you're not seeing it on YOUR server. In fact, your PostgreSQL application is kind of poky. What should you do? While doing advanced performance engineering for really high-end systems takes years to learn, you can learn the basics to solve performance issues for 80% of PostgreSQL installations in less than an hour. In this session, you will learn: -- The parts of database application performance -- The performance setup procedure -- Basic troubleshooting tools -- The 13 postgresql.conf settings you need to know -- Where to look for more information.
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax
Connecting Apache Spark to C* is easy, thanks to DataStax Spark Cassandra Connector. But what about Security?
The DSE bring Enterprise security and Kerberos support to C*. Latest Hadoop distribution has Spark support and also support Kerberos. So now you can add a Cassandra to you Hadoop infrastructure with integrated security and build reliable speed level and streaming applications by combining data from both worlds.
This presentation will show all that fun around security configurations
1. DSE client with SSL and Kerberos
2. Connect from Hadoop Spark to DSE
3. Connect DSE Spark to HDFS sources.
4. And all above even with Widows DC :)
About the Speaker
Artem Aliev Software Developer, DataStax
Artem Aliev is a software developer in the DataStax Analytics team. He works on integrating C* database with analytics solution like Spark and Hive.
This presentation answer a lot of your questions about PostgreSQL and the Red Hat Cluster Suite.
It reviews how you can create failover/standby capabilities with the following activities:
General PostgreSQL clustering options
Overview of Red Hat Cluster Service
Identification of candidate databases for clustering
Identification of hardware for clustering
Analysis of uptime requirements and data latency
Implementation of clustering
Testing of clustering
PostgreSQL installation tips for RHCS
OpenStack is rapidly gaining popularity with businesses as they realize the benefits of a private cloud architecture. This presentation was delivered by Dave Page, Chief Architect, Tools & Installers at EnterpriseDB & PostgreSQL Core Team member during PG Open 2014. He addressed some of the common components of OpenStack deployments, how they can affect Postgres servers, and how users might best utilize some of the features they offer when deploying Postgres, including:
• Different configurations for the Nova compute service
• Use of the Cinder block store
• Virtual networking options with Neutron
• WAL archiving with the Swift object store
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
The performance of modern Big Data frameworks, e.g. Spark, depends greatly on high-speed storage and shuffling, which impose a significant memory burden on production data centers. In many production situations, the persistence and shuffling intensive applications can lead to a major performance loss due to lack of memory. Thus, the common practice is usually to over-allocate the memory assigned to the data workers for production applications, which in turn reduces overall resource utilization. One efficient way to address the dilemma between the performance and cost efficiency of Big Data applications is through data center computing resource disaggregation. This paper proposes and implements a system that incorporates the Apache Spark Big Data framework with a novel in-memory distributed file system to achieve memory disaggregation for data persistence and shuffling. We address the challenge of optimizing performance at a low cost by co-designing the proposed in-memory distributed file system with large-volume DIMMbased persistent memory (PMEM) and RDMA technology. The disaggregation design allows each part of the system to be scaled independently, which is particularly suitable for cloud deployments. The proposed system is evaluated in a productionlevel cluster using real enterprise-level Spark production applications. The results of an empirical evaluation show that the system can achieve up to a 3.5-fold performance improvement for shuffle-intensive applications with the same amount of memory, compared to the default Spark setup. Moreover, by leveraging PMEM, we demonstrate that our system can effectively increase by 66.5% the memory capacity of the computing cluster with affordable cost, with a reasonable execution time overhead with respect to using local DRAM only.
Lc3 beijing-june262018-sahdev zala-guangyaSahdev Zala
Our slides deck, used at the LinuxCon+ContainerCon+CLOUDOPEN China 2018, on Kubernetes cluster design considerations and our journey to 1000+ node single cluster with IBM Cloud.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
We will present our Office 365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DSE on Azure.
The presentation will feature demos on how you too can build similar applications.
Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
Sudarshan Kadambi presented this talk at the Bay Area Spark Meetup @ Bloomberg. He covered Bloomberg Apache Spark Server and contributions to Apache Spark. The talk also talked about challenges of doing high-volume online analytics while still observing high-levels of SLAs
Network support for resource disaggregation in next-generation datacentersSangjin Han
Presented at the 12th ACM Workshop on Hot Topics in Networks (HotNets XII)
Datacenters have traditionally been a collection of individual servers, each of which aggregates a fixed amount of compute, memory, storage, and network resources as an independent physical entity. Extrapolating from recent trends, we envisage that future datacenters will be architected in a drastically different manner: all computational resources within a server will be disaggregated into standalone blades, and the datacenter network will directly interconnect them. This is what we call "Disaggregated Datacenters".
This presentation briefly sketches why and how this transition will happen. In particular, we focus on the role of network fabric in such disaggregated datacenters, as it will face a set of challenges brought by the new datacenter architecture.
The paper can be found here: http://www.eecs.berkeley.edu/~sangjin/static/pub/hotnets2013_ddc.pdf
Most mid-sized Django websites thrive by relying on memcached. Though what happens when basic memcached is not enough? And how can one identify when the caching architecture is becoming a bottleneck? We'll cover the problems we've encountered and solutions we've put in place.
Scale confidently. From laptop to lots of nodes to multi-cluster, multi-use case deployments, Elastic experts are sharing best practices to master and pitfalls to avoid when it comes to scaling Elasticsearch.
Healthcare Claim Reimbursement using Apache SparkDatabricks
Optum Inc helps hospitals accurately calculate the claim reimbursement, detect underpayment from the Insurance company. Optum receives millions of claims per day which needs to be evaluated in less than 8 hours and the results need to be sent back to the hospitals for revenue recovery purposes.
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
You have a working application that is using MySQL: great! At the beginning, you are probably using a single database instance, and maybe – but not necessarily – you have replication for backups, but you are not reading from slaves yet. Scalability and reliability were not the main focus in the past, but they are starting to be a concern. Soon, you will have many databases and you will have to deal with replication lag. This talk will present how to tackle the transition.
We mostly cover standard/asynchronous replication, but we will also touch on Galera and Group Replication. We present how to adapt the application to become replication-friendly, which facilitate reading from and failing over to slaves. We also present solutions for managing read views at scale and enabling read-your-own-writes on slaves. We also touch on vertical and horizontal sharding for when deploying bigger servers is not possible anymore.
Are UNIQUE and FOREIGN KEYs still possible at scale, what are the downsides of AUTO_INCREMENTs, how to avoid overloading replication, what are the limits of archiving, … Come to this talk to get answers and to leave with tools for tackling the challenges of the future.
Similar to Apache Spark At Scale in the Cloud (20)
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
2. Rose Toomey, Coatue Management
Spark At Scale In the
Cloud
#UnifiedDataAnalytics #SparkAISummit
3. About me
NYC. Finance. Technology. Code.
• Each job I wrote code but found that the data
challenges just kept growing
– Lead API Developer at Gemini Trust
– Director at Novus Partners
• Now: coding and working with data full time
– Software Engineer at Coatue Management
4. How do you process this…
Numbers are approximate.
• Dataset is 35+ TiB raw
• Input files are 80k+ unsplittable compressed row-based
format with heavy skew, deeply nested directory structure
• Processing results in 275+ billion rows cached to disk
• Lots of data written back out to S3
– Including stages ending in sustained writes of tens of TiB
4
5. On a very big Spark cluster…
Sometimes you just need to bring the entire
dataset into memory.
The more nodes a Spark cluster has, the more
important configuration tuning becomes.
Even more so in the cloud, where you will
regularly experience I/O variance and
unreliable nodes.
6. In the cloud?
• Infrastructure management is hard
– Scaling resources and bandwidth in a datacenter
is not instant
– Spark/Hadoop clusters are not islands – you’re
managing an entire ecosystem of supporting
players
• Optimizing Spark jobs is hard
Let’s limit the number of hard things we’re going to tackle
at once.
7. Things going wrong at scale
Everything is relative. In smaller clusters, these
configurations worked fine.
• Everything is waiting on everything else because Netty
doesn't have enough firepower to shuffle faster
• Speculation meets skew and relaunches the very
slowest parts of a join, leaving most of the cluster idle
• An external service rate limits, which causes blacklisting
to sideline most of a perfectly good cluster
7
8. Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Networking
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
9. Putting together a big
Spark cluster
• What kind of nodes should the
cluster have? Big? Small?
Medium?
• What's your resource limitation for
the number of executors?
– Just memory (standalone)
– Both memory and vCPUs (YARN)
• Individual executors should have
how much memory and how many
virtual CPUs?Galactic Wreckage in Stephan's Quintet
9
10. One Very Big Standalone Node
One mega instance configured with many
"just right" executors, each provisioned with
• < 32 GiB heap (sweet spot for GC)
• 5 cores (for good throughput)
• Minimizes shuffle overhead
• Like the pony, not offered by your cloud
provider. Also, poor fault tolerance.
10
11. Multiple Medium-sized Nodes
When looking at medium sized nodes, we
have a choice:
• Just one executor
• Multiple executors
But a single executor might not be the best
resource usage:
• More cores on a single executor is not
necessarily better
• When using a cluster manager like
YARN, more executors could be a more
efficient use of CPU and memory
11
12. Many Small Nodes
12
• 500+ small nodes
• Each node over-provisioned
relative to multiple executor per
node configurations
• Single executor per node
• Most fault tolerant but big
communications overhead
“Desperate affairs require
desperate measures.”
Vice Admiral Horatio Nelson
13. Why ever choose the worst solution?
Single executor per small (or medium) node is the worst
configuration for cost, provisioning, and resource usage. Why not
recommend against it?
• Resilient to node degradation and loss
• Quick transition to production: relative over-provisioning of
resources to each executor behaves more like a notebook
• Awkward instance sizes may provision more quickly than larger
instances
13
14. Onward!
Now you have your cluster composition in mind, you’ll need to scale
up your base infrastructure to support the number of nodes:
• Memory and garbage collection
• Tune RPC for cluster communications
• Where do you put very large datasets?
• How do you get them off the cluster?
• No task left behind: scheduling in difficult times
14
15. Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Networking
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
16. Spark memory management
SPARK-1000: Consolidate
storage and execution memory
management
• NewRatio controls
Young/Old proportion
• spark.memory.fraction
sets storage and execution
space to ~60% tenured
space
16
Young Generation 1/3
Old Generation 2/3
300m reserved
spark.memory.fraction ~60%
50% execution
dynamic – will take more
50% storage
spark.memory.storageFraction
~40%
Spark
metadata,
user data
structures,
OOM safety
18. Field guide to Spark GC tuning
• Lots of minor GC - easy fix
– Increase Eden space (high allocation rate)
• Lots of major GC - need to diagnose the trigger
– Triggered by promotion - increase Eden space
– Triggered by Old Generation filling up - increase Old Generation
space or decrease spark.memory.fraction
• Full GC before stage completes
– Trigger minor GC earlier and more often
18
19. Full GC tailspin
Balance sizing up against tuning code
• Switch to bigger and/or more nodes
• Look for slow running stages caused by avoidable shuffle, tune
joins and aggregation operations
• Checkpoint both to preserve work at strategic points but also to
truncate DAG lineage
• Cache to disk only
• Trade CPU for memory by compressing data in memory using
spark.rdd.compress
19
20. Which garbage collector?
Throughput or latency?
• ParallelGC favors throughput
• G1GC is low latency
– Shiny new things like string deduplication
– vulnerable to wide rows
Whichever you choose, collect early and often.
20
21. Where to cache big datasets
• To disk. Which is slow.
• But frees up as much tenured space as possible for
execution, and storing things which must be in memory
– internal metadata
– user data structures
– broadcasting the skew side of joins
21
23. Perils of caching to disk
19/04/13 01:27:33 WARN BlockManagerMasterEndpoint: No more replicas
available for rdd_48_27005 !
When you lose an executor, you lose all the cached blocks stored by that
executor even if the node is still running.
• If lineage is gone, the entire job will fail
• If lineage is present, RDD#getOrCompute tries to compensate for the missing
blocks by re-ingesting the source data. While it keeps your job from failing,
this could introduce enormous slowdowns if the source data is skewed, your
ingestion process is complex, etc.
23
24. Self healing block management
// use this with replication >= 2 when caching to
disk in non-distributed filesystem
spark.storage.replication.proactive = true
Pro-active block replenishment in case of node/executor failures
https://issues.apache.org/jira/browse/SPARK-15355
https://github.com/apache/spark/pull/14412
24
25. Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Networking
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
26. Tune RPC for cluster
communications
Netty server processing RPC requests
is the backbone of both authentication
and shuffle services.
Insufficient RPC resources cause slow
speed mayhem: clients disassociate,
operations time out.
org.apache.spark.network.util.
TransportConf is the shared config for
both shuffle and authentication services.
Ruth Teitelbum and Marlyn Meltzer
reprogramming ENIAC, 1946
26
27. Scaling RPC
// used for auth
spark.rpc.io.serverThreads = coresPerDriver * rpcThreadMultiplier
// used for shuffle
spark.shuffle.io.serverThreads = coresPerDriver * rpcThreadMultiplier
Where "RPC thread multiplier" is a scaling factor to increase the service's thread pool.
• 8 is aggressive, might cause issues
• 4 is moderately aggressive
• 2 is recommended (start here, benchmark, then increase)
• 1 (number of vCPU cores) is default but is too small for a large cluster
27
28. Shuffle
The definitive presentation on shuffle tuning:
Tuning Apache Spark for Large-Scale Workloads (Gaoxiang Liu
and Sital Kedia)
So this section focuses on
• Some differences to configurations presented in Liu and
Kedia's presentation, as well as
• Configurations that weren't shown in this presentation
28
29. Strategy for lots of shuffle clients
1. Scale the server way up
// mentioned in Liu/Kedia presentation but now deprecated
// spark.shuffle.service.index.cache.entries = 2048
// default: 100 MiB
spark.shuffle.service.index.cache.size = 256m
// length of accept queue. default: 64
spark.shuffle.io.backLog = 8192
// default (not increased by spark.network.timeout)
spark.rpc.lookupTimeout = 120s
29
30. Strategy for lots of shuffle clients
2. make clients more patient, more fault tolerant, fewer
simultaneous requests in flight
spark.reducer.maxReqsInFlight = 5 // default:
Int.MaxValue
spark.shuffle.io.maxRetries = 10 // default: 3
spark.shuffle.io.retryWait = 60s // default 5s
30
31. Strategy for lots of shuffle clients
spark.shuffle.io.numConnectionsPerPeer = 1
Scaling this up conservatively for multiple executor per node
configurations can be helpful.
Not recommended to change the default for single executor per
node.
31
32. Shuffle partitions
spark.sql.shuffle.partitions = max(1, nodes - 1) *
coresPerExecutor * parallelismPerCore
where parallelism per core is some hyperthreading factor, let's say 2.
It's not the best for large shuffles although it can be adjusted.
Apache Spark Core—Deep Dive—Proper Optimization (Daniel Tomes)
recommends setting this value to max(cluster executor cores,
shuffle stage input / 200 MB). That translates to 5242 partitions
per TB. Highly aggressive shuffle optimization is required for a large
dataset on a cluster with a large number of executors.
32
33. Kill Spill
spark.shuffle.spill.numElementsForceSpillThreshold = 25000000
spark.sql.windowExec.buffer.spill.threshold = 25000000
spark.sql.sortMergeJoinExec.buffer.spill.threshold = 25000000
• Spill is the number one cause of poor performance on very large
Spark clusters. These settings control when Spark spills data from
memory to disk – the defaults are a bad choice!
• Set these to a big Integer value – start with 25000000 and
increase if you can. More is more.
• SPARK-21595: Separate thresholds for buffering and spilling in
ExternalAppendOnlyUnsafeRowArray
34. Scaling AWS S3 Writes
Hadoop AWS S3 support in 3.2.0 is
amazing
• Especially the new S3A committers
https://hadoop.apache.org/docs/r3.2.0/hado
op-aws/tools/hadoop-aws/index.html
EMR: write to HDFS and copy off using
s3DistCp (limit reducers if necessary)
Databricks: writing directly to S3 just works
FirstNASAISINGLASSrocketlaunch
34
35. Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Services
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
36. Task Scheduling
Spark's powerful task scheduling
settings can interact in unexpected
ways at scale.
• Dynamic resource allocation
• External shuffle
• Speculative Execution
• Blacklisting
• Task reaper
Apollo 13 Mailbox at Mission Control
36
37. Dynamic resource allocation
Dynamic resource allocation benefits a multi-tenant cluster where
multiple applications can share resources.
If you have an ETL pipeline running on a large transient Spark
cluster, dynamic allocation is not useful to your single application.
Note that even in the first case, when your application no longer
needs some executors, those cluster nodes don't get spun down:
• Dynamic allocation requires an external shuffle service
• The node stays live and shuffle blocks continue to be served from it
37
38. External shuffle service
spark.shuffle.service.enabled = true
spark.shuffle.registration.timeout = 60000 // default: 5ms
spark.shuffle.registration.maxAttempts = 5 // default: 3
Even without dynamic allocation, an external shuffle service may be a good idea.
• If you lose executors through dynamic allocation, the external shuffle process still
serves up those blocks.
• The external shuffle service could be more responsive than the executor itself
However, the registration values are insufficient for a large busy cluster:
SPARK-20640 Make rpc timeout and retry for shuffle registration configurable
38
39. Speculative execution
When speculative execution works as intended, tasks running slowly
due to transient node issues don't bog down that stage indefinitely.
• Spark calculates the median execution time of all tasks in the stage
• spark.speculation.quantile - don't start speculating until this
percentage of tasks are complete (default 0.75)
• spark.speculation.multiplier - expressed as a multiple of the
median execution time, this is how slow a task must be to be
considered for speculation
• Whichever task is still running when the first finishes gets killed
39
40. One size does not fit all
spark.speculation = true
spark.speculation.quantile = 0.8 //default: 0.75
spark.speculation.multiplier = 4 // default: 1.5
These were our standard speculative execution settings. They
worked "fine" in most of our pipelines. But they worked fine
because the median size of the tasks at 80% was OK.
What happens when reasonable settings meet unreasonable
data?
40
42. Speculation: unintended consequences
The median task length is based on the fast 80% - but due to heavy skew, this estimate is bad!
Causing the scheduler to take the worst part of the job and … launches more copies of the worst
longest running tasks ... one of which then gets killed.
spark.speculation = true
// start later (might get a better estimate)
spark.speculation.quantile = 0.90
// default 1.5 - require a task to be really bad
spark.speculation.multiplier = 6
The solution was two-fold:
• Start speculative execution later (increase the quantile) and require a greater slowness
multiplier
• Do something about the skew
42
43. Benefits of speculative execution
• Speculation can be very helpful when the application is interacting
with an external service. Example: writing to S3
• When speculation kills a task that was going to fail anyway, it
doesn't count against the failed tasks for that
stage/executor/node/job
• Clusters are not tuned in a day! Speculation can help pave over
slowdowns caused by scaling issues
• Useful canary: when you see tasks being intentionally killed in any
quantity, it's worth investigating why
43
44. Blacklisting
spark.blacklist.enabled = true
spark.blacklist.task.maxTaskAttemptsPerExecutor = 1 // task blacklisted from
executor
spark.blacklist.stage.maxFailedTasksPerExecutor = 2 // executor blacklisted from
stage
// how many different tasks must fail in successful tasks sets before executor
// blacklisted from application
spark.blacklist.application.maxFailedTasksPerExecutor = 2
spark.blacklist.timeout = 1h // executor removed from blacklist, takes new tasks
Blacklisting prevents Spark from scheduling tasks on executors/nodes which have failed too many
times in the current stage.
The default number of failures are too conservative when using flaky external services. Let's see
how quickly it can add up...
44
46. Blacklisting gone wrong
• While writing three very large datasets to S3, something went
wrong about 17 TiB in
• 8600+ errors trying write to S3 in the space of eight minutes,
distributed across 1000 nodes
– Some executors backoff and retry, succeed
– Speculative execution kicks in, padding the blow
– But all the nodes quickly accumulate at least two failed tasks,
many have more and get blacklisted
• Eventually translating to four failed tasks, killing the job
46
48. Don't blacklist too soon
• We enabled blacklisting but didn't adjust the defaults because - we never "needed" to
before
• Post mortem showed cluster blocks were too large for our s3a settings
spark.blacklist.enabled = true
spark.blacklist.stage.maxFailedTasksPerExecutor = 8 // default: 2
spark.blacklist.application.maxFailedTasksPerExecutor = 24 // default: 2
spark.blacklist.timeout = 15m // default: 1h
Solution was to
• Make blacklisting a lot more tolerant of failure
• Repartition data on write for better block size
• Adjust s3a settings to raise multipart upload size
48
49. Don't fear the reaper
spark.task.reaper.enabled = true
// default: -1 (prevents executor from self-destructing)
spark.task.reaper.killTimeout = 180s
The task reaper monitors to make sure tasks that get interrupted or killed actually shut
down.
On a large job, give a little extra time before killing the JVM
• If you've increased timeouts, the task may need more time to shut down cleanly
• If the task reaper kills the JVM abruptly, you could lose cached blocks
SPARK-18761 Uncancellable / unkillable tasks may starve jobs of resources
49
50. Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Services
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
51. Increase tolerance
• If you find a timeout or number of retries, raise it
• If you find a buffer, backlog, queue, or threshold, increase it
• If you have a MR task with a number of reducers trying to use
a service concurrently in a large cluster
– Either limit the number of active tasks per reducer, or
– Limit the number of reducers active at the same time
51
52. Be more patient
// default - might be too low for a large cluster
under load
spark.network.timeout = 120s
Spark has a lot of different networking timeouts. This is the
biggest knob to turn: increasing this increases many settings at
once.
(This setting does not increase the spark.rpc.timeout used by
shuffle and authentication services.)
52
53. Executor heartbeat timeouts
spark.executor.heartbeatInterval = 10s // default
spark.executor.heartbeatInterval should be significantly
less than spark.network.timeout.
Executors missing heartbeats usually signify a memory issue, not
a network problem.
• Increase the number of partitions in the dataset
• Remediate skew causing some partition(s) to be much larger
than the others
53
54. Be resilient to failure
spark.stage.maxConsecutiveAttempts = 10 // default: 4
// default: 4 (would go higher for cloud storage misbehavior)
spark.task.maxFailures = 12
spark.max.fetch.failures.per.stage = 10 // default: 4 (helps shuffle)
Increasing the number of failures your application can accept at the task and stage level.
Use blacklisting and speculation to your advantage. It's better to concede some extra resources to a
stage which eventually succeeds than to fail the entire job:
• Note that tasks killed through speculation - which might otherwise have failed - don't count against
you here.
• Blacklisting - which in the best case removes from a stage or job a host which can't participate
anyway - also helps proactively keep this count down. Just be sure to raise the number of failures
there too!
54
55. Koan
A Spark job that is broken
is only a special case of a
Spark job that is working.
Koan Mu calligraphy by Brigitte D'Ortschy
is licensed under CC BY 3.0
55
56. Interested?
• What we do: data engineering @ Coatue
‒ Terabyte scale, billions of rows
‒ Lambda architecture
‒ Functional programming
• Stack
‒ Scala (cats, shapeless, fs2, http4s)
‒ Spark / Hadoop / EMR / Databricks
‒ Data warehouses
‒ Python / R / Tableau
‒ Chat with me or email: rtoomey@coatue.com
‒ Twitter: @prasinous
56
58. Desirable heap size for executors
spark.executor.memory = ???
JVM flag -XX:+UseCompressedOops allows you to use 4-byte pointers instead
of 8 (on by default in JDK 7+).
< 32 GB good for prompt GC, supports compressed OOPs.
32-48 GB "dead zone."
without compressed OOPs over 32 GB, you need almost 48GB to hold the
same number of objects.
49 - 64+ GB very large joins or special case with wide rows and G1GC.
58
59. How many concurrent tasks per executor?
spark.executor.cores = ???
Defaults to number of physical cores, but represents the maximum number of
concurrent tasks that can run on a single executor.
< 2 Too few cores. Doesn't make good use of parallelism.
2 - 4 recommended size for "most" spark apps.
5 HDFS client performance tops out.
> 8 Too many cores. Overhead from context switching outweighs benefit.
59
60. Memory
• Spark docs: Garbage Collection Tuning
• Distribution of Executors, Cores and Memory for a Spark Application
running in Yarn (spoddutur.github.io/spark-notes)
• How-to: Tune Your Apache Spark Jobs (Part 2) - (Sandy Ryza)
• Why Your Spark Applications Are Slow or Failing, Part 1: Memory
Management (Rishitesh Mishra)
• Why 35GB Heap is Less Than 32GB – Java JVM Memory Oddities
(Fabian Lange)
• Everything by Aleksey Shipilëv at https://shipilev.net/, @shipilev, or
anywhere else
60
61. GC debug logging
Restart your cluster with these options in
spark.executor.extraJavaOptions and
spark.driver.extraJavaOptions
-verbose:gc -XX:+PrintGC -XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails
-XX:+PrintGCCause -XX:+PrintTenuringDistribution
-XX:+PrintFlagsFinal
61
62. Parallel GC: throughput friendly
-XX:+UseParallelGC -XX:ParallelGCThreads=NUM_THREADS
• The heap size set using spark.driver.memory and
spark.executor.memory
• Defaults to one third Young Generations and two thirds Old
Generation
• Number of threads does not scale 1:1 with number of cores
– Start with 8
– After 8 cores, use 5/8 remaining cores
– After 32 cores, use 5/16 remaining cores
62
63. Parallel GC: sizing Young Generation
• Eden is 3/4 of young generation
• Each of the two survivor spaces is 1/8 of young generation
By default, -XX:NewRatio=2, meaning that Old Generation occupies 2/3
of the heap
• Increase NewRatio to give Old Generation more space (3 for
3/4 of the heap)
• Decrease NewRatio to give Young Generation more space (1
for 1/2 of the heap)
63
64. Parallel GC: sizing Old Generation
By default, spark.memory.fraction allows cached internal data
to occupy 0.6 * (heap size - 300M). Old Generation needs
to be bigger than spark.memory.fraction.
• Decrease spark.memory.storageFraction (default 0.5) to free
up more space for execution
• Increase Old Generation space to combat spilling to disk,
cache eviction
64
65. G1 GC: latency friendly
-XX:+UseG1GC -XX:ParallelGCThreads=X
-XX:ConcGCThreads=(2*X)
Parallel GC threads are the "stop the world" worker threads. Defaults to the same
calculation as parallel GC; some articles recommend 8 + max(0, cores - 8) * 0.625.
Concurrent GC threads mark in parallel with the running application. The default of a
quarter as many threads as used for parallel GC may be conservative for a large Spark
application. Several articles recommended scaling this number of thread up in conjunction
with a lower initiating heap occupancy.
Garbage First Garbage Collector Tuning (Monica Beckwith)
65
66. G1 GC logging
Same as shown for parallel GC, but also
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintAdaptiveSizePolicy
-XX:+G1SummarizeConcMark
G1 offers a range of GC logging information on top of the
standard parallel GC logging options.
Collecting and reading G1 garbage collector logs - part 2 (Matt
Robson)
66
67. G1 Initiating heap occupancy
-XX:InitiatingHeapOccupancyPercent=35
By default, G1 GC will initiate garbage collection when the heap is 45 percent full. This can lead to
a situation where full GC is necessary before the less costly concurrent phase has run or
completed.
By triggering concurrent GC sooner and scaling up the number of threads available to perform the
concurrent work, the more aggressive concurrent phase can forestall full collections.
Best practices for successfully managing memory for Apache Spark applications on Amazon EMR
(Karunanithi Shanmugam)
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing (Eric Kaczmarek and
Liqi Yi, Intel)
67
68. G1 Region size
-XX:G1HeapRegionSize=16
The heap defaults to region size between 1 and 32 MiB. For example, a heap with <= 32 GiB has a region size
of 8 MiB; one with <= 16 GiB has 4 MiB.
If you see Humongous Allocation in your GC logs, indicating an object which occupies > 50% of your current
region size, then consider increasing G1HeapRegionSize. Changing this setting is not recommended for most
cases because
• Increasing region size reduces the number of available regions, plus
• The additional cost of copying/cleaning up the larger regions may reduce throughput or increase latency
Most commonly caused by a dataset with very wide rows. If you can't improve G1 performance, switch back to
parallel GC.
Plumbr.io handbook: GC Tuning: In Practice: Other Examples: Humongous Allocations
68
69. G1 string deduplication
-XX:+UseStringDeduplication
-XX:+PrintStringDeduplicationStatistics
May decrease your memory usage if you have a significant
number of duplicate String instances in memory.
JEP 192: String Deduplication in G1
69
70. Shuffle
• Scaling Apache Spark at Facebook (Ankit Agarwal and Sameer Agarwal)
• Spark Shuffle Deep Dive (Bo Yang)
These older presentations sometimes pertain to previous versions of Spark
but still have substantial value.
• Optimal Strategies for Large Scale Batch ETL Jobs (Emma Tang) - 2017
• Apache Spark @Scale: A 60 TB+ production use case from Facebook
(Sital Kedia, Shuojie Wang and Avery Ching) - 2016
• Apache Spark the fastest open source engine for sorting a petabyte
(Reynold Xin) - 2014
70
71. S3
• Best Practices Design Patterns: Optimizing Amazon S3
Performance (Mai-Lan Tomsen Bukovec, Andy Warfield, and
Tim Harris)
• Seven Tips for Using S3DistCp on Amazon EMR to Move
Data Efficiently Between HDFS and Amazon S3 (Illya
Yalovyy)
• Cost optimization through performance improvement of
S3DistCp (Sarang Anajwala)
71
72. S3: EMR
Write your data to HDFS and then create a separate step using s3DistCp to
copy the files to S3.
This utility is problematic for large clusters and large datasets:
• Primitive error handling
– Deals with being rate limited by S3 by.... trying harder, choking, failing
– No way to increase the number of failures allowed
– No way to distinguish between being rate limited and getting fatal backend
errors
• If any s3DistCp step fails, EMR job fails even if a later s3DistCp step
succeeds
72
73. Using s3DistCp on a large cluster
-D mapreduce.job.reduces=(numExecutors / 2)
The default number of reducers is one per executor - documentation says the "right"
number is probably 0.95 or 1.75. All three choices are bad for s3DistCp, where the
reduce phase of the job writes to S3. Experiment to figure out how much to scale down
the number of reducers so the data is copied off in a timely manner without too much
rate limiting.
On large jobs, recommend running s3DistCp step as many times as necessary to
ensure all your data makes it off HDFS to S3 before the cluster shuts down.
Hadoop Map Reduce Tutorial: Map-Reduce User Interfaces
73
74. Databricks
fs.s3a.multipart.threshold = 2147483647 // default (in bytes)
fs.s3a.multipart.size = 104857600
fs.s3a.connection.maximum = min(clusterNodes, 500)
fs.s3a.connection.timeout = 60000 // default: 20000ms
fs.s3a.block.size = 134217728 // default: 32M - used for reading
fs.s3a.fast.upload = true // disable if writes are failing
// spark.stage.maxConsecutiveAttempts = 10 // default 4 -
increase if writes are failing
Databricks Runtimes uses their own S3 committer code which provides
reliable performance writing directly to S3.
74
75. Hadoop 3.2.0
// https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/committers.html
fs.s3a.committer.name = directory
fs.s3a.committer.staging.conflict-mode = replace // replace == overwrite
fs.s3a.attempts.maximum = 20 // How many times we should retry commands on transient
errors
fs.s3a.retry.throttle.limit = 20 // number of times to retry throttled request
fs.s3a.retry.throttle.interval = 1000ms
// Controls the maximum number of simultaneous connections to S3
fs.s3a.connection.maximum = ???
// Number of (part)uploads allowed to the queue before blocking additional uploads.
fs.s3a.max.total.tasks = ???
If you're lucky enough to have access to Hadoop 3.2.0, here's some highlights
pertinent to large clusters.
75
76. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT