EMR Spark tuning involves configuring Spark and YARN parameters like executor memory and cores to optimize performance. The default Spark configurations depend on the deployment method (Thrift, Zeppelin etc). YARN is used for resource management in cluster mode, and allocates resources to containers based on minimum and maximum thresholds. When tuning, factors like available cluster resources, executor instances and cores should be considered to avoid overcommitting resources.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
How to Automate Performance Tuning for Apache SparkDatabricks
Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?
These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
Managing Apache Spark Workload and Automatic OptimizingDatabricks
eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.
Speaker: Lantao Jin
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.
In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%
In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.
Troubleshooting Kafka's socket server: from incident to resolutionJoel Koshy
LinkedIn’s Kafka deployment is nearing 1300 brokers that move close to 1.3 trillion messages a day. While operating Kafka smoothly even at this scale is testament to both Kafka’s scalability and the operational expertise of LinkedIn SREs we occasionally run into some very interesting bugs at this scale. In this talk I will dive into a production issue that we recently encountered as an example of how even a subtle bug can suddenly manifest at scale and cause a near meltdown of the cluster. We will go over how we detected and responded to the situation, investigated it after the fact and summarize some lessons learned and best-practices from this incident.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
How to Automate Performance Tuning for Apache SparkDatabricks
Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?
These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
Managing Apache Spark Workload and Automatic OptimizingDatabricks
eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.
Speaker: Lantao Jin
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.
In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%
In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.
Troubleshooting Kafka's socket server: from incident to resolutionJoel Koshy
LinkedIn’s Kafka deployment is nearing 1300 brokers that move close to 1.3 trillion messages a day. While operating Kafka smoothly even at this scale is testament to both Kafka’s scalability and the operational expertise of LinkedIn SREs we occasionally run into some very interesting bugs at this scale. In this talk I will dive into a production issue that we recently encountered as an example of how even a subtle bug can suddenly manifest at scale and cause a near meltdown of the cluster. We will go over how we detected and responded to the situation, investigated it after the fact and summarize some lessons learned and best-practices from this incident.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Introduction to Machine Learning in Spark. Presented at Bangalore Apache Spark Meetup by Shashank L and Shashidhar E S on 17/10/2015.
http://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/225649429/
Presented at Spark+AI Summit Europe 2019
https://databricks.com/session_eu19/apache-spark-at-scale-in-the-cloud
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Vous avez récemment commencé à travailler sur Spark et vos jobs prennent une éternité pour se terminer ? Cette présentation est faite pour vous.
Himanshu Arora et Nitya Nand YADAV ont rassemblé de nombreuses bonnes pratiques, optimisations et ajustements qu'ils ont appliqué au fil des années en production pour rendre leurs jobs plus rapides et moins consommateurs de ressources.
Dans cette présentation, ils nous apprennent les techniques avancées d'optimisation de Spark, les formats de sérialisation des données, les formats de stockage, les optimisations hardware, contrôle sur la parallélisme, paramétrages de resource manager, meilleur data localité et l'optimisation du GC etc.
Ils nous font découvrir également l'utilisation appropriée de RDD, DataFrame et Dataset afin de bénéficier pleinement des optimisations internes apportées par Spark.
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
"Is your team looking to bring the power of full, end-to-end stream processing with Apache Flink to your organization but are concerned about the time, resources or skills required? In this talk, Sharon Xie, Decodable Founding Engineer and Apache Flink PMC Member, Robert Metzger, will reveal the biggest lessons learned, and how to avoid common mistakes when adopting Apache Flink. If you have any plans on implementing Apache Flink, then this is a session you do not want to miss.
We will talk about avoiding data-loss with Flink’s Kafka exactly-once producer, configuring Flink for getting the most bang for the buck out of your memory configuration and tuning for efficient checkpointing."
Software Profiling: Java Performance, Profiling and FlamegraphsIsuru Perera
Guest lecture at University of Colombo School of Computing on 30th May 2018
Covers following topics:
Software Profiling
Measuring Performance
Java Garbage Collection
Sampling vs Instrumentation
Java Profilers. Java Flight Recorder
Java Just-in-Time (JIT) compilation
Flame Graphs
Linux Profiling
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
Couchbase is a popular open source NoSQL platform used by giants like Apple, LinkedIn, Walmart, Visa and many others and runs on-premise or in a public/hybrid/multi cloud.
Couchbase has a sub-millisecond K/V cache integrated with a document based DB, a unique and many more services and features.
In this session we will talk about the unique architecture of Couchbase, its unique N1QL language - a SQL-Like language that is ANSI compliant, the services and features Couchbase offers and demonstrate some of them live.
We will also discuss what makes Couchbase different than other popular NoSQL platforms like MongoDB, Cassandra, Redis, DynamoDB etc.
At the end we will talk about the next version of Couchbase (6.5) that will be released later this year and about Couchbase 7.0 that will be released next year.
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
achine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
קוראים לי ניצן אור קדראי ואני עומדת בצומת המעניינת שבין טכנולוגיה, מדיה ואקטיביזם.
בארבע וחצי השנים האחרונות אני עובדת בידיעות אחרונות, בהתחלה כמנהלת המוצר של אפליקציית ynet וכיום כמנהלת החדשנות.
הייתי שותפה בהקמת עמותת סטארט-אח, עמותה המספקת שירותי פיתוח ומוצר עבור עמותות אחרות, ולאחרונה מתעסקת בהקמת קהילה שמטרתה לחקור את ההיבטים הטכנולוגיים של תופעת הפייק ניוז ובניית כלים אפליקטיביים לצורך ניהול חכם של המלחמה בתופעה.
ההרצאה תדבר על תופעת הפייק ניוז. נתמקד בטכנולוגיה שמאפשרת את הפצת הפייק ניוז ונראה דוגמאות לשימוש בטכנולוגיה זו.
נבחן את היקף התופעה ברשתות החברתיות ונלמד איך ענקיות הטכנולוגיה מנסות להילחם בה.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
MAKING YOUR ANALYTICS TALK BUSINESS
Aligning your analysis to the business is fundamental for all types of analytics (digital or product analytics, business intelligence, etc) and is vertical- and tool agnostic. In this talk we will build on the discussion that was started in the previous meetup, and will discuss how analysts can learn to derive their stakeholders' expectations, how to shift from metrics to "real" KPIs, and how to approach an analysis in order to create real impact.
This session is primarily geared towards those starting out into analytics, practitioners who feel that they are still struggling to prove their value in the organization or simply folks who want to power up their reporting and recommendation skills. If you are already a master at aligning your analysis to the business, you're most welcome as well: join us to share your experiences so that we can all learn from each other and improve!
Bios:
Eliza Savov - Eliza is the team lead of the Customer Experience and Analytics team at Clicktale, the worldwide leader in behavioral analytics. She has extensive experience working with data analytics, having previously worked at Clicktale as a senior customer experience analyst, and as a product analyst at Seeking Alpha.
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
In the talk we will discuss how to break down the company’s overall goals all the way to your BI team’s daily activities in 3 simple stages:
1. Understanding the path to success - Creating a revenue model
2. Gathering support and strategizing - Structuring a team
3. Executing - Tracking KPIs
Bios:
Omri Halak -Omri is the director of business operations at Logz.io, an intelligent and scalable machine data analytics platform built on ELK & Grafana that empowers engineers to monitor, troubleshoot, and secure mission-critical applications more effectively. In this position, Omri combines actionable business insights from the BI side with fast and effective delivery on the Operations side. Omri has ample experience connecting data with business, with previous positions at SimilarWeb as a business analyst, at Woobi as finance director, and as Head of State Guarantees at Israel Ministry of Finance.
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
Lecturer has Deep experience defining Cloud computing, security models for IaaS, PaaS, and SaaS architectures specifically as the architecture relates to IAM. Deep Experience Defining Privacy protection Policy, a big fan of GDPR interpretation.
DeelExperience in Information security, Defining Healthcare security best practices including AI and Big Data, IT Security and ICS security and privacy controls in the industrial environments.
Deep knowledge of security frameworks such as Cloud Security Alliance (CSA), International Organization for Standardization (ISO), National Institute of Standards and Technology (NIST), IBM ITCS104 etc.
What Will You learn:
Every day, the website collects a huge amount of data. The data allows to analyze the behavior of Internet users, their interests, their purchasing behavior and the conversion rates. In order to increase business, big data offers the tools to analyze and process data in order to reveal competitive advantages from the data.
What Healthcare has to do with Big Data
How AI can assist in patient care?
Why some are afraid? Are there any dangers?
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
Building a low latency (sub millisecond), high throughput database that can handle big data AND linearly scale is not easy - but we did it anyway...
In this session we will get to know Aerospike, an enterprise distributed primary key database solution.
- We will do an introduction to Aerospike - basic terms, how it works and why is it widely used in mission critical systems deployments.
- We will understand the 'magic' behind Aerospike ability to handle small, medium and even Petabyte scale data, and still guarantee predictable performance of sub-millisecond latency
- We will learn how Aerospike devops is different than other solutions in the market, and see how easy it is to run it on cloud environments as well as on premise.
We will also run a demo - showing a live example of the performance and self-healing technologies the database have to offer.
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS
-Learn how to connect BI and product management to solve business problems
-Discover how to lead clients to ask the right questions to get the data and insight they really want
-Get pointers on saving your time and your company's resources by understanding what your customers need, not what they ask for
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
amazon aws big data demystified meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
Introduction to streaming and messaging flume kafka sqs kinesis
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
2. Tuning?
Spark, as you have likely figured out by this point, is a parallel processing engine. What is maybe
less obvious is that Spark is not a “magic” parallel processing engine, and is limited in its ability to
figure out the optimal amount of parallelism. Every Spark stage has a number of tasks, each of
which processes data sequentially. In tuning Spark jobs, this number is probably the single most
important parameter in determining performance.
3. What are the defaults?
● Thrift JDBC Server (Spark SQL)
● Zeppelin (Spark SQL)
● Yarn
4. Spark + Thrift
● sudo /usr/lib/spark/sbin/start-thriftserver.sh --help
config Description Default
--deploy-mode Whether to launch the driver program locally
("client") or on one of the worker machines
inside the cluster ("cluster")
Client
--driver-memory Memory for driver (e.g. 1000M, 2G) (Default: 1024M)
--driver-cores Number of cores used by the driver, only in
cluster mode
(Default: 1)
--executor-memory Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--num-executors Number of executors to launch If dynamic
allocation is enabled, the initial number of
executors will be at least default value .
(Default: 2)
6. EMR Spark defaults
config Description Default for r4.4xl
instance
Default for c4.8xl
instance
Deploy mode Client/Cluster client
spark.driver.memory this is set based on the smaller
of the instance types in these
two instance groups.
spark.executor.cores Setting is configured based
on the slave instance types in
the cluster.
32
4
spark.executor.memory Setting is configured based
on the slave instance types in
the cluster.
218880M
5120M
8. Zeppelin
config description default
zeppelin.spark.concurrentSQL Running in parallel notebooks
and paragraph
false
spark.app.name zeppelin
spark.cores.max Total number of cores to use. Default: Empty , which means uses all
available core.
spark.executor.instances (--num-executors in spark submit)
spark.executor.cores (--executor-cores)
spark.executor.memory (--executor-memory) (set using a G at
the end for GB, like 16G )
default 1G
9. Default Config Comparison Client mode (c4.8xl)
config description thrift Zeppelin Spark EMR
spark.executor.instances (--num-executors in
spark submit) Number
of executors to launch
If dynamic allocation is
enabled, the initial
number of executors
will be at least default
value .
2
spark.executor.cores (--executor-cores) 1 4 4
spark.executor.memory (--executor-memory) 1G 5120M 5120M
11. Yarn
● What is a container
○ A container is a YARN JVM process. In MapReduce the application master service, mapper and reducer tasks
are all containers that execute inside the YARN framework. You can view running container stats by
accessing Resource Managers web interface "http://<resource_manager_host>:8088/cluster/scheduler"
● When running Spark on YARN, each Spark executor runs as a YARN container. [...]
● YARN Resource Manager (RM) allocates resources to the application through logical queues which include memory,
CPU, and disks resources.
○ By default, the RM will allow up to 8192MB ("yarn.scheduler.maximum-allocation-mb") to an Application
Master (AM) container allocation request.
○ The default minimum allocation is ("yarn.scheduler.minimum-allocation-mb").
○ The AM can only request resources from the RM that are in increments of ("yarn.scheduler.minimum-
allocation-mb") and do not exceed ("yarn.scheduler.maximum-allocation-mb").
○ The AM is responsible for rounding off ("mapreduce.map.memory.mb") and
("mapreduce.reduce.memory.mb") to a value divisible by the ("yarn.scheduler.minimum-allocation-mb").
○ RM will deny an allocation greater than 8192MB and a value not divisible by 1024MB in the following example.
12. Yarn Deploy mode: Cluster/Client
There are two deploy modes that can be used to launch Spark applications on
YARN.
● In cluster mode, the Spark driver runs inside an application master process
which is managed by YARN on the cluster, and the client can go away after
initiating the application.
● In client mode, the driver runs in the client process, and the application
master is only used for requesting resources
● Confirm via: http://ec2-34-245-34-125.eu-west-
1.compute.amazonaws.com:18080/history/application_1519552338222_0001/environment/
13. Yarn Scheduler
● FAIR
● FIFO
● What is the default?
● default is FIFO in spark,
● FAIR in EMR : http://ec2-34-245-34-125.eu-west-
1.compute.amazonaws.com:18080/history/application_1519552338222_0001/environment/)
● sqlContext.getConf("spark.scheduler.mode")
● Scheduling Queues
○ Scheduling pools:
○ https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools
○ each pool has it own sub priorities/configuration:
○ https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
14. Debugging an App
Debugging an app:
In YARN terminology, executors and application masters run inside “containers”. YARN has two modes for handling container logs after an
application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS
and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the yarn logs command.
Current setup can be confirmed at:
http://ec2-34-245-34-125.eu-west-1.compute.amazonaws.com:18080/history/application_1519552338222_0001/environment/
15. Yarn -default values per instance
Configuration Option Default
Value
r 4.8xlarge
Default
Value
r 4.4xlarge
Default value
c4.8xl
mapreduce.map.java.opts Each mapper and reducer is a java
process and we need some reserve
memory to store the java code. Hence
the Map/Reducer memory should be
greater than the JVM heap size.
-Xmx6042m -Xmx5837m -Xmx1183m
mapreduce.reduce.java.opts -Xmx12084m -Xmx11674m -Xmx2366m
mapreduce.map.memory.mb is the upper memory limit that Hadoop
allows to be allocated to a mapper
7552 7296 1479
mapreduce.reduce.memory.mb is the upper memory limit that Hadoop
allows to be allocated to a reducer
15104 14592 2958
yarn.app.mapreduce.am.resource.mb criteria to select resource for particular job.
Here is given 7552 Means any
nodemanager which has equal or more
memory available will get selected for
executing job.
7552 14592 2958
16. Yarn -default values per instance
Configuration Option r 4.8xlarge Default Value
r 4.4xlarge
Default value
c4.8xl
yarn.scheduler.minimum-
allocation-mb
Min ram allocation. 32 32 32
yarn.scheduler.maximum-
allocation-mb
It defines the maximum memory allocation available for a
container in MB
it means RM can only allocate memory to containers in
increments of "yarn.scheduler.minimum-allocation-mb"
and not exceed "yarn.scheduler.maximum-allocation-mb"
and It should not be more then total allocated memory of
the Node.
241664 116736 53248
yarn.nodemanager.resource.
memory-mb
Amount of physical memory, in MB, that can be allocated
for containers. It means the amount of memory YARN
can utilize on this node and therefore this property should
be lower then the total memory of that machine.
241664 116736 53248
17.
18. Yarn Tuning
- 1 c4.8xlarge master
- 4 c4.8xlarge instances
- each with 36 CPU cores
- and each with 60 GB RAM.
First, The master will not count towards the calculation since it should be managing and not running jobs like a worker would.
Before doing anything else, we need to look at YARN, which allocates resources to the cluster. There are some things to consider:
19. Yarn Tuning
1) To figure out what resources YARN has available, there is a chart for emr-5.11.0 showing that
2) By default (for c4.8xlarge in the chart), YARN has 53248 MB (52 GB) max and 32 MB min available to it.
3) The max is supposed to be cleanly divisible by the minimum (1664 in this case), since memory incremented by the minimum number.
4) YARN is basically set for each instance it is running on (Total RAM to allocate to YARN to use).
5) Be sure to leave at least 1-2GB RAM and 1 vCPU for each instance's O/S and other applications to run too. The default amount of RAM (52
GB out of 60 GB RAM on the instance) seems to cover this, but this will lave us with 35 (36-1) vCPUs per instance off the top
6) If a Spark container runs over the amount of memory YARN has allocated, YARN will kill the container and retry to run it again.
7) Also, if this works and you want to tune further, you can see about increasing resources available to YARN (Like using perhaps 58 GB of
RAM instead of 52 GB) and test those too.
20. When it comes to executors
1) This is set for the entire cluster. it doesn't have to use all resources
2) Likely, you are going to want more than 1 executor per node, since parallelism is usually the goal. This
can be experimented with to manually size a job, again it doesn't even have to use all resources.
3) If we were to use the article [1] as an example, we may want to try an do 3 executors per node to start
with (Not too few or too many), so --num-executors 11 [3 executors per node x 4 nodes = 12 executors, -
1 for application master running = 11].
21. And Executors cores...
1) Although it doesn't explicitly say it, I have noticed problems working with other customers, when total
executors exceeds the instance's vCPU / core count.
2) It is suggested to put aside at least 1 executor core per node to ensure there are resources left over to
run O/S, App Master and the like.
3) Executor cores are being assigned per executor (discussed just above). So this isn't total per node or
for the cluster, but rather depending on the resources available divided by the number of executors
wanted per node.
4) In the article's example this works out to --executor-cores= 11 [36 cores/ vCPU per instance - 1 for
AM/overhead = 35 cores. 35 cores/ 3 executors per node = 11 executor cores]
22. And Executors memory...
1) This is assigned like the executor cores (finally, something sized similarly). This is per executor and thus is not the total amount of memory
on a node, or set in YARN, but rather depending on the resources available divided by the number of executors again (Like executor-cores)
2) Sometimes, some off-heap memory and other things can make an executor/container larger than the memory set. Leave room so YARN
doesn't kill it due to using more memory than is provided.
3) In the article's example this works out to --executor-memory 16G (16384 MB) [52 GB provided to YARN / 3 executors = 17 GB RAM max.
17 GB-1 GB for extra overhead = 16GB (16384 MB)]
So, putting the above together as an example, we could use 3 executors per node with: --num-executors 11 --executor-cores 11 --
executor-memory 16G
23. AND EXECUTOR.INSTANCES
If Spark dynamic allocation is enabled, the initial number of executor instances is based on the Spark default value for
spark.dynamicAllocation.initialExecutors which in turn depends on the spark.dynamicAllocation.minExecutors which is 0.
If you set --num-executors in the command line then the initial executors started will be equal to that value.
if MaximizeResourceAllocation is set then EMR service will set the spark.executor.instances to the number of core nodes in the
cluster.
If Dynamic allocation is also set then the initial number of executors started will be the number of core nodes in the cluster.
24. Yarn Deploy mode tuning
1) When we use --deploy-mode client, we tell YARN to deploy to an 'external node', but since this is being run from the master, it is straining
the master for the entire cluster. Client mode is best if you have spark-shell on another instance in the same network and can submit a job
from there to the cluster but use the instance's resources for the driver (taking the load off the cluster) or when using it interactively to test
since the application master is on the master instance and its logs will be there).
2) We probably want to use YARN as our deployment mode [5]. In particular, we should be looking at yarn-cluster. This means each
container's App master is deployed in the cluster and not just on one device (That is also managing the whole cluster). The resources running
are spread out, but the logs are too for the application masters). There's another good explanation on this available [6].
3) Remember, if you over-subscribe the driver memory and cores, there's less for the rest of the cluster to use for actual work. Most often,
the driver memory is set to 1-2GB; 20GB would take more than half the RAM and 4 cores would take most of the cores in our earlier
example. I t may be worth leaving these as their defaults (not using the options) for now and then conservatively raising them if the logs
inform you to do so.
25. EMR and s3
* Use S3DistCp to move objects in and out of Amazon S3. S3DistCp implements error handling, retries and back-offs to match the
requirements of Amazon S3. For more information, see Distributed Copy Using S3DistCp.
* Design your application with eventual consistency in mind. Use HDFS for intermediate data storage while the cluster is running and
Amazon S3 only to input the initial data and output the final results.
* If your clusters will commit 200 or more transactions per second to Amazon S3, contact support to prepare your bucket for greater
transactions per second and consider using the key partition strategies described in Amazon S3 Performance Tips & Tricks.
* Set the Hadoop configuration setting io.file.buffer.size to 65536. This causes Hadoop to spend less time seeking through Amazon S3
objects.
* Consider disabling Hadoop's speculative execution feature if your cluster is experiencing Amazon S3 concurrency issues. You do this
through the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution configuration settings. This is also
useful when you are troubleshooting a slow cluster.
* If you are running a Hive cluster, see Are you having trouble loading data to or from Amazon S3 into Hive?.
* Pre-cache the results of an Amazon S3 list operation locally on the cluster. You do this in HiveQL with the
following command: set hive.optimize.s3.query=true;.
● Apply compression testing…
● Use s3a
References:
[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-error-hive.html#emr-troubleshoot-error-hive-3
[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors-io.html#emr-troubleshoot-errors-io-1
26. how can i undo MaximizeResourceAllocation ?
When MaximizeResourceAllocation is enabled, 5 parameters value are set under
spark-defaults.conf. To undo MaximizeResourceAllocation, we need to adjust the
spark.executor.cores and spark.executor.memory parameters etc according to
your use case.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html
27. what would be a recommend setting for working a cluster in parallel for 2 and
more users. should we turn off MaximizeResourceAllocation? turn on
spark.dynamicAllocation ?
it depends on the use-case such as
● how many applications you want to run in parallel
● how many tasks their application is expected to launch.
● Disabling MaximizeResourceAllocation helps to run applications in parallel.
But disabling the parameter will affect the application performance.
28. Example of tuning
Spark resource tuning is essentially a case of fitting the number of executors we want to assign per job to the available resources in the
cluster. If we assume your Core and Task nodes are 40 (20+20) instances of type m3.xlarge.
each m3.xlarge instance has 4 vCPUs and 15GiB of RAM available for use [1]. Since we are going to leave 1 vCPU per node free as per the
initial issue, this gives us a total cluster vCPU resources of:
40 x 3 vCPUs = 120vCPUs
For m3.xlarge instance types, the maximum amount of RAM available for each node is controlled by the property
"yarn.nodemanager.resource.memory-mb" located in the /etc/hadoop/conf/yarn-site.xml file, located on the master instance node. It is set
by default to 11520 (even with the maximumResourcesAllocation enabled). This allows for some RAM to be left free on each node for other
tasks such as system procedures, etc.
(40 x 11520MB RAM) / 1024 (to convert to GiB) = 450GiB RAM*
* Note: this does not mean that it is OK to divide this 450GiB by 40 and to set each executor memory to this value. This is because this
maximum is a hard maximum and each executor requires some memoryOverhead to account for things like VM overheads, interned strings,
other native overheads, etc.[2]. Typically, the actual maximum memory set by --executor-memory is 80% of this value.
29. Example of tuning
Cpu memory comments
Per Instnace 4 15 M3.xlarge.,
arn.nodemanager.resource
.memory-mb = 11520MB,
each executor requires
some memoryOverhead to
account for things like VM
overheads. the actual
maximum memory set by -
-executor-memory is 80%
of this value.
Per Cluster (20
tasks+ 20 core=40)
40 * (4-1)= 120 (40 x 11520MB RAM) /
1024 (to convert to GiB) =
450GiB RAM*
-1 core for the OS and
daemon
30. Example of tuning
Executor cores Executor memory comments
Per executor
3
80% of yarn
(11.5GB)= 9GB
Or 60% of machine
total (15B)= 9GB.
M3.xlarge.,
arn.nodemanager.resource
.memory-mb = 11520MB,
each executor requires
some memoryOverhead to
account for things like VM
overheads. the actual
maximum memory set by -
-executor-memory is 80%
of this value.
Per Cluster (20
tasks+ 20 core=40)
40 * (4-1)= 120 40*9GB = 360GB. -1 core for the OS and
daemon
31. Example of tuning
Using this as a simple formula, we could have many examples such as (using your previously provided settings as a basis):
1) 1 Spark-submit Job -> --executor-cores 3 --executor-memory 9g --conf spark.dynamicAllocation.maxExecutors=40 --conf
spark.yarn.executor.memoryOverhead=2048
2) 2 Spark-submit Jobs -> --executor-cores 1 --executor-memory 4500m --conf spark.dynamicAllocation.maxExecutors=80 --conf
spark.yarn.executor.memoryOverhead=1024
(--executor-cores are 1 here as 3/2 = 1.5 and decimals are rounded down)
3) 3 Spark-submit Jobs -> --executor-cores 1 --executor-memory 3000m --conf spark.dynamicAllocation.maxExecutors=120 --conf
spark.yarn.executor.memoryOverhead=768
32. Example of tuning
Example Executors
Cores
Executors-mem spark.dynamicAlloca
tion.maxExecutors
spark.yarn.executor.
memoryOverhead
comments
job1 3 9GB 40 2048 20%
OVERHEAD
Job2 (half) 1 4.5GB 80 1024 -executor-cores are 1
here as 3/2 = 1.5
and decimals are
rounded down
Job3 (third) 1 3GB 120 768
33. Example of tuning - double the instance size...
Lets double the instance size and see what happens...
On a m3.2xlarge with double the resources per node we would have 40 x 8 vCPUs and 40 x 30GiB RAM. Again, leaving 1 vCPU free per node
(8-1=7 vCPUs per node) and trusting that yarn limits the max used RAM to 23GiB (23552MiB) or so, our new values would look like:
1) 1 Spark-submit Job -> --executor-cores 7 --executor-memory 19g --conf spark.dynamicAllocation.maxExecutors=40 --conf
spark.yarn.executor.memoryOverhead=4096
(19456MiB + 4096MiB) * 1 overall Job = 23552MiB => A perfect fit!
23552 * 0.8 = 18841.6MiB so we can probably round this up to 19GB of RAM or so without issue. The majority of the memory discrepency
between 23-19 (4GiB) can be set in the memoryOverhead parameter)
2) 2 Spark-submit Jobs -> --executor-cores 3 --executor-memory 9g --conf spark.dynamicAllocation.maxExecutors=80 --conf
spark.yarn.executor.memoryOverhead=2560
(9216 + 2560) *2 Jobs to run = 23552MiB => A perfect fit!
(--executor-cores are 3 here as 7/2 = 3.5 and decimals are rounded down)
34. So for the example:
"3) 3 Spark-submit Jobs -> --executor-cores 2 --executor-memory 6g --conf spark.dynamicAllocation.maxExecutors=120 --conf
spark.yarn.executor.memoryOverhead=1536"
We have 3 jobs with a maximum executor memory size of 6GiB. This is 18GiB that fits into the 23GiB. Of the 5GiB remaining, we allocate that
as memoryOverhead but with an equal share per job. This means that we set this value to 1536MB RAM to use the remaining RAM, except for
512MiB.
As you can see above, the calculation is that the per job is:
(executor-memory + memoryOverhead) * number of concurrent jobs = value that must be <= Yarn threshold.
(6144MB + 1536MB)*3 = 23040MiB which is 512MiB less than the Yarn threshold value of 23552MiB (23GiB).
35. Example of tuning - double the instance size...
Cpu memory comments
Per Instance 8 30 M3.xlarge.,
arn.nodemanager.resource
.memory-mb = 23522MB,
each executor requires
some memoryOverhead to
account for things like VM
overheads. the actual
maximum memory set by -
-executor-memory is 80%
of this value.
Per Cluster (20
tasks+ 20 core=40)
40 * (8-1)= 280 (40 x 23.5GB) = 940GiB
RAM*
-1 core for the OS and
daemon
36. Example of tuning- double the instance size...
Executor cores Executor memory comments
Per executor
7
80% of yarn
(23.5GB)= 18GB
Or 60% of machine
total (30GB)=
18GB.
M3.xlarge.,
arn.nodemanager.resource
.memory-mb = 11520MB,
each executor requires
some memoryOverhead to
account for things like VM
overheads. the actual
maximum memory set by -
-executor-memory is 80%
of this value.
Per Cluster (20
tasks+ 20 core=40)
40 * (4-1)= 280 40*18GB = 720GB. -1 core for the OS and
daemon
37. Example of tuning - double the instance size...
Example Executors
Cores
Executors-
mem
spark.dyn
amicAlloc
ation.max
Executors
spark.yarn.executor
.memoryOverhead
(in MB)
comments
job4 7 19GB 40 4096 (19456MiB + 4096MiB) * 1 overall Job =
23552MiB => A perfect fit!
23552 * 0.8 = 18841.6MiB so we can
probably round this up to 19GB of RAM or
so without issue. The majority of the
memory discrepency between 23-19
(4GiB) can be set in the
memoryOverhead parameter)
Job5 (half) 3 9GB 80 2560 (9216 + 2560) *2 Jobs to run = 23552MiB
=> A perfect fit!
(--executor-cores are 3 here as 7/2 = 3.5
and decimals are rounded down)
Job6 (third) 2 6GB 120 1536 (--executor-cores are 2 here as 7/3 =
38. How much memory per executor? Alot? alittle?
● Running executors with too much memory often results in excessive garbage collection
delays. 64GB is a good upper limit for a single executor.
● I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough
guess is that at most five tasks per executor can achieve full write throughput, so it’s
good to keep the number of cores per executor below that number.
● Running tiny executors (with a single core and just enough memory needed to run a
single task, for example) throws away the benefits that come from running multiple
tasks in a single JVM. For example, broadcast variables need to be replicated once on
each executor, so many small executors will result in many more copies of the data.
39. Is it right to say the setting maximizeResourceAllocation should be used in all
cases, or if not, what are the reasons not to use it?"
Using the dynamicAllocation settings . we can better determine the number of overall resources that we have and how many executors that
we want to give to each job.
For example, leaving the cluster as-is using only the original job per cluster, we would naturally want to allocate all of our resources to this
one job. If the job runs better using large executors, we would then allocate all of our resources to this job. We could do this by creating the
cluster using the maximizeResourceAllocation property set to true which would automatically tune our clusters to use the most resources
available [8]. Your existing settings of "--executor-memory 9g --executor-cores 3 --conf spark.yarn.executor.memoryOverhead=2048"
would be sufficient for this task, though not actually necessary if the maximizeResourceallocation option is set. You can see the default
settings that the maximizeResourceAllocation options sets in the /etc/spark/conf/spark-defaults.conf file on your master node.
40. 1 Job, smaller number of larger executor
spark-submit --master yarn --executor-cores 3
--conf maximizeResourceAllocation=true
--conf spark.dynamicAllocation.minExecutors=1
--conf spark.dynamicAllocation.maxExecutors=40
--conf spark.dynamicAllocation.initialExecutors=40
--conf spark.dynamicAllocation.enabled=true
--conf spark.driver.maxResultSize=10G
--conf spark.akka.frameSize=1000
Here, Spark dynamicAllocation will allocate the maximum available RAM for you for your executor. We will be running 40 individual executors
that have 3 vCPUs each. However, say your job runs better with a smaller number of executors?
41. 1 Job, greater number of smaller executors:
In this case you would simply set the dynamicAllocation settings in a way similar to the following, but adjust your memory and vCPU options
in a way that allows for more executors to be launched on each node. So for twice the executors on our 40 nodes, we need to half the
memory and the vCPUs in order to make the 2 executors "fit" onto each of our nodes.
spark-submit --master yarn --executor-cores 1 --executor-memory=4500m
--conf spark.yarn.executor.memoryOverhead=1024
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.minExecutors=1
--conf spark.dynamicAllocation.maxExecutors=80
--conf spark.dynamicAllocation.initialExecutors=20**
--conf spark.dynamicAllocation.enabled=true
--conf spark.driver.maxResultSize=10G
42. 2 Jobs, equal number of 1 larger executor per
node per job
spark-submit --master yarn --executor-cores 1 --executor-memory=4500m
--conf spark.yarn.executor.memoryOverhead=1024
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.minExecutors=1
--conf spark.dynamicAllocation.maxExecutors=40
--conf spark.dynamicAllocation.initialExecutors=20
--conf spark.dynamicAllocation.enabled=true
--conf spark.driver.maxResultSize=10G
43. 2 Jobs, equal number of 1 larger executor per node per job
The main difference here is the setting of the value "--conf spark.dynamicAllocation.maxExecutors=40". We need to set this value to 40 as our
memory limit is essentially halved as we are running 2 separate jobs on each node. Hence, our total resources on our cluster effectively
halves to 80vCPUs and 225GiB RAM per Spark-submit job.So to fit in our 2 jobs into these newer limits, it gives us again 1vCPU per executor:
80 vCPUs / 40 nodes = 2 vCPUs per job. Since we need to leave 1 of the 4 cores free per node, that gives us 4-1=3 vCPUs per node divided by
2 jobs is 1.5. We round down then to 1 vCPU per job.
225 * 0.8 (to get 80% of settable memory) = 180GiB per job. 180 / 40 nodes = 4.5GiB per job per node.
By setting the value of spark.dynamicAllocation.maxExecutors to 40 for each job, it ensures that neither job encroaches into the total
resources allocated for each job. By limiting the maxExecutors to 40 for each job, the maximum resources available per jobs is:
40 X 4500MiB RAM (divided by 1000 to convert to GiB) = 180GiB per job. 40 x 1 vCPU = 40vCPUs per job.
2 Jobs means that in total 360GiB RAM is used (80% of total RAM as expected) and 40vCPUs * 2 jobs is 80 vCPUs which again, fits in with
the maximum vCPUs for the cluster.
The difference between Example 3 and Example 2 is that in Example 2, all of the resources are being used by 2 executors on each node for
the 1 job. In Example 2, your job would have a maximum of 80 executors (40 nodes * 2 executors per node) running on the same 1 job. In
Example 3, your 2 jobs would have 40 executors maximum (40 nodes * 1 executor per node) running on each job concurrently.
-
44. NOTE 1
The --executor-cores value was only set to 1. This is because since we are only using 3 out of the 4 vCPUs per node, and 3/2 = 1.5, decimal
values are not allowed in this parameter. This means that a whole vCPU core is going unused.
In Example 2 above, you could simply lower the --executor-memory value to 3GiB and increase the --conf
spark.dynamicAllocation.maxExecutors value to 120 to use this core.
"Please note that with the reduced CPU power per node, it may be worthwhile either increasing your instance count or instance type in order
to maintain the speeds that you want. Increasing the instance type though (say for example, to a m3.2xlarge type) also has the benefit of
though of being able to use more of the nodes CPU for the executors. The OS only needs a small bit of CPU free, but the spark tunings only let
us specify the CPU to use in terms of cores.
Hence, on a m3.xlarge instance, 1 core left free from the 4 cores is 25% of your CPU resources left free but on a m3.2xlarge, leaving 1 core
free from the 8 cores is only 12.5% of your CPU resources left free, and so on."
You may need to balance the instance types to the instance count in order to get a "best case" usage for your resource requirements.
45. NOTE 2
You could consider trying to amalgamate both of your jobs into the one .jar file that is being called. While this would likely require more
coding and tweaking of your application, doing so would allow you to allow the Spark dynamicAllocation process to effectively "load balance"
the number of executors across your jobs.
Instead of running your 2 jobs concurrently in two separate spark-submit calls, using the 1 spark-submit call using settings similar to Example
1 above, would allow the resources from one job to be used on the other job if one job was to finish earlier than the other.
Of course you have less control over which job gets more exclusive resources than the other, but it may mean that if using Example 3 above
(as an example) resources left over from the end of Job 1 could be used to help speed up the remainder of Job 2, and vice-versa.
46. NOTE 3
If running the jobs on the same cluster, please note that the job files will be
grouped together. This can make debugging issues in your jobs more difficult as
there will be less separation of the job logs. If you want a greater level of job log
separation, then I would suggest that you perhaps consider running each job on a
separate EMR cluster. This will also give you an extra cluster to monitor and will
also give you separate CloudWatch metrics per job, instead of having the 1 set of
metrics for both jobs. This will make monitoring your jobs a lot easier as the
metrics per cluster will be related to only one of your jobs. If there is an issue
with either job, it will be a lot easier to debug the issue as it would be a lot harder
to see from the CloudWatch metrics which specific job was causing the issue.
47. Links:
[1] Amazon EC2 Instance Types: https://aws.amazon.com/ec2/instance-types/
[2] Running Spark on yarn: http://spark.apache.org/docs/1.6.1/running-on-yarn.html
[3] https://spark.apache.org/docs/1.6.1/configuration.html#dynamic-allocation
[4] How-to: Tune Your Apache Spark Jobs (Part 1): http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
[5] How-to: Tune Your Apache Spark Jobs (Part 2): http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
[6] Tuning Spark: http://spark.apache.org/docs/1.6.1/tuning.html
[7] Submitting User Applications with spark-submit: https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-
Applications-with-spark-submit
[8] Configure Spark (EMR 4.x Releases): http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-
configure.html#spark-dynamic-allocation
[9] http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/ - **Sam ,it would be important for your use case
----------
==========