This document summarizes a presentation on Spark tuning for system administrators. It provides contact information for the presenters, Anya Bida and Rachel Warren, and addresses intermittent, reliable and optimal configuration of Spark applications (mySparkApp). Key points include setting initial Spark configuration parameters like spark.executor.memory, using fair schedulers in YARN and Spark, and techniques for handling memory issues like persisting RDDs to disk and using checkpointing to improve reliability.
This is a version of a talk I presented at Spark Summit East 2016 with Rachel Warren. In this version, I also discuss memory management on the JVM with pictures from Alexey Grishchenko, Sandy Ryza, and Mark Grover.
Developing, Testing and Scaling with Apache Camel - UberConf 2015Matt Raible
Apache Camel is an integration framework that allows you to define routing and mediation rules in a number of domain-specific languages. This presentation shows how I used Apache Camel to replace IBM Message Broker on a project. It includes information on how routes were developed using Camel’s Java API and how Camel can be integrated with Spring Boot. It also covers unit, integration and load testing (using Gatling) of these services. Finally, it touches on monitoring with hawtio and New Relic.
This is a version of a talk I presented at Spark Summit East 2016 with Rachel Warren. In this version, I also discuss memory management on the JVM with pictures from Alexey Grishchenko, Sandy Ryza, and Mark Grover.
Developing, Testing and Scaling with Apache Camel - UberConf 2015Matt Raible
Apache Camel is an integration framework that allows you to define routing and mediation rules in a number of domain-specific languages. This presentation shows how I used Apache Camel to replace IBM Message Broker on a project. It includes information on how routes were developed using Camel’s Java API and how Camel can be integrated with Spring Boot. It also covers unit, integration and load testing (using Gatling) of these services. Finally, it touches on monitoring with hawtio and New Relic.
SF Solr Meetup - Interactively Search and Visualize Your Big Datagethue
Open up your user base to the data! Contrary to programming and SQL, almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop. The underlying technical details of the application and its interaction with Apache Solr will be clarified.
The session will detail how to get started with data indexing in just a few clicks as well as explore several data analysis scenarios with the latest Solr Analytics Facets and Spark Streaming. Through a Web browser, attendees will be shown how to explore and visualize data for quick answers. The search dashboard in Hue, with its draggable charts and dynamic interface, lets any non-technical user look for documents or patterns.
Attendees of this talk will learn how to get started with interactive search visualization in their Solr cluster.
Topics in intermediate/early-advaned Jasmine testing for client-side JavaScript web applications.
Source code, test specs, and harnesses available here:
https://github.com/jbellsey/dbc-jasmine
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...Amazon Web Services
Code profiling gives a rich, detailed view of runtime performance. However, it's difficult to achieve in production: for even a small fraction of web requests, huge challenges in scalability, access, and ease of use appear. Despite this, Yelp profiles a nontrivial fraction of its traffic by combining Amazon EC2, Amazon EMR, and Amazon S3. Developers can search, sort, filter, and combine interesting profiles; during a site slowdown or page failure, this allows a fast diagnosis and speedy recovery. Some of our analyses run nightly, while others run in real-time via Storm topologies. This session includes our use cases for code profiling, its benefits, and the implementation of its handlers and analysis flows. We include both performance results and implementation challenges of our MapReduce and Storm jobs, including code overviews. We also touch on issues such as concurrent logging, cross-data center replication, job scheduling, and API definitions.
Whether you create a single Virtual Machine or a multi-tier environment in Azure, diagnosing and troubleshooting connectivity issues can be a challenge. Microsoft has released a new toolset, called Azure Network Watcher, to empower all Azure users to be able to monitor, diagnose, and gain insights into your Azure networks.
In this session, we will discuss how to get started, the different capabilities of the tool and what you can use them for, along with highlighting some current limitations. Demos of some key features will also be included.
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Sascha Wenninger
Provides an overview of popular integration approaches, maps them to SAP's integration tools and concludes with some lessons learnt in their application.
Apache Solr makes it so easy to interactively visualize and explore your data. Create a dashboard, add some facets, select some values, cross it with the time and just look at the results. Apache Spark is the growing framework for performing streaming computations, which makes it ideal for real time indexing. Solr also comes with new Analytics Facets which are a major weapon added to the arsenal of the data explorer. They bring another dimension: calculations. We can now do the equivalent of SQL, just in a much simpler and faster way. These calculations can operate over buckets of data.
Understanding the state of your web application using Apache Kafka, SparkExist
A presentation on understanding the state of your web application using Apache Kafka and Spark by Adrian Co, one of Exist's software architects during the DevCon Summit 2015.
Serverless is a new framework that allows developers to easily harness AWS Lambda and Api Gateway to build and deploy full fledged API services without needing to deal with any ops level overhead or paying for servers when they're not in use. It's kinda like Heroku on-demand for single functions.
Find out more about Deep Learning in terms of
•AI
•Infrastructure
•Common neural network architectures and use cases
•An introduction to Apache MXNet
•Demos
•Resources
How to Upgrade Your Database Plan on Heroku and Rails Setup?Katy Slemon
Heroku is one-stop solution to upgrade your database plan and deploy it using RoR. Let’s find out steps to upgrade your database plan on Heroku and Rails Setup.
Platform-as-a-Service (PaaS) is a technology designed to make DevOps easier and allow developers to focus on application development. The PaaS takes care of provisioning, scaling, HA, and other cloud management aspects.
Discussing these key elements of PaaS, this session presented by Lakmal Warusawithana, Director - Cloud Architecture included a demonstration of app deployment, provisioning, auto-scaling and more.
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
by Anya Bida and Rachel Warren from Alpine Data
https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/
Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.
SF Solr Meetup - Interactively Search and Visualize Your Big Datagethue
Open up your user base to the data! Contrary to programming and SQL, almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop. The underlying technical details of the application and its interaction with Apache Solr will be clarified.
The session will detail how to get started with data indexing in just a few clicks as well as explore several data analysis scenarios with the latest Solr Analytics Facets and Spark Streaming. Through a Web browser, attendees will be shown how to explore and visualize data for quick answers. The search dashboard in Hue, with its draggable charts and dynamic interface, lets any non-technical user look for documents or patterns.
Attendees of this talk will learn how to get started with interactive search visualization in their Solr cluster.
Topics in intermediate/early-advaned Jasmine testing for client-side JavaScript web applications.
Source code, test specs, and harnesses available here:
https://github.com/jbellsey/dbc-jasmine
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...Amazon Web Services
Code profiling gives a rich, detailed view of runtime performance. However, it's difficult to achieve in production: for even a small fraction of web requests, huge challenges in scalability, access, and ease of use appear. Despite this, Yelp profiles a nontrivial fraction of its traffic by combining Amazon EC2, Amazon EMR, and Amazon S3. Developers can search, sort, filter, and combine interesting profiles; during a site slowdown or page failure, this allows a fast diagnosis and speedy recovery. Some of our analyses run nightly, while others run in real-time via Storm topologies. This session includes our use cases for code profiling, its benefits, and the implementation of its handlers and analysis flows. We include both performance results and implementation challenges of our MapReduce and Storm jobs, including code overviews. We also touch on issues such as concurrent logging, cross-data center replication, job scheduling, and API definitions.
Whether you create a single Virtual Machine or a multi-tier environment in Azure, diagnosing and troubleshooting connectivity issues can be a challenge. Microsoft has released a new toolset, called Azure Network Watcher, to empower all Azure users to be able to monitor, diagnose, and gain insights into your Azure networks.
In this session, we will discuss how to get started, the different capabilities of the tool and what you can use them for, along with highlighting some current limitations. Demos of some key features will also be included.
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Sascha Wenninger
Provides an overview of popular integration approaches, maps them to SAP's integration tools and concludes with some lessons learnt in their application.
Apache Solr makes it so easy to interactively visualize and explore your data. Create a dashboard, add some facets, select some values, cross it with the time and just look at the results. Apache Spark is the growing framework for performing streaming computations, which makes it ideal for real time indexing. Solr also comes with new Analytics Facets which are a major weapon added to the arsenal of the data explorer. They bring another dimension: calculations. We can now do the equivalent of SQL, just in a much simpler and faster way. These calculations can operate over buckets of data.
Understanding the state of your web application using Apache Kafka, SparkExist
A presentation on understanding the state of your web application using Apache Kafka and Spark by Adrian Co, one of Exist's software architects during the DevCon Summit 2015.
Serverless is a new framework that allows developers to easily harness AWS Lambda and Api Gateway to build and deploy full fledged API services without needing to deal with any ops level overhead or paying for servers when they're not in use. It's kinda like Heroku on-demand for single functions.
Find out more about Deep Learning in terms of
•AI
•Infrastructure
•Common neural network architectures and use cases
•An introduction to Apache MXNet
•Demos
•Resources
How to Upgrade Your Database Plan on Heroku and Rails Setup?Katy Slemon
Heroku is one-stop solution to upgrade your database plan and deploy it using RoR. Let’s find out steps to upgrade your database plan on Heroku and Rails Setup.
Platform-as-a-Service (PaaS) is a technology designed to make DevOps easier and allow developers to focus on application development. The PaaS takes care of provisioning, scaling, HA, and other cloud management aspects.
Discussing these key elements of PaaS, this session presented by Lakmal Warusawithana, Director - Cloud Architecture included a demonstration of app deployment, provisioning, auto-scaling and more.
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
by Anya Bida and Rachel Warren from Alpine Data
https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/
Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.
Caching and tuning fun for high scalability @ PHPTourWim Godden
Caching has been a 'hot' topic for a few years. But caching takes more than merely taking data and putting it in a cache : the right caching techniques can improve performance and reduce load significantly. But we'll also look at some major pitfalls, showing that caching the wrong way can bring down your site. If you're looking for a clear explanation about various caching techniques and tools like Memcached, Nginx and Varnish, as well as ways to deploy them in an efficient way, this talk is for you. In this tutorial, we'll start from a Zend Framework based site. We'll add caching, begin to add servers and replace the standard LAMP stack, all while performing live benchmarks.
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Caching and tuning fun for high scalability @ phpBenelux 2011Wim Godden
Slides for "Caching and Tuning fun for high scalability" talk, given @ phpBenelux Conference - Jan 28, 2011
Note that a lot of things were explained with each slide... that content is ofcourse not in the slides, so it might make some slides very unclear.
While working with Java applications running on the Java HotSpot VM, we might sometimes encounter problems such as application hangs, memory leaks, unexpected application behavior, or crashes. Troubleshooting such problems can be very hard and tricky. But with knowledge of the right set of tools and utilities for nailing these problems down and how to approach them, troubleshooting can be made much easier and can help us develop stable, reliable, and efficient Java applications. This slides deck covers how we should approach these JVM issues and which tools and utilities are useful for diagnosing and troubleshooting them.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Caching and tuning fun for high scalability @ FOSDEM 2012Wim Godden
Caching has been a 'hot' topic for a few years. But caching takes more than merely taking data and putting it in a cache : the right caching techniques can improve performance and reduce load significantly. But we'll also look at some major pitfalls, showing that caching the wrong way can bring down your site. If you're looking for a clear explanation about various caching techniques and tools like Memcached, Nginx and Varnish, as well as ways to deploy them in an efficient way, this talk is for you.
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICESMichael Plöd
Caching is relevant for a wide range of business applications and there is a huge variety of products in the market ranging from easy to adopt local heap based caches to powerful distributed data grids. This talk addresses advanced usage of Spring’s caching abstraction such as integrating a cache provider that is not integrated by the default Spring Package. In addition to that I will also give an overview of the JCache Specification and it’s adoption in the Spring ecosystem. Finally the presentation will also address various best practices for integrating various caching solutions into enterprise grade applications that don’t have the luxury of having „eventual consistency“ as a non-functional requirement.
Caching and tuning fun for high scalabilityWim Godden
Caching has been a 'hot' topic for a few years. But caching takes more than merely taking data and putting it in a cache : the right caching techniques can improve performance and reduce load significantly. But we'll also look at some major pitfalls, showing that caching the wrong way can bring down your site.
If you're looking for a clear explanation about various caching techniques and tools like Memcached, Nginx and Varnish, as well as ways to deploy them in an efficient way, this talk is for you.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
6. Default != Recommended
Example: By default, spark.executor.memory = 1g
1g allows small jobs to finish out of the box.
Spark assumes you'll increase this parameter.
!6
7. Which parameters are important?
!
How do I configure them?
!7
Default != Recommended
8. Filter* data
before an
expensive reduce
or aggregation
consider*
coalesce(
Use* data
structures that
require less
memory
Serialize*
PySpark
serializing
is built-in
Scala/
Java?
persist(storageLevel.[*]_SER)
Recommended:
kryoserializer *
tuning.html#tuning-
data-structures
See "Optimize partitions."
*
See "GC investigation." *
See "Checkpointing." *
The Spark Tuning Cheat-Sheet
29. !29
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
30. !30
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
31. !31
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
Limitation: Driver must not be
larger than a single node.
32. !32
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
33. !33
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
!
!
!
!
!
!
34. !34
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
!
!
!
!
!
!
Parameter Default Recommended
spark.executor.cores 1(Yarn mode) 5 or less
35. !35
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
executors per node= (cores per node) / (5cores per executor)
!
!
!
!
!
36. !36
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
executors per node= (cores per node) / (5cores per executor)
!
executor.memory = (memory per node) / (executors per node)
!
!
!
37. !37
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
executors per node= (cores per node) / (5cores per executor)
!
executor.memory = (memory per node) / (executors per node)
!
maxExecutors=(executors per node) x (num nodes)
!
38. !38
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
39. !39
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
Verify my calculations respect this
limitation.
44. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
45. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
46. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
here let's talk about one scenario
47.
48. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
49. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
50. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
51. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
53. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
54. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
here let's talk about one scenario
56. Symptoms:
!56
• mySparkApp is running for several hours
Container is lost.
• Several Executors are lost.
• Behavior is intermittent (sometimes succeeds,
sometimes fails).
60. Potential Solution: RDD.checkpoint()
!60
Use in these cases:
!
Function:
• saves the RDD to stable
storage (eg hdfs or S3)
How-to:
Cache first!
SparkContext.setCheckpointDir(directory: String)
RDD.checkpoint()
61. Potential Solution: RDD.checkpoint()
!61
Use in these cases:
• high-traffic cluster
• network blips
• preemption
• disk space nearly full
!
!
Function:
• saves the RDD to stable
storage (eg hdfs or S3)
How-to:
Cache first!
SparkContext.setCheckpointDir(directory: String)
RDD.checkpoint()