SlideShare a Scribd company logo
1 of 24
Standalone Spark Deployment
For Stability and Performance
Totango
❖ Leading Customer Success Platform
❖ Helps companies retain and grow their customer base
❖ Advanced actionable analytics for subscription and recurring revenue
❖ Founded @ 2010
❖ Infrastructure on AWS cloud
❖ Spark for batch processing
❖ ElasticSearch for serving layer
About Me
Romi Kuntsman
Senior Big Data Engineer @ Totango
Working with Apache Spark since v1.0
Working with AWS Cloud since 2008
Spark on AWS - first attempts
❖ We tried Amazon EMR (Elastic MapReduce) to install Spark on YARN
➢Performance hit per application (starts Spark instance for each)
➢Performance hit per server (running services we don't use, like HDFS)
➢Slow and unstable cluster resizing (often stuck and need to recreate)
❖We tried spark-ec2 script to install Spark Standalone on AWS EC2 machines
➢Serial (not parallel) initialization of multiple servers - slow!
➢Unmaintained scripts since availability of Spark on EMR (see above)
➢Doesn't integrate with our existing systems
Spark on AWS - road to success
❖ We decided to write our own scripts to integrate and control everything
❖Understood all Spark components and configuration settings
❖Deployment based on Chef, like we do in all servers
❖Integrated monitoring and logging, like we have in all our systems
❖Full server utilization - running exactly what we need and nothing more
❖Cluster hanging or crashing no longer happens
❖Seamless cluster resize without hurting any existing jobs
❖Able to upgrade to any version of Spark (not dependant on third party)
What we'll discuss
❖Separation of Spark Components
❖Centralized Managed Logging
❖Monitoring Cluster Utilization
❖Auto Scaling Groups
❖Termination Protection
❖Upstart Mechanism
❖NewRelic Integration
❖Chef-based Instantiation
Data w/ Romi
Ops w/ Alon
Separation of Components
❖Spark Master Server (single)
➢Master Process - accepts requests to start applications
➢History Process - serves history data of completed applications
❖Spark Slave Server (multiple)
➢Worker Process - handles workload of applications on server
➢External Shuffle Service - handles data exchange between workers
➢Executor Process (one per core - for running apps) - runs actual code
Configuration - Deploy Spread Out
❖spark.deploy.spreadOut (SPARK_MASTER_OPTS)
➢true = use cores spread across all workers
➢false = fill up all worker cores before getting more
Configuration - Cleanup
❖spark.worker.cleanup.* (SPARK_WORKER_OPTS)
➢.enabled = true (turn on mechanism to clean up app folders)
➢.interval = 1800 (run every 1800 seconds, or 30 minutes)
➢.appDataTtl = 1800 (remove finished applications after 30 minutes)
❖We have 100s of applications per day, each with it's jars and logs
❖Rapid cleanup is essential to avoid filling up disk space
❖We collect the logs before cleanup - details in following slides ;-)
❖Only cleans up files of completed applications
External Shuffle Service
❖Preserves shuffle files written by executors
❖Servers shuffle files to other executors who want to fetch them
❖If (when) one executor crashes (OOM etc), others may still access it's shuffle
❖We run the shuffle service itself in a separate process from the executor
❖To enable: spark.shuffle.service.enable=true
❖Config: spark.shuffle.io.* (see documentation)
Logging - components
❖ Master Log (/logs/spark-runner-org.apache.spark.deploy.master.Master-*)
➢Application registration, worker coordination
❖History Log (/logs/spark-runner-org.apache.spark.deploy.history.HistoryServer-*)
➢Access to history, errors reading history (e.g. I/O from S3, not found)
❖Worker Log (/logs/spark-runner-org.apache.spark.deploy.worker.Worker-*)
➢Executor management (launch, kill, ACLs)
❖Shuffle Log (/logs/org.apache.spark.deploy.ExternalShuffleService-*)
➢External Executor Registrations
Logging - applications
❖Application Logs (/mnt/spark-work/app-12345/execid/stderr)
➢All output from executor process, including your own code
❖Using LogStash to gather logs from all applications together
input {
file {
path => "/mnt/spark-work/app-*/*/std*"
start_position => beginning
}
}
filter {
grok {
match => [ "path", "/mnt/spark-work/%{NOTSPACE:application}/.+/%{NOTSPACE:logtype}" ]
}
}
output {
file {
path => "/logs/applications.log"
message_format => "%{application} %{logtype} %{message}"
}
}
Monitoring Cluster Utilization
❖ Spark Reports Metrics (Codahale) through Graphite
➢Master metrics - running application and their status
➢Worker metrics - used cores, free cores
➢JVM metrics - memory allocation, GC
❖We use Anodot to view and track
metrics trends and anomalies
And now, to the Ops side...
Alon Torres
DevOps Engineer @ Totango
Auto Scaling Group Components
❖Auto Scaling Group
➢Scale your group up or down flexibly
➢Supports health checks and load balancing
❖Launch Configuration
➢Template used by the ASG to launch instances
➢User Data script for post-launch configuration
❖User Data
➢Install prerequisites and fetch instance info
➢Install and start Chef client
Launch
Configuration
Auto Scaling
Group
EC2
Instance
EC2
Instance
EC2
Instance
EC2
Instance
EC2
Instance
EC2
Instance
User
Data
Auto Scaling Group resizing in AWS
❖ Scheduled
➢Set the desired size according to a specified schedule
➢Good for scenarios with predictable, cyclic workloads.
❖Alert-Based
➢Set specific alerts that trigger a cluster action
➢Alerts can monitor instance health properties (resource usage)
❖Remote-triggered
➢Using the AWS API/CLI, resize the cluster however you want
Resizing the ASG with Jenkins
❖We use schedule-based Jenkins jobs that utilize the AWS CLI
➢Each job sets the desired Spark cluster size
➢Makes it easy for our Data team to make changes to the schedule
➢Desired size can be manually overridden if needed
Termination Protection
❖When scaling down, ASG treats all nodes as equal termination candidates
❖We want to avoid killing instances with currently running jobs
❖To achieve this, we used a built-in feature of ASG - termination protection
❖Any instance in the ASG can be set as protected, thus preventing
termination when scaling down the cluster.
if [ $(ps -ef | grep executor | grep spark | wc -l) -ne 0 ]; then
aws autoscaling set-instance-protection --protected-from-scale-in …
fi
Upstart Jobs for Spark
❖ Every spark component has an upstart job the does the following
➢Set Spark Niceness (Process priority in CPU resource distribution)
➢Start the required Spark component and ensure it stays running
■ The default spark daemon script runs in the background
■ For Upstart, we modified the script to run in the foreground
❖ nohup nice -n "$SPARK_NICENESS"…&
vs
❖ nice -n "$SPARK_NICENESS" ...
NewRelic Monitoring
❖ Cloud-based Application and Server monitoring
❖Supports multiple alert policies for different needs
➢Who to alert, and what triggers the alerts
❖Newly created instances are auto - assigned the default alert policy
Policy Assignment using AWS Lambda
❖Spark instances have their own policy in NewRelic
❖Each instance has to ask NewRelic to be reassigned to the new policy
➢Parallel reassignment requests may collide and override each other
❖Solution - during provisioning and shutdown, we do the following:
➢Put a record in an AWS Kinesis stream that contains their hostname
and their desired NewRelic policy ID
➢The record triggers an AWS Lambda script that uses the NewRelic API
to reassign the hostname given to the policy ID given
Chef
❖Configuration Management Tool, can provision and configure instances
➢Describe an instance state as code, let chef handle the rest
➢Typically works in server/client mode - client updates every 30m
➢Besides provisioning, also prevents configuration drifts
❖Vast amount of plugins and cookbooks - the sky's the limit!
❖Configures all the instances in our DC
Spark Instance Provisioning
❖ Setup Spark
➢Setup prerequisites - users, directories, symlinks and jars
➢ Download and extract spark package from S3
❖Configure termination protection cron script
❖Configure upstart conf files
❖Place spark config files
❖Assign NewRelic policy
❖Add shutdown scripts
➢Delete instance from chef database
Questions?
❖ Alon Torres, DevOps
https://il.linkedin.com/in/alontorres
❖Romi Kuntsman, Senior Big Data Engineer
https://il.linkedin.com/in/romik
❖Stay in touch!
Totango Engineering Technical Blog
http://labs.totango.com/

More Related Content

What's hot

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... CassandraInstaclustr
 
Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applicationsBen Slater
 
Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks
Running Solr at Memory Speed with Alluxio - Timothy Potter, LucidworksRunning Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks
Running Solr at Memory Speed with Alluxio - Timothy Potter, LucidworksLucidworks
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
 
Introducing SciaaS @ Sanger
Introducing SciaaS @ SangerIntroducing SciaaS @ Sanger
Introducing SciaaS @ SangerPeter Clapham
 
Amazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and MigrationAmazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and MigrationAmazon Web Services
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentDataWorks Summit
 
DataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenterDataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenterDataStax Academy
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
Amazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and MigrationAmazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and MigrationAmazon Web Services
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to YouAmazon Web Services
 
Red Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorRed Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorOrgad Kimchi
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesYousun Jeong
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 

What's hot (20)

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Openstack ha
Openstack haOpenstack ha
Openstack ha
 
Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applications
 
Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks
Running Solr at Memory Speed with Alluxio - Timothy Potter, LucidworksRunning Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks
Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Introducing SciaaS @ Sanger
Introducing SciaaS @ SangerIntroducing SciaaS @ Sanger
Introducing SciaaS @ Sanger
 
Amazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and MigrationAmazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and Migration
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
 
DataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenterDataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenter
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Amazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and MigrationAmazon RDS for MySQL: Best Practices and Migration
Amazon RDS for MySQL: Best Practices and Migration
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
Red Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorRed Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom Director
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 

Similar to Standalone Spark Deployment for Stability and Performance

Standalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceStandalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceRomi Kuntsman
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Piyush Kumar
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Sadique Puthen
 
Namos openstack-manager
Namos openstack-managerNamos openstack-manager
Namos openstack-managerKanagaraj M
 
SOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DBSOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DBUniFabric
 
Automating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAutomating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAkshaya Mahapatra
 
AWS migration: getting to Data Center heaven with AWS and Chef
AWS migration: getting to Data Center heaven with AWS and ChefAWS migration: getting to Data Center heaven with AWS and Chef
AWS migration: getting to Data Center heaven with AWS and ChefJuan Vicente Herrera Ruiz de Alejo
 
Openstack HA
Openstack HAOpenstack HA
Openstack HAYong Luo
 
Spark in the Maritime Domain
Spark in the Maritime DomainSpark in the Maritime Domain
Spark in the Maritime DomainDemi Ben-Ari
 
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSOpenstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSSadique Puthen
 
How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...
How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...
How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...VEXXHOST Private Cloud
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex
 
Writing your First Ansible Playbook
Writing your First Ansible PlaybookWriting your First Ansible Playbook
Writing your First Ansible PlaybookSana Khan
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Simplifying the Move to OpenStack
Simplifying the Move to OpenStackSimplifying the Move to OpenStack
Simplifying the Move to OpenStackOpenStack
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesScyllaDB
 

Similar to Standalone Spark Deployment for Stability and Performance (20)

Standalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceStandalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and Performance
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28
 
Namos openstack-manager
Namos openstack-managerNamos openstack-manager
Namos openstack-manager
 
SOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DBSOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DB
 
Infrastructure as Code
Infrastructure as CodeInfrastructure as Code
Infrastructure as Code
 
Automating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAutomating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps Approach
 
AWS migration: getting to Data Center heaven with AWS and Chef
AWS migration: getting to Data Center heaven with AWS and ChefAWS migration: getting to Data Center heaven with AWS and Chef
AWS migration: getting to Data Center heaven with AWS and Chef
 
Openstack HA
Openstack HAOpenstack HA
Openstack HA
 
Spark in the Maritime Domain
Spark in the Maritime DomainSpark in the Maritime Domain
Spark in the Maritime Domain
 
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSOpenstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
 
How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...
How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...
How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
Writing your First Ansible Playbook
Writing your First Ansible PlaybookWriting your First Ansible Playbook
Writing your First Ansible Playbook
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Simplifying the Move to OpenStack
Simplifying the Move to OpenStackSimplifying the Move to OpenStack
Simplifying the Move to OpenStack
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million Devices
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Standalone Spark Deployment for Stability and Performance

  • 1. Standalone Spark Deployment For Stability and Performance
  • 2. Totango ❖ Leading Customer Success Platform ❖ Helps companies retain and grow their customer base ❖ Advanced actionable analytics for subscription and recurring revenue ❖ Founded @ 2010 ❖ Infrastructure on AWS cloud ❖ Spark for batch processing ❖ ElasticSearch for serving layer
  • 3. About Me Romi Kuntsman Senior Big Data Engineer @ Totango Working with Apache Spark since v1.0 Working with AWS Cloud since 2008
  • 4. Spark on AWS - first attempts ❖ We tried Amazon EMR (Elastic MapReduce) to install Spark on YARN ➢Performance hit per application (starts Spark instance for each) ➢Performance hit per server (running services we don't use, like HDFS) ➢Slow and unstable cluster resizing (often stuck and need to recreate) ❖We tried spark-ec2 script to install Spark Standalone on AWS EC2 machines ➢Serial (not parallel) initialization of multiple servers - slow! ➢Unmaintained scripts since availability of Spark on EMR (see above) ➢Doesn't integrate with our existing systems
  • 5. Spark on AWS - road to success ❖ We decided to write our own scripts to integrate and control everything ❖Understood all Spark components and configuration settings ❖Deployment based on Chef, like we do in all servers ❖Integrated monitoring and logging, like we have in all our systems ❖Full server utilization - running exactly what we need and nothing more ❖Cluster hanging or crashing no longer happens ❖Seamless cluster resize without hurting any existing jobs ❖Able to upgrade to any version of Spark (not dependant on third party)
  • 6. What we'll discuss ❖Separation of Spark Components ❖Centralized Managed Logging ❖Monitoring Cluster Utilization ❖Auto Scaling Groups ❖Termination Protection ❖Upstart Mechanism ❖NewRelic Integration ❖Chef-based Instantiation Data w/ Romi Ops w/ Alon
  • 7. Separation of Components ❖Spark Master Server (single) ➢Master Process - accepts requests to start applications ➢History Process - serves history data of completed applications ❖Spark Slave Server (multiple) ➢Worker Process - handles workload of applications on server ➢External Shuffle Service - handles data exchange between workers ➢Executor Process (one per core - for running apps) - runs actual code
  • 8. Configuration - Deploy Spread Out ❖spark.deploy.spreadOut (SPARK_MASTER_OPTS) ➢true = use cores spread across all workers ➢false = fill up all worker cores before getting more
  • 9. Configuration - Cleanup ❖spark.worker.cleanup.* (SPARK_WORKER_OPTS) ➢.enabled = true (turn on mechanism to clean up app folders) ➢.interval = 1800 (run every 1800 seconds, or 30 minutes) ➢.appDataTtl = 1800 (remove finished applications after 30 minutes) ❖We have 100s of applications per day, each with it's jars and logs ❖Rapid cleanup is essential to avoid filling up disk space ❖We collect the logs before cleanup - details in following slides ;-) ❖Only cleans up files of completed applications
  • 10. External Shuffle Service ❖Preserves shuffle files written by executors ❖Servers shuffle files to other executors who want to fetch them ❖If (when) one executor crashes (OOM etc), others may still access it's shuffle ❖We run the shuffle service itself in a separate process from the executor ❖To enable: spark.shuffle.service.enable=true ❖Config: spark.shuffle.io.* (see documentation)
  • 11. Logging - components ❖ Master Log (/logs/spark-runner-org.apache.spark.deploy.master.Master-*) ➢Application registration, worker coordination ❖History Log (/logs/spark-runner-org.apache.spark.deploy.history.HistoryServer-*) ➢Access to history, errors reading history (e.g. I/O from S3, not found) ❖Worker Log (/logs/spark-runner-org.apache.spark.deploy.worker.Worker-*) ➢Executor management (launch, kill, ACLs) ❖Shuffle Log (/logs/org.apache.spark.deploy.ExternalShuffleService-*) ➢External Executor Registrations
  • 12. Logging - applications ❖Application Logs (/mnt/spark-work/app-12345/execid/stderr) ➢All output from executor process, including your own code ❖Using LogStash to gather logs from all applications together input { file { path => "/mnt/spark-work/app-*/*/std*" start_position => beginning } } filter { grok { match => [ "path", "/mnt/spark-work/%{NOTSPACE:application}/.+/%{NOTSPACE:logtype}" ] } } output { file { path => "/logs/applications.log" message_format => "%{application} %{logtype} %{message}" } }
  • 13. Monitoring Cluster Utilization ❖ Spark Reports Metrics (Codahale) through Graphite ➢Master metrics - running application and their status ➢Worker metrics - used cores, free cores ➢JVM metrics - memory allocation, GC ❖We use Anodot to view and track metrics trends and anomalies
  • 14. And now, to the Ops side... Alon Torres DevOps Engineer @ Totango
  • 15. Auto Scaling Group Components ❖Auto Scaling Group ➢Scale your group up or down flexibly ➢Supports health checks and load balancing ❖Launch Configuration ➢Template used by the ASG to launch instances ➢User Data script for post-launch configuration ❖User Data ➢Install prerequisites and fetch instance info ➢Install and start Chef client Launch Configuration Auto Scaling Group EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance User Data
  • 16. Auto Scaling Group resizing in AWS ❖ Scheduled ➢Set the desired size according to a specified schedule ➢Good for scenarios with predictable, cyclic workloads. ❖Alert-Based ➢Set specific alerts that trigger a cluster action ➢Alerts can monitor instance health properties (resource usage) ❖Remote-triggered ➢Using the AWS API/CLI, resize the cluster however you want
  • 17. Resizing the ASG with Jenkins ❖We use schedule-based Jenkins jobs that utilize the AWS CLI ➢Each job sets the desired Spark cluster size ➢Makes it easy for our Data team to make changes to the schedule ➢Desired size can be manually overridden if needed
  • 18. Termination Protection ❖When scaling down, ASG treats all nodes as equal termination candidates ❖We want to avoid killing instances with currently running jobs ❖To achieve this, we used a built-in feature of ASG - termination protection ❖Any instance in the ASG can be set as protected, thus preventing termination when scaling down the cluster. if [ $(ps -ef | grep executor | grep spark | wc -l) -ne 0 ]; then aws autoscaling set-instance-protection --protected-from-scale-in … fi
  • 19. Upstart Jobs for Spark ❖ Every spark component has an upstart job the does the following ➢Set Spark Niceness (Process priority in CPU resource distribution) ➢Start the required Spark component and ensure it stays running ■ The default spark daemon script runs in the background ■ For Upstart, we modified the script to run in the foreground ❖ nohup nice -n "$SPARK_NICENESS"…& vs ❖ nice -n "$SPARK_NICENESS" ...
  • 20. NewRelic Monitoring ❖ Cloud-based Application and Server monitoring ❖Supports multiple alert policies for different needs ➢Who to alert, and what triggers the alerts ❖Newly created instances are auto - assigned the default alert policy
  • 21. Policy Assignment using AWS Lambda ❖Spark instances have their own policy in NewRelic ❖Each instance has to ask NewRelic to be reassigned to the new policy ➢Parallel reassignment requests may collide and override each other ❖Solution - during provisioning and shutdown, we do the following: ➢Put a record in an AWS Kinesis stream that contains their hostname and their desired NewRelic policy ID ➢The record triggers an AWS Lambda script that uses the NewRelic API to reassign the hostname given to the policy ID given
  • 22. Chef ❖Configuration Management Tool, can provision and configure instances ➢Describe an instance state as code, let chef handle the rest ➢Typically works in server/client mode - client updates every 30m ➢Besides provisioning, also prevents configuration drifts ❖Vast amount of plugins and cookbooks - the sky's the limit! ❖Configures all the instances in our DC
  • 23. Spark Instance Provisioning ❖ Setup Spark ➢Setup prerequisites - users, directories, symlinks and jars ➢ Download and extract spark package from S3 ❖Configure termination protection cron script ❖Configure upstart conf files ❖Place spark config files ❖Assign NewRelic policy ❖Add shutdown scripts ➢Delete instance from chef database
  • 24. Questions? ❖ Alon Torres, DevOps https://il.linkedin.com/in/alontorres ❖Romi Kuntsman, Senior Big Data Engineer https://il.linkedin.com/in/romik ❖Stay in touch! Totango Engineering Technical Blog http://labs.totango.com/