SlideShare a Scribd company logo
1 of 39
Spark as Service in cloud on Yarn
Hadoop Meetup
bharatb@qubole.com, rgupta@qubole.com
May 15, 2015
Agenda
• Spark on Yarn
• Autoscaling Spark Apps and Cluster
management
• Hive Integration with Spark
• Persistent History Server
Spark on Yarn
Hadoop1
Disadvantages of hadoop1
• Limited to only MR
• Separate Map and Reduce slots =>
underutilization
• JT is heavily loaded for job scheduling,
monitoring and resource allocation.
Yarn Overview
Advantages of Spark on Yarn
• General cluster for running multiple
workflows. AM can have custom logic for
scheduling
• AM can ask for more containers when
required and give up containers when free
• This become even better when yarn clusters
can autoscale
• Get features like spot nodes etc which brings
additional challenges
Advantages of Spark on Yarn
• Qubole Yarn clusters can upscale and downscale
based on load and support spot instances.
Autoscaling Spark Applications
Spark Provisioning: Problems
• Spark Application starts with fixed number of
resources and hold on to them till its alive
• Sometimes its difficult to estimate resources
required by a job since AM is long running
• It becomes limiting spl when Yarn clusters can
autoscale.
Dynamic Provisioning
• Speed up spark commands by using free
resources in yarn cluster and also by releasing
resources when free to RM.
Spark on Yarn basics
Driver
AM
Executor-1 Executor-n
• Cluster Mode: Driver and AM run in same JVM in
a yarn Executor
• Client Mode: Driver and AM run in separate JVM
• Driver and AM talk using Actors to handle both
cases
Driver AM Executor-1 Executor-n
Dynamic Provisioning: Problem
Statement
• Two parts:
– Spark AM has no way to ask for additional
containers and give up free containers
– Automating the process of requesting containers
and releasing containers. Cached data in
containers make this difficult
Dynamic Provisioning: Part1
Dynamic Provisioning: Part1
• Implementation of 2 new apis:
// Request 5 extra executors
sc.requestExecutors(5)
// Kill executors with IDs 1, 15, and 16
sc.killExecutors(Seq("1", "15", "16"))
requestExecutors
AM
Reporter Thread
E1 E2 En
• AM has reporter thread that has count of
number of executors
• Reporter thread was used to restart died
executors
• Driver increments count of number of
executors when sc.requestExecutors is called.
Driver
removeExecutors
• To kill executors, one must precisely tell which
executors need to be killed
• Driver maintains list of all executors and can
be obtained by:
sc.executorStorageStatuses.foreach(x => println(x.blockManagerId.executorId))
• Whats cached in each executor is also
available using:
sc.executorStorageStatuses.foreach(x => println(s”memUsed = ${x.memUsed}
diskUsed=${x.diskUsed)”))
Removing Executors Tradeoffs
• BlockManager in each executor can have
cached RDDs, shuffle and broadcast data
• Killing an executor with shuffle data will
require the stage to rerun.
• To avoid this use external shuffle service
introduced in spark-1.2
Dynamic Provisioning: Part2
Upscaling Heuristics
• Request Executors as many pending tasks
• Request Executors in rounds if there are
pending tasks, doubling number of executors
added in each round bounded by some upper
limit
• Request executors by estimating workload
• Introduced –max-executors as extra param
Downscaling Heuristics
• Remove Executors when they are idle
• Remove Executors if then are idle for X secs
• Cant downscale executors with shuffle data or
broadcast data.
• --num-executors act as minimum executors
Scope
• Kill executors on spot nodes first
• Flag for not killing up executors if they have
shuffle data
Where is the code?
• https://github.com/apache/spark/pull/2840
• https://github.com/apache/spark/pull/2746
Spark Hive Integration
What is involved?
• Spark programs should be able to access hive
metastore
• Other Qubole services can be producers or
consumers of data and metadata(hive, presto,
pig etc)
Using SparkSQL - Command UI
Using SparkSQL - Results
Using SparkSQL - Notebook
• SQL, Python, Scala code can be input
Using SparkSQL - REST api - scala
curl --silent -X POST 
-H "X-AUTH-TOKEN: $AUTH_TOKEN" 
-H "Content-Type: application/json" 
-H "Accept: application/json" 
-d '{
"program" : "val s = new org.apache.spark.sql.hive.HiveContext(sc);
s.sql("show tables").collect.foreach(println)",
"language" : "scala",
"command_type" : "SparkCommand"
}' 
https://api.qubole.net/api/latest/commands
Using SparkSQL - REST api - sql
curl --silent -X POST 
-H "X-AUTH-TOKEN: $AUTH_TOKEN" 
-H "Content-Type: application/json" 
-H "Accept: application/json" 
-d '{
"program" : "show tables",
"language" : "sql",
"command_type" : "SparkCommand"
}' 
https://api.qubole.net/api/latest/commands
NOT RELEASE YET
Using SparkSQL - qds-sdk-py / java
from qds_sdk.commands import SparkCommand
with open(“test_spark.py”) as f:
code = f.read()
cmd = SparkCommand.run(language="python",
label="spark", program=code)
results = cmd.get_results()
Using SparkSQL - Cluster config
Spark UI container info
Basic cluster organization
• DB instance in Qubole account
• ssh tunnel from master to metastore DB
• Metastore server running on master on port
10000
• On master and slave nodes, hive-site.xml:-
hive.metastore.uris=thrift://master_ip:10000
Hosted metastore
Problems
• yarn overhead should be 20% (TPC-H)
• Parquet needs higher PermGen
• cached tables use actual table
• alter table recover partitions not supported
• VPC cluster has slow access to metastore
• SchemaRDD gone - old jars dont run
• hive jars needed on system classpath
Future/Near future
• Run with Qubole’s hive codebase
• Metastore caching
• Benchmarking
Future/Near future
• Persistent History Server
• Fast access to spark AM running in customer
cluster
Thank You

More Related Content

What's hot

PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
Ontico
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
PostgreSQL-Consulting
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
distributed matters
 

What's hot (20)

Streaming replication in PostgreSQL
Streaming replication in PostgreSQLStreaming replication in PostgreSQL
Streaming replication in PostgreSQL
 
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
 
PostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsPostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability Methods
 
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
Query Parallelism in PostgreSQL: What's coming next?
Query Parallelism in PostgreSQL: What's coming next?Query Parallelism in PostgreSQL: What's coming next?
Query Parallelism in PostgreSQL: What's coming next?
 
What is new in MariaDB 10.6?
What is new in MariaDB 10.6?What is new in MariaDB 10.6?
What is new in MariaDB 10.6?
 
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
 
MySQL Live Migration - Common Scenarios
MySQL Live Migration - Common ScenariosMySQL Live Migration - Common Scenarios
MySQL Live Migration - Common Scenarios
 
MySQL shell and It's utilities - Praveen GR (Mydbops Team)
MySQL shell and It's utilities - Praveen GR (Mydbops Team)MySQL shell and It's utilities - Praveen GR (Mydbops Team)
MySQL shell and It's utilities - Praveen GR (Mydbops Team)
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
 
Spark performance tuning eng
Spark performance tuning engSpark performance tuning eng
Spark performance tuning eng
 
RMOUG2016 - Resource Management (the critical piece of the consolidation puzzle)
RMOUG2016 - Resource Management (the critical piece of the consolidation puzzle)RMOUG2016 - Resource Management (the critical piece of the consolidation puzzle)
RMOUG2016 - Resource Management (the critical piece of the consolidation puzzle)
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
 
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 ViennaAutovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
 

Viewers also liked

Why Content Marketing Fails
Why Content Marketing FailsWhy Content Marketing Fails
Why Content Marketing Fails
Rand Fishkin
 
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
SlideShare
 

Viewers also liked (20)

Building Machine Learning Pipelines
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning Pipelines
 
Optimizer Hints
Optimizer HintsOptimizer Hints
Optimizer Hints
 
Attacking Web Proxies
Attacking Web ProxiesAttacking Web Proxies
Attacking Web Proxies
 
Introduction to cocoa sql mapper
Introduction to cocoa sql mapperIntroduction to cocoa sql mapper
Introduction to cocoa sql mapper
 
Cloud Computing (CCSME 2015 talk) - mypapit
Cloud Computing (CCSME 2015 talk) - mypapitCloud Computing (CCSME 2015 talk) - mypapit
Cloud Computing (CCSME 2015 talk) - mypapit
 
8 Ways a Digital Media Platform is More Powerful than “Marketing”
8 Ways a Digital Media Platform is More Powerful than “Marketing”8 Ways a Digital Media Platform is More Powerful than “Marketing”
8 Ways a Digital Media Platform is More Powerful than “Marketing”
 
How Often Should You Post to Facebook and Twitter
How Often Should You Post to Facebook and TwitterHow Often Should You Post to Facebook and Twitter
How Often Should You Post to Facebook and Twitter
 
Slides That Rock
Slides That RockSlides That Rock
Slides That Rock
 
Why Content Marketing Fails
Why Content Marketing FailsWhy Content Marketing Fails
Why Content Marketing Fails
 
What Makes Great Infographics
What Makes Great InfographicsWhat Makes Great Infographics
What Makes Great Infographics
 
Sea Of Greed
Sea Of GreedSea Of Greed
Sea Of Greed
 
Masters of SlideShare
Masters of SlideShareMasters of SlideShare
Masters of SlideShare
 
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to SlideshareSTOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
 
You Suck At PowerPoint!
You Suck At PowerPoint!You Suck At PowerPoint!
You Suck At PowerPoint!
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization
 
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content MarketingHow To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
 
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 

Similar to Building Spark as Service in Cloud

Similar to Building Spark as Service in Cloud (20)

Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0
 

More from InMobi Technology

More from InMobi Technology (19)

Ensemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic TradingEnsemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic Trading
 
Backbone & Graphs
Backbone & GraphsBackbone & Graphs
Backbone & Graphs
 
24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL
 
Reflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site ScriptingReflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site Scripting
 
Introduction to Threat Modeling
Introduction to Threat ModelingIntroduction to Threat Modeling
Introduction to Threat Modeling
 
HTTP Basics Demo
HTTP Basics DemoHTTP Basics Demo
HTTP Basics Demo
 
The Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big DataThe Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big Data
 
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
 
Security News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet BangaloreSecurity News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet Bangalore
 
Matriux blue
Matriux blueMatriux blue
Matriux blue
 
PCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder dataPCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder data
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Shodan- That Device Search Engine
Shodan- That Device Search EngineShodan- That Device Search Engine
Shodan- That Device Search Engine
 
Big Data BI Simplified
Big Data BI SimplifiedBig Data BI Simplified
Big Data BI Simplified
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
 
Building Audience Analytics Platform
Building Audience Analytics PlatformBuilding Audience Analytics Platform
Building Audience Analytics Platform
 
Big Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile ContextBig Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile Context
 
Freedom Hack Report 2014
Freedom Hack Report 2014Freedom Hack Report 2014
Freedom Hack Report 2014
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Building Spark as Service in Cloud

  • 1. Spark as Service in cloud on Yarn Hadoop Meetup bharatb@qubole.com, rgupta@qubole.com May 15, 2015
  • 2. Agenda • Spark on Yarn • Autoscaling Spark Apps and Cluster management • Hive Integration with Spark • Persistent History Server
  • 5. Disadvantages of hadoop1 • Limited to only MR • Separate Map and Reduce slots => underutilization • JT is heavily loaded for job scheduling, monitoring and resource allocation.
  • 7. Advantages of Spark on Yarn • General cluster for running multiple workflows. AM can have custom logic for scheduling • AM can ask for more containers when required and give up containers when free • This become even better when yarn clusters can autoscale • Get features like spot nodes etc which brings additional challenges
  • 8. Advantages of Spark on Yarn • Qubole Yarn clusters can upscale and downscale based on load and support spot instances.
  • 10. Spark Provisioning: Problems • Spark Application starts with fixed number of resources and hold on to them till its alive • Sometimes its difficult to estimate resources required by a job since AM is long running • It becomes limiting spl when Yarn clusters can autoscale.
  • 11. Dynamic Provisioning • Speed up spark commands by using free resources in yarn cluster and also by releasing resources when free to RM.
  • 12. Spark on Yarn basics Driver AM Executor-1 Executor-n • Cluster Mode: Driver and AM run in same JVM in a yarn Executor • Client Mode: Driver and AM run in separate JVM • Driver and AM talk using Actors to handle both cases Driver AM Executor-1 Executor-n
  • 13. Dynamic Provisioning: Problem Statement • Two parts: – Spark AM has no way to ask for additional containers and give up free containers – Automating the process of requesting containers and releasing containers. Cached data in containers make this difficult
  • 15. Dynamic Provisioning: Part1 • Implementation of 2 new apis: // Request 5 extra executors sc.requestExecutors(5) // Kill executors with IDs 1, 15, and 16 sc.killExecutors(Seq("1", "15", "16"))
  • 16. requestExecutors AM Reporter Thread E1 E2 En • AM has reporter thread that has count of number of executors • Reporter thread was used to restart died executors • Driver increments count of number of executors when sc.requestExecutors is called. Driver
  • 17. removeExecutors • To kill executors, one must precisely tell which executors need to be killed • Driver maintains list of all executors and can be obtained by: sc.executorStorageStatuses.foreach(x => println(x.blockManagerId.executorId)) • Whats cached in each executor is also available using: sc.executorStorageStatuses.foreach(x => println(s”memUsed = ${x.memUsed} diskUsed=${x.diskUsed)”))
  • 18. Removing Executors Tradeoffs • BlockManager in each executor can have cached RDDs, shuffle and broadcast data • Killing an executor with shuffle data will require the stage to rerun. • To avoid this use external shuffle service introduced in spark-1.2
  • 20. Upscaling Heuristics • Request Executors as many pending tasks • Request Executors in rounds if there are pending tasks, doubling number of executors added in each round bounded by some upper limit • Request executors by estimating workload • Introduced –max-executors as extra param
  • 21. Downscaling Heuristics • Remove Executors when they are idle • Remove Executors if then are idle for X secs • Cant downscale executors with shuffle data or broadcast data. • --num-executors act as minimum executors
  • 22. Scope • Kill executors on spot nodes first • Flag for not killing up executors if they have shuffle data
  • 23. Where is the code? • https://github.com/apache/spark/pull/2840 • https://github.com/apache/spark/pull/2746
  • 25. What is involved? • Spark programs should be able to access hive metastore • Other Qubole services can be producers or consumers of data and metadata(hive, presto, pig etc)
  • 26. Using SparkSQL - Command UI
  • 27. Using SparkSQL - Results
  • 28. Using SparkSQL - Notebook • SQL, Python, Scala code can be input
  • 29. Using SparkSQL - REST api - scala curl --silent -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" -d '{ "program" : "val s = new org.apache.spark.sql.hive.HiveContext(sc); s.sql("show tables").collect.foreach(println)", "language" : "scala", "command_type" : "SparkCommand" }' https://api.qubole.net/api/latest/commands
  • 30. Using SparkSQL - REST api - sql curl --silent -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" -d '{ "program" : "show tables", "language" : "sql", "command_type" : "SparkCommand" }' https://api.qubole.net/api/latest/commands NOT RELEASE YET
  • 31. Using SparkSQL - qds-sdk-py / java from qds_sdk.commands import SparkCommand with open(“test_spark.py”) as f: code = f.read() cmd = SparkCommand.run(language="python", label="spark", program=code) results = cmd.get_results()
  • 32. Using SparkSQL - Cluster config
  • 34. Basic cluster organization • DB instance in Qubole account • ssh tunnel from master to metastore DB • Metastore server running on master on port 10000 • On master and slave nodes, hive-site.xml:- hive.metastore.uris=thrift://master_ip:10000
  • 36. Problems • yarn overhead should be 20% (TPC-H) • Parquet needs higher PermGen • cached tables use actual table • alter table recover partitions not supported • VPC cluster has slow access to metastore • SchemaRDD gone - old jars dont run • hive jars needed on system classpath
  • 37. Future/Near future • Run with Qubole’s hive codebase • Metastore caching • Benchmarking
  • 38. Future/Near future • Persistent History Server • Fast access to spark AM running in customer cluster

Editor's Notes

  1. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps.. Those 3 steps are…
  2. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps.. Those 3 steps are…
  3. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps.. Those 3 steps are…
  4. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps.. Those 3 steps are…
  5. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps.. Those 3 steps are…
  6. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps.. Those 3 steps are…
  7. Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps.. Those 3 steps are…