SlideShare a Scribd company logo
1 of 19
Apache Spark-on-YARN:
Empower Spark Applications on Hadoop Cluster
P R E S E N T E D B Y A n d y F e n g ( @ a f e n g 7 6 ) , T h o m a s G r a v e s ⎪ J u n e 5 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
Agenda
Apache Spark
Spark-on-YARN
Yahoo Use Cases
What is Apache Spark?
 Fast and expressive cluster computing system
compatible with Apache Hadoop
 Improves efficiency via
› General execution graphs
› In-memory storage
 Improves usability via
› Rich APIs in Java, Scala, Python
› Interactive shell
 Runs standalone, on YARN, on Mesos, and on
Amazon EC2
Apache Spark – Key Idea
 RDD: resilient distributed
dataset
› Collections of objects spread
across a cluster
› Built through parallel
transformations
› Automatically rebuilt on
failure
› Controllable persistence
 Write programs in terms
of RDD transformations
and actions
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith("Error"))
messages = errors.map(_.split('t')(2))
messages.cache()
for (str <- Array(“foo”, “bar”))
messages.filter(_.contains(str)).count()
Base RDD from HDFS
RDD in memory
Iterative processing
Projects Built on Spark
Spark
GraphX
(graph)
…
Shark
(SQL)
MLbase
(machine
learning)
BlinkDB
val tweets = spark.textFile(“hdfs://…”)
val points = sql(
“select latitude, longitude from tweets”)
val model = KMeans.train(points, 10, 50)
Spark-on-YARN: Hadoop + Spark
HDFS (Storage)
YARN (Resource Manager)
MapReduce
(Batch Processing)
Spark
(Iterative
Processing)
…
Why Spark-on-YARN?
● Access to HDFS dataset
● Ex. 150PB data on Yahoo Hadoop clusters
● Leverage existing Hadoop clusters
● Ex. 32,000 compute nodes in Yahoo
● Familiar by Hadoop users
● No extra deployment and operation costs
● Easy to get started
Spark-on-YARN: Features
● YARN-client and YARN-cluster modes
● Integration w/ Spark history server
● Secure HDFS access
● Authentication between Spark components
● Support running application JARs in HDFS
● Hadoop Distributed cache (files/archives) supported
● Hadoop 0.23 and Hadoop 2.x supported
Spark-on-YARN: Cluster Mode
spark-submit MYJAR --master yarn-cluster –class MYCLASS
Spark-YARN: Client Mode
spark-submit MYJAR --master yarn-client –class MYCLASS
Interactive Shell on YARN
 Spark shell
› Easy way to learn spark
› Great for adhoc queries
 Launch shell w/ Spark-
on-YARN
› Scala
• MASTER=yarn-client spark-shell
› Python
• MASTER=yarn-client pyspark
Spark-on-YARN UI
Spark-on-YARN UI
Spark-on-YARN: Upcoming
● Support long running jobs
o Ex. long-lived machine learning jobs
o Ex. Shark
● Dynamic resource allocation
o # of containers
● Integrate with Hadoop enhancements
o Generic History Server (Timeline Server)
o Pre-emption
o …
Use Case: Yahoo Native Ads POC
 Logistic regression algorithm
o 120 LOC in Spark/Scala
o 30 min. on model creation for
100M samples and 13K features
 Initial version launched within
2 hours after Spark/YARN
announcement
o Compared: Several days on
hardware acquisition, system setup
and data movement
Use Case: Mobile Search Ads POC
 Learn from mobile search ads clicks data
 600M labeled examples on HDFS
 100M sparse features
 Spark programs for Gradient Boosting
Decision Trees
 6 hours for model training with 100 workers
 Model w/ accuracy very close to heavily-manually-
tuned Logistic Regression models
 See us @ Spark Summit 2014
 M. Amde, H. Das, E. Sparks, and A. Talwalkar:
“Scalable Distributed Decision Trees in Spark MLLib”
Use Case: Audience Expansion for Ad Targeting
Hadoop & Storm
Spark
• G. Li, J. Kim, and A. Feng: “Yahoo Audience Expansion: Migration from
Hadoop Streaming to Spark”, Spark Summit 2013.
Acknowledgement
 Spark-on-YARN contributors
› Matei and his team at DataBricks & AMPLab
› Mridul, Thomas, Bobby, Paul etc. from Yahoo
› Sandy, Marcelo etc. from Cloudera
 Spark-on-YARN users
› Hirakendu, Amit, Gavin, Jaebong and many Yahoo users
› Production launch in Taobao
Thank You
@afeng76
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.

More Related Content

What's hot

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

What's hot (20)

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 

Viewers also liked

Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
DataWorks Summit
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
gethue
 

Viewers also liked (15)

SocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean ManualSocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean Manual
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Proxy Servers
Proxy ServersProxy Servers
Proxy Servers
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Proxy Server
Proxy ServerProxy Server
Proxy Server
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 

Similar to Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Similar to Spark-on-YARN: Empower Spark Applications on Hadoop Cluster (20)

Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
Sandish3Certs
Sandish3CertsSandish3Certs
Sandish3Certs
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Module01
 Module01 Module01
Module01
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Poorna Hadoop
Poorna HadoopPoorna Hadoop
Poorna Hadoop
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

  • 1. Apache Spark-on-YARN: Empower Spark Applications on Hadoop Cluster P R E S E N T E D B Y A n d y F e n g ( @ a f e n g 7 6 ) , T h o m a s G r a v e s ⎪ J u n e 5 , 2 0 1 4 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  • 3. What is Apache Spark?  Fast and expressive cluster computing system compatible with Apache Hadoop  Improves efficiency via › General execution graphs › In-memory storage  Improves usability via › Rich APIs in Java, Scala, Python › Interactive shell  Runs standalone, on YARN, on Mesos, and on Amazon EC2
  • 4. Apache Spark – Key Idea  RDD: resilient distributed dataset › Collections of objects spread across a cluster › Built through parallel transformations › Automatically rebuilt on failure › Controllable persistence  Write programs in terms of RDD transformations and actions lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith("Error")) messages = errors.map(_.split('t')(2)) messages.cache() for (str <- Array(“foo”, “bar”)) messages.filter(_.contains(str)).count() Base RDD from HDFS RDD in memory Iterative processing
  • 5. Projects Built on Spark Spark GraphX (graph) … Shark (SQL) MLbase (machine learning) BlinkDB val tweets = spark.textFile(“hdfs://…”) val points = sql( “select latitude, longitude from tweets”) val model = KMeans.train(points, 10, 50)
  • 6. Spark-on-YARN: Hadoop + Spark HDFS (Storage) YARN (Resource Manager) MapReduce (Batch Processing) Spark (Iterative Processing) …
  • 7. Why Spark-on-YARN? ● Access to HDFS dataset ● Ex. 150PB data on Yahoo Hadoop clusters ● Leverage existing Hadoop clusters ● Ex. 32,000 compute nodes in Yahoo ● Familiar by Hadoop users ● No extra deployment and operation costs ● Easy to get started
  • 8. Spark-on-YARN: Features ● YARN-client and YARN-cluster modes ● Integration w/ Spark history server ● Secure HDFS access ● Authentication between Spark components ● Support running application JARs in HDFS ● Hadoop Distributed cache (files/archives) supported ● Hadoop 0.23 and Hadoop 2.x supported
  • 9. Spark-on-YARN: Cluster Mode spark-submit MYJAR --master yarn-cluster –class MYCLASS
  • 10. Spark-YARN: Client Mode spark-submit MYJAR --master yarn-client –class MYCLASS
  • 11. Interactive Shell on YARN  Spark shell › Easy way to learn spark › Great for adhoc queries  Launch shell w/ Spark- on-YARN › Scala • MASTER=yarn-client spark-shell › Python • MASTER=yarn-client pyspark
  • 14. Spark-on-YARN: Upcoming ● Support long running jobs o Ex. long-lived machine learning jobs o Ex. Shark ● Dynamic resource allocation o # of containers ● Integrate with Hadoop enhancements o Generic History Server (Timeline Server) o Pre-emption o …
  • 15. Use Case: Yahoo Native Ads POC  Logistic regression algorithm o 120 LOC in Spark/Scala o 30 min. on model creation for 100M samples and 13K features  Initial version launched within 2 hours after Spark/YARN announcement o Compared: Several days on hardware acquisition, system setup and data movement
  • 16. Use Case: Mobile Search Ads POC  Learn from mobile search ads clicks data  600M labeled examples on HDFS  100M sparse features  Spark programs for Gradient Boosting Decision Trees  6 hours for model training with 100 workers  Model w/ accuracy very close to heavily-manually- tuned Logistic Regression models  See us @ Spark Summit 2014  M. Amde, H. Das, E. Sparks, and A. Talwalkar: “Scalable Distributed Decision Trees in Spark MLLib”
  • 17. Use Case: Audience Expansion for Ad Targeting Hadoop & Storm Spark • G. Li, J. Kim, and A. Feng: “Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark”, Spark Summit 2013.
  • 18. Acknowledgement  Spark-on-YARN contributors › Matei and his team at DataBricks & AMPLab › Mridul, Thomas, Bobby, Paul etc. from Yahoo › Sandy, Marcelo etc. from Cloudera  Spark-on-YARN users › Hirakendu, Amit, Gavin, Jaebong and many Yahoo users › Production launch in Taobao
  • 19. Thank You @afeng76 We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.