Submit Search
Upload
Get most out of Spark on YARN
•
10 likes
•
3,825 views
DataWorks Summit
Follow
Get most out of Spark on YARN Oleg Zhurakousky Hortonworks
Read less
Read more
Technology
Report
Share
Report
Share
1 of 26
Recommended
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
DataWorks Summit/Hadoop Summit
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit/Hadoop Summit
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
hitesh1892
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
Recommended
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
DataWorks Summit/Hadoop Summit
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit/Hadoop Summit
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
hitesh1892
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
Hortonworks
Hive on spark berlin buzzwords
Hive on spark berlin buzzwords
Szehon Ho
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
YARN and the Docker container runtime
YARN and the Docker container runtime
DataWorks Summit/Hadoop Summit
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
DataWorks Summit
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
DataWorks Summit
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
Intro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
Producing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
More Related Content
What's hot
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
Hortonworks
Hive on spark berlin buzzwords
Hive on spark berlin buzzwords
Szehon Ho
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
YARN and the Docker container runtime
YARN and the Docker container runtime
DataWorks Summit/Hadoop Summit
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
DataWorks Summit
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
DataWorks Summit
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
Intro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
What's hot
(20)
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
Hive on spark berlin buzzwords
Hive on spark berlin buzzwords
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
YARN and the Docker container runtime
YARN and the Docker container runtime
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Apache Spark
Intro to Apache Spark
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Viewers also liked
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
Producing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
Sandeep Patil
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
gethue
SocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean Manual
WCU Webometrics Institute
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
DataWorks Summit
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
Spark on yarn
Spark on yarn
datamantra
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
Why your Spark job is failing
Why your Spark job is failing
Sandy Ryza
Proxy Servers
Proxy Servers
Sourav Roy
Apache Spark Model Deployment
Apache Spark Model Deployment
Databricks
Proxy Server
Proxy Server
guest095022
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
Viewers also liked
(15)
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Producing Spark on YARN for ETL
Producing Spark on YARN for ETL
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
SocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean Manual
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark on yarn
Spark on yarn
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Why your Spark job is failing
Why your Spark job is failing
Proxy Servers
Proxy Servers
Apache Spark Model Deployment
Apache Spark Model Deployment
Proxy Server
Proxy Server
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
Similar to Get most out of Spark on YARN
YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks
Spark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Wangda Tan
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
Lessons Learned Running a Container Cloud on Apache Hadoop YARN
Lessons Learned Running a Container Cloud on Apache Hadoop YARN
Billie Rinaldi
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
DataWorks Summit
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
DataWorks Summit
Hadoop Operations – Past, Present, and Future
Hadoop Operations – Past, Present, and Future
DataWorks Summit
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
POSSCON
Overview of slider project
Overview of slider project
Steve Loughran
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
DataWorks Summit
Oracle Database Cloud Service
Oracle Database Cloud Service
Jean-Philippe PINTE
Applications on Hadoop
Applications on Hadoop
markgrover
Similar to Get most out of Spark on YARN
(20)
YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Spark One Platform Webinar
Spark One Platform Webinar
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Lessons Learned Running a Container Cloud on Apache Hadoop YARN
Lessons Learned Running a Container Cloud on Apache Hadoop YARN
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
Hadoop Operations – Past, Present, and Future
Hadoop Operations – Past, Present, and Future
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
Overview of slider project
Overview of slider project
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
Oracle Database Cloud Service
Oracle Database Cloud Service
Applications on Hadoop
Applications on Hadoop
More from DataWorks Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
More from DataWorks Summit
(20)
Data Science Crash Course
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Recently uploaded
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
BookNet Canada
Key Features Of Token Development (1).pptx
Key Features Of Token Development (1).pptx
LBM Solutions
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
UiPathCommunity
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
The Digital Insurer
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
Neo4j
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Fwdays
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Safe Software
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
2toLead Limited
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Mattias Andersson
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
Softradix Technologies
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Mark Billinghurst
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Memoori
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Sinan KOZAK
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
Miki Katsuragi
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Fwdays
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Kalema Edgar
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Rizwan Syed
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Scott Keck-Warren
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Addepto
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Slibray Presentation
Recently uploaded
(20)
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Key Features Of Token Development (1).pptx
Key Features Of Token Development (1).pptx
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Get most out of Spark on YARN
1.
Page 1 ©
Hortonworks Inc. 2014 Get most out of Spark on YARN Oleg Zhurakousky, Hortonworks @z_oleg
2.
Page 2 ©
Hortonworks Inc. 2014 Spark “Apache Spark™ is a general engine for data processing.” val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Word Count in Spark’s Scala API
3.
Page 3 ©
Hortonworks Inc. 2014 Spark Under the Hood (Hadoop|FlatMapped|Filter|MapPartitions|Shuffled)RDD stage0: (Hadoop|FlatMapped|Filter|MapPartitions)RDD stage1: ShuffledRDD ShuffleMapTask: (flatMap | map) ResultTask: (reduceByKey) ShuffleMapTask: (flatMap | map) Spark API Spark Compiler / Optimizer DAG Runtime Execution Engine Spark Cluster YARN Mesos Client Cluster DAGScheduler, ActiveJob Task Task Task SparkAM
4.
Page 4 ©
Hortonworks Inc. 2014 Spark 101 • Spark provides: – API – Task Execution engine – Libraries for SQL, Machine Learning, Graph • Resilient Distributed Dataset (RDD) – Immutable, distributed data collection – 2 Types of RDD Functions – Transformations (e.g. map, filter….) create new RDD – Actions (e.g. collect, count) lead of Spark DAG execution – Spark Transformations are pipelined until actions are called – Spark provides factory methods to create RDDs – From in-memory Collection – from various data sources – E.g. HDFS, Local files – You can create RDD manually as well
5.
Page 5 ©
Hortonworks Inc. 2014 Spark 101 - 2 • Spark Driver – Client side application that creates Spark Context • Spark Context – Talks to Spark Driver, Cluster Manager to Launch Spark Executors • Cluster Manager – E.g YARN, Spark Standalone, MESOS • Executors – Spark worker bees
6.
Page 6 ©
Hortonworks Inc. 2014 Demo
7.
Page 7 ©
Hortonworks Inc. 2014 Extensibility – Why? • Integrate with native features of target system (e.g., YARN Shuffle) • Optimize by benefiting form side effects • KV Grouping • Sorting • Unified security and monitoring • Hybrid execution environment • Streaming and Micro-batching • Cross-context data sharing • Specialization over generalization
8.
Page 8 ©
Hortonworks Inc. 2014 Extensibility Spark is a Framework with many extensibility points • RDD • Add, additional operations, optimize existing operations • SparkContext • Execution delegation, hybrid execution environments • ShuffleReaders/ShuffleWriters • Delegate reads/writes to other source/target storage systems • Additional operations • RDD and Context • Data Broadcasting and Caching • Cross context sharing of cached/broadcasted data • More. . .
9.
Page 9 ©
Hortonworks Inc. 2014 Demo
10.
Page 10 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 Spark on YARN – Why? Page 10 Workloads Run Natively IN Hadoop HDFS (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) SQL (Tez) STREAMING (Storm, S4,…) GRAPH (Giraph) SPARK HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weave…) Run the Spark workload IN Hadoop Instead of shipping the data to the code with Predictable Performance and Quality of Service
11.
Page 11 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark On YARN Benefits Ship the code to the data instead of the other way around Efficient locality-aware access to data stored in HDFS Leverage existing Hadoop clusters Single provisioning, management, monitoring story Simplified deployment and operations Scale-up for production use with minimal IT involvement Secure, multi-tenant workload management
12.
Page 12 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo
13.
Page 13 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark on YARN – Key Features Flexible deployment options Support for secure mode Monitoring and metrics Distributed cache and local resource management
14.
Page 15 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Integration with Hadoop Security • Kerberos support - kinit & submit: • kinit -kt /etc/security/keytabs/myuser.headless.keytab myuser • HiveContext - accessing the Hive Metastore in secure mode • Minimum configuration: hive.metastore.uris • Be careful with extra configuration, as spark cannot support some of hive configurations. • Need the patch Spark-5111 for basic Kerberos setup working against Hadoop-2.6 (connecting to hive metastore). • Spark ThriftServer: • Minimum configuration: Kerberos keytab/principal, sasl, authorization, • Spark thrift server has to be co-located with hive thrift server. • Spark user has to be able to access the keytab of hive.
15.
Page 17 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Application Timeline Service (ATS) Integration
16.
Page 19 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved OrcFile support • saveAsOrcFile • save the table into ORC Format File • orcFile • Read orc format file as table source. • Other features: • Column pruning, self-contained schema support, predicate push down, different compression method. • External data source API: • Refer to latest PR in the JIRA(SPARK-2883) with the support. • Import org.apache.spark.sql.hive.orc._ • Operate under HiveContext Tech Preview in HDP 2.2, targeted for rewrite to Spark Data Source API
17.
Page 20 ©
Hortonworks Inc. 2014 Spark on YARN - Beyond the basics - Multi-tenancy & workload management
18.
Page 21 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Key Themes • Security: Kerberos Token Renewal • Log Aggregation: View logs for running applications • Fault Tolerance: AM Retry and Container keep alive • Service Registry: Directory for services running in YARN Long Running Services Support THEME • CPU Scheduling • CPU Isolation through CGroups • Node Labels for scheduling constraints Workload Mgmt THEME • YARN Rolling Upgrades support • Work Preserving Restart • Timeline Server support in Secure clusters Reliable and Secure Operations THEME
19.
Page 22 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Enhancements to Support Long Running Services • YARN-941: YARN updates Kerberos token for a Long Running Service after the token expires Security Capability • YARN-2468: Aggregate and capture logs during the lifetime of a Long Running Service Log Aggregation Capability • YARN-1489: When ApplicationMaster(AM) restarts, do not kill all associated containers – reconnect to the AM • YARN-611/YARN-614: Tolerate AM failures for Long Running Services Fault Tolerance Capability • YARN-913: Service Registry that publishes host and ports that each service comes up on Service Registry Capability
20.
Page 23 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Node Labels: Apply Node Constraints A App L1 L1 L1 Deploy/Allocate B App L1 L1 L1 Isolate A A A nodes labels
21.
Page 24 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved CPU Scheduling What • Admin tells YARN how much CPU capacity is available in cluster • Applications specify CPU capacity needed for each container • YARN Capacity Scheduler schedules application taking CPU capacity availability into account Why • Applications (for example Storm, HBase, Machine Learning) need predictable access to CPU as a resource • CPU has become bottleneck instead of memory in certain clusters (128 GB RAM, 6 CPUs)
22.
Page 25 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved CGroup Isolation What • Admin enables CGroups for CPU Isolation for YARN application workloads Why • Applications need guaranteed access to CPU resources • To ensure SLAs, need to enforce CPU allocations given to an Application container
23.
Page 26 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved Default Queue Mapping What • Admin can define a default queue per user or group • If no queue is specified for an application, YARN Capacity Scheduler will use the user/group’s default queue for application Why • Queues are required for enforcing SLAs, make it easy to utilize queues • Users and Applications want to submit Yarn apps/jobs and not have to specify queue
24.
Page 27 ©
Hortonworks Inc. 2011 – 2014. All Rights Reserved ResourceManager Rest API What • Submit YARN applications through REST • Get YARN application status through REST • Kill YARN application status through REST Why • External Applications need to interact (schedule/monitor) with YARN without needing to embed Java library • Enable administration from a remote system
25.
Page 28 ©
Hortonworks Inc. 2014 When Things go wrong • Where to look – yarn application –list (get the list of running application) – yarn logs -applicationId <app_id> – Check Spark Environment : http://<host>:8088/proxy/<job_id>/environment/ • Common Issues – Submitted a job but nothing happens – Job stays in accepted state when allocated more memory/cores than is available – May need to kill unresponsive/stale jobs – Insufficient HDFS access – May lead to failure such as “Loading data to table default.testtable Failed with exception Unable to move sourcehdfs://red1:8020/tmp/hive-spark/hive_2015-03-04_12-45- 42_404_3643812080461575333-1/-ext-10000/kv1.txt to destination hdfs://red1:8020/apps/hive/warehouse/testtable/kv1.txt” – Wrong host in Beeline, shows error as invalid URL – “Error: Invalid URL: jdbc:hive2://localhost:10001 (state=08S01,code=0)” – Error about closed SQLContext, restart Thrift Server Grant user/group necessary HDF
26.
Page 29 ©
Hortonworks Inc. 2014 Thank you!