SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Spark Operations
Kostas Sakellis
2© Cloudera, Inc. All rights reserved.
Me
• Software Engineer at Cloudera
• Contributor to Apache Spark
• Before that, contributed to Cloudera Manager
3© Cloudera, Inc. All rights reserved.
Building a proof of
concept!
Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
4© Cloudera, Inc. All rights reserved.
Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
5© Cloudera, Inc. All rights reserved.
Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
6© Cloudera, Inc. All rights reserved.
Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
7© Cloudera, Inc. All rights reserved.
Partitions
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
8© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
9© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
10© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
11© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
12© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDD Lineage
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
Lineage
13© Cloudera, Inc. All rights reserved.
Task
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
• A pipelined set of transformation on a single thread
14© Cloudera, Inc. All rights reserved.
Spark Architecture
15© Cloudera, Inc. All rights reserved.
Spark System Architecture
16© Cloudera, Inc. All rights reserved.
Deployments
• Spark supports pluggable Cluster Managers
• local, Standalone, YARN and Mesos
• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone
• CDH 5.x includes Spark on YARN support
17© Cloudera, Inc. All rights reserved.
Standalone
Master
Worker
Client
Worker
Process
App
Master
Process
18© Cloudera, Inc. All rights reserved.
Standalone
• On cluster
./sbin/start-master.sh
./sbin/start-slave.sh <master-spark-URL>
• Submit job
spark-submit --master <master-spark-URL> …
19© Cloudera, Inc. All rights reserved.
Container
YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process
20© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process
21© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process
22© Cloudera, Inc. All rights reserved.
Spark on YARN
• Submit job
spark-submit --master yarn-client …
• Cluster mode
spark-submit --master yarn-cluster …
• Spark shell only works in client mode!
23© Cloudera, Inc. All rights reserved.
Customers often
have shared
infrastructure
Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
24© Cloudera, Inc. All rights reserved.
Multi-tenancy
• Cluster utilization is top metric
• Target: 70-80% utilization
• Mixed workloads from mixed customers
• We recommend YARN
• Built in resource manager
25© Cloudera, Inc. All rights reserved.
Underutilized
Clusters
Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
26© Cloudera, Inc. All rights reserved.
Dynamic Allocation
• Spark applications scale the number of executors based on load
• Removes need for: --num-executors
• Idle executors get killed
• First supported in CDH 5.4
• Ideal for:
• Long ETL jobs with large shuffles
• shell applications: hive and spark shell
27© Cloudera, Inc. All rights reserved.
Dynamic Allocation Limitations
• Still required to specify cores
• --num-cores
• Memory
• --executor-memory
• Includes JVM overhead
• Need to do the math yourself
• Our customers still get it wrong!
28© Cloudera, Inc. All rights reserved.
The Future of Dynamic Allocation
• Only “task size” needed: --task-size
• Eliminates
• --num-cores
• --num-executors
• --executor-memory
• Leads to better cluster utilization
29© Cloudera, Inc. All rights reserved.
Security, now it’s
getting serious.
Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
30© Cloudera, Inc. All rights reserved.
Authentication
• Kerberos – the necessary evil
• Ubiquitous amongst other services
• YARN, HDFS, Hive, HBase, etc.
• Spark utilizes delegation tokens
31© Cloudera, Inc. All rights reserved.
Encryption
• Control plane
• File distribution
• Block Manager
• User UI / REST API
• Data-at-rest (shuffle files)
SPARK-6028 (Replace with netty)
Replace with netty
Spark 1.4
SPARK-2750 (SSL)
SPARK-5682
32© Cloudera, Inc. All rights reserved.
Authorization
• Enterprises have sensitive data
• Beyond HDFS file permissions
• Partial access to data
• Column level granularity
• Apache Sentry
• HDFS-Sentry synchronization plugin
• Record Service
• Column level security for Spark!
33© Cloudera, Inc. All rights reserved.
Thank you
We’re Hiring!

More Related Content

What's hot

How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
Cloudera, Inc.
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
DataWorks Summit
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
Szehon Ho
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
Cloudera, Inc.
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
نهاد مبارك
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
Cloudera, Inc.
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
Cloudera, Inc.
 
Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7
Cloudera, Inc.
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera, Inc.
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
 

What's hot (20)

How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7Apache solr performance and scalability effort update palo alto 2017%2 f7
Apache solr performance and scalability effort update palo alto 2017%2 f7
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 

Viewers also liked

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Are you Kudu-ing me?!
Are you Kudu-ing me?!Are you Kudu-ing me?!
Are you Kudu-ing me?!
Przemek Maciolek
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road Ahead
Cloudera, Inc.
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
Chicago Hadoop Users Group
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
Adam Kawa
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
Apache Apex
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
sparrowAnalytics.com
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Hortonworks
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
Hortonworks
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
Hortonworks
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 

Viewers also liked (19)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Are you Kudu-ing me?!
Are you Kudu-ing me?!Are you Kudu-ing me?!
Are you Kudu-ing me?!
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road Ahead
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similar to Apache Spark Operations

Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
Gwen (Chen) Shapira
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
Cloudera Japan
 
Cloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennightCloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennight
Cloudera Japan
 
Enterprise machine learning on k8s lessons learned and the road ahead
Enterprise machine learning on k8s   lessons learned and the road aheadEnterprise machine learning on k8s   lessons learned and the road ahead
Enterprise machine learning on k8s lessons learned and the road ahead
Timothy Chen
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
Kite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big DataKite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big Data
_blue
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
Grant Henke
 
Container Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey BoxellContainer Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey Boxell
Oracle Developers
 
Querying multiple distributed storage systems with Apache Hive robustly
Querying multiple distributed storage systems with Apache Hive robustlyQuerying multiple distributed storage systems with Apache Hive robustly
Querying multiple distributed storage systems with Apache Hive robustly
Ashish Singh
 
YARN
YARNYARN
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
Cloudera, Inc.
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
 
Continuous Delivery to Kubernetes with Jenkins and Helm
Continuous Delivery to Kubernetes with Jenkins and HelmContinuous Delivery to Kubernetes with Jenkins and Helm
Continuous Delivery to Kubernetes with Jenkins and Helm
David Currie
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
Cloudera, Inc.
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 

Similar to Apache Spark Operations (20)

Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to Production
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
 
Cloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennightCloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennight
 
Enterprise machine learning on k8s lessons learned and the road ahead
Enterprise machine learning on k8s   lessons learned and the road aheadEnterprise machine learning on k8s   lessons learned and the road ahead
Enterprise machine learning on k8s lessons learned and the road ahead
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Kite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big DataKite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big Data
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
Container Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey BoxellContainer Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey Boxell
 
Querying multiple distributed storage systems with Apache Hive robustly
Querying multiple distributed storage systems with Apache Hive robustlyQuerying multiple distributed storage systems with Apache Hive robustly
Querying multiple distributed storage systems with Apache Hive robustly
 
YARN
YARNYARN
YARN
 
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Continuous Delivery to Kubernetes with Jenkins and Helm
Continuous Delivery to Kubernetes with Jenkins and HelmContinuous Delivery to Kubernetes with Jenkins and Helm
Continuous Delivery to Kubernetes with Jenkins and Helm
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 

Recently uploaded (20)

E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 

Apache Spark Operations

  • 1. 1© Cloudera, Inc. All rights reserved. Spark Operations Kostas Sakellis
  • 2. 2© Cloudera, Inc. All rights reserved. Me • Software Engineer at Cloudera • Contributor to Apache Spark • Before that, contributed to Cloudera Manager
  • 3. 3© Cloudera, Inc. All rights reserved. Building a proof of concept! Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
  • 4. 4© Cloudera, Inc. All rights reserved. Example sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
  • 5. 5© Cloudera, Inc. All rights reserved. Example sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
  • 6. 6© Cloudera, Inc. All rights reserved. Example sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
  • 7. 7© Cloudera, Inc. All rights reserved. Partitions sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() HDFS Partition 1 Partition 2 Partition 3 Partition 4
  • 8. 8© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4
  • 9. 9© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4
  • 10. 10© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4
  • 11. 11© Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Collect
  • 12. 12© Cloudera, Inc. All rights reserved. …RDD …RDD RDD Lineage HDFS Partition 1 Partition 2 Partition 3 Partition 4 sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Collect Lineage
  • 13. 13© Cloudera, Inc. All rights reserved. Task …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Collect • A pipelined set of transformation on a single thread
  • 14. 14© Cloudera, Inc. All rights reserved. Spark Architecture
  • 15. 15© Cloudera, Inc. All rights reserved. Spark System Architecture
  • 16. 16© Cloudera, Inc. All rights reserved. Deployments • Spark supports pluggable Cluster Managers • local, Standalone, YARN and Mesos • In early 2014, CDH 4.x with Spark 0.9 only supported Standalone • CDH 5.x includes Spark on YARN support
  • 17. 17© Cloudera, Inc. All rights reserved. Standalone Master Worker Client Worker Process App Master Process
  • 18. 18© Cloudera, Inc. All rights reserved. Standalone • On cluster ./sbin/start-master.sh ./sbin/start-slave.sh <master-spark-URL> • Submit job spark-submit --master <master-spark-URL> …
  • 19. 19© Cloudera, Inc. All rights reserved. Container YARN Architecture Resource Manager Node Manager Client Node Manager Container Process App Master Container Process
  • 20. 20© Cloudera, Inc. All rights reserved. Container Spark on YARN Architecture Resource Manager Node Manager Client Node Manager Container Process App Master Container Process
  • 21. 21© Cloudera, Inc. All rights reserved. Container Spark on YARN Architecture Resource Manager Node Manager Client Node Manager Container Process App Master Container Process
  • 22. 22© Cloudera, Inc. All rights reserved. Spark on YARN • Submit job spark-submit --master yarn-client … • Cluster mode spark-submit --master yarn-cluster … • Spark shell only works in client mode!
  • 23. 23© Cloudera, Inc. All rights reserved. Customers often have shared infrastructure Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
  • 24. 24© Cloudera, Inc. All rights reserved. Multi-tenancy • Cluster utilization is top metric • Target: 70-80% utilization • Mixed workloads from mixed customers • We recommend YARN • Built in resource manager
  • 25. 25© Cloudera, Inc. All rights reserved. Underutilized Clusters Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
  • 26. 26© Cloudera, Inc. All rights reserved. Dynamic Allocation • Spark applications scale the number of executors based on load • Removes need for: --num-executors • Idle executors get killed • First supported in CDH 5.4 • Ideal for: • Long ETL jobs with large shuffles • shell applications: hive and spark shell
  • 27. 27© Cloudera, Inc. All rights reserved. Dynamic Allocation Limitations • Still required to specify cores • --num-cores • Memory • --executor-memory • Includes JVM overhead • Need to do the math yourself • Our customers still get it wrong!
  • 28. 28© Cloudera, Inc. All rights reserved. The Future of Dynamic Allocation • Only “task size” needed: --task-size • Eliminates • --num-cores • --num-executors • --executor-memory • Leads to better cluster utilization
  • 29. 29© Cloudera, Inc. All rights reserved. Security, now it’s getting serious. Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
  • 30. 30© Cloudera, Inc. All rights reserved. Authentication • Kerberos – the necessary evil • Ubiquitous amongst other services • YARN, HDFS, Hive, HBase, etc. • Spark utilizes delegation tokens
  • 31. 31© Cloudera, Inc. All rights reserved. Encryption • Control plane • File distribution • Block Manager • User UI / REST API • Data-at-rest (shuffle files) SPARK-6028 (Replace with netty) Replace with netty Spark 1.4 SPARK-2750 (SSL) SPARK-5682
  • 32. 32© Cloudera, Inc. All rights reserved. Authorization • Enterprises have sensitive data • Beyond HDFS file permissions • Partial access to data • Column level granularity • Apache Sentry • HDFS-Sentry synchronization plugin • Record Service • Column level security for Spark!
  • 33. 33© Cloudera, Inc. All rights reserved. Thank you We’re Hiring!

Editor's Notes

  1. Lets talk about what we have seen as issues from our customers as issues as they try to get Spark into production.
  2. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  3. Spark makes building a proof of concept with a subset of data relatively easy. But then things go wrong Plug for my talk at Hadoop Summit
  4. Lets start with an example program in Spark.
  5. Lets start with an example program in Spark.
  6. The sum() call launches a job
  7. A chunk of data somewhere Could be on Hadoop File System (HDFS) Could be cached in Spark Defines the degree of parallelism
  8. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  9. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  10. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  11. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  12. Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  13. Lets review the general Spark architecture
  14. A driver Where the DAG scheduler lives Drives the show Single point of failure Executors Communicates with driver Runs the tasks created by the driver Think of this as a ThreadPoolExecutor in java Pluggable cluster managers YARN, Mesos, standalone
  15. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  16. Lets review the general Spark architecture
  17. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  18. Lets review the general Spark architecture
  19. Lets review the general Spark architecture
  20. Lets review the general Spark architecture
  21. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  22. Spark makes building a proof of concept with a subset of data relatively easy.
  23. Spark makes building a proof of concept with a subset of data relatively easy.
  24. Control plane File distribution Block Manager User UI / REST API Data-at-rest (shuffle files)