SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Getting Spark Customers to Production
Kostas Sakellis
2© Cloudera, Inc. All rights reserved.
Me
• Software Engineer at Cloudera
• Contributor to Apache Spark
• Before that, contributed to Cloudera Manager
3© Cloudera, Inc. All rights reserved.
Our customers
• Various degrees of sophistication with Spark
• In all stages of development
• From POC to production deployments
• 95% use Spark on YARN*
• Biweekly analysis of tickets
4© Cloudera, Inc. All rights reserved.
WARING: This is biased!
5© Cloudera, Inc. All rights reserved.
Building a proof of
concept!
Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
6© Cloudera, Inc. All rights reserved.
“Why is my job failing?”
7© Cloudera, Inc. All rights reserved.
“Why is my job slow?”
8© Cloudera, Inc. All rights reserved.
Misconfiguration
accounts for 20% of
job failures
Courtesy of: http://blog.sdrock.com/pastors/files/2013/06/time-clock.jpg
9© Cloudera, Inc. All rights reserved.
Resource Declaration
• Not easy knowing what you need and how to specify it
• Compute:
• --num-executors vs. --num-cores
• Memory
• --executor-memory
• Includes JVM overhead
• Need to do the math yourself
10© Cloudera, Inc. All rights reserved.
Dynamic Allocation
• Let Spark do the work for you
• Available since Spark 1.2*
• No need to specify compute a priori
• Limitation: Still required to specify cores
• In future:
• Allow specification of “task size”
• Dynamically allocate cores
11© Cloudera, Inc. All rights reserved.
YARN Configuration mismatch
• Compute:
• yarn.nodemanager.resource.cpu-vcores
• yarn.scheduler.maximum-allocation.vcores
• Memory:
• yarn.nodemanager.resource.memory-mb
• yarn.scheduler.maximum-allocation-mb
12© Cloudera, Inc. All rights reserved.
YARN Configuration mismatch
• Common to ask for more resources than allowed
• Future work:
• Exposing relevant YARN configurations in Spark UI
• Requires changes to YARN itself
13© Cloudera, Inc. All rights reserved.
Container
[pid=63375,containerID=container_1388158490598_0001_01_00
0003] is running beyond physical memory limits. Current
usage: 2.1 GB of 2 GB physical memory used; 2.8 GB of 4.2
GB virtual memory used. Killing container.
[...]
Another YARN goodie…
14© Cloudera, Inc. All rights reserved.
yarn.nodemanager.resource.memory-mb
Executor Container
spark.yarn.executor.memoryOverhead (7%) (10% in 1.4)
spark.executor.memory
spark.shuffle.memoryFraction (0.4) spark.storage.memoryFraction (0.6)
Memory allocation
15© Cloudera, Inc. All rights reserved.
YARN Overhead
• Future work:
• Better understanding of off heap allocations
• Improve memory usage visibility
16© Cloudera, Inc. All rights reserved.
Run program
through all our
data
Courtesy of:https://conniehallscott.files.wordpress.com/2013/01/411748_538971446114753_1125606225_o.jpg
17© Cloudera, Inc. All rights reserved.
Data dependent tuning
• As data rates change, re-tuning Spark is usually necessary
• Spark is sensitive to shuffle spills
• The most common knob we modify is…
18© Cloudera, Inc. All rights reserved.
Partitions, Partitions, Partitions!
19© Cloudera, Inc. All rights reserved.
GC Stalls
20© Cloudera, Inc. All rights reserved.
Partitions
• Smaller is often better
• Parameterized partition size
• reduceByKey(…, nPartitions)
• Parameterize application
• Future work:
• Dynamically determine # of partitions (SPARK-4630)
21© Cloudera, Inc. All rights reserved.
But for now?
• Easy answer:
• Keep multiplying by 1.5 and see what works
• Harder answer:
22© Cloudera, Inc. All rights reserved.
Shuffle less!
23© Cloudera, Inc. All rights reserved.
Shuffles
Wide DependencyNarrow Dependencies
24© Cloudera, Inc. All rights reserved.
ReduceByKey when Possible
•ReduceByKey allows a map-side-combine
parsed
.map{line =>(line.level, 1)}
.reduceByKey{(a, b) => a + b}
.collect()
•GroupByKey transfers all the data
parsed
.map{line =>(line.level, 1)}
.groupByKey.map{case(word,counts) =>
(word,counts.sum)}
.collect()
25© Cloudera, Inc. All rights reserved.
ReduceByKey when Possible
•ReduceByKey
•GroupByKey
26© Cloudera, Inc. All rights reserved.
Security, now it’s
getting serious.
Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
27© Cloudera, Inc. All rights reserved.
Authentication
• Kerberos – the necessary evil
• Ubiquitous amongst other services
• YARN, HDFS, Hive, HBase, etc.
• Spark utilizes delegation tokens
28© Cloudera, Inc. All rights reserved.
Encryption
• Control plane
• File distribution
• Block Manager
• User UI / REST API
• Data-at-rest (shuffle files)
SPARK-6028 (Replace with netty)
Replace with netty
Spark 1.4
SPARK-2750 (SSL)
SPARK-5682
29© Cloudera, Inc. All rights reserved.
Authorization
• Enterprises have sensitive data
• Beyond HDFS file permissions
• Partial access to data
• Column level granularity
• Apache Sentry
• HDFS-Sentry synchronization plugin
30© Cloudera, Inc. All rights reserved.
Customers often
have shared
infrastructure
Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
31© Cloudera, Inc. All rights reserved.
Multi-tenancy
• Cluster utilization is top metric
• Target: 70-80% utilization
• Mixed workloads from mixed customers
• We recommend YARN
• Built in resource manager
32© Cloudera, Inc. All rights reserved.
Underutilized
Clusters
Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
33© Cloudera, Inc. All rights reserved.
Dynamic Allocation
• Allows jobs to scale to size according to load
• Knobs to control min, max and initial size
• Future Work:
• Target: Dynamic allocation enabled by default
• Data locality & Caching
• Open question with Streaming
34© Cloudera, Inc. All rights reserved.
Thank you
We’re Hiring!

More Related Content

What's hot

Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer  Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Mary Kypreos
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo Overview
Bill Havanki
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
Wei-Chiu Chuang
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
Rakesh Saha
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
InMobi Technology
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
Cloudera, Inc.
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
Cloudera, Inc.
 
Farming hadoop in_the_cloud
Farming hadoop in_the_cloudFarming hadoop in_the_cloud
Farming hadoop in_the_cloud
Steve Loughran
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
DataWorks Summit
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BT
Cloudera, Inc.
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with Docker
Yahoo Developer Network
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera, Inc.
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 

What's hot (20)

Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer  Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo Overview
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Farming hadoop in_the_cloud
Farming hadoop in_the_cloudFarming hadoop in_the_cloud
Farming hadoop in_the_cloud
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BT
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with Docker
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 

Viewers also liked

Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is Failing
Cloudera, Inc.
 
Apache Spark Use case for Education Industry
Apache Spark Use case for Education IndustryApache Spark Use case for Education Industry
Apache Spark Use case for Education Industry
Vinayak Agrawal
 
Cancer Outlier Pro file Analysis using Apache Spark
Cancer Outlier Profile Analysis using Apache SparkCancer Outlier Profile Analysis using Apache Spark
Cancer Outlier Pro file Analysis using Apache Spark
Mahmoud Parsian
 
How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache Spark
Oren Raboy
 
Kodu Game Lab e Project Spark
Kodu Game Lab e Project SparkKodu Game Lab e Project Spark
Kodu Game Lab e Project Spark
Fabrício Catae
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Modern Data Stack France
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
C4Media
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Real Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and SparkReal Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and Spark
QAware GmbH
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thessaloniki
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
Databricks
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 

Viewers also liked (20)

Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is Failing
 
Apache Spark Use case for Education Industry
Apache Spark Use case for Education IndustryApache Spark Use case for Education Industry
Apache Spark Use case for Education Industry
 
Cancer Outlier Pro file Analysis using Apache Spark
Cancer Outlier Profile Analysis using Apache SparkCancer Outlier Profile Analysis using Apache Spark
Cancer Outlier Pro file Analysis using Apache Spark
 
How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache Spark
 
Kodu Game Lab e Project Spark
Kodu Game Lab e Project SparkKodu Game Lab e Project Spark
Kodu Game Lab e Project Spark
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Real Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and SparkReal Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and Spark
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 

Similar to Getting Apache Spark Customers to Production

Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
Cloudera, Inc.
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
DataWorks Summit
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
Cloudera, Inc.
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
Spark etl
Spark etlSpark etl
Spark etl
Imran Rashid
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
Gwen (Chen) Shapira
 
YARN
YARNYARN
The Kubernetes WebLogic revival (part 2)
The Kubernetes WebLogic revival (part 2)The Kubernetes WebLogic revival (part 2)
The Kubernetes WebLogic revival (part 2)
Simon Haslam
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
Grant Henke
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
 
20191201 kubernetes managed weblogic revival - part 2
20191201 kubernetes managed weblogic revival - part 220191201 kubernetes managed weblogic revival - part 2
20191201 kubernetes managed weblogic revival - part 2
makker_nl
 
Hadoop security implementationon 20171003
Hadoop security implementationon 20171003Hadoop security implementationon 20171003
Hadoop security implementationon 20171003
lee tracie
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
Elastic build environment
Elastic build environmentElastic build environment
Elastic build environment
Cachet Software Solutions Ltd
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
AWS實際架構實踐演化與解決方案
AWS實際架構實踐演化與解決方案AWS實際架構實踐演化與解決方案
AWS實際架構實踐演化與解決方案
CKmates
 
實際架構實踐演化與解決方案
實際架構實踐演化與解決方案實際架構實踐演化與解決方案
實際架構實踐演化與解決方案
Camel Camel
 
實際架構實踐演化與解決方案
實際架構實踐演化與解決方案實際架構實踐演化與解決方案
實際架構實踐演化與解決方案
CKmates
 

Similar to Getting Apache Spark Customers to Production (20)

Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
Spark etl
Spark etlSpark etl
Spark etl
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
YARN
YARNYARN
YARN
 
The Kubernetes WebLogic revival (part 2)
The Kubernetes WebLogic revival (part 2)The Kubernetes WebLogic revival (part 2)
The Kubernetes WebLogic revival (part 2)
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 
20191201 kubernetes managed weblogic revival - part 2
20191201 kubernetes managed weblogic revival - part 220191201 kubernetes managed weblogic revival - part 2
20191201 kubernetes managed weblogic revival - part 2
 
Hadoop security implementationon 20171003
Hadoop security implementationon 20171003Hadoop security implementationon 20171003
Hadoop security implementationon 20171003
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Elastic build environment
Elastic build environmentElastic build environment
Elastic build environment
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
AWS實際架構實踐演化與解決方案
AWS實際架構實踐演化與解決方案AWS實際架構實踐演化與解決方案
AWS實際架構實踐演化與解決方案
 
實際架構實踐演化與解決方案
實際架構實踐演化與解決方案實際架構實踐演化與解決方案
實際架構實踐演化與解決方案
 
實際架構實踐演化與解決方案
實際架構實踐演化與解決方案實際架構實踐演化與解決方案
實際架構實踐演化與解決方案
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
Yara Milbes
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
anfaltahir1010
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
kalichargn70th171
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
Karya Keeper
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 

Recently uploaded (20)

Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 

Getting Apache Spark Customers to Production

  • 1. 1© Cloudera, Inc. All rights reserved. Getting Spark Customers to Production Kostas Sakellis
  • 2. 2© Cloudera, Inc. All rights reserved. Me • Software Engineer at Cloudera • Contributor to Apache Spark • Before that, contributed to Cloudera Manager
  • 3. 3© Cloudera, Inc. All rights reserved. Our customers • Various degrees of sophistication with Spark • In all stages of development • From POC to production deployments • 95% use Spark on YARN* • Biweekly analysis of tickets
  • 4. 4© Cloudera, Inc. All rights reserved. WARING: This is biased!
  • 5. 5© Cloudera, Inc. All rights reserved. Building a proof of concept! Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
  • 6. 6© Cloudera, Inc. All rights reserved. “Why is my job failing?”
  • 7. 7© Cloudera, Inc. All rights reserved. “Why is my job slow?”
  • 8. 8© Cloudera, Inc. All rights reserved. Misconfiguration accounts for 20% of job failures Courtesy of: http://blog.sdrock.com/pastors/files/2013/06/time-clock.jpg
  • 9. 9© Cloudera, Inc. All rights reserved. Resource Declaration • Not easy knowing what you need and how to specify it • Compute: • --num-executors vs. --num-cores • Memory • --executor-memory • Includes JVM overhead • Need to do the math yourself
  • 10. 10© Cloudera, Inc. All rights reserved. Dynamic Allocation • Let Spark do the work for you • Available since Spark 1.2* • No need to specify compute a priori • Limitation: Still required to specify cores • In future: • Allow specification of “task size” • Dynamically allocate cores
  • 11. 11© Cloudera, Inc. All rights reserved. YARN Configuration mismatch • Compute: • yarn.nodemanager.resource.cpu-vcores • yarn.scheduler.maximum-allocation.vcores • Memory: • yarn.nodemanager.resource.memory-mb • yarn.scheduler.maximum-allocation-mb
  • 12. 12© Cloudera, Inc. All rights reserved. YARN Configuration mismatch • Common to ask for more resources than allowed • Future work: • Exposing relevant YARN configurations in Spark UI • Requires changes to YARN itself
  • 13. 13© Cloudera, Inc. All rights reserved. Container [pid=63375,containerID=container_1388158490598_0001_01_00 0003] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container. [...] Another YARN goodie…
  • 14. 14© Cloudera, Inc. All rights reserved. yarn.nodemanager.resource.memory-mb Executor Container spark.yarn.executor.memoryOverhead (7%) (10% in 1.4) spark.executor.memory spark.shuffle.memoryFraction (0.4) spark.storage.memoryFraction (0.6) Memory allocation
  • 15. 15© Cloudera, Inc. All rights reserved. YARN Overhead • Future work: • Better understanding of off heap allocations • Improve memory usage visibility
  • 16. 16© Cloudera, Inc. All rights reserved. Run program through all our data Courtesy of:https://conniehallscott.files.wordpress.com/2013/01/411748_538971446114753_1125606225_o.jpg
  • 17. 17© Cloudera, Inc. All rights reserved. Data dependent tuning • As data rates change, re-tuning Spark is usually necessary • Spark is sensitive to shuffle spills • The most common knob we modify is…
  • 18. 18© Cloudera, Inc. All rights reserved. Partitions, Partitions, Partitions!
  • 19. 19© Cloudera, Inc. All rights reserved. GC Stalls
  • 20. 20© Cloudera, Inc. All rights reserved. Partitions • Smaller is often better • Parameterized partition size • reduceByKey(…, nPartitions) • Parameterize application • Future work: • Dynamically determine # of partitions (SPARK-4630)
  • 21. 21© Cloudera, Inc. All rights reserved. But for now? • Easy answer: • Keep multiplying by 1.5 and see what works • Harder answer:
  • 22. 22© Cloudera, Inc. All rights reserved. Shuffle less!
  • 23. 23© Cloudera, Inc. All rights reserved. Shuffles Wide DependencyNarrow Dependencies
  • 24. 24© Cloudera, Inc. All rights reserved. ReduceByKey when Possible •ReduceByKey allows a map-side-combine parsed .map{line =>(line.level, 1)} .reduceByKey{(a, b) => a + b} .collect() •GroupByKey transfers all the data parsed .map{line =>(line.level, 1)} .groupByKey.map{case(word,counts) => (word,counts.sum)} .collect()
  • 25. 25© Cloudera, Inc. All rights reserved. ReduceByKey when Possible •ReduceByKey •GroupByKey
  • 26. 26© Cloudera, Inc. All rights reserved. Security, now it’s getting serious. Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
  • 27. 27© Cloudera, Inc. All rights reserved. Authentication • Kerberos – the necessary evil • Ubiquitous amongst other services • YARN, HDFS, Hive, HBase, etc. • Spark utilizes delegation tokens
  • 28. 28© Cloudera, Inc. All rights reserved. Encryption • Control plane • File distribution • Block Manager • User UI / REST API • Data-at-rest (shuffle files) SPARK-6028 (Replace with netty) Replace with netty Spark 1.4 SPARK-2750 (SSL) SPARK-5682
  • 29. 29© Cloudera, Inc. All rights reserved. Authorization • Enterprises have sensitive data • Beyond HDFS file permissions • Partial access to data • Column level granularity • Apache Sentry • HDFS-Sentry synchronization plugin
  • 30. 30© Cloudera, Inc. All rights reserved. Customers often have shared infrastructure Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
  • 31. 31© Cloudera, Inc. All rights reserved. Multi-tenancy • Cluster utilization is top metric • Target: 70-80% utilization • Mixed workloads from mixed customers • We recommend YARN • Built in resource manager
  • 32. 32© Cloudera, Inc. All rights reserved. Underutilized Clusters Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
  • 33. 33© Cloudera, Inc. All rights reserved. Dynamic Allocation • Allows jobs to scale to size according to load • Knobs to control min, max and initial size • Future Work: • Target: Dynamic allocation enabled by default • Data locality & Caching • Open question with Streaming
  • 34. 34© Cloudera, Inc. All rights reserved. Thank you We’re Hiring!

Editor's Notes

  1. Lets talk about what we have seen as issues from our customers as issues as they try to get Spark into production.
  2. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  3. In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  4. Spark makes building a proof of concept with a subset of data relatively easy. But then things go wrong Plug for my talk at Hadoop Summit
  5. num-executors vs. num-cores? 10 executors with 1 core, or 5 executors with 2 cores? Memory: - this is the aggregate across all cores.
  6. This shows up in the YARN NodeManager logs
  7. Spark makes building a proof of concept with a subset of data relatively easy.
  8. Max partition size is 2GB Small partitions help deal w/ stragglers Small partitions avoid overhead
  9. Fastest way to shuffle a lot of data: Don’t shuffle Second fastest way to shuffle a lot of data: Shuffle a small amount of data
  10. Data is merged together before its serialized & sent over network Vs. Higher serialization and network transfer costs
  11. Data is merged together before its serialized & sent over network Vs. Higher serialization and network transfer costs
  12. Data is merged togethe before its serialized & sent over network Vs. Higher serialization and network transfer costs
  13. Spark makes building a proof of concept with a subset of data relatively easy.
  14. Control plane File distribution Block Manager User UI / REST API Data-at-rest (shuffle files)
  15. Spark makes building a proof of concept with a subset of data relatively easy.
  16. Dynamic allocation: - streaming - locality (worked on) - making it even better.