SlideShare a Scribd company logo
Brad Kaiser, IBM/TWC
Craig Ingram, IBM/TWC
Supporting Highly
Multitenant Spark
Notebook Workloads
#EUdev8
Hosting Multitenant Spark Notebooks Is
Hard But It Doesn't Have To Be
• Our Journey
• Best Practices
• New Work
2#EUdev8
Our Journey
3#EUdev8
Who we are
4#EUdev8
IBM's Commitment to Open Source
5#EUdev8
• Contribute intellectual and technical
capital to the Apache Spark
community.
• Make the core technology enterprise-
and cloud-ready.
• Build data science skills to drive
intelligence into business applications
— https://cognitiveclass.ai/
• http://spark.tc
Mission
• Provide:
– a secure, performant, stable cluster.
– interactive analytics, visualizations, and reports.
– collaboration and sharing with other data scientists,
engineers, and consumers.
– job scheduling capabilities.
– a quick and easy way to get started with Spark.
6#EUdev8
Goals
• Support hundreds of analysts/data scientists
using Spark
– Quick kernel creation (<10s from notebook creation to
available sc).
– Utilize cluster resources efficiently.
– Elastically scale based on load.
7#EUdev8
Lessons Learned at TWC
8#EUdev8
One Big Cluster
9#EUdev8
• Spark collocated with
Cassandra
• Fast
• Stable
One Big Cluster - Cons
10#EUdev8
• Outside analysts start using
our cluster
• Provided notebook services
• Interrupted our perfectly
scheduled jobs
• Used a lot of resources
causing Cassandra to crash
Add Smaller Clusters
11#EUdev8
• We built some smaller
clusters
• Still platform agnostic
• Other teams couldn't
affect our production
cluster
Add Smaller Clusters - Cons
12#EUdev8
• Hassle to set up
• Required a lot of
maintenance
• Sat idle
EMR
• Analysts make ad hoc clusters for their needs
• No maintenance from us
• Learning curve for analysts
• They tend to leave them running
13#EUdev8
Lessons Learned - IBM
14#EUdev8
Data Science Experience
• Collaborative environment on the front end
– Collaboration Tools
– Shared Data Sets
– Flows
– GitHub Integration
• Multiple compute environments on the back end
– DSX on the cloud: compute runs on IBM cloud
– DSX Local: compute runs on private cloud or Z
15#EUdev8
Lessons Learned - IBM
• You need kernel remoting
– Allows advanced collaborative tools in the application tier
– Allows resource consolidation in the analytics tier
• Resource consolidation puts stress on the analytics
tier
– Starvation
– Management of cached data
– Performance bottlenecks (example: Spark web UI)
16#EUdev8
Best Practices
17#EUdev8
Best Practices
• Use kernel remoting
• Use fewer, bigger clusters
• Know your workloads
• Isolate users
• Schedule resources efficiently
18#EUdev8
Use Kernel Remoting
• Running all of your notebook kernels on the
same server is a bottle neck
• Run your kernels distributed on the cluster
• You can run a lot more notebooks
19#EUdev8
Jupyter Enterprise Gateway
• New Open Source project from IBM
• Goals:
– Allow hundreds of notebook users to share a single Spark
cluster…
– …with enterprise-level security and performance.
• Used in IBM Analytics Engine (GA)
• developer.ibm.com/code/openprojects/jupyter-
enterprise-gateway-2
20#EUdev8
Jupyter Enterprise Gateway:
How it Works
21#EUdev8
Spark Cluster
Security
Layer
Jupyter Enterprise Gateway
• Multitenant
• Remote Kernel Lifecycle Management
YARN
Workers
Impersonation:
Alice’s kernel
runs under
Alice’s user ID.
Spark
ExecutorsSpark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark
ExecutorsSpark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Secure
Secure
Secure Secure
Use Fewer, Bigger Clusters
• Better resource utilization through statistical multiplexing
• Improved security and auditing due to centralization
– Hive, Ranger, and Atlas are common in the ecosystem.
– Many new, platform specific solutions to address this problem.
• Easier collaboration between users
– Shared notebooks with interactive visualizations and markdown
support.
– GitHub integration for versioning and external sharing.
– Catalog based data discovery and sharing.
– Governance and auditing support.
22#EUdev8
Know your workloads
• What is the main resource they use?
• Overprovision the hardware
• When to scale up and down?
– CPU load
– YARN/RM Queue Stats (depth, waiting jobs, available
CPU/mem, preemptions)
• If containers are getting preempted, it’s due to queues filling
up.
23#EUdev8
Isolate Users
• Don't have a generic user account
• Catalog and Governance
– Hive
– Atlas
– Ranger
• You don't want users embedding keys and
passwords in their notebooks.
24#EUdev8
Schedule Resources Efficiently
25#EUdev8
YARN Queues
• Take advantage of YARN’s hierarchical queue system to manage
and organize resources.
– Over-allocate queues for better resource utilization and sharing.
– Take advantage of node labels for users that have priority jobs
that require an SLA.
– Intra-queue preemption and asynchronous container allocation
should be available in YARN 3.0.
26#EUdev8
YARN Queues
27#EUdev8
Dynamic Allocation
• Dynamic allocation lets you take advantage of varying activity
– Proactively scales the number of executors based on the
scheduler's backlog.
– Removes idle executors after a timeout.
– Be sure to set a sensible number of initial executors
and minimum executor floor and let them ramp up on
demand.
• Static allocation best for known workloads
28#EUdev8
New Work
29#EUdev8
Improvements to Spark
• Alleviate the tradeoffs inherent in current best
practices.
– Recover cached data when shutting down idle
executors
– Proactively shut down executors to prevent
starvation
30#EUdev8
Recovering Cached Data
• Replicates cached data to remaining executors
before shutting them down.
• Ameliorates cache issues with dynamic
allocation
• Useful in shared spark notebook environments
31#EUdev8
Benchmarks
32#EUdev8
Benchmarks
33#EUdev8
Check out my PR
– SPARK-21097
– github.com/apache/spark/pull/19041
34#EUdev8
Preventing Starvation
• Eliminate issues where users are unable to run
anything due to other users taking up all of the
cluster’s resources.
• Especially useful in shared spark notebook
environments where idle resources can be
reclaimed easily.
• Preemption can solve this.
35#EUdev8
Enter Preemption
• Requests containers associated with over-
allocated queues to shut down.
• Handle YARN's PreemptionMessage in a way
that best suits the workload.
• Pick the right executors to terminate.
36#EUdev8
Keep up with the JIRA
• SPARK-21122
• PR coming soon
37#EUdev8
Call to action
• Look at our JIRAs
• Try out our PRs
38#EUdev8
Shout-outs!!!
• Our notebook workload simulator, benchmark,
and tracing tools.
– spark-bench - github.com/SparkTC/spark-bench
• Check out Emily Curtin’s talk tomorrow about spark-bench.
– spark-tracing - github.com/SparkTC/spark-tracing
• Matthew Schauer's baby awaiting open-source approval.
• Special thanks to Vijay Bommireddipalli and
Fred Reiss for their guidance and support!
39#EUdev8
Contact Info
Brad Kaiser
kaiserb@us.ibm.com
Craig Ingram
cingram@us.ibm.com
40#EUdev8
Extra Material
41#EUdev8
YARN Asynchronous Scheduling
• Enable asynchronous scheduling of containers in YARN.
– yarn.scheduler.capacity.schedule-asynchronously.enable
– YARN-7327 and YARN-5139
42#EUdev8
References
• Some Icons provided by Icons8
43#EUdev8
Quick Settings
• Use spark.yarn.jars or
spark.yarn.archive.
• Running Spark from tmpfs did not improve
performance.
• Support for multiple/new versions of Spark.
44#EUdev8
Move to backup
Disable unused credential providers
• spark.yarn.security.credentials.hive.enabled
• spark.yarn.security.credentials.hbase.enabled
45#EUdev8
conf avg stdev
default 8.745 0.051
nohive 8.7 0.05
nohbase 7.87 0.05
Move to backup
Problem Domain
• Security
– UI and service protection
– Data governance and auditing
• Stability
• Performance
46#EUdev8
What’s next…
• spark-on-k8s
• Scheduler improvements
• Executor startup time reduction
47#EUdev8
• Hundreds of notebook users leads to a
highly multitenant and interactive workload.
• Challenge: Give each user the illusion of having
a large cluster all to herself
48#EUdev8
Many users
at once
Bursty
offered load
Latency is
important
What platforms are out there…
Hosted Solutions
• Data Science Experience
(DSX)/Watson Data Platform (WDP)
• databricks
• Google Cloud Platform
• Microsoft Azure HDInsight
• and others!!!
49#EUdev8
What platforms are out there…
Self-Hosted Solutions
• HDP – Hadoop Data Platform
• CDH – Cloudera Distribution
Including Hadoop
• MapR
• Mesosphere
50#EUdev8
Reasons to Self Host at TWC
• Cloud Agnostic
• Flexibility
• Sensitive Data
• Fewer Options in 2014
• Cassandra Colocation
• Cost…maybe?
51#EUdev8
Potential Downfalls
• In a self-hosted environment, everything is up to
you.
– Security
– Stability and Performance
– Scalability
• Compute
• Storage
– Monitoring
– Alerting
– Logging
52#EUdev8

More Related Content

What's hot

Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
Databricks
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
Databricks
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Spark Summit
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
Jen Aman
 

What's hot (20)

Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 

Similar to Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser

The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
YARN
YARNYARN
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
Desplegar en la nube y no morir en el intento - Plain Concepts Dev Day
Desplegar en la nube y no morir en el intento - Plain Concepts Dev DayDesplegar en la nube y no morir en el intento - Plain Concepts Dev Day
Desplegar en la nube y no morir en el intento - Plain Concepts Dev Day
Plain Concepts
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Databricks
 
Commit to excellence - Java in containers
Commit to excellence - Java in containersCommit to excellence - Java in containers
Commit to excellence - Java in containers
Red Hat Developers
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
Prakash Chockalingam
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
Virtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin MurrayVirtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin Murray
Databricks
 
(ATS4-PLAT06) Considerations for sizing and deployment
(ATS4-PLAT06) Considerations for sizing and deployment(ATS4-PLAT06) Considerations for sizing and deployment
(ATS4-PLAT06) Considerations for sizing and deployment
BIOVIA
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
 
DevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on ExadataDevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on Exadata
MarketingArrowECS_CZ
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
Stephen Borg
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 

Similar to Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser (20)

The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
YARN
YARNYARN
YARN
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Desplegar en la nube y no morir en el intento - Plain Concepts Dev Day
Desplegar en la nube y no morir en el intento - Plain Concepts Dev DayDesplegar en la nube y no morir en el intento - Plain Concepts Dev Day
Desplegar en la nube y no morir en el intento - Plain Concepts Dev Day
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Commit to excellence - Java in containers
Commit to excellence - Java in containersCommit to excellence - Java in containers
Commit to excellence - Java in containers
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
Virtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin MurrayVirtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin Murray
 
(ATS4-PLAT06) Considerations for sizing and deployment
(ATS4-PLAT06) Considerations for sizing and deployment(ATS4-PLAT06) Considerations for sizing and deployment
(ATS4-PLAT06) Considerations for sizing and deployment
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
DevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on ExadataDevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on Exadata
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 

Recently uploaded

一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 

Recently uploaded (20)

一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 

Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser

  • 1. Brad Kaiser, IBM/TWC Craig Ingram, IBM/TWC Supporting Highly Multitenant Spark Notebook Workloads #EUdev8
  • 2. Hosting Multitenant Spark Notebooks Is Hard But It Doesn't Have To Be • Our Journey • Best Practices • New Work 2#EUdev8
  • 5. IBM's Commitment to Open Source 5#EUdev8 • Contribute intellectual and technical capital to the Apache Spark community. • Make the core technology enterprise- and cloud-ready. • Build data science skills to drive intelligence into business applications — https://cognitiveclass.ai/ • http://spark.tc
  • 6. Mission • Provide: – a secure, performant, stable cluster. – interactive analytics, visualizations, and reports. – collaboration and sharing with other data scientists, engineers, and consumers. – job scheduling capabilities. – a quick and easy way to get started with Spark. 6#EUdev8
  • 7. Goals • Support hundreds of analysts/data scientists using Spark – Quick kernel creation (<10s from notebook creation to available sc). – Utilize cluster resources efficiently. – Elastically scale based on load. 7#EUdev8
  • 8. Lessons Learned at TWC 8#EUdev8
  • 9. One Big Cluster 9#EUdev8 • Spark collocated with Cassandra • Fast • Stable
  • 10. One Big Cluster - Cons 10#EUdev8 • Outside analysts start using our cluster • Provided notebook services • Interrupted our perfectly scheduled jobs • Used a lot of resources causing Cassandra to crash
  • 11. Add Smaller Clusters 11#EUdev8 • We built some smaller clusters • Still platform agnostic • Other teams couldn't affect our production cluster
  • 12. Add Smaller Clusters - Cons 12#EUdev8 • Hassle to set up • Required a lot of maintenance • Sat idle
  • 13. EMR • Analysts make ad hoc clusters for their needs • No maintenance from us • Learning curve for analysts • They tend to leave them running 13#EUdev8
  • 14. Lessons Learned - IBM 14#EUdev8
  • 15. Data Science Experience • Collaborative environment on the front end – Collaboration Tools – Shared Data Sets – Flows – GitHub Integration • Multiple compute environments on the back end – DSX on the cloud: compute runs on IBM cloud – DSX Local: compute runs on private cloud or Z 15#EUdev8
  • 16. Lessons Learned - IBM • You need kernel remoting – Allows advanced collaborative tools in the application tier – Allows resource consolidation in the analytics tier • Resource consolidation puts stress on the analytics tier – Starvation – Management of cached data – Performance bottlenecks (example: Spark web UI) 16#EUdev8
  • 18. Best Practices • Use kernel remoting • Use fewer, bigger clusters • Know your workloads • Isolate users • Schedule resources efficiently 18#EUdev8
  • 19. Use Kernel Remoting • Running all of your notebook kernels on the same server is a bottle neck • Run your kernels distributed on the cluster • You can run a lot more notebooks 19#EUdev8
  • 20. Jupyter Enterprise Gateway • New Open Source project from IBM • Goals: – Allow hundreds of notebook users to share a single Spark cluster… – …with enterprise-level security and performance. • Used in IBM Analytics Engine (GA) • developer.ibm.com/code/openprojects/jupyter- enterprise-gateway-2 20#EUdev8
  • 21. Jupyter Enterprise Gateway: How it Works 21#EUdev8 Spark Cluster Security Layer Jupyter Enterprise Gateway • Multitenant • Remote Kernel Lifecycle Management YARN Workers Impersonation: Alice’s kernel runs under Alice’s user ID. Spark ExecutorsSpark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark ExecutorsSpark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Secure Secure Secure Secure
  • 22. Use Fewer, Bigger Clusters • Better resource utilization through statistical multiplexing • Improved security and auditing due to centralization – Hive, Ranger, and Atlas are common in the ecosystem. – Many new, platform specific solutions to address this problem. • Easier collaboration between users – Shared notebooks with interactive visualizations and markdown support. – GitHub integration for versioning and external sharing. – Catalog based data discovery and sharing. – Governance and auditing support. 22#EUdev8
  • 23. Know your workloads • What is the main resource they use? • Overprovision the hardware • When to scale up and down? – CPU load – YARN/RM Queue Stats (depth, waiting jobs, available CPU/mem, preemptions) • If containers are getting preempted, it’s due to queues filling up. 23#EUdev8
  • 24. Isolate Users • Don't have a generic user account • Catalog and Governance – Hive – Atlas – Ranger • You don't want users embedding keys and passwords in their notebooks. 24#EUdev8
  • 26. YARN Queues • Take advantage of YARN’s hierarchical queue system to manage and organize resources. – Over-allocate queues for better resource utilization and sharing. – Take advantage of node labels for users that have priority jobs that require an SLA. – Intra-queue preemption and asynchronous container allocation should be available in YARN 3.0. 26#EUdev8
  • 28. Dynamic Allocation • Dynamic allocation lets you take advantage of varying activity – Proactively scales the number of executors based on the scheduler's backlog. – Removes idle executors after a timeout. – Be sure to set a sensible number of initial executors and minimum executor floor and let them ramp up on demand. • Static allocation best for known workloads 28#EUdev8
  • 30. Improvements to Spark • Alleviate the tradeoffs inherent in current best practices. – Recover cached data when shutting down idle executors – Proactively shut down executors to prevent starvation 30#EUdev8
  • 31. Recovering Cached Data • Replicates cached data to remaining executors before shutting them down. • Ameliorates cache issues with dynamic allocation • Useful in shared spark notebook environments 31#EUdev8
  • 34. Check out my PR – SPARK-21097 – github.com/apache/spark/pull/19041 34#EUdev8
  • 35. Preventing Starvation • Eliminate issues where users are unable to run anything due to other users taking up all of the cluster’s resources. • Especially useful in shared spark notebook environments where idle resources can be reclaimed easily. • Preemption can solve this. 35#EUdev8
  • 36. Enter Preemption • Requests containers associated with over- allocated queues to shut down. • Handle YARN's PreemptionMessage in a way that best suits the workload. • Pick the right executors to terminate. 36#EUdev8
  • 37. Keep up with the JIRA • SPARK-21122 • PR coming soon 37#EUdev8
  • 38. Call to action • Look at our JIRAs • Try out our PRs 38#EUdev8
  • 39. Shout-outs!!! • Our notebook workload simulator, benchmark, and tracing tools. – spark-bench - github.com/SparkTC/spark-bench • Check out Emily Curtin’s talk tomorrow about spark-bench. – spark-tracing - github.com/SparkTC/spark-tracing • Matthew Schauer's baby awaiting open-source approval. • Special thanks to Vijay Bommireddipalli and Fred Reiss for their guidance and support! 39#EUdev8
  • 40. Contact Info Brad Kaiser kaiserb@us.ibm.com Craig Ingram cingram@us.ibm.com 40#EUdev8
  • 42. YARN Asynchronous Scheduling • Enable asynchronous scheduling of containers in YARN. – yarn.scheduler.capacity.schedule-asynchronously.enable – YARN-7327 and YARN-5139 42#EUdev8
  • 43. References • Some Icons provided by Icons8 43#EUdev8
  • 44. Quick Settings • Use spark.yarn.jars or spark.yarn.archive. • Running Spark from tmpfs did not improve performance. • Support for multiple/new versions of Spark. 44#EUdev8 Move to backup
  • 45. Disable unused credential providers • spark.yarn.security.credentials.hive.enabled • spark.yarn.security.credentials.hbase.enabled 45#EUdev8 conf avg stdev default 8.745 0.051 nohive 8.7 0.05 nohbase 7.87 0.05 Move to backup
  • 46. Problem Domain • Security – UI and service protection – Data governance and auditing • Stability • Performance 46#EUdev8
  • 47. What’s next… • spark-on-k8s • Scheduler improvements • Executor startup time reduction 47#EUdev8
  • 48. • Hundreds of notebook users leads to a highly multitenant and interactive workload. • Challenge: Give each user the illusion of having a large cluster all to herself 48#EUdev8 Many users at once Bursty offered load Latency is important
  • 49. What platforms are out there… Hosted Solutions • Data Science Experience (DSX)/Watson Data Platform (WDP) • databricks • Google Cloud Platform • Microsoft Azure HDInsight • and others!!! 49#EUdev8
  • 50. What platforms are out there… Self-Hosted Solutions • HDP – Hadoop Data Platform • CDH – Cloudera Distribution Including Hadoop • MapR • Mesosphere 50#EUdev8
  • 51. Reasons to Self Host at TWC • Cloud Agnostic • Flexibility • Sensitive Data • Fewer Options in 2014 • Cassandra Colocation • Cost…maybe? 51#EUdev8
  • 52. Potential Downfalls • In a self-hosted environment, everything is up to you. – Security – Stability and Performance – Scalability • Compute • Storage – Monitoring – Alerting – Logging 52#EUdev8