SlideShare a Scribd company logo
Why Kubernetes as a container
orchestrator is a right choice for
running spark clusters on cloud?
Rachit Arora
rachitar@in.ibm.com
IBM, India Software Labs
Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation
Typical Bigdata Application
Secure
Catalog and Search
Ingest &
Store
Prepare Analyze Visualize
Date Engineer Date Scientist
Application
Developer
Let look into role of Data Scientist
• I want to run my analytics jobs
• Social media analytics
• Text analytics (Structure and Unstructured)
• I want to run queries on demand
• I want to run R scripts
• I want to submit Spark jobs
• I want to view History Server Logs of my application
• I want to View Daemon logs
• I want to write Notebooks
Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics
Spark Serverless Characteristics
• No Servers to Provision
• Scale with usage
• Availability and fault tolerant
• Never pay for idle
Kernel
History
Server
Notebook
Server
Browswer
Data
Scientist
COS/Injestion
Data
Engineer
Typical Hadoop/Spark Cluster - Setup
• Get the suitable hardware
• Prepare host machine
• Setup various networks
• Private
• Public
• Management
• Fetch the binaries for the install
• Prepare the blueprint/config file for the install
• Start the install
• Many a times install fails, debug and retry again.
Earlier Experiments
Option OS Provisioning Config
Cluster Management /
Updates
1 Bare metal Chef Chef
2 xCAT – Stateful(Create your own VMs) PostScripts xCAT - updateNode
3 xCAT – Sysclone(Image from current system) Not Needed xCAT - updateNode
4 Bare metal PostScripts xCAT - updateNode
5 Cloud Provider Specific Images Not Needed Manual/Scripts
6 Standard-ISO Image Anaconda -post scripts Manual/Scripts
How do I build Serverless Spark
• Option 1: Vanilla Containers – If I need to build with Kubernetes
• Repeatable
• Application Portability
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure Utilization
Guiding Principles
• Virtualization helps repeatability, lesser failures & speed
• Maintenance
• Performance(Equivalent to Bare metal)
• Use open source from an active community
• Cloud-agnostic
Think Containers!!
Containers: Thinking as VMs ? (mistake)
Docker in Hadoop Cluster on Cloud
• Each cluster node is a virtual node powered by Docker. Each node of
the cluster is a Docker container
• Docker containers run on a bunch of bare metal hosts (Docker-hosts)
• Each Hadoop cluster will have multiple nodes/Docker containers
spanning multiple hosts
• Docker
• Container management - Custom
• Multi host networking – Overlay Network
• Registry – Private
• Local Storage
Typical Clusters
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2 Data 3
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2 Data 3
Data 4 Data 5Cluster 1
Cluster 2
Cluster 3
Network Boundary
Docker Images
• Master node
• Data Node
• Edge Node
• Auxiliary service images
• Ldap
• Mysql
• Ambari server
• KMS
Multi host Docker networking
• Weave based overlay network among nodes
• One /26 private subnet per cluster (172.x.x.x)
• Master node to have a public IP – ports-forwarding
• Portable public IPs
• Network speed (shared with other masters)
• Edge node will be accessible using a public IP
• User can SSH and run Hive, Hbase, Hadoop & Spark shells
• Private network
• High Speed
• Secure
Network Architecture
A A A B
bond1 bond0 eth-weave
docker0
10 Gbps Softlayer
private network
10 Gbps Softlayer
public network
B B B C
eth-weave
docker0
C C C C
eth-weave
docker0
bond1 bond0 bond1 bond0
docker port
forwarding
* docker ICC=false (no inter container communication over docker0 network)
* All inter-container communication is through weave network
* One weave’s private subnet per cluster (No communication across subnets)
weave overlay network on Softlayer private network
Host 1 Host 2 Host 3
Master
node
B BB
Master
node
Edge
node
Data
node
Cluster Provisioning
Provisioning Infrastructure
• Cluster Manager that provides REST API to create cluster
• API Gateway application
• Deployment agent
• Home grown Container Orchestrator - Deployer scripts that actually do all the work
• Prepare Directory structure
• Prepare network
• Start Containers with right options for
• Volumes
• Ports
• IP
• Hostname
• network
Phases involved
1. Acquire Hardware
2. Deploy Provisioning Infra
3. Add Hardware to Resource Pool
4. Prepare Host Machines
5. Orchestrate Cluster lifecycle
1. Create Cluster, Add Nodes, Remove Nodes, Delete Cluster
How is a cluster created?
Cluster
Manager
DBAPI Gateway
2. Manage resources
1. Create Cluster
Deployer
Agent
Deployer
Agent
Deployer
Agent
3. Create Cluster
4. PrepareNode
7. Install IOP
5. Get Node details
6. Get Blueprint
How do I build Serverless Spark
• Option 2 : Function as a service
• Single Node Cluster – Or No Cluster at all
• Spark local mode
• all in one Image
• Resource Limitations
• Design Limitations
How do I build Serverless Spark
• Option 3 : Kubernetes
* Slide from Kubernetes Scheduler Design & Discussion
What Kubernetes Bring in?
• Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
• It Manages Containers for me
• It Manages High availability
• It Provides me flexibility to choose resource I WANT and Persistence I want
• Kubernetes – Lots of addon services: third-party logging, monitoring,
and security tools
• Reduced operational costs
• Improved infrastructure utilization
Kubernetes - Spark Cluster Manager Options
Kubernetes Scheduler
Standalone
Yarn on Kubernetes ?
Conclusion
• Spark Serverless - need for Data Scientist
• Kuberenetes enables
• Spark Clusters in Background and Kernel Up and Running in seconds
• High Availability
• Auto scaling with Spark monitoring and Kubernetes deployment features
• No Extra Money for idle time
References
• IBM Watson Studio
https://datascience.ibm.com
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud
Rachit Arora
rachitar@in.ibm.com
@rachit1arora
Thank you
Rachit Arora
rachitar@in.ibm.com
@rachit1arora

More Related Content

What's hot

Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipeline
DataWorks Summit
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
DataWorks Summit
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
DataWorks Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Databricks
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
DataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache Ambari
DataWorks Summit
 
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
DataWorks Summit
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
DataWorks Summit
 

What's hot (20)

Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipeline
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache Ambari
 
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
 

Similar to Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud?

Serverless spark
Serverless sparkServerless spark
Serverless spark
MamathaBusi
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
DataWorks Summit
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
Rachit Arora
 
Secure Your Containers: What Network Admins Should Know When Moving Into Prod...
Secure Your Containers: What Network Admins Should Know When Moving Into Prod...Secure Your Containers: What Network Admins Should Know When Moving Into Prod...
Secure Your Containers: What Network Admins Should Know When Moving Into Prod...
Cynthia Thomas
 
Rami Sayar - Node microservices with Docker
Rami Sayar - Node microservices with DockerRami Sayar - Node microservices with Docker
Rami Sayar - Node microservices with Docker
Web à Québec
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetes
Dongwon Kim
 
OpenStack and Windows
OpenStack and WindowsOpenStack and Windows
OpenStack and Windows
Alessandro Pilotti
 
Containers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshellContainers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshell
Eugene Fedorenko
 
stackconf 2020 | Replace your Docker based Containers with Cri-o Kata Contain...
stackconf 2020 | Replace your Docker based Containers with Cri-o Kata Contain...stackconf 2020 | Replace your Docker based Containers with Cri-o Kata Contain...
stackconf 2020 | Replace your Docker based Containers with Cri-o Kata Contain...
NETWAYS
 
Containers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aciContainers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aci
Rajesh Kolla
 
Oscon 2017: Build your own container-based system with the Moby project
Oscon 2017: Build your own container-based system with the Moby projectOscon 2017: Build your own container-based system with the Moby project
Oscon 2017: Build your own container-based system with the Moby project
Patrick Chanezon
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker Containers
BlueData, Inc.
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stackNitin Mehta
 
Docker - Portable Deployment
Docker - Portable DeploymentDocker - Portable Deployment
Docker - Portable Deploymentjavaonfly
 
Containers 101
Containers 101Containers 101
Containers 101
Black Duck by Synopsys
 
Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013
Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013
Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013dotCloud
 
Cloud orchestration major tools comparision
Cloud orchestration major tools comparisionCloud orchestration major tools comparision
Cloud orchestration major tools comparision
Ravi Kiran
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
Vinay Rao
 

Similar to Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud? (20)

Serverless spark
Serverless sparkServerless spark
Serverless spark
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
 
Secure Your Containers: What Network Admins Should Know When Moving Into Prod...
Secure Your Containers: What Network Admins Should Know When Moving Into Prod...Secure Your Containers: What Network Admins Should Know When Moving Into Prod...
Secure Your Containers: What Network Admins Should Know When Moving Into Prod...
 
Rami Sayar - Node microservices with Docker
Rami Sayar - Node microservices with DockerRami Sayar - Node microservices with Docker
Rami Sayar - Node microservices with Docker
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetes
 
OpenStack and Windows
OpenStack and WindowsOpenStack and Windows
OpenStack and Windows
 
Containers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshellContainers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshell
 
stackconf 2020 | Replace your Docker based Containers with Cri-o Kata Contain...
stackconf 2020 | Replace your Docker based Containers with Cri-o Kata Contain...stackconf 2020 | Replace your Docker based Containers with Cri-o Kata Contain...
stackconf 2020 | Replace your Docker based Containers with Cri-o Kata Contain...
 
Containers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aciContainers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aci
 
Oscon 2017: Build your own container-based system with the Moby project
Oscon 2017: Build your own container-based system with the Moby projectOscon 2017: Build your own container-based system with the Moby project
Oscon 2017: Build your own container-based system with the Moby project
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker Containers
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stack
 
OpenStack Summit
OpenStack SummitOpenStack Summit
OpenStack Summit
 
Docker - Portable Deployment
Docker - Portable DeploymentDocker - Portable Deployment
Docker - Portable Deployment
 
Docker-Intro
Docker-IntroDocker-Intro
Docker-Intro
 
Containers 101
Containers 101Containers 101
Containers 101
 
Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013
Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013
Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013
 
Cloud orchestration major tools comparision
Cloud orchestration major tools comparisionCloud orchestration major tools comparision
Cloud orchestration major tools comparision
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud?

  • 1. Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud? Rachit Arora rachitar@in.ibm.com IBM, India Software Labs
  • 2. Spark Unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Yarn Mesos Standalon e Scheduler Kubernete s Spark SQL Interactive Queries Spark Streaming Stream processing Spark MLlib Machine Learning GraphX Graph Computation
  • 3. Typical Bigdata Application Secure Catalog and Search Ingest & Store Prepare Analyze Visualize Date Engineer Date Scientist Application Developer
  • 4. Let look into role of Data Scientist • I want to run my analytics jobs • Social media analytics • Text analytics (Structure and Unstructured) • I want to run queries on demand • I want to run R scripts • I want to submit Spark jobs • I want to view History Server Logs of my application • I want to View Daemon logs • I want to write Notebooks
  • 5. Evolution of Spark Analytics On Prem Install • Acquire Hardware • Prepare Machine • Install Spark • Retry • Apply patches • security • Upgrades • Scale • High availability Virtualization • Prepare Vm Imaging Solution • Network Management • High Avilability • Patches • Scale Managed • Configure Cluster • Customize • Scale • Pay even if idle Serverless • Run analytics
  • 6. Spark Serverless Characteristics • No Servers to Provision • Scale with usage • Availability and fault tolerant • Never pay for idle Kernel History Server Notebook Server Browswer Data Scientist COS/Injestion Data Engineer
  • 7. Typical Hadoop/Spark Cluster - Setup • Get the suitable hardware • Prepare host machine • Setup various networks • Private • Public • Management • Fetch the binaries for the install • Prepare the blueprint/config file for the install • Start the install • Many a times install fails, debug and retry again.
  • 8. Earlier Experiments Option OS Provisioning Config Cluster Management / Updates 1 Bare metal Chef Chef 2 xCAT – Stateful(Create your own VMs) PostScripts xCAT - updateNode 3 xCAT – Sysclone(Image from current system) Not Needed xCAT - updateNode 4 Bare metal PostScripts xCAT - updateNode 5 Cloud Provider Specific Images Not Needed Manual/Scripts 6 Standard-ISO Image Anaconda -post scripts Manual/Scripts
  • 9. How do I build Serverless Spark • Option 1: Vanilla Containers – If I need to build with Kubernetes • Repeatable • Application Portability • Faster Development Cycle • Reduced dev-ops load • Improved Infrastructure Utilization
  • 10. Guiding Principles • Virtualization helps repeatability, lesser failures & speed • Maintenance • Performance(Equivalent to Bare metal) • Use open source from an active community • Cloud-agnostic
  • 12. Containers: Thinking as VMs ? (mistake)
  • 13. Docker in Hadoop Cluster on Cloud • Each cluster node is a virtual node powered by Docker. Each node of the cluster is a Docker container • Docker containers run on a bunch of bare metal hosts (Docker-hosts) • Each Hadoop cluster will have multiple nodes/Docker containers spanning multiple hosts • Docker • Container management - Custom • Multi host networking – Overlay Network • Registry – Private • Local Storage
  • 14. Typical Clusters L D A P K M S My SQL S S H Master Data 1 Data 2 Data 3 L D A P K M S My SQL S S H Master Data 1 Data 2 L D A P K M S My SQL S S H Master Data 1 Data 2 Data 3 Data 4 Data 5Cluster 1 Cluster 2 Cluster 3 Network Boundary
  • 15. Docker Images • Master node • Data Node • Edge Node • Auxiliary service images • Ldap • Mysql • Ambari server • KMS
  • 16. Multi host Docker networking • Weave based overlay network among nodes • One /26 private subnet per cluster (172.x.x.x) • Master node to have a public IP – ports-forwarding • Portable public IPs • Network speed (shared with other masters) • Edge node will be accessible using a public IP • User can SSH and run Hive, Hbase, Hadoop & Spark shells • Private network • High Speed • Secure
  • 17. Network Architecture A A A B bond1 bond0 eth-weave docker0 10 Gbps Softlayer private network 10 Gbps Softlayer public network B B B C eth-weave docker0 C C C C eth-weave docker0 bond1 bond0 bond1 bond0 docker port forwarding * docker ICC=false (no inter container communication over docker0 network) * All inter-container communication is through weave network * One weave’s private subnet per cluster (No communication across subnets) weave overlay network on Softlayer private network Host 1 Host 2 Host 3 Master node B BB Master node Edge node Data node
  • 19. Provisioning Infrastructure • Cluster Manager that provides REST API to create cluster • API Gateway application • Deployment agent • Home grown Container Orchestrator - Deployer scripts that actually do all the work • Prepare Directory structure • Prepare network • Start Containers with right options for • Volumes • Ports • IP • Hostname • network
  • 20. Phases involved 1. Acquire Hardware 2. Deploy Provisioning Infra 3. Add Hardware to Resource Pool 4. Prepare Host Machines 5. Orchestrate Cluster lifecycle 1. Create Cluster, Add Nodes, Remove Nodes, Delete Cluster
  • 21. How is a cluster created? Cluster Manager DBAPI Gateway 2. Manage resources 1. Create Cluster Deployer Agent Deployer Agent Deployer Agent 3. Create Cluster 4. PrepareNode 7. Install IOP 5. Get Node details 6. Get Blueprint
  • 22. How do I build Serverless Spark • Option 2 : Function as a service • Single Node Cluster – Or No Cluster at all • Spark local mode • all in one Image • Resource Limitations • Design Limitations
  • 23. How do I build Serverless Spark • Option 3 : Kubernetes * Slide from Kubernetes Scheduler Design & Discussion
  • 24. What Kubernetes Bring in? • Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. • It Manages Containers for me • It Manages High availability • It Provides me flexibility to choose resource I WANT and Persistence I want • Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools • Reduced operational costs • Improved infrastructure utilization
  • 25. Kubernetes - Spark Cluster Manager Options Kubernetes Scheduler Standalone Yarn on Kubernetes ?
  • 26. Conclusion • Spark Serverless - need for Data Scientist • Kuberenetes enables • Spark Clusters in Background and Kernel Up and Running in seconds • High Availability • Auto scaling with Spark monitoring and Kubernetes deployment features • No Extra Money for idle time
  • 27. References • IBM Watson Studio https://datascience.ibm.com • IBM Watson https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/ • Analytics Engine https://www.ibm.com/cloud/analytics-engine • Apache Spark • Kubernetes Scheduler Design & Discussion • Kuberenetes Clusters on IBM Cloud Rachit Arora rachitar@in.ibm.com @rachit1arora

Editor's Notes

  1. Spark is an open source, scalable, massively parallel, in-memory execution engine for analytics applications. Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Spark Core: The foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over 100s of high-level operators that make it easy to build parallel apps. Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala. You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object Store, and any Hadoop data source.
  2. Prepare Even though you have the right data, it may not be in the right format or structure for analysis. That’s where data preparation comes in. Data engineers need to bring raw data into one interface from wherever it lives – on premises, in the cloud or on your desktop – where it can then be shaped, transformed, explored, and prepared for analysis. Data scientist: Primarily responsible for building predictive analytic models and building insights. He will analyze data that’s been cataloged and prepared by the data engineer using machine learning tools like Watson Machine Learning. He will build applications using Jupyter Notebooks, RStudio After the data scientist shares his Analytical outputs , Application developer can build APPs like a cognitive chatbot. As the chatbot engages with customers, it will continuously improve its knowledge and help uncover new insights.
  3. As a data scientist what I was required to do On Prem to Virtuliation as demand increased in my organization for the sevrice I decided to move to virtualized VM to handle many request on demand but there still pain was more Then I decided to try services being offereed on cloud like EMR and IBM Analytics Engine or Microsoft Insights etce but there I need to order cluster sand configure them to suit my work loads Keep them running even when I do not want to use them Cover what is takes to install a hadoop/spark cluster
  4. Wit Spalr and Hadoop  serveless is whole new game   “Function as aservice “  - History , Logs , Performance as if i am doing stuff on prem Lets take case of Data Scentient  Note book - kernel , log , History Server expectation from spark server less Data Engineer and Scientist sending request to Serverless Spark 
  5. Setup is hard 6.8 to 6.7 example Optimal seetings for work loads We do have such offering , its deployed in bluemix
  6. xCAT is Extreme Cluster/Cloud Administration Toolkit, xCAT offers complete management for clusters, Grids, Clouds, Datacenters, and many other things. It is agile, extensible, and based on years of system administration best practices and experience. It enables you to: Provision Operating Systems on physical or virtual machines: RHEL, CentOS, Fedora, SLES, Ubuntu, AIX, Windows, VMWare, KVM, PowerVM, PowerKVM, zVM. Provision using scripted install, stateless, statelite, iSCSI, or cloning
  7. What problems we faced - Managing Hardware as a service – complex Build solutions for Container orchestration Container Security and OS Patch PROBLEM : INSTALLATION AND CUSTOMIZATION Data Persistence
  8. - First we tried for imaging solution and when I dd docker run command . That was just magic
  9. “How do I backup a container?” How do I handle restart of host machine ? “What’s my patch management strategy for my running containers?” How do I network them ? Where is my data going to be ? How do I manage them ?
  10. 500 baremetal We spin conatiners we are not using orch as of now , will talk why Our own registry Local storage
  11. Monolthic application being breaked down . We plan to still do it further to service break downs
  12. Strech on benfits of separating out / like I can change ldap tp IPA , Mysql to other db , KMS verssion change . Separate metrics for these conatiners
  13. Stress on security Stress on extensible to adopt newer networking solutions Mentions it its not recommened to have ssh / Talk about ip forwarding and multiple networks we have and need for those networks in the cluster
  14. Explain here about the cpu allocation and the schduleings strategies and how we wanted to schedule them and reason behind it for the taking advanteage of the hadoop’s build in retundancy and replicatioon to have less failure chances T
  15. Talk here about orchestration needs , CPU SET and Local disks . how many of the current orchestrators not fitting in . Talk about the montioring aspects of the conatiners and logging aspects of the conatiners
  16. Resource manager , layout maker Deployer can be any of machines will work no master deployer here Wait for nodes to prepare Once up start the install which is config driven . Yum install hadoop ! Oh I have it , set up db , Oh I have it .
  17. AWS Lambda , IBM Apache OpenWisk , Microsoft functions. No Cluster just executors Not The experience I am used too. History Server Logs, Monitoring tools? Inability to communicate directly: Spark using DAG execution framework spawns jobs with multiple stages. For inter-stage communication, Spark requires data transfer across executors. Many Fiunction as a service does not allow communication between two Lambda functions. This poses a challenge for running executors in this environment. Extremely limited runtime resources: Many Function invocations are currently limited to a maximum execution duration of 5 minutes, 1536 MB memory and 512 MB disk space. Spark loves memory, can have a large disk footprint and can spawn long running tasks. This makes Functions a difficult environment to run Spark on.
  18. Introduction to Kubernetes Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It Manages Containers for me It Manages High avilabilty It Provides me flexibilty to choose resource I WANT and Persistence I want Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools – Reduced operational costs – Improved infrastructure utilization
  19. No or very less latency
  20. IBM Watson brings together data management, data policies, data preparation, and analysis capabilities into a common framework. You can index, discover, control, and share data with Watson Knowledge Catalog, refine and prepare the data with Data Refinery, then organize resources to analyze the same data with Watson Studio. The IBM Watson apps are fully integrated to use the same user interface and framework. You can pick whichever apps and tools you need for your organization. Watson Studio (Watson Studio) provides you with the environment and tools to solve your business problems by collaboratively analyzing data What is Analytics Engine? You can use AE to Build and deploy clusters within minutes with simplified user experience, scalability, and reliability. You Custom configure the environment and Scale on demand.