Data Processing and
Kubernetes
Anirudh Ramanathan (Google Inc.)
Agenda
• Basics of Kubernetes & Containers
• Motivation
• Apache Spark and HDFS on Kubernetes
• Data Processing Ecosystem
• Future Work
What is Kubernetes?
Kubernetes
Kubernetes is an open-source system
Kubernetes is an open-source system for
automating deployment, scaling, and
management
Kubernetes
Kubernetes is an open-source system for
automating deployment, scaling, and
management of containerized applications.
Kubernetes
‘Containerized’
Containers
• Repeatable Builds and
Workflows
• Application Portability
• High Degree of Control over
Software
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure
Utilization
libs
app
kernel
libs
app
libs
app
libs
app
• Based on Google's experience running containers in
production for over 15 years
• Large OSS Community - 1200+ contributors and 45k+
commits
• Ecosystem and Partners - 100+ organizations involved
• One of the top 100 projects overall on GitHub - 23k+
stars
Statistics
Overview
At a Glance
kubelet
kubeletCLI
API
users master nodes
etcd
kubelet
scheduler
controllers
apiserver
UI
Nodes and Pods
Pod
Volume
Containers
Pod
Containers
8080 8080
• Pod is set of co-located
containers
• Created by declarative
specification
• Each pod has distinct IP
address
• Volumes local or
network-attached
8080
Volume
Controllers
● Drive current state -> desired state
● Act independently
● Recurring pattern in the system
Examples:
● Deployment
● DaemonSet
● StatefulSet
observe
diff
act
Motivation
• Resource sharing between batch, serving and stateful
workloads
– Streamlined developer experience
– Reduced operational costs
– Improved infrastructure utilization
• Kubernetes and the Container Ecosystem
– Lots of addon services: third-party logging, monitoring,
and security tools
– For example, the Istio project, announced May 24, by IBM,
Google and Lyft
Why Kubernetes?
Cluster Administration
Namespaces
Resource
Accounting
Logging
Monitoring
Resource
Quota
Pluggable
Authorization
Admission
Control
RBAC
• Launch Jobs as a particular
user into a specific
namespace
• RBAC and Namespace-level
resource quotas
• Audit logging for clusters
• Several monitoring solutions
to see node, cluster and
pod-level statistics
Data Processing
• Beta recently announced at Spark Summit 2017
• Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata,
Red Hat, and growing.
Spark on Kubernetes
https://github.com/apache-spark-on-k8s/spar
k
Spark Core
Kubernetes Standalone YARN Mesos
GraphX SparkSQL MLlib Streaming
Spark on Kubernetes
Kubernetes
Integration
Container images with dependencies baked
in
Files from GCS/S3/HDFS/HTTP
File Staging Server
Staged files and
JARs
Several ways of running Spark Jobs along with their dependencies
on Kubernetes
Spark on Kubernetes
Spark Core Kubernetes Scheduler
Backend
Kubernetes Clusternew executors
remove executors
configuration
• Resource Requests
• Authnz
• Communication with K8s
State of Spark
Spark Streaming
Spark Shell
Client Mode
Python/R support
Cluster Mode
Java/Scala
Support
Dynamic
Allocation
Local File Staging High Availability
Spark SQL
GraphX MLlib
Dec 2016
Development
Began
Mar 2017
Alpha
Release
June 2017
Beta
Release
Nov 2016
Design
= supported but
untested
= not yet
supported
• Community driven effort to get HDFS running well on
Kubernetes
• Uses a helm chart to install onto a cluster
• Identified and solved several problems around data
locality when running Spark Jobs
HDFS on Kubernetes
https://github.com/apache-spark-on-k8s/kubernetes-HDFS
HDFS on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Namenode Pod Datanode Pod 1 Datanode Pod 2
HDFS on Kubernetes -- Lessons Learned [Public]
Kimoon Kim (PepperData)
State of HDFS
• HDFS with basic data locality works!
• Future Work
– Remaining data locality issues -- rack locality, node
preference, etc
– Performance benchmarks and testing
– Kerberos support
– Namenode HA
Ecosystem
• Pipelines feature many other components.
• All of the below must run well on K8s
– Cassandra
– Kafka
– Zookeeper
– Elasticsearch, Kibana, etc
Data Pipelines are complicated!
• Cassandra:
https://github.com/kubernetes/examples/tree/master/cassandra
• Kafka:
https://github.com/kubernetes/contrib/tree/master/statefulsets/ka
fka
• Zookeeper:
https://github.com/kubernetes/charts/tree/master/incubator/zook
eeper
• zetcd: https://github.com/coreos/zetcd
• Elasticsearch Operator:
https://github.com/upmc-enterprises/elasticsearch-operator
Cassandra, Kafka and Zookeeper
Future Work
• Batch Scheduling and Resource Sharing
– Priorities and Preemption
• Storage
– Local Storage Provisioning
• Extensibility
– Kubernetes CustomResources (formerly
ThirdPartyResources)
– UI and Dashboard Improvements
• Cluster Federation and Multi-cloud deployments
• Get involved!
https://github.com/kubernetes/community/t
ree/master/sig-big-data
• SIG BigData weekly meeting open to all
(10am PT on Wednesdays) via Zoom:
http://zoom.us/my/sig.big.data
Future Work
Questions/Discussion

Big data and Kubernetes

  • 1.
  • 2.
    Agenda • Basics ofKubernetes & Containers • Motivation • Apache Spark and HDFS on Kubernetes • Data Processing Ecosystem • Future Work
  • 3.
  • 4.
    Kubernetes Kubernetes is anopen-source system
  • 5.
    Kubernetes is anopen-source system for automating deployment, scaling, and management Kubernetes
  • 6.
    Kubernetes is anopen-source system for automating deployment, scaling, and management of containerized applications. Kubernetes
  • 7.
  • 8.
    Containers • Repeatable Buildsand Workflows • Application Portability • High Degree of Control over Software • Faster Development Cycle • Reduced dev-ops load • Improved Infrastructure Utilization libs app kernel libs app libs app libs app
  • 9.
    • Based onGoogle's experience running containers in production for over 15 years • Large OSS Community - 1200+ contributors and 45k+ commits • Ecosystem and Partners - 100+ organizations involved • One of the top 100 projects overall on GitHub - 23k+ stars Statistics
  • 11.
  • 12.
    At a Glance kubelet kubeletCLI API usersmaster nodes etcd kubelet scheduler controllers apiserver UI
  • 13.
    Nodes and Pods Pod Volume Containers Pod Containers 80808080 • Pod is set of co-located containers • Created by declarative specification • Each pod has distinct IP address • Volumes local or network-attached 8080 Volume
  • 14.
    Controllers ● Drive currentstate -> desired state ● Act independently ● Recurring pattern in the system Examples: ● Deployment ● DaemonSet ● StatefulSet observe diff act
  • 15.
  • 16.
    • Resource sharingbetween batch, serving and stateful workloads – Streamlined developer experience – Reduced operational costs – Improved infrastructure utilization • Kubernetes and the Container Ecosystem – Lots of addon services: third-party logging, monitoring, and security tools – For example, the Istio project, announced May 24, by IBM, Google and Lyft Why Kubernetes?
  • 17.
    Cluster Administration Namespaces Resource Accounting Logging Monitoring Resource Quota Pluggable Authorization Admission Control RBAC • LaunchJobs as a particular user into a specific namespace • RBAC and Namespace-level resource quotas • Audit logging for clusters • Several monitoring solutions to see node, cluster and pod-level statistics
  • 18.
  • 19.
    • Beta recentlyannounced at Spark Summit 2017 • Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata, Red Hat, and growing. Spark on Kubernetes https://github.com/apache-spark-on-k8s/spar k Spark Core Kubernetes Standalone YARN Mesos GraphX SparkSQL MLlib Streaming
  • 20.
    Spark on Kubernetes Kubernetes Integration Containerimages with dependencies baked in Files from GCS/S3/HDFS/HTTP File Staging Server Staged files and JARs Several ways of running Spark Jobs along with their dependencies on Kubernetes
  • 21.
    Spark on Kubernetes SparkCore Kubernetes Scheduler Backend Kubernetes Clusternew executors remove executors configuration • Resource Requests • Authnz • Communication with K8s
  • 22.
    State of Spark SparkStreaming Spark Shell Client Mode Python/R support Cluster Mode Java/Scala Support Dynamic Allocation Local File Staging High Availability Spark SQL GraphX MLlib Dec 2016 Development Began Mar 2017 Alpha Release June 2017 Beta Release Nov 2016 Design = supported but untested = not yet supported
  • 23.
    • Community driveneffort to get HDFS running well on Kubernetes • Uses a helm chart to install onto a cluster • Identified and solved several problems around data locality when running Spark Jobs HDFS on Kubernetes https://github.com/apache-spark-on-k8s/kubernetes-HDFS
  • 24.
    HDFS on Kubernetes nodeA node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Namenode Pod Datanode Pod 1 Datanode Pod 2 HDFS on Kubernetes -- Lessons Learned [Public] Kimoon Kim (PepperData)
  • 25.
    State of HDFS •HDFS with basic data locality works! • Future Work – Remaining data locality issues -- rack locality, node preference, etc – Performance benchmarks and testing – Kerberos support – Namenode HA
  • 26.
  • 27.
    • Pipelines featuremany other components. • All of the below must run well on K8s – Cassandra – Kafka – Zookeeper – Elasticsearch, Kibana, etc Data Pipelines are complicated!
  • 28.
    • Cassandra: https://github.com/kubernetes/examples/tree/master/cassandra • Kafka: https://github.com/kubernetes/contrib/tree/master/statefulsets/ka fka •Zookeeper: https://github.com/kubernetes/charts/tree/master/incubator/zook eeper • zetcd: https://github.com/coreos/zetcd • Elasticsearch Operator: https://github.com/upmc-enterprises/elasticsearch-operator Cassandra, Kafka and Zookeeper
  • 29.
    Future Work • BatchScheduling and Resource Sharing – Priorities and Preemption • Storage – Local Storage Provisioning • Extensibility – Kubernetes CustomResources (formerly ThirdPartyResources) – UI and Dashboard Improvements • Cluster Federation and Multi-cloud deployments
  • 30.
    • Get involved! https://github.com/kubernetes/community/t ree/master/sig-big-data •SIG BigData weekly meeting open to all (10am PT on Wednesdays) via Zoom: http://zoom.us/my/sig.big.data Future Work
  • 31.