Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data and Kubernetes

1,074 views

Published on

High level overview of work performed around the Kubernetes ecosystem to enable big data applications and pipelines

Published in: Technology
  • Be the first to comment

Big data and Kubernetes

  1. 1. Data Processing and Kubernetes Anirudh Ramanathan (Google Inc.)
  2. 2. Agenda • Basics of Kubernetes & Containers • Motivation • Apache Spark and HDFS on Kubernetes • Data Processing Ecosystem • Future Work
  3. 3. What is Kubernetes?
  4. 4. Kubernetes Kubernetes is an open-source system
  5. 5. Kubernetes is an open-source system for automating deployment, scaling, and management Kubernetes
  6. 6. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes
  7. 7. ‘Containerized’
  8. 8. Containers • Repeatable Builds and Workflows • Application Portability • High Degree of Control over Software • Faster Development Cycle • Reduced dev-ops load • Improved Infrastructure Utilization libs app kernel libs app libs app libs app
  9. 9. • Based on Google's experience running containers in production for over 15 years • Large OSS Community - 1200+ contributors and 45k+ commits • Ecosystem and Partners - 100+ organizations involved • One of the top 100 projects overall on GitHub - 23k+ stars Statistics
  10. 10. Overview
  11. 11. At a Glance kubelet kubeletCLI API users master nodes etcd kubelet scheduler controllers apiserver UI
  12. 12. Nodes and Pods Pod Volume Containers Pod Containers 8080 8080 • Pod is set of co-located containers • Created by declarative specification • Each pod has distinct IP address • Volumes local or network-attached 8080 Volume
  13. 13. Controllers ● Drive current state -> desired state ● Act independently ● Recurring pattern in the system Examples: ● Deployment ● DaemonSet ● StatefulSet observe diff act
  14. 14. Motivation
  15. 15. • Resource sharing between batch, serving and stateful workloads – Streamlined developer experience – Reduced operational costs – Improved infrastructure utilization • Kubernetes and the Container Ecosystem – Lots of addon services: third-party logging, monitoring, and security tools – For example, the Istio project, announced May 24, by IBM, Google and Lyft Why Kubernetes?
  16. 16. Cluster Administration Namespaces Resource Accounting Logging Monitoring Resource Quota Pluggable Authorization Admission Control RBAC • Launch Jobs as a particular user into a specific namespace • RBAC and Namespace-level resource quotas • Audit logging for clusters • Several monitoring solutions to see node, cluster and pod-level statistics
  17. 17. Data Processing
  18. 18. • Beta recently announced at Spark Summit 2017 • Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata, Red Hat, and growing. Spark on Kubernetes https://github.com/apache-spark-on-k8s/spar k Spark Core Kubernetes Standalone YARN Mesos GraphX SparkSQL MLlib Streaming
  19. 19. Spark on Kubernetes Kubernetes Integration Container images with dependencies baked in Files from GCS/S3/HDFS/HTTP File Staging Server Staged files and JARs Several ways of running Spark Jobs along with their dependencies on Kubernetes
  20. 20. Spark on Kubernetes Spark Core Kubernetes Scheduler Backend Kubernetes Clusternew executors remove executors configuration • Resource Requests • Authnz • Communication with K8s
  21. 21. State of Spark Spark Streaming Spark Shell Client Mode Python/R support Cluster Mode Java/Scala Support Dynamic Allocation Local File Staging High Availability Spark SQL GraphX MLlib Dec 2016 Development Began Mar 2017 Alpha Release June 2017 Beta Release Nov 2016 Design = supported but untested = not yet supported
  22. 22. • Community driven effort to get HDFS running well on Kubernetes • Uses a helm chart to install onto a cluster • Identified and solved several problems around data locality when running Spark Jobs HDFS on Kubernetes https://github.com/apache-spark-on-k8s/kubernetes-HDFS
  23. 23. HDFS on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Namenode Pod Datanode Pod 1 Datanode Pod 2 HDFS on Kubernetes -- Lessons Learned [Public] Kimoon Kim (PepperData)
  24. 24. State of HDFS • HDFS with basic data locality works! • Future Work – Remaining data locality issues -- rack locality, node preference, etc – Performance benchmarks and testing – Kerberos support – Namenode HA
  25. 25. Ecosystem
  26. 26. • Pipelines feature many other components. • All of the below must run well on K8s – Cassandra – Kafka – Zookeeper – Elasticsearch, Kibana, etc Data Pipelines are complicated!
  27. 27. • Cassandra: https://github.com/kubernetes/examples/tree/master/cassandra • Kafka: https://github.com/kubernetes/contrib/tree/master/statefulsets/ka fka • Zookeeper: https://github.com/kubernetes/charts/tree/master/incubator/zook eeper • zetcd: https://github.com/coreos/zetcd • Elasticsearch Operator: https://github.com/upmc-enterprises/elasticsearch-operator Cassandra, Kafka and Zookeeper
  28. 28. Future Work • Batch Scheduling and Resource Sharing – Priorities and Preemption • Storage – Local Storage Provisioning • Extensibility – Kubernetes CustomResources (formerly ThirdPartyResources) – UI and Dashboard Improvements • Cluster Federation and Multi-cloud deployments
  29. 29. • Get involved! https://github.com/kubernetes/community/t ree/master/sig-big-data • SIG BigData weekly meeting open to all (10am PT on Wednesdays) via Zoom: http://zoom.us/my/sig.big.data Future Work
  30. 30. Questions/Discussion

×