Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's the Hadoop-la about Kubernetes?


Published on

There is increased interest in using Kubernetes, the open-source container orchestration system for modern, stateful Big Data analytics workloads. The promised land is a unified platform that can handle cloud native stateless and stateful Big Data applications. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the technical gaps and considerations for Big Data on Kubernetes.

Containers offer significant value to businesses; including increased developer agility, and the ability to move applications between on-premises servers, cloud instances, and across data centers. Organizations have embarked on this journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services (such as front end UIs and simple, content-centric experiences) are often great candidates as stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload.
Stateful applications, on the other hand, are services that require backing storage and keeping state is critical to running the service. Hadoop, Spark and to lesser extent, noSQL platforms such as Cassandra, MongoDB, Postgres, and mySQL are great examples. They require some form of persistent storage that will survive service restarts...

Anant Chintamaneni, VP Products, BlueData
Nanda Vijaydev, Director Solutions, BlueData

Published in: Technology

What's the Hadoop-la about Kubernetes?

  1. 1. DataWorks Summit 2018, San Jose, CA What’s the ‘Hadoop-la’ about Kubernetes
  2. 2. Today’s Speakers Nanda VijaydevAnant Chintamaneni @NandaVijaydev@AnantCman Vice President of Products BlueData Software Sr. Director of Solutions BlueData Software
  3. 3. Agenda • Market Dynamics (with containers) • What is Kubernetes – Why should you care? • Requirements for Stateful Hadoop Clusters • Key gaps in Kubernetes for running Hadoop • What will it take to go from here to there. • Q & A
  4. 4. The “Promised Land” Single “Container” Platform for multiple application patterns…. Public Cloud Infrastructure On-Prem Infrastructure Stateless (web frontends, servers) Stateful (databases, queues) Daemons (log collection, monitoring) Others? TargetInfraInfra-agnosticWorkloads
  5. 5. And the winner is……..
  6. 6. Kubernetes (K8s) – Key Points.. | Open source “platform” for containerized workloads | Platform building blocks vs. turnkey platform – | Top use case is stateless/microservices deployments | Evolving for stateful and others
  7. 7. Kubernetes (K8s) – Key Concepts | Kubernetes: a platform for application patterns | Pod: a single instance of an application in Kubernetes | Controller: manages replicated pods for an application pattern
  8. 8. Kubernetes (K8s) – Master/Worker
  9. 9. Kubernetes (K8s) – Pods
  10. 10. Kubernetes (K8s) – Controller
  11. 11. Kubernetes (K8s) – Service
  12. 12. Kubernetes (K8s) – Storage | Volume: Ephemeral, Lifecycle of pod | Persistent Volume: Networked Storage, Pod independent | Persistent Volume Claim: Requested amount
  13. 13. Kubernetes (K8s) - Controller Patterns
  14. 14. Reality Check…. K8s challenges source:
  15. 15. Why Hadoop/Spark on Containers Infrastructure • Agility and elasticity • Standardized environments (dev, test, prod) • Portability (on-premises and cloud) • Higher resource utilization Applications • Fool-proof packaging (configs, libraries, driver versions, etc.) • Repeatable builds and orchestration • Faster app dev cycles
  16. 16. This is not about using containers to run Hadoop/Spark tasks on YARN: Source: Not to be confused with……..
  17. 17. containers cluster Hadoop in Docker Containers This is about running Hadoop clusters in containers:
  18. 18. Attributes of Hadoop Clusters • Not exactly monolithic applications, but close • Multiple, co-operating services with dynamic APIs – Service start-up / tear-down ordering requirements – Different sets of services running on different hosts (nodes) – Tricky service interdependencies impact scalability • Lots of configuration (aka state) – Host name, IP address, ports, etc. – Big meta-data: Hadoop and Spark service-specific configurations
  19. 19. RM YARN ResourceManager NM YARN NodeManager NN HDFS NameNode DN HDFS DataNode Master Node RMNN DN NM Worker Node DN NM Worker Node DN NM Worker Node Hadoop itself is clustered…. Hive Server2 Hive Server 2 Data Metadata
  20. 20. RM YARN ResourceManager NM YARN NodeManager NN HDFS NameNode DN HDFS DataNode JHS Job History Server JN Journal Node ZK ZooKeeper HFS HttpFS Service HM Hbase Master HRS Hbase Region Server Hue Hue OZ Oozie SHS Spark History Server Ambari Ambari server DB MySQL/Postgres GW Gateway FA Flume Agent Tez Tez Service SS Solr Server Hive on LLAP Hive on LLAP RA Ranger HS Hive Server HSS Hive Metastore Service ACK! There is seemingly no end to these services & versions … And lots of services to keep in synch
  21. 21. • Use a Hadoop manager – Hortonworks: Ambari – Cloudera: Cloudera Manager – MapR: MapR Control System (MCS) • Follow common deployment pattern • Ensures distro supportability Managing and Configuring Hadoop
  22. 22. And we want multiple Hadoop clusters Data Engineering SQL Analytics Machine Learning “Containerized” Platform Multiple evaluation teams Evaluate different business use cases (e.g. ETL, machine learning) Use different services (e.g. Hive, Pig, SparkR), different distributions / versions Shared ‘containerized’ infrastructure Petabyte scale data 2.6 2.2 Multiple distributions, services, tools on shared, cost-effective infrastructure 2.12.5 2.7 Data/Storage
  23. 23. Requirements for success Hadoop won’t change Resource Management (YARN) Master Services running always Hadoop Service Dependency & Endpoints State Persistence (Data + Metadata) +
  24. 24. Hadoop Clusters on Kubernetes Challenges and Gaps • Existing, available Controller pattern is insufficient • Hadoop service inter-communication via K8s Services (clusterIP, NodePort etc) is not trivial • Persistent volumes (PV) and the persistent volume claim (PVC) approach needs to adapt to Hadoop requirements for state persistence.
  25. 25. So is it to possible run Hadoop in all its glory on Kubernetes (K8s)?
  26. 26. It’s a journey
  27. 27. Started with BlueData Custom Controller on K8s 12 months ago - we learnt a lot!
  28. 28. HDP Cluster Custom Controller - Architecture K8s API Server K8s Scheduler K8s Controller Manager Custom Controller (Pod) Ambari, NN, RM (Pod) DN, NM (Pod) HDP Cluster Pod Pod Pod Pod K8s Cluster BlueData Namespace Networking (Ex.Calico) Default Namespace DN, NM (Pod) DN, NM (Pod)
  29. 29. • Launch statefulsets for defined roles • Configure and start services in the right sequence • Make the services available to end users – Network and port mapping • Secure the services with existing enterprise policies (e.g. LDAP / AD) • Maintain Big Data performance goals Our ‘Custom Controller’ Approach..
  30. 30. Launching HDP on K8s with Ambari Each role is a Statefulset. 4 Statefulsets for this cluster Launch: BlueData UI or API - Cluster Metadata: Manifest file - Node Roles: statefulsets - Node count: Nbr of pods per role - Node Services: List of services and ports
  31. 31. HDP cluster running on K8s with BlueData - A nodeport service is created per pod for all endpoints of each pod
  32. 32. Statefulset definition - details • Persistent Storage • Volume claim template • Preserve / (root) to enable restarts and migration • Both init & app container has definition to mount same “subPath” from dynamic volume • initContainer set up /var /opt /etc on volume dynamically provisioned, used by app container • Container access setup • Leverage K8s postStart hook to set up authorized_keys & /etc/resolv.conf • Ease of use • Added concept of flavor definition for CPU, memory, storage etc.
  33. 33. Key Gaps (Custom Controller) Functional gaps • Authentication and authorization was done by controller • Limited to single namespace and lacked mapping to K8s Multi-tenancy Usability gaps • Inability to use native kubectl commands for all operations • Unable to use helm charts and other community projects
  34. 34. So what’s next to make it more K8s native and address gaps..
  35. 35. Available Approaches…. • Use kubectl commands for simple deployment • Use Helm charts for dependency management • Use Operators for managing complex actions during and after deployment Operator = Custom Resource Definition (CRD) + Custom Controller
  36. 36. Creating Hadoop “Custom Operator” API Server Scheduler Controller etcd Register Hadoop CRD Create Hadoop cluster (Kubectl create hadoopcluster) Custom Hadoop Controller Observe/ Assess/ Act Hadoop Operator: 1. Create statefulsets 2. Configure services 3. Map ports 4. Scale up/Scale down 5. Migrate to ensure FT
  37. 37. Custom Operator – CRD • Native extension to standard K8s APIs • Uses same authentication, authorization, and audit logging • Use kubectl commands to operate on CRD object (e.g. create hadoopcluster) • API request object will be stored in “etcd”
  38. 38. Example – CRD Registration and Usage apiVersion: kind: CustomResourceDefinition metadata: name: spec: group: version: v1alpha1 scope: Namespace names: plural: hadoopclusters singular: hadoopcluster kind: HadoopCluster • Create: kubectl create –f <CRD>.yaml • New REST API endpoints : /apis/ pha1/namespaces/*/hadoopclus ters/...
  39. 39. Example – New objects using CRD apiVersion: ”" kind: HadoopCluster metadata: name: my-new-hdp-cluster spec: image: bluedata/hdp26:v0.0.1 roles: - name: master replicas: 1 resources … - name: worker replicas: 4 resources … … Create: kubectl create –f <request>.yaml Manage: kubectl get hadoopcluster
  40. 40. Custom Operator– Controller • Watch on instances of objects with type defined in “CRD” • Example: Create HDP cluster with Hive, and Oozie • Runs scripts and services to coordinate activities between different pods for clusters • Example: Start HDFS, Start HiveServer2 • Any modifications, and scaling logic can be applied using custom controller watch events • Example: Expand and shrink cluster • Same controller handles requests for multiple instances of custom object • Example: Create and monitor multiple HDP clusters
  41. 41. Review Hadoop “Custom Operator” API Server Scheduler Controller etcd Register Hadoop CRD Create Hadoop cluster (Kubectl create hadoopcluster) Custom Hadoop Controller Observe/ Assess/ Act
  42. 42. • Lightweight Directory Access Protocol (LDAP) service • Active Directory (AD) service • Directory Name Service (DNS) • Kerberos Key Distribution Center (KDC) • Key Management Service (KMS) Additional Configuration
  43. 43. • Networking – Used calico for our testing • Storage – Persistent external storage (gluster) • This approach allows us to run on any standard K8s installation (1.9 and higher) Network and access to services
  44. 44. Key Takeaways • Kubernetes is still best suited for stateless services • Complex stateful services like Hadoop requires significant work • Statefulsets is a key enabler – necessary, but not sufficient • New innovations and K8s contributions are needed to run Big Data BlueData will simplify onboarding of Hadoop products to K8s
  45. 45. Thank You For more information: Booth # S5