● What’s Kubernetes
● Why we need to run park on Kubernetes
● How comparable kubernetes with other cluster managers
● Hands on with few spark jobs.
● Container orchestrator
● Provision containers on multiple nodes and abstract the networks over multiple node.
● Supports multiple namespaces for better project isolation.
● User role and privilege management.
● Support different deployment
Hadoop, HDFS and Yarn
● HDFS for the storage layer, namenode and datanode services take care the data storage part.
● Yarn resource negotiator provide a compute framework over HDFS nodes.
● Map-reduce jobs are written on yarn framework.
● Best fit for batch-processing, big-data storage
Apache Spark on Yarn
● Using Yarn we deployed spark driver and slaves into hadoop cluster.
● Yarn provides more flexible resource management.
● Dynamic worker allocation or on demand allocation.
● Best fit, if you already have a hadoop cluster and want to run spark jobs on it.
● SQL interface for bigdata, with spark like architecture.
● Interface with HDFS, NoSQL, Hive, Kafka etc. and provide unified standard SQL interface
● Exposes APIs JDBC, HTTP.
● Best fit for quick data analysis using SQL commands.
Apache Spark on Kubernetes
● Kubernetes is a widely used container orchestrator
● Major deployments outside big-data domain for different needs.
● Project supports big-data tools like spark and hadoop on top of it.
● Run spark job on existing kubernetes cluster.
● Got better feature set with resourcemanagement compared to all other cluster managers.
● Best fit if you already have kubernetes cluster in your environment.