Towards a Cloud
Native Big Data
Platform using
MiCADO
Abdelkhalik Mosa, Tamas Kiss, Gabriele
Pierantoni, James DesLauriers, Dimitrios
Kagialis, and Gabor Terstyanszky
Research Centre for Parallel Computing,
School of Computer Science and Engineering,
University of Westminster, UK
ISPDC 2020
MiCADO
Introduction
• Big data requires building
scalable distributed
architectures for efficient
storage and processing.
• Hadoop Ecosystem is a
software suite for handling
storage and processing
requirements of big data.
Mrozek, Dariusz. "Foundations of the Hadoop Ecosystem."
Scalable Big Data Analytics for Protein Bioinformatics. Springer, Cham, 2018
Hadoop Deployment Challenges
• Deploying big data cluster is time-consuming and challenging.
• Static Hadoop clusters can be either overprovisioned or underprovisioned.
• Managed Hadoop (e.g., AWS EMR, HDInsight) suffer from vendor lock-in.
• Organisations are migrating 50% of their public cloud applications/data to
private clouds due to security concerns, cost, performance, and control1.
• Automating infrastructure makes on-premises solutions more cost-
effective than public Cloud for 85%-90% of workloads2.
• Cloudbreak automates Hortonworks Data Platform (HDP) deployment;
however, it is not cloud native and only supports HDP.
1. https://www.crn.com/businesses-moving-from-public-cloud-due-to-security-says-idc-survey
2. https://www.crn.com/news/data-center/300100761/michael-dell-on-premises-solutions-are-more-cost-effective-than-public-cloud-for-85-to-90-percent-of-workloads.htm
Hadoop and Cloud Native Computing
• In cloud native computing:
• Applications are deployed as microservices.
• Every single microservice is containerized.
• Containers are dynamically orchestrated to optimize resource utilization and efficiently
manage the entire application.
• Hadoop was not originally built for the cloud; create big data clusters using
commodity hardware.
• Hadoop might be a good fit for cloud native computing as the Hadoop ecosystem
consists of several disjoint tools that can be containerized.
Research Aim • Building cloud native Hadoop-based big data
clusters using MiCADO
MiCADO Overview
MiCADO is an application-level multi-cloud
orchestration and auto-scaling framework.
https://project-cola.eu/
TOSCA
Application
Description
Template
(ADT)
Container Executor
cAdvisor
Container/node
Monitoring
MiCADO
WORKER NODE
Node Exporter
DockerCloud
Orchestrator
Monitoring
System
Translates
ADT
Enforces
scaling
Container
Orchestrator
Prometheus
Submitter Policy Keeper
MiCADO MASTER NODE
ML based
optimisation
Optimiser
Kubernetes
Swarm
Occopus
Terraform
https://micado-scale.readthedocs.io/en/latest/deployment.html
MiCADO-based Big Data Platform
Building MiCADO-based Big Data Platform
Containerizing
big data tools
Coupling or
decoupling of
compute and
storage
Writing
application
description
templates (ADTs)
Submitting the
ADT to MiCADO
Containerizing Big Data Tools
• Hadoop is based on a set of disjoint services.
• These microservices make it easier to containerize various big data tools (ex. HDFS, YARN, …).
NN
DN
RM
NM
HDFS YARN
Kafka
ZooKeeper
Hadoop Core
Coupling vs Decoupling
Compute and Storage
• Coupling
• Hadoop couples compute and storage to minimize data
movement and overcome slow network access speed.
• Scaling could waste resources (either compute or storage).
• Decoupling
• Decoupling contradicts the original Hadoop design that
aimed to bring compute to data.
• Decoupling enables independent scaling of compute and
storage for better utilization and cost savings.
• Decoupling may suffer from performance degradation.
https://conferences.oreilly.com/strata/strata-ny-
2017/public/schedule/detail/63400
Creating ADT – Applications and Nodes
• GCE
• Azure
• Nova
• CloudSigma
• CloudBroker
• Oracle?
MiCADO CPU-based Autoscaling
datanode-0
nodeman-0
datanode-1
nodeman-1
datanode-2
nodeman-2
Minimum number of slave nodes, and pods
(e.g., data nodes and node managers)
datanode-0
nodeman-0
datanode-1
nodeman-1
datanode-2
nodeman-2
datanode-3
nodeman-3
datanode-4
nodeman-4
Scaling Rule
Maximum number of slave nodes, and pods
(e.g., data nodes and node managers)
Alert CPU > 60% for 1 minute
Add/remove n nodes/pods
Complex autoscaling policies:
• Resource-based scaling
• Deadline-based scaling
Creating ADT – Scaling Policies
Deployment and
Orchestration
using MiCADO
• Deploy MiCADO using Ansible1
• Download the ADT from the Github repository2
• Customise the ADT, such as:
• Changing the cluster size, using a different private or
public cloud
• Defining different scaling policies
• Submit the ADT to the MiCADO master.
• Cloud orchestrator (Occopus or Terraform) creates
VMs on the selected cloud.
• Container orchestrator (Kubernetes) deploys the big
data tools and exposes ports.
• MiCADO policy keeper applies automatic scaling
based on user-defined scaling policies.
1. https://micado-scale.readthedocs.io/en/latest/deployment.html
2. https://github.com/UoW-CPC/MiCADO-based-big-data-platform
Experimental
Evaluation
• MiCADO master on AWS EC2 using:
• t2.medium instance (2 vCPUs, 2.3 GHz, and
4.0 GiB Memory).
• Two Experiments:
• Deployment time:
• Hadoop master and slave nodes are
running on:
t2.small instances (1 vCPUs, 2.5 GHz,
and 2.0 GiB Memory).
• Autoscaling using Teragen and Terasort:
• Hadoop master and slave nodes are
running on:
t2.xlarge (4 vCPUs, 2.3 GHz, 16 GiB
Memory, and 2TB storage).
Deployment
Time
Teragen
and
Terasort
Scaling by adding 2 nodes; one at a time
Conclusion
• Building portable scalable cloud native big data
platform can overcome existing challenges in Hadoop
deployments.
• Cost, security and control reasons motivate migrations
to private clouds.
• MiCADO automates the deployment and scalability of
containerized applications in a cloud agnostic
(portable) way.
• Building a cloud native Hadoop-based big data cluster:
• Containerizing big data tools.
• Coupling versus decoupling of compute and storage.
• Writing the ADT
• Submitting the ADT to MiCADO
• To build a Hadoop cluster on MiCADO:
• Deploy MiCADO, download the ADT, customize ADT, and
submit to MiCADO
Resources
• MiCADO Github:
• https://github.com/micado-scale
• MiCADO Docs:
• https://micado-scale.readthedocs.io/en/latest/
• MiCADO-based Big Data:
• https://github.com/UoW-CPC/MiCADO-based-big-data-platform
Questions

Towards a Cloud Native Big Data Platform using MiCADO

  • 1.
    Towards a Cloud NativeBig Data Platform using MiCADO Abdelkhalik Mosa, Tamas Kiss, Gabriele Pierantoni, James DesLauriers, Dimitrios Kagialis, and Gabor Terstyanszky Research Centre for Parallel Computing, School of Computer Science and Engineering, University of Westminster, UK ISPDC 2020 MiCADO
  • 2.
    Introduction • Big datarequires building scalable distributed architectures for efficient storage and processing. • Hadoop Ecosystem is a software suite for handling storage and processing requirements of big data. Mrozek, Dariusz. "Foundations of the Hadoop Ecosystem." Scalable Big Data Analytics for Protein Bioinformatics. Springer, Cham, 2018
  • 3.
    Hadoop Deployment Challenges •Deploying big data cluster is time-consuming and challenging. • Static Hadoop clusters can be either overprovisioned or underprovisioned. • Managed Hadoop (e.g., AWS EMR, HDInsight) suffer from vendor lock-in. • Organisations are migrating 50% of their public cloud applications/data to private clouds due to security concerns, cost, performance, and control1. • Automating infrastructure makes on-premises solutions more cost- effective than public Cloud for 85%-90% of workloads2. • Cloudbreak automates Hortonworks Data Platform (HDP) deployment; however, it is not cloud native and only supports HDP. 1. https://www.crn.com/businesses-moving-from-public-cloud-due-to-security-says-idc-survey 2. https://www.crn.com/news/data-center/300100761/michael-dell-on-premises-solutions-are-more-cost-effective-than-public-cloud-for-85-to-90-percent-of-workloads.htm
  • 4.
    Hadoop and CloudNative Computing • In cloud native computing: • Applications are deployed as microservices. • Every single microservice is containerized. • Containers are dynamically orchestrated to optimize resource utilization and efficiently manage the entire application. • Hadoop was not originally built for the cloud; create big data clusters using commodity hardware. • Hadoop might be a good fit for cloud native computing as the Hadoop ecosystem consists of several disjoint tools that can be containerized.
  • 5.
    Research Aim •Building cloud native Hadoop-based big data clusters using MiCADO
  • 6.
    MiCADO Overview MiCADO isan application-level multi-cloud orchestration and auto-scaling framework. https://project-cola.eu/ TOSCA Application Description Template (ADT) Container Executor cAdvisor Container/node Monitoring MiCADO WORKER NODE Node Exporter DockerCloud Orchestrator Monitoring System Translates ADT Enforces scaling Container Orchestrator Prometheus Submitter Policy Keeper MiCADO MASTER NODE ML based optimisation Optimiser Kubernetes Swarm Occopus Terraform https://micado-scale.readthedocs.io/en/latest/deployment.html
  • 7.
  • 8.
    Building MiCADO-based BigData Platform Containerizing big data tools Coupling or decoupling of compute and storage Writing application description templates (ADTs) Submitting the ADT to MiCADO
  • 9.
    Containerizing Big DataTools • Hadoop is based on a set of disjoint services. • These microservices make it easier to containerize various big data tools (ex. HDFS, YARN, …). NN DN RM NM HDFS YARN Kafka ZooKeeper Hadoop Core
  • 10.
    Coupling vs Decoupling Computeand Storage • Coupling • Hadoop couples compute and storage to minimize data movement and overcome slow network access speed. • Scaling could waste resources (either compute or storage). • Decoupling • Decoupling contradicts the original Hadoop design that aimed to bring compute to data. • Decoupling enables independent scaling of compute and storage for better utilization and cost savings. • Decoupling may suffer from performance degradation. https://conferences.oreilly.com/strata/strata-ny- 2017/public/schedule/detail/63400
  • 11.
    Creating ADT –Applications and Nodes • GCE • Azure • Nova • CloudSigma • CloudBroker • Oracle?
  • 12.
    MiCADO CPU-based Autoscaling datanode-0 nodeman-0 datanode-1 nodeman-1 datanode-2 nodeman-2 Minimumnumber of slave nodes, and pods (e.g., data nodes and node managers) datanode-0 nodeman-0 datanode-1 nodeman-1 datanode-2 nodeman-2 datanode-3 nodeman-3 datanode-4 nodeman-4 Scaling Rule Maximum number of slave nodes, and pods (e.g., data nodes and node managers) Alert CPU > 60% for 1 minute Add/remove n nodes/pods Complex autoscaling policies: • Resource-based scaling • Deadline-based scaling
  • 13.
    Creating ADT –Scaling Policies
  • 14.
    Deployment and Orchestration using MiCADO •Deploy MiCADO using Ansible1 • Download the ADT from the Github repository2 • Customise the ADT, such as: • Changing the cluster size, using a different private or public cloud • Defining different scaling policies • Submit the ADT to the MiCADO master. • Cloud orchestrator (Occopus or Terraform) creates VMs on the selected cloud. • Container orchestrator (Kubernetes) deploys the big data tools and exposes ports. • MiCADO policy keeper applies automatic scaling based on user-defined scaling policies. 1. https://micado-scale.readthedocs.io/en/latest/deployment.html 2. https://github.com/UoW-CPC/MiCADO-based-big-data-platform
  • 15.
    Experimental Evaluation • MiCADO masteron AWS EC2 using: • t2.medium instance (2 vCPUs, 2.3 GHz, and 4.0 GiB Memory). • Two Experiments: • Deployment time: • Hadoop master and slave nodes are running on: t2.small instances (1 vCPUs, 2.5 GHz, and 2.0 GiB Memory). • Autoscaling using Teragen and Terasort: • Hadoop master and slave nodes are running on: t2.xlarge (4 vCPUs, 2.3 GHz, 16 GiB Memory, and 2TB storage).
  • 16.
  • 17.
  • 18.
    Conclusion • Building portablescalable cloud native big data platform can overcome existing challenges in Hadoop deployments. • Cost, security and control reasons motivate migrations to private clouds. • MiCADO automates the deployment and scalability of containerized applications in a cloud agnostic (portable) way. • Building a cloud native Hadoop-based big data cluster: • Containerizing big data tools. • Coupling versus decoupling of compute and storage. • Writing the ADT • Submitting the ADT to MiCADO • To build a Hadoop cluster on MiCADO: • Deploy MiCADO, download the ADT, customize ADT, and submit to MiCADO
  • 19.
    Resources • MiCADO Github: •https://github.com/micado-scale • MiCADO Docs: • https://micado-scale.readthedocs.io/en/latest/ • MiCADO-based Big Data: • https://github.com/UoW-CPC/MiCADO-based-big-data-platform
  • 20.