Towards a Cloud Native Big Data Platform using MiCADO

Towards a Cloud
Native Big Data
Platform using
MiCADO
Abdelkhalik Mosa, Tamas Kiss, Gabriele
Pierantoni, James DesLauriers, Dimitrios
Kagialis, and Gabor Terstyanszky
Research Centre for Parallel Computing,
School of Computer Science and Engineering,
University of Westminster, UK
ISPDC 2020
MiCADO

Introduction
• Big data requires building
scalable distributed
architectures for efficient
storage and processing.
• Hadoop Ecosystem is a
software suite for handling
storage and processing
requirements of big data.
Mrozek, Dariusz. "Foundations of the Hadoop Ecosystem."
Scalable Big Data Analytics for Protein Bioinformatics. Springer, Cham, 2018

Hadoop Deployment Challenges
• Deploying big data cluster is time-consuming and challenging.
• Static Hadoop clusters can be either overprovisioned or underprovisioned.
• Managed Hadoop (e.g., AWS EMR, HDInsight) suffer from vendor lock-in.
• Organisations are migrating 50% of their public cloud applications/data to
private clouds due to security concerns, cost, performance, and control1.
• Automating infrastructure makes on-premises solutions more cost-
effective than public Cloud for 85%-90% of workloads2.
• Cloudbreak automates Hortonworks Data Platform (HDP) deployment;
however, it is not cloud native and only supports HDP.
1. https://www.crn.com/businesses-moving-from-public-cloud-due-to-security-says-idc-survey
2. https://www.crn.com/news/data-center/300100761/michael-dell-on-premises-solutions-are-more-cost-effective-than-public-cloud-for-85-to-90-percent-of-workloads.htm

Hadoop and Cloud Native Computing
• In cloud native computing:
• Applications are deployed as microservices.
• Every single microservice is containerized.
• Containers are dynamically orchestrated to optimize resource utilization and efficiently
manage the entire application.
• Hadoop was not originally built for the cloud; create big data clusters using
commodity hardware.
• Hadoop might be a good fit for cloud native computing as the Hadoop ecosystem
consists of several disjoint tools that can be containerized.

Research Aim • Building cloud native Hadoop-based big data
clusters using MiCADO

MiCADO Overview
MiCADO is an application-level multi-cloud
orchestration and auto-scaling framework.
https://project-cola.eu/
TOSCA
Application
Description
Template
(ADT)
Container Executor
cAdvisor
Container/node
Monitoring
MiCADO
WORKER NODE
Node Exporter
DockerCloud
Orchestrator
Monitoring
System
Translates
ADT
Enforces
scaling
Container
Orchestrator
Prometheus
Submitter Policy Keeper
MiCADO MASTER NODE
ML based
optimisation
Optimiser
Kubernetes
Swarm
Occopus
Terraform
https://micado-scale.readthedocs.io/en/latest/deployment.html

MiCADO-based Big Data Platform

Building MiCADO-based Big Data Platform
Containerizing
big data tools
Coupling or
decoupling of
compute and
storage
Writing
application
description
templates (ADTs)
Submitting the
ADT to MiCADO

Containerizing Big Data Tools
• Hadoop is based on a set of disjoint services.
• These microservices make it easier to containerize various big data tools (ex. HDFS, YARN, …).
NN
DN
RM
NM
HDFS YARN
Kafka
ZooKeeper
Hadoop Core

Coupling vs Decoupling
Compute and Storage
• Coupling
• Hadoop couples compute and storage to minimize data
movement and overcome slow network access speed.
• Scaling could waste resources (either compute or storage).
• Decoupling
• Decoupling contradicts the original Hadoop design that
aimed to bring compute to data.
• Decoupling enables independent scaling of compute and
storage for better utilization and cost savings.
• Decoupling may suffer from performance degradation.
https://conferences.oreilly.com/strata/strata-ny-
2017/public/schedule/detail/63400

Creating ADT – Applications and Nodes
• GCE
• Azure
• Nova
• CloudSigma
• CloudBroker
• Oracle?

MiCADO CPU-based Autoscaling
datanode-0
nodeman-0
datanode-1
nodeman-1
datanode-2
nodeman-2
Minimum number of slave nodes, and pods
(e.g., data nodes and node managers)
datanode-0
nodeman-0
datanode-1
nodeman-1
datanode-2
nodeman-2
datanode-3
nodeman-3
datanode-4
nodeman-4
Scaling Rule
Maximum number of slave nodes, and pods
(e.g., data nodes and node managers)
Alert CPU > 60% for 1 minute
Add/remove n nodes/pods
Complex autoscaling policies:
• Resource-based scaling
• Deadline-based scaling

Creating ADT – Scaling Policies

Deployment and
Orchestration
using MiCADO
• Deploy MiCADO using Ansible1
• Download the ADT from the Github repository2
• Customise the ADT, such as:
• Changing the cluster size, using a different private or
public cloud
• Defining different scaling policies
• Submit the ADT to the MiCADO master.
• Cloud orchestrator (Occopus or Terraform) creates
VMs on the selected cloud.
• Container orchestrator (Kubernetes) deploys the big
data tools and exposes ports.
• MiCADO policy keeper applies automatic scaling
based on user-defined scaling policies.
1. https://micado-scale.readthedocs.io/en/latest/deployment.html
2. https://github.com/UoW-CPC/MiCADO-based-big-data-platform

Experimental
Evaluation
• MiCADO master on AWS EC2 using:
• t2.medium instance (2 vCPUs, 2.3 GHz, and
4.0 GiB Memory).
• Two Experiments:
• Deployment time:
• Hadoop master and slave nodes are
running on:
t2.small instances (1 vCPUs, 2.5 GHz,
and 2.0 GiB Memory).
• Autoscaling using Teragen and Terasort:
• Hadoop master and slave nodes are
running on:
t2.xlarge (4 vCPUs, 2.3 GHz, 16 GiB
Memory, and 2TB storage).

Teragen
and
Terasort
Scaling by adding 2 nodes; one at a time

Conclusion
• Building portable scalable cloud native big data
platform can overcome existing challenges in Hadoop
deployments.
• Cost, security and control reasons motivate migrations
to private clouds.
• MiCADO automates the deployment and scalability of
containerized applications in a cloud agnostic
(portable) way.
• Building a cloud native Hadoop-based big data cluster:
• Containerizing big data tools.
• Coupling versus decoupling of compute and storage.
• Writing the ADT
• Submitting the ADT to MiCADO
• To build a Hadoop cluster on MiCADO:
• Deploy MiCADO, download the ADT, customize ADT, and
submit to MiCADO

Resources
• MiCADO Github:
• https://github.com/micado-scale
• MiCADO Docs:
• https://micado-scale.readthedocs.io/en/latest/
• MiCADO-based Big Data:
• https://github.com/UoW-CPC/MiCADO-based-big-data-platform

Towards a Cloud Native Big Data Platform using MiCADO

More Related Content

What's hot

Similar to Towards a Cloud Native Big Data Platform using MiCADO

More from Abdelkhalik Mosa

Recently uploaded

Towards a Cloud Native Big Data Platform using MiCADO