Automation of Hadoop cluster operations in Arm Treasure Data

Y
Confidential © Arm 2017
Automation of Hadoop cluster operations
in Arm Treasure Data
Yan Wang
Arm Treasure Data
March 14, 2019
Confidential © Arm 20172
Who am I?
● Yan Wang (王岩)
● May 2018 〜 Arm Treasure Data
Hadoop team, Software Engineer
● Contributing hadoop
● Like Japanese Mahjong
● Blog https://tiana528.github.io/
LukaMe
Confidential © Arm 20173
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 20174
Arm Treasure Data Product
Customers don’t
need to operate
hadoop clusters.
We do.
Confidential © Arm 20175
Hadoop Usage
multi-clouds
Cluster
very multi-tenancy
permanent storage
HA
M
S
S S
cluster structure
patched hadoop
PTD-2.7.3-xxx
operation tool
CDH
HDP
Self-developed
Operation point of view
● Recreate cluster on incident
● Self-developed operation tool is
key point for operation
Improved in the past year
Confidential © Arm 20176
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 20177
Reduce hadoop cluster creation time significantly
-- by making use of AWS Auto Scaling Group
● Before
Environment
Setup
Create cluster
of 100 nodes
launch nodes one
by one
● Too slow
○ Client side
■ 1 hour
○ Cluster ready
■ 1 hour
Environment
Setup
create AWS Auto
Scaling Group
● Much faster
○ Client side
■ 3 minutes
○ Cluster ready
■ 15 minutes
● After
Create cluster
of 100 nodes
9 months ago
Confidential © Arm 20178
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 20179
General flow of how to recreate a hadoop cluster
● No downtime : A/B switch
ClusterA
job
server
ClusterA
job
server
ClusterB ClusterA
job
server
ClusterB ClusterB
job
server
create new
cluster
switch
traffic
shutdown
old cluster
Confidential © Arm 201710
Simplify hadoop cluster recreation
-- by creating our wrapper script of SRE tool
● Issues
○ Too many parameters
○ Stressful to shutdown
7 months ago
● Before ● After
service create -S aws -s development -c ClusterB ...
service delete -S aws -s development -c ClusterA ...
cluster create ClusterB
cluster delete ClusterA
● Improved
○ 1 parameter
○ Stressless to shutdown
Use SRE team tool directly Use our wrapper script
= SRE tool + verification + config
Confidential © Arm 201711
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201712
Gained a lot of merits by changing instance type of slaves
c3.8xlarge
Very old model
6 months ago
● Before ● After
m5d.12xlarge
Latest model
● Improved
○ Larger per container memory
○ Larger & faster local disk
○ Lower cost
○ ...
● But …
Confidential © Arm 201713
But… new issue occured
● New issue happened
○ Amazon don’t have so many m5d instances for on-demand allocation
○ Insufficient instances to do A/B switch in one availability zone when
recreate a cluster.
● Ask Amazon support for help
○ They suggest us buying more reserved instances or use other instance
types intermediately.
● Other approaches?
Confidential © Arm 201714
Handle the situation of insufficient instances in one AZ
-- by supporting cross AZ environment
● Cross AZ environment
C
job
server
● Keypoint : no large network traffic between AZs which can be expensive.
worker
AZ_1 AZ_2
job
server
worker
AZ_1 AZ_2
job
server
job
server
worker
AZ_1 AZ_2
job
server
job
server
worker
AZ_1 AZ_2
job
server
A CA B CA B C B
REST API REST API
create new
cluster
switch
traffic
shutdown
old cluster
Confidential © Arm 201715
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201716
Create patches to fast fail jobs consuming too much disk
task timeline
0h 10h 20h 30h 40h
job fail here
● Before ● After
failed
retried
We created two patches
For local : MAPREDUCE-7022 Fast fail rogue jobs based on task scratch dir size
For HDFS : MAPREDUCE-7148 Fast fail jobs when exceeds dfs quota limitation
(Disk quota configured)
failed
retried
failed
retried failed
Retry is meaningless
task timeline
0h 10h 20h 30h 40h
job fail here
failed
4 months ago
Fail fast
Confidential © Arm 201717
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201718
installed on all nodes
check very detailed status
Simplify incident Handling by creating health check scripts
Check A
Run command B
Check C
If … else…
Open URL ...
● Before ● After
runbook
health check script
● When incident happen
○ Follow complex runbook during
incident. Needs to collect info first.
● When incident happen
○ Run health check during incident,
and know where is the issue.
● Future
○ integrate with Auto Scaling Group health check.
4 months ago
datadog metrics
trigger
alerts
Confidential © Arm 201719
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201720
Easy to know when to scale out
-- by creating capacity metrics based on machine learning
on going(POC)
alert comes
manually scale out if
having performance issue
● Before ● After
HDFS put/get latency
Price plan & using slots
Probe query
HDFS usage
CPU I/O wait
linear regression
capacity metrics
● Expect improvement
○ Know when to scale out immediately
and easily.
● Future plan : use it for auto scale.
● Issue
○ A little late…
○ Hard for junior to understand
Confidential © Arm 201721
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201722
Simplify shutdown slaves
-- by using Auto Scaling Group shutdown hook
shutdown 2 at a time
wait block replication finish
then shutdown 2 more…
● Before ● After
● Issue
○ boring operation
○ potential job retry
AWS Auto Scaling Group shutdown hook
● Expect improvement
○ safe & fast
on going
hadoop node decommission script
● Future plan : find a “proper” node to kill
○ e.g. short running tasks
Confidential © Arm 201723
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace Chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201724
Replace Chef by debian packaging and Codedeploy
We meet many issues using Chef
○ Only ruby
○ Unnecessary complicated
○ Stateful
○ 15 override rules of attributes
○ Slow
○ Fail silently
○ Dependent on other team’s release
cycle
○ two pass model
○ 5 years adding little by little
○ ...
● Before ● After
Debian packaging
○ Standard way in Linux
AWS Codedeploy
○ Fast and easy to maintenance
○ Can be used in other cloud
● Expect improvement
○ Much easier to maintenance
○ cluster creation 15 minutes => 5
minutes
on going
Confidential © Arm 201725
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
● Future roadmap
○ API-based routing and workflow-based hadoop recreation
○ Usage history based account routing
● Summary
Confidential © Arm 201726
API-based routing and workflow-based hadoop recreation
● Expect improvement
○ Totally automate hadoop cluster
recreation through workflow
○ server side validation
● Issue
○ Very manual
○ depends on manual validation
job
server
submit git pull request,
review, merge,
upload databag,
run chef-client on all nodes
change routing
● Before ● After
A B
job
server
A B
job
server
REST API Call
curl -X PUT .../hadoop_routes -d
'{"defauls":"ClusterB"}'
change routing
A B
job
server
A B
API-based routing
Confidential © Arm 201727
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
● Future roadmap
○ API-based routing and workflow-based hadoop recreation
○ Usage history based account routing
● Summary
Confidential © Arm 201728
Usage history based account routing
Busy
cluster
Idle
cluster
resource not fully utilized
job
server
Fixed
routing
Big cluster
easy to meet insufficient instance
issue when creating big cluster
Fixed
size
Busy
cluster
Idle
cluster
resource utilization increase
job
server
Dynamic routing
more accounts to
idle cluster
AZ_1 AZ_2
● Before ● After
Dynamic account routing
easy to split cluster when instances are
insufficient
smaller
cluster1
smaller
cluster2
Confidential © Arm 201729
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
● Future roadmap
● Summary
Confidential © Arm 201730
Summary
● Common idea
○ Use modernized cloud-based approach
○ API-based operation
○ Start from small and many small changes leading to large impact
Confidential © Arm 201731
We are hiring
https://www.treasuredata.com/company/careers/jobs/positions/?job=f6fd040b-c843-4991-bd49-bc674aab9a9e&team=Engineering
Confidential © Arm 201732 Confidential © Arm 201732 Confidential © Arm 201732
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
1 of 32

Recommended

From docker to kubernetes: running Apache Hadoop in a cloud native way by
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
1.5K views126 slides
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016... by
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
2.5K views39 slides
Tales from the Cloudera Field by
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
4K views38 slides
HBaseCon 2015: Elastic HBase on Mesos by
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon
3.1K views47 slides
In-memory Caching in HDFS: Lower Latency, Same Great Taste by
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
14.7K views71 slides
Hadoop Storage in the Cloud Native Era by
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraDataWorks Summit
481 views37 slides

More Related Content

What's hot

Improving HDFS Availability with Hadoop RPC Quality of Service by
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
4K views65 slides
Improving Hadoop Cluster Performance via Linux Configuration by
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
6.4K views59 slides
Zero-downtime Hadoop/HBase Cross-datacenter Migration by
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationScott Miao
1.4K views60 slides
Maintainable cloud architecture_of_hadoop by
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopKai Sasaki
4.3K views60 slides
Postgres in Amazon RDS by
Postgres in Amazon RDSPostgres in Amazon RDS
Postgres in Amazon RDSDenish Patel
9.7K views37 slides
Rigorous and Multi-tenant HBase Performance Measurement by
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementDataWorks Summit
3.6K views40 slides

What's hot(20)

Improving HDFS Availability with Hadoop RPC Quality of Service by Ming Ma
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
Ming Ma4K views
Improving Hadoop Cluster Performance via Linux Configuration by Alex Moundalexis
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
Alex Moundalexis6.4K views
Zero-downtime Hadoop/HBase Cross-datacenter Migration by Scott Miao
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Scott Miao1.4K views
Maintainable cloud architecture_of_hadoop by Kai Sasaki
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
Kai Sasaki4.3K views
Postgres in Amazon RDS by Denish Patel
Postgres in Amazon RDSPostgres in Amazon RDS
Postgres in Amazon RDS
Denish Patel9.7K views
Rigorous and Multi-tenant HBase Performance Measurement by DataWorks Summit
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit3.6K views
Large-scale Web Apps @ Pinterest by HBaseCon
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon4.1K views
HBaseCon 2015: HBase and Spark by HBaseCon
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
HBaseCon8.7K views
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ... by DataWorks Summit
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
DataWorks Summit1.3K views
Improving Hadoop Performance via Linux by Alex Moundalexis
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
Alex Moundalexis15.5K views
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop by DataWorks Summit
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit1.2K views
Interactive Hadoop via Flash and Memory by Chris Nauroth
Interactive Hadoop via Flash and MemoryInteractive Hadoop via Flash and Memory
Interactive Hadoop via Flash and Memory
Chris Nauroth1.3K views
HBaseCon 2015: HBase Performance Tuning @ Salesforce by HBaseCon
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon6.1K views
Hug Hbase Presentation. by Jack Levin
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin9.5K views
Apache HBase, Accelerated: In-Memory Flush and Compaction by HBaseCon
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
HBaseCon2.5K views
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera by Cloudera, Inc.
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.5.5K views
Time-Series Apache HBase by HBaseCon
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon5.6K views
HBase Tales From the Trenches - Short stories about most common HBase operati... by DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit1.8K views
Backup and Disaster Recovery in Hadoop by larsgeorge
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge6.5K views
Cloudera Impala: A Modern SQL Engine for Apache Hadoop by Cloudera, Inc.
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.5K views

Similar to Automation of Hadoop cluster operations in Arm Treasure Data

StripeEu Twistedbytes Presentation by
StripeEu Twistedbytes PresentationStripeEu Twistedbytes Presentation
StripeEu Twistedbytes Presentationtwistedbytes
713 views20 slides
Running Dataproc At Scale in production - Searce Talk at GDG Delhi by
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiSearce Inc
65 views41 slides
State of serverless by
State of serverlessState of serverless
State of serverlessAnurag Saran
440 views20 slides
Writing and deploying serverless python applications by
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applicationsCesar Cardenas Desales
302 views65 slides
Flume-based Independent News Aggregator by
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News AggregatorMário Almeida
2.9K views27 slides
PyConIE 2017 Writing and deploying serverless python applications by
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsCesar Cardenas Desales
345 views62 slides

Similar to Automation of Hadoop cluster operations in Arm Treasure Data(20)

StripeEu Twistedbytes Presentation by twistedbytes
StripeEu Twistedbytes PresentationStripeEu Twistedbytes Presentation
StripeEu Twistedbytes Presentation
twistedbytes713 views
Running Dataproc At Scale in production - Searce Talk at GDG Delhi by Searce Inc
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Searce Inc65 views
Flume-based Independent News Aggregator by Mário Almeida
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
Mário Almeida2.9K views
PyConIE 2017 Writing and deploying serverless python applications by Cesar Cardenas Desales
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
NetflixOSS Meetup season 3 episode 1 by Ruslan Meshenberg
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg21.3K views
Ceph data services in a multi- and hybrid cloud world by Sage Weil
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
Sage Weil8.9K views
Serverless Apps on Google Cloud: more dev, less ops by mabl
Serverless Apps on Google Cloud: more dev, less opsServerless Apps on Google Cloud: more dev, less ops
Serverless Apps on Google Cloud: more dev, less ops
mabl232 views
Serverless Apps on Google Cloud: more dev, less ops by Joseph Lust
Serverless Apps on Google Cloud:  more dev, less opsServerless Apps on Google Cloud:  more dev, less ops
Serverless Apps on Google Cloud: more dev, less ops
Joseph Lust194 views
Netflix Open Source Meetup Season 4 Episode 2 by aspyker
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker19.4K views
[BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes] by Wong Hoi Sing Edison
[BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes][BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes]
[BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes]
Scaling Redis: Dmitry Polyakovsky by Redis Labs
Scaling Redis: Dmitry PolyakovskyScaling Redis: Dmitry Polyakovsky
Scaling Redis: Dmitry Polyakovsky
Redis Labs676 views
Truemotion Adventures in Containerization by Ryan Hunter
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
Ryan Hunter200 views
Effectively deploying hadoop to the cloud by Avinash Ramineni
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
Avinash Ramineni275 views
Embracing Serverless with Google by Joseph Lust
Embracing Serverless with GoogleEmbracing Serverless with Google
Embracing Serverless with Google
Joseph Lust907 views
Embracing Serverless with Google by mabl
Embracing Serverless with GoogleEmbracing Serverless with Google
Embracing Serverless with Google
mabl155 views
Scalable complex event processing on samza @UBER by Shuyi Chen
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
Shuyi Chen4.8K views
LINE's Private Cloud - Meet Cloud Native World by LINE Corporation
LINE's Private Cloud - Meet Cloud Native WorldLINE's Private Cloud - Meet Cloud Native World
LINE's Private Cloud - Meet Cloud Native World
LINE Corporation2.4K views
Webinar: Building a multi-cloud Kubernetes storage on GitLab by MayaData Inc
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
MayaData Inc98 views

Recently uploaded

OSMC 2023 | Will ChatGPT Take Over My Job? by Philipp Krenn by
OSMC 2023 | Will ChatGPT Take Over My Job? by Philipp KrennOSMC 2023 | Will ChatGPT Take Over My Job? by Philipp Krenn
OSMC 2023 | Will ChatGPT Take Over My Job? by Philipp KrennNETWAYS
22 views25 slides
falsettos by
falsettosfalsettos
falsettosRenzoCalandra
9 views48 slides
Pdffromtextfile_1.pdf by
Pdffromtextfile_1.pdfPdffromtextfile_1.pdf
Pdffromtextfile_1.pdfTRIEU QUANG NGO
6 views2 slides
Post-event report intro session-1.docx by
Post-event report intro session-1.docxPost-event report intro session-1.docx
Post-event report intro session-1.docxRohitRathi59
12 views2 slides
PB CV by
PB CVPB CV
PB CVPedro Borracha
7 views16 slides
Timeahead Agency Pitch Deck.pdf by
Timeahead Agency Pitch Deck.pdfTimeahead Agency Pitch Deck.pdf
Timeahead Agency Pitch Deck.pdfHabib-ur- Rehman
11 views13 slides

Recently uploaded(20)

OSMC 2023 | Will ChatGPT Take Over My Job? by Philipp Krenn by NETWAYS
OSMC 2023 | Will ChatGPT Take Over My Job? by Philipp KrennOSMC 2023 | Will ChatGPT Take Over My Job? by Philipp Krenn
OSMC 2023 | Will ChatGPT Take Over My Job? by Philipp Krenn
NETWAYS22 views
Post-event report intro session-1.docx by RohitRathi59
Post-event report intro session-1.docxPost-event report intro session-1.docx
Post-event report intro session-1.docx
RohitRathi5912 views
231121 SP slides - PAS workshop November 2023.pdf by PAS_Team
231121 SP slides - PAS workshop November 2023.pdf231121 SP slides - PAS workshop November 2023.pdf
231121 SP slides - PAS workshop November 2023.pdf
PAS_Team158 views
Roozbeh Torkzadeh - TU Eindhoven by Dutch Power
Roozbeh Torkzadeh - TU EindhovenRoozbeh Torkzadeh - TU Eindhoven
Roozbeh Torkzadeh - TU Eindhoven
Dutch Power85 views
Synthetic Biology.pptx by ShubNoor4
Synthetic Biology.pptxSynthetic Biology.pptx
Synthetic Biology.pptx
ShubNoor47 views
Gym Members Community.pptx by nasserbf1987
Gym Members Community.pptxGym Members Community.pptx
Gym Members Community.pptx
nasserbf19877 views
Managing Github via Terrafom.pdf by micharaeck
Managing Github via Terrafom.pdfManaging Github via Terrafom.pdf
Managing Github via Terrafom.pdf
micharaeck5 views
Christan van Dorst - Hyteps by Dutch Power
Christan van Dorst - HytepsChristan van Dorst - Hyteps
Christan van Dorst - Hyteps
Dutch Power89 views
Helko van den Brom - VSL by Dutch Power
Helko van den Brom - VSLHelko van den Brom - VSL
Helko van den Brom - VSL
Dutch Power87 views

Automation of Hadoop cluster operations in Arm Treasure Data

  • 1. Confidential © Arm 2017 Automation of Hadoop cluster operations in Arm Treasure Data Yan Wang Arm Treasure Data March 14, 2019
  • 2. Confidential © Arm 20172 Who am I? ● Yan Wang (王岩) ● May 2018 〜 Arm Treasure Data Hadoop team, Software Engineer ● Contributing hadoop ● Like Japanese Mahjong ● Blog https://tiana528.github.io/ LukaMe
  • 3. Confidential © Arm 20173 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 4. Confidential © Arm 20174 Arm Treasure Data Product Customers don’t need to operate hadoop clusters. We do.
  • 5. Confidential © Arm 20175 Hadoop Usage multi-clouds Cluster very multi-tenancy permanent storage HA M S S S cluster structure patched hadoop PTD-2.7.3-xxx operation tool CDH HDP Self-developed Operation point of view ● Recreate cluster on incident ● Self-developed operation tool is key point for operation Improved in the past year
  • 6. Confidential © Arm 20176 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 7. Confidential © Arm 20177 Reduce hadoop cluster creation time significantly -- by making use of AWS Auto Scaling Group ● Before Environment Setup Create cluster of 100 nodes launch nodes one by one ● Too slow ○ Client side ■ 1 hour ○ Cluster ready ■ 1 hour Environment Setup create AWS Auto Scaling Group ● Much faster ○ Client side ■ 3 minutes ○ Cluster ready ■ 15 minutes ● After Create cluster of 100 nodes 9 months ago
  • 8. Confidential © Arm 20178 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 9. Confidential © Arm 20179 General flow of how to recreate a hadoop cluster ● No downtime : A/B switch ClusterA job server ClusterA job server ClusterB ClusterA job server ClusterB ClusterB job server create new cluster switch traffic shutdown old cluster
  • 10. Confidential © Arm 201710 Simplify hadoop cluster recreation -- by creating our wrapper script of SRE tool ● Issues ○ Too many parameters ○ Stressful to shutdown 7 months ago ● Before ● After service create -S aws -s development -c ClusterB ... service delete -S aws -s development -c ClusterA ... cluster create ClusterB cluster delete ClusterA ● Improved ○ 1 parameter ○ Stressless to shutdown Use SRE team tool directly Use our wrapper script = SRE tool + verification + config
  • 11. Confidential © Arm 201711 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 12. Confidential © Arm 201712 Gained a lot of merits by changing instance type of slaves c3.8xlarge Very old model 6 months ago ● Before ● After m5d.12xlarge Latest model ● Improved ○ Larger per container memory ○ Larger & faster local disk ○ Lower cost ○ ... ● But …
  • 13. Confidential © Arm 201713 But… new issue occured ● New issue happened ○ Amazon don’t have so many m5d instances for on-demand allocation ○ Insufficient instances to do A/B switch in one availability zone when recreate a cluster. ● Ask Amazon support for help ○ They suggest us buying more reserved instances or use other instance types intermediately. ● Other approaches?
  • 14. Confidential © Arm 201714 Handle the situation of insufficient instances in one AZ -- by supporting cross AZ environment ● Cross AZ environment C job server ● Keypoint : no large network traffic between AZs which can be expensive. worker AZ_1 AZ_2 job server worker AZ_1 AZ_2 job server job server worker AZ_1 AZ_2 job server job server worker AZ_1 AZ_2 job server A CA B CA B C B REST API REST API create new cluster switch traffic shutdown old cluster
  • 15. Confidential © Arm 201715 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 16. Confidential © Arm 201716 Create patches to fast fail jobs consuming too much disk task timeline 0h 10h 20h 30h 40h job fail here ● Before ● After failed retried We created two patches For local : MAPREDUCE-7022 Fast fail rogue jobs based on task scratch dir size For HDFS : MAPREDUCE-7148 Fast fail jobs when exceeds dfs quota limitation (Disk quota configured) failed retried failed retried failed Retry is meaningless task timeline 0h 10h 20h 30h 40h job fail here failed 4 months ago Fail fast
  • 17. Confidential © Arm 201717 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 18. Confidential © Arm 201718 installed on all nodes check very detailed status Simplify incident Handling by creating health check scripts Check A Run command B Check C If … else… Open URL ... ● Before ● After runbook health check script ● When incident happen ○ Follow complex runbook during incident. Needs to collect info first. ● When incident happen ○ Run health check during incident, and know where is the issue. ● Future ○ integrate with Auto Scaling Group health check. 4 months ago datadog metrics trigger alerts
  • 19. Confidential © Arm 201719 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 20. Confidential © Arm 201720 Easy to know when to scale out -- by creating capacity metrics based on machine learning on going(POC) alert comes manually scale out if having performance issue ● Before ● After HDFS put/get latency Price plan & using slots Probe query HDFS usage CPU I/O wait linear regression capacity metrics ● Expect improvement ○ Know when to scale out immediately and easily. ● Future plan : use it for auto scale. ● Issue ○ A little late… ○ Hard for junior to understand
  • 21. Confidential © Arm 201721 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 22. Confidential © Arm 201722 Simplify shutdown slaves -- by using Auto Scaling Group shutdown hook shutdown 2 at a time wait block replication finish then shutdown 2 more… ● Before ● After ● Issue ○ boring operation ○ potential job retry AWS Auto Scaling Group shutdown hook ● Expect improvement ○ safe & fast on going hadoop node decommission script ● Future plan : find a “proper” node to kill ○ e.g. short running tasks
  • 23. Confidential © Arm 201723 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace Chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 24. Confidential © Arm 201724 Replace Chef by debian packaging and Codedeploy We meet many issues using Chef ○ Only ruby ○ Unnecessary complicated ○ Stateful ○ 15 override rules of attributes ○ Slow ○ Fail silently ○ Dependent on other team’s release cycle ○ two pass model ○ 5 years adding little by little ○ ... ● Before ● After Debian packaging ○ Standard way in Linux AWS Codedeploy ○ Fast and easy to maintenance ○ Can be used in other cloud ● Expect improvement ○ Much easier to maintenance ○ cluster creation 15 minutes => 5 minutes on going
  • 25. Confidential © Arm 201725 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ● Future roadmap ○ API-based routing and workflow-based hadoop recreation ○ Usage history based account routing ● Summary
  • 26. Confidential © Arm 201726 API-based routing and workflow-based hadoop recreation ● Expect improvement ○ Totally automate hadoop cluster recreation through workflow ○ server side validation ● Issue ○ Very manual ○ depends on manual validation job server submit git pull request, review, merge, upload databag, run chef-client on all nodes change routing ● Before ● After A B job server A B job server REST API Call curl -X PUT .../hadoop_routes -d '{"defauls":"ClusterB"}' change routing A B job server A B API-based routing
  • 27. Confidential © Arm 201727 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ● Future roadmap ○ API-based routing and workflow-based hadoop recreation ○ Usage history based account routing ● Summary
  • 28. Confidential © Arm 201728 Usage history based account routing Busy cluster Idle cluster resource not fully utilized job server Fixed routing Big cluster easy to meet insufficient instance issue when creating big cluster Fixed size Busy cluster Idle cluster resource utilization increase job server Dynamic routing more accounts to idle cluster AZ_1 AZ_2 ● Before ● After Dynamic account routing easy to split cluster when instances are insufficient smaller cluster1 smaller cluster2
  • 29. Confidential © Arm 201729 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ● Future roadmap ● Summary
  • 30. Confidential © Arm 201730 Summary ● Common idea ○ Use modernized cloud-based approach ○ API-based operation ○ Start from small and many small changes leading to large impact
  • 31. Confidential © Arm 201731 We are hiring https://www.treasuredata.com/company/careers/jobs/positions/?job=f6fd040b-c843-4991-bd49-bc674aab9a9e&team=Engineering
  • 32. Confidential © Arm 201732 Confidential © Arm 201732 Confidential © Arm 201732 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos!