Infrastructure for auto scaling distributed system
1. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Big Data Conference in Vilnius 2018
Kai Sasaki
Infrastructure for
Auto Scaling
Distributed System
2. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Bio
Kai Sasaki (佐々木 海)
• Senior Software Engineer at Arm Treasure Data since 2015
• Hadoop, Presto, Spark, TensorFlow.js, Apache Hivemall
• Books
– Available as paperback
and ebook.
• Twitter
– @Lewuathe
3. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Agenda
• Who is Treasure Data?
• What is distributed data analysis?
• What kind of challenges we have?
– Operational Cost
– Stability and Scalability
• Our Approach
– AWS CodeDeploy & Auto Scaling Group
– Query Simulation
– Graceful/Force Shutdown
4. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Who is Treasure Data?
5. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Treasure Data
Founded in Dec, 2011 in Silicon Valley
• Mountain View, CA
• DMP, eCDP, IoT, Cloud
• We joined Arm Oct, 2018
6. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Treasure Data
We are providing end-to-end integrated data analysis platform.
• Data Ingestion
– Mobile Device, Automotive, IoT
• Enterprise Customer Data Platform
• Service Integration
– BI tool (e.g. Tableau)
– Marketing tool
7. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Treasure Data
Open Source Lover
• Fluentd
• Embulk
• Digdag
• Apache Hivemall
8. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Enterprise Data Analysis
• Scalable processing
• Reliable platform
• Secure data protection
9. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Arm Pelion Platform
Treasure Data is a part of Arm Pelion IoT Platform
• Flexibility in connectivity management
• Efficient data processing
10. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Distributed Data
Analysis
11. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Distributed Data Analysis
Service component that enables us to process huge dataset
Scalability Throughput Data Consistency
• Easy to do horizontal scaling
• Flexible to the business
requirement
– Interface (e.g. SQL)
– Data Format
• Impossible scale with single
node machine
• Business requirement for batch
processing (e.g. daily batch)
• Write side operation is possible
– INSERT, DELETE, UPDATE
• Correct measurement is the
key for data analysis
12. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Distributed Processing Engines
Bunch of open source softwares are available for distributed processing
• Hadoop
• Presto
• Spark
• Kafka
13. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Typical Architecture
Master-Worker Model
https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm
14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Distributed Plan
select
t1.class,
t2.features,
count(1)
from iris t1
join iris t2
on t1.class = t2.class
group by 1, 2;
16. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Challenges for Distributed Data Analysis
Maintaining distributed data analysis platform in real world is not easy.
• Operation
– Deployment
– Logging Investigation
– Monitoring
• Money
– Large Scale Cluster
– Network Cost
• Stability
– Capacity Sufficiency
17. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Challenges for Distributed Data Analysis
Manual launch/termination?
Capacity estimation is correct?
Which version is deployed?
What kind of metrics do we
need to monitor?
How much does it cost?
18. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Challenges for Distributed Data Analysis
Manual launch/termination?
Capacity estimation is correct?
Which version is deployed?
What kind of metrics do we
need to monitor?
How much does it cost?
MANUALLY
20. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Our Approach
Practical solutions by taking full advantage of public cloud services
• AWS CodeDeploy
– Integration with Auto Scaling Group
• EC2 Auto Scaling Group
– Load test by Query Simulation
– Metric Based Capacity Estimation
– Graceful/Force Instance Termination
21. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
CodeDeploy
Deployment Service for Deployment in AWS
• Easy to Integrate with Auto Scaling Group
• Available Everywhere
– Supporting On-Premise Instances
• Scalable for distributed system use cases
• https://docs.aws.amazon.com/codedeploy/index.html
22. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Auto Scaling System
System should be scaled automatically without any manual operation
• Load test by Query Simulation
• Metric Based Capacity Estimation
• Graceful Termination & Force Termination
23. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Query Simulation
Load test should be based on the real world workload.
• Get query list from the past history of our customer
• Query signature clustering
• Construct data set and query list based on the list
• That enables us to do load test easily based on production workload
24. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Query Signature
Query signature represents a query in a shortened format.
25. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Query Simulation
Conductor
c5.9xlarge
1. Get raw query list 2. Construct test data and query list
26. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Metric Based Capacity Estimation
Designed to achieve target metric value by adjusting capacity
• Add/reduce instances proportional to the target metric value
• e.g. Target average CPU usage = 40%
27. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Metric Based Capacity Estimation
Designed to achieve target metric value by adjusting capacity
• 40% is the threshold to balance the cost and performance
28. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Graceful Termination
Terminating instances gracefully
• Avoid making worse user experience
• Lifecycle hook in auto scaling group
• Cron job to check running tasks
– Number of tasks in the worker
– Send completion to lifecycle hook
https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroupLifecycle.html
29. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Graceful Termination
Terminating instances gracefully
1. Instance is moved to Terminating:Wait status
2. Cron job make the state transition to Terminating:Proceed
3. The instance is gracefully terminated
Send complete lifecycle hook
ASG terminate the instance
30. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Force Termination
Long running task can block graceful termination
• Put “timeout” limitation
• Simulate “how long it takes to terminate gracefully”
Date Time
31. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Instance Termination
Balance between customer experience and cost optimization.
Graceful Termination
Keep queries running as much as possible
satisfies customer expectation.
• Non fault tolerant system such as Presto
• Distributed analysis workload tends to be too long
to be retried
Force Termination
Cost optimization is one of the primary
goal of auto scaling
• Auto scale out/in around 10 minutes does not lose
agility for capacity adjustment.
• Force termination happening only over 10 mins
queries is acceptable
32. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Recap
• Who is Treasure Data?
• What is distributed data analysis?
• What kind of challenges we have?
– Operational Cost
– Stability and Scalability
• Our Approach
– AWS CodeDeploy & Auto Scaling Group
– Query Simulation
– Graceful/Force Shutdown