Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20180522 infra autoscaling_system


Published on

Infrastructure for auto-scaling distributed systems

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

20180522 infra autoscaling_system

  1. 1. T R E A S U R E D A T A INFRASTRUCTURE FOR AUTO SCALING DISTRIBUTED SYSTEMS Auto scaling of distributed processing engine is not common. So what we did. Kai Sasaki Software Engineer in Treasure Data
  2. 2. ABOUT ME - Kai Sasaki (@Lewuathe) - Software Engineer in Treasure Data since 2015 - PTD migration - Remerge on MapReduce - Presto Auto Scaling System - Improving testing environment of query engine Working in Query Engine Team (Managing Hive, Presto in Treasure data) - Open Source Contributor of Hadoop, Hivemall, Spark, Presto, TensorFlow
  3. 3. TOPICS Distributed Processing Engine General architecture of distributed processing engine and characteristics of specific platforms like Hadoop, Presto used in Treasure Data. Solution with Cloud In order to achieve high scalable distributed system, we make use of existing cloud features like auto scaling group, deployment management system. Painful Scalable System Auto scaling of distributed processing engine is not common. We will describe the real pain point to scale our distributed system.
  4. 4. AGENDA • Distributed System in TD • Presto and Hive • Painful points to scale out distributed processing engine • What we’ve done • Decoupling storage layer • Packaging and deployment on CodeDeploy • Capacity resizing with Auto Scaling Group • Real Auto Scaling • CPU metrics • Cost reduction
  6. 6. HIVE AND PRESTO ON PLAZMADB Bulk Import Fluentd Mobile SDK PlazmaDB Presto Hive SQL, CDP Amazon S3
  7. 7. HIVE IN TREASURE DATA • Multiple clusters with 50+ worker cluster • Hive 0.13 (latest 3.0.0) Stats • 4+ million queries / month • 257 trillion records / month • 3+ PB / month In May, 2018
  8. 8. PRESTO IN TREASURE DATA • Multiple clusters with 40+ worker cluster • Presto 0.188 (latest 0.201) Stats • 14+ million queries / month • 1053 trillion records / month • 19.7+ PB / month In May, 2018
  9. 9. DISTRIBUTED PROCESSING IN TD CoordinatorAppMaster WorkerContainer Scaling
  10. 10. WHY IS SCALING EFFECTIVE • Distributed processing engine splits a job into multiple task fragments which can be run in parallel. These task fragments are distributed to multiple worker nodes. Having more worker nodes indicates may enable us to distribute smaller tasks to worker nodes. SELECT t1.user, t1.path,, t2.address, t2.age FROM www_access t1 JOIN user_table t2 ON t1.user = t2.user WHERE t1.path = ‘/register’ scan scan filter join output
  11. 11. WHY IS SCALING EFFECTIVE • JOIN operation is a common in OLAP processing which generally consumes a lot of computing resource. It is well known that there are efficient parallel join algorithms which can be fully leveraged by distributing processing engine.
  12. 12. WHY IS SCALING EFFECTIVE • Reading data from a backend storage (e.g. S3) is high latency operation. It can consume much memory and network bandwidth. Distributing table scan fragments make it possible to scatter the network load evenly.
  13. 13. PAIN POINTS • Basically horizontal scaling of worker nodes in these system gives us a way to achieve competitive performance easily without losing much time and money in theory. • But there are several painful points to overcome in daily release. • Stateful datasource • Bootstrapping of the instance takes time. • Manually tracking the deployment process is unstable. • Configuration overridden mechanism is complex. (and also package version) • Deployment verification (smoke test) • Graceful shutdown to achieve reliable query running • Specifying the target of deployment in multi-cloud environment • Capacity estimation
  15. 15. DECOUPLING STORAGE LAYER PlazmaDB Amazon S3 • Only intermediate/temporary data is stored in processing engine so that we discard any instance in the cluster anytime. Replace input/output class with Plazma specific one.
  16. 16. CODEDEPLOY + AUTO SCALING GROUP • We migrated our Presto infrastructure to Amazon CodeDeploy and EC2 Auto Scaling Group which make it easy to create cluster based provisioning and scale out/in cluster capacity. AWS 
 CodeDeploy Auto Scaling Deploy Package
  17. 17. CODEDEPLOY • CodeDeploy is a service of AWS to enable us to automate deployment flow. CodeDeploy manages the steps of deployment of specific package as a whole. • We can define a group where the package should be deployed. • CodeDeploy package is just a zip file including all contents needed to run the application. • Configs • Binary package • Hook scripts • It manages the steps of application deployment as a whole. Versioning of package checking application health, configuration rendering can be done via CodeDeploy.
  18. 18. DEPLOYMENT TARGET • Specifying the target of deployment is troublesome in multi-cloud environment. • Deployment target is a definition to specify unique resource where the package should be deployed. • site • stage • service • cluster • For example, a Presto cluster resource can be specified as this. • presto-coordinator-default-production-aws • presto-worker-default-production-aws • Deployment specifies • CodeDeploy package version to be deployed (Git tag) • Deployment target
  19. 19. DEPLOYMENT TARGET • Configuration has a overridden hierarchy based on the deployment target. • You can define the default configuration value of each layer in deployment target. • For example • site=aws: -Xmx4G -Xms1G • stage=production: -Xmx192G -Xms160G • service=presto-worker: N/A • cluster=one: -Xmx127G • You can control the configuration in a fine-grained manner without losing the deployment flexibility
  20. 20. CODEDEPLOY + AUTO SCALING GROUP • Bootstrapping EC2 instance is delegated to Auto Scaling Group and application provisioning can be done through AWS CodeDeploy. • CodeDeploy package includes all application configurations. ASG for workers Master
  21. 21. CODEDEPLOY + AUTO SCALING GROUP • If you increase the total capacity of ASG, ASG automatically bootstrap EC2 instance and deploy the same CodeDeploy package as other workers have. • Provisioning of cluster takes far less time (1+ hour -> 10mins) for 100 instances in total. ASG for workers Master
  22. 22. CODEDEPLOY + AUTO SCALING GROUP • We can basically scale-in the cluster capacity in similar manner. One thing we need to take care of is shutdown process. In order to avoid to fail on-going query, it is necessary to shutdown instances gracefully. We use lifecycle hook to notify ASG when to shutdown the instance. ASG for workers Master lifecycle hook - complete
  23. 23. DEPLOYMENT HOOK • CodeDeploy provides a way to run the hook script to prepare application deployment. • ApplicationStop • BeforeInstall • AfterInstall • ApplicationStart • ValidationService • It is easy to validate the application process running properly before joining into cluster by running smoke test in ValidationService hook.
  25. 25. REAL “AUTO” SCALING… • But we still need to specify the desired capacity manually… • It’s based on our past history and experience. Auto Scaling 40? 50? or more?
  26. 26. REAL AUTO SCALING • Auto scaling policy feature provided an easy way to automatically scaling the cluster based on the specific metric. • Simple Scaling Policy • Step Scaling Policy • Target Tracking Scaling Policy • Target tracking scaling policy enables us to adjust the cluster capacity in a fine grained manner. It calculates the necessary capacity based on the gap between current metric and the target value.
  27. 27. TARGET TRACKING OF CPU • Workload simulation with 10~30 concurrent queries. • Production size cluster (around 40 worker instances)
  28. 28. TARGET TRACKING OF CPU • Expected cost reduction by the target value of CPU average usage.
  29. 29. TARGET TRACKING OF CPU • Actual capacity transition with target tracking policy • Target 40% CPU Usage and enabling scaling in.
  30. 30. FINALLY… • It did not work properly because of • Scale-in transition is conservative compared to scale-out. So cluster capacity tends to be kept high long time. -> Cost increase • Graceful shutdown also prevents to scale-in transition because long-running queries can block instance termination.
  31. 31. FUTURE WORK • Real Auto Scaling without using target tracking policy • Detect a instance which can be in the shutdown process soon • Estimate capacity based on application specific metrics • Fine grained test to make sure query result consistency • Auto query migration in case of outage
  33. 33. T R E A S U R E D A T A