Successfully reported this slideshow.
Your SlideShare is downloading. ×

Infrastructure for auto scaling distributed system

Advertisement

More Related Content

Slideshows for you

Similar to Infrastructure for auto scaling distributed system

Advertisement
Advertisement

Related Books

Free with a 30 day trial from Scribd

See all

Infrastructure for auto scaling distributed system

  1. 1. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Big Data Conference in Vilnius 2018 Kai Sasaki Infrastructure for Auto Scaling Distributed System
  2. 2. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Bio Kai Sasaki (佐々木 海) • Senior Software Engineer at Arm Treasure Data since 2015 • Hadoop, Presto, Spark, TensorFlow.js, Apache Hivemall • Books – Available as paperback and ebook. • Twitter – @Lewuathe
  3. 3. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Agenda • Who is Treasure Data? • What is distributed data analysis? • What kind of challenges we have? – Operational Cost – Stability and Scalability • Our Approach – AWS CodeDeploy & Auto Scaling Group – Query Simulation – Graceful/Force Shutdown
  4. 4. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Who is Treasure Data?
  5. 5. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Treasure Data Founded in Dec, 2011 in Silicon Valley • Mountain View, CA • DMP, eCDP, IoT, Cloud • We joined Arm Oct, 2018
  6. 6. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Treasure Data We are providing end-to-end integrated data analysis platform. • Data Ingestion – Mobile Device, Automotive, IoT • Enterprise Customer Data Platform • Service Integration – BI tool (e.g. Tableau) – Marketing tool
  7. 7. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Treasure Data Open Source Lover • Fluentd • Embulk • Digdag • Apache Hivemall
  8. 8. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Enterprise Data Analysis • Scalable processing • Reliable platform • Secure data protection
  9. 9. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Arm Pelion Platform Treasure Data is a part of Arm Pelion IoT Platform • Flexibility in connectivity management • Efficient data processing
  10. 10. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Distributed Data Analysis
  11. 11. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Distributed Data Analysis Service component that enables us to process huge dataset Scalability Throughput Data Consistency • Easy to do horizontal scaling • Flexible to the business requirement – Interface (e.g. SQL) – Data Format • Impossible scale with single node machine • Business requirement for batch processing (e.g. daily batch) • Write side operation is possible – INSERT, DELETE, UPDATE • Correct measurement is the key for data analysis
  12. 12. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Distributed Processing Engines Bunch of open source softwares are available for distributed processing • Hadoop • Presto • Spark • Kafka
  13. 13. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Typical Architecture Master-Worker Model https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm
  14. 14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Distributed Plan select t1.class, t2.features, count(1) from iris t1 join iris t2 on t1.class = t2.class group by 1, 2;
  15. 15. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Challenges
  16. 16. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Challenges for Distributed Data Analysis Maintaining distributed data analysis platform in real world is not easy. • Operation – Deployment – Logging Investigation – Monitoring • Money – Large Scale Cluster – Network Cost • Stability – Capacity Sufficiency
  17. 17. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Challenges for Distributed Data Analysis Manual launch/termination? Capacity estimation is correct? Which version is deployed? What kind of metrics do we need to monitor? How much does it cost?
  18. 18. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Challenges for Distributed Data Analysis Manual launch/termination? Capacity estimation is correct? Which version is deployed? What kind of metrics do we need to monitor? How much does it cost? MANUALLY
  19. 19. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Our Approach
  20. 20. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Our Approach Practical solutions by taking full advantage of public cloud services • AWS CodeDeploy – Integration with Auto Scaling Group • EC2 Auto Scaling Group – Load test by Query Simulation – Metric Based Capacity Estimation – Graceful/Force Instance Termination
  21. 21. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. CodeDeploy Deployment Service for Deployment in AWS • Easy to Integrate with Auto Scaling Group • Available Everywhere – Supporting On-Premise Instances • Scalable for distributed system use cases • https://docs.aws.amazon.com/codedeploy/index.html
  22. 22. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Auto Scaling System System should be scaled automatically without any manual operation • Load test by Query Simulation • Metric Based Capacity Estimation • Graceful Termination & Force Termination
  23. 23. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Query Simulation Load test should be based on the real world workload. • Get query list from the past history of our customer • Query signature clustering • Construct data set and query list based on the list • That enables us to do load test easily based on production workload
  24. 24. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Query Signature Query signature represents a query in a shortened format.
  25. 25. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Query Simulation Conductor c5.9xlarge 1. Get raw query list 2. Construct test data and query list
  26. 26. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Metric Based Capacity Estimation Designed to achieve target metric value by adjusting capacity • Add/reduce instances proportional to the target metric value • e.g. Target average CPU usage = 40%
  27. 27. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Metric Based Capacity Estimation Designed to achieve target metric value by adjusting capacity • 40% is the threshold to balance the cost and performance
  28. 28. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Graceful Termination Terminating instances gracefully • Avoid making worse user experience • Lifecycle hook in auto scaling group • Cron job to check running tasks – Number of tasks in the worker – Send completion to lifecycle hook https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroupLifecycle.html
  29. 29. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Graceful Termination Terminating instances gracefully 1. Instance is moved to Terminating:Wait status 2. Cron job make the state transition to Terminating:Proceed 3. The instance is gracefully terminated Send complete lifecycle hook ASG terminate the instance
  30. 30. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Force Termination Long running task can block graceful termination • Put “timeout” limitation • Simulate “how long it takes to terminate gracefully” Date Time
  31. 31. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Instance Termination Balance between customer experience and cost optimization. Graceful Termination Keep queries running as much as possible satisfies customer expectation. • Non fault tolerant system such as Presto • Distributed analysis workload tends to be too long to be retried Force Termination Cost optimization is one of the primary goal of auto scaling • Auto scale out/in around 10 minutes does not lose agility for capacity adjustment. • Force termination happening only over 10 mins queries is acceptable
  32. 32. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Recap • Who is Treasure Data? • What is distributed data analysis? • What kind of challenges we have? – Operational Cost – Stability and Scalability • Our Approach – AWS CodeDeploy & Auto Scaling Group – Query Simulation – Graceful/Force Shutdown
  33. 33. Thank You! Danke! Merci! 谢谢! Gracias! Kiitos! Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.

×