Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tokyo SRE Meetup - Building Reliable Services - A Journey from servers to services


Published on

TD Presents: Reliability x Large Scale talks for infrastructure and Site Reliability Engineering

Talk: Building Reliable Services - A journey from servers to services
Speaker: Chris Maxwell

Location: Tokyo, Japan
Date: March 15, 2018

【TD Presents】「信頼性×大規模」サービスを運営する会社が語る!サービスを安定的、かつ、スケーラブルに運営するための技術事例勉強会 ~インフラ/SRE編~

Published in: Technology

Tokyo SRE Meetup - Building Reliable Services - A Journey from servers to services

  2. 2. T R E A S U R E D A T A BUILDING RELIABLE SERVICES The journey from servers to services Chris Maxwell Site Reliability Manager
  3. 3. Treasure Data Services
  4. 4. WHY? Building Reliable Services • Reliability is an emergent property • You cannot buy reliability • You can invest in communication, tools, and processes that increase reliability
  5. 5. Product Sales M arketing Analytics DAILY WORKLOAD 1+ Million Events / Sec 400,000+ Queries / Day 15+ Trillion Rows / Day
 173+ Million Rows / Sec
  6. 6. MANY DEPLOYMENTS 8+ Environments Varying capabilities and scale per environment 50+ Services Not a micro services architecture… 275+ Deployments Production clusters from 3 to 200+ instances
  7. 7. RUNTIME CONVERGENCE Cookbooks Downloaded Configuration Management Server Pattern Code Downloaded Configuration Management of releases Runtime Failures Dependencies and Releases use same process Dependencies Downloaded 3rd Party dependencies are everywhere
  8. 8. OUR HERO Infrastructure Engineer Systems Engineer who owns the resources underlying services. Automation, Cloud, Networks, Security Groups, DNS, Production Support services Site Reliability Engineer Software Engineer and Systems Engineer that improves services with automation and system- wide tools and best practices
  9. 9. INCREASE VELOCITY Faster than Weekly Deployments • Releases through Configuration Management • Infrastructure team gatekeeping More Sites • We need more sites by end of the year • 50+ services per site
  10. 10. COMPLEX PLATFORM Where to Start? • Job Control • Query and Compute • Storage • Segmentation Many Differences • Ruby • Java • Hadoop • Presto • Scala Many teams • Backend • Query • API • Integrations • Frontend • Infrastructure Growth and Change • New features every week • Product evolution
  11. 11. SERVICE DELIVERY IS HARD Hero Refuses Politely… Teams continue using existing practices Foundation is Dirty Work Thankless tasks Change exposes implicit usage Measure Reliability Improves existing processes Starts measuring features
  12. 12. WISDOM FROM OUTSIDE Simple First “Everything should be made as simple as possible, but not simpler.” — Paraphrase of Albert Einstein
  13. 13. ON EXPERTS AND ADVICE You’re the expert given your specific context and needs
  14. 14. MENTOR RETURNS The number of “chunks” of context an human engineer 
 can retain is the: “magical number seven (7), plus or minus two” — George Miller
  15. 15. FIRST CHANGES Standard Deployment Targets For our environment, we need: • Site - data residency • Cloud - vendor / implementation • Region - resource location • Service - internal service name • Stage - delivery stages • Cluster - deployment target
  16. 16. HARD WORK AHEAD Reliability sometimes means rolling up your sleeves and getting dirty, working on core infrastructure to create a strong foundation to be reliable upon
  17. 17. FIRST CHANGES Standard Startup Services For our environment, we need: • preinit - discover deployment target • ephemeral - automatic volume mounting • final - bootstrap configuration management
  18. 18. KEEP IT SIMPLE “Complexity is the root cause of the vast majority of problems with software today” — Moseley & Marks
  19. 19. ACCEPTS CHALLENGE Standard Service Definition • Autoscale Group • Optional CodeDeploy Package • Internal Load Balancer • Internal DNS Endpoint • Optional External Load Balancer & DNS Endpoint
  20. 20. AUTOSCALING PRESTO Attach to the Team Our hero joins a service team Autoscaling Presto Helps to autoscale the entire service Work with Team Helps transition config into artifact
  21. 21. CODEDEPLOY PRESTO Learn from Team Their challenges and needs Artifact Code + Config Transition from simple autoscaling to Code + Config Artifacts Simple is Hard 3+ sources of configuration truth 12+ mostly same but different configurations Complexity was workaround for inflexible Configuration Management
  22. 22. MOVE FAST Direct API Tools • Service API not complete • Team needed compound operations Conductor to manage cluster ops • Built service-specific tools using underlying APIs • Routing and Segmentation
  23. 23. FRIENDS FOR THE JOURNEY AutoScaling & Launch Configuration IAM Instance-Profile RolesRoute53CodeDeploy EC2 Security GroupApplication Load Balancer & Target Group
  24. 24. MORE FRIENDS Trusting Team Software Engineering teams trusted our hero Outside Experience Engineers with Domain Specific experience helped our hero understand the systems
  25. 25. SLIDE TITLE value of explicitly defined service contracts talk first, software later
  26. 26. DELIVERY STATES Dangerous Shutdown Some services require careful shutdown procedures Delivery cannot hard-fail 14-day running jobs Loose definition of responsibility Delivery is an organic combination of Configuration Management, system service control, release control New Orchestration exposes old assumptions In-place is sub-optimal for 2-week jobs New-cluster is sub-optimal for remaining jobs
  27. 27. MENTOR RETURNS Tools express the process Process should uplift the organization “Tools are necessary but not sufficient. To build a future we all can live with, we have to build it together” — Bridget Kromhout
  28. 28. OUR HERO Service Tool Orchestrate 6 infrastructure APIs with MVP tools: • Leverage immediate gain • orchestration • Paying interest • Learning team needs and behaviour • Liability that must be paid in full • Intend to replace with API + client
  29. 29. SERVICES FIRST All services should look the same Any engineer can • Create a cluster • Update a cluster • Deploy to a cluster • Delete a cluster Safely, using the same tool
  30. 30. SLIDE TITLE Survey the Work How deep does the hole go? Start with Friends API and Segmentation Where to Start? Look for the greatest need
  31. 31. COMPLEXITY Complex Service(s) • Manual Post-Start Actions • Service Discovery because no standards Duplication in Many Places • 5 services of the same service • We were pushing the limits of legacy model
  32. 32. COMPLEXITY Unclear boundaries • Configuration ownership shared across teams • Service Discovery because no standards Unclear assumptions • Inconsistent naming and usage • The way it works now is the way it should be
  33. 33. MIGRATION Simplifying Complex Re-evaluate all choices in light of services-first Many Transitional Changes Startup Services Infrastructure to Application Precision Replacement Coordinated Handover Careful work
  34. 34. THE PROCESS Legacy Process • Servers First • Human Orchestration Transition • Services First • Automatic triggers legacy Value • Replace legacy with artifact
  35. 35. VISION Standard Services First With standards,
 exceptions are hard; Without standards, everything is hard
  36. 36. OUR HERO Autoscaling Implemented • Second Services Team: • Launched to Staging last week • Launched to Production yesterday
  37. 37. THE REWARD Service Patterns for Scaling • Deployment Targets • Standard Startup • Standard Services New Powers • On-Demand Clusters • Per-Cluster Versioning • Immediate Feedback
  38. 38. OUR HERO Your team builds it, your team runs it; we can help your team run it better
  39. 39. OUR BLUEPRINT Standard Services • Deployment Target • Internal Hostname • Internal Load Balancer • Autoscale Group • CodeDeploy Artifact Supporting Services Artifacts are easier with: • Configuration support hooks • Service Control hooks • Remote Execution hooks • Metrics, monitors, logs, alerts
  40. 40. REMAINING SERVICES 41+ Services Just 41+ more to go Each one needs conversion 200+ Deployments Just 200+ more to go Each one needs re-deployment Empathy Not all services were designed for a multi-cluster environments Not all services were designed for graceful termination Not all services have active improvements planned Challenges • Non-idempotent • State-full / Disk-full • Master/Worker Co-Services • Maintain Service Levels • High Throughput Environment
  41. 41. THE WAY HOME Best Practices Standard Services Standard Delivery Standard Tooling Work for Teams Improve Service as a Service Work with Teams Enable Super Powers Deploy on Demand Per-Cluster Versions
  42. 42. REMAINING SERVICES Service Improvements Target business value: Delivery Velocity High-Trust Services Support Config Management No Big-Bang Replacements Business Depends on Previous Process Strategy to Improve Small Iterations Incremental Value
  43. 43. OUR SERVICE IS NOT YOUR SERVICE All software is created within a context, and trade-offs are made based on that context
  44. 44. RELIABILITY Reliability is: The quality of being trustworthy or performing consistently well
  45. 45. INVESTMENTS Understandable Make every service easy to understand Allow any engineer to quickly operate and improve Consistent Make every service look the same Allow any engineer to work on any system without context Repeatable Practice makes perfect
  47. 47. NO HEROES, ONLY TEAM Yuu Yamashita Takashi Kokubun Yuki Ito Chris Maxwell You? Site Reliability Engineer Robin Bowes You? Site Reliability Engineer You? Infrastructure Engineer You? Site Reliability Engineer
  48. 48. T R E A S U R E D A T A BUILDING RELIABLE SERVICES • @WrathOfChris • Chris Maxwell • 採用情報 • トレジャーデータ株式会社