Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Show me the Money! Cost & Resource Tracking for Hadoop and Storm

1,529 views

Published on

Show me the Money! Cost & Resource Tracking for Hadoop and Storm

Published in: Technology

Show me the Money! Cost & Resource Tracking for Hadoop and Storm

  1. 1. Show Me The Money! Cost & Resource Tracking for Hadoop & Storm Hadoop Summit June 30, 2016 Kendall  Thrapp
  2. 2. • 3000+ grid users • ~600 distinct projects • Running 1.2M+ apps/day … all focused on meeting their own SLAs but not necessarily on how their grid usage impacts YAHOO PROPRIETARY Hadoop @ Yahoo Scale 2 Tracking  resource  usage  and  cost  is  cri0cal  to  manage  capacity  and  ensure  fairness Image  by  b  k  @  h3ps://flic.kr/p/4EjNgb  (CC  BY-­‐SA  2.0)
  3. 3. YAHOO PROPRIETARY Why Care About Resource Utilization? 3 Capacity  Planning OperaLonal  Efficiency Profitability  &  ROI Grid  Efficiency Transparency See  trends  over  Lme  and  predict  future  shorValls Provide  jusLficaLon  for  engineering  more  efficient  code Include  Hadoop  plaVorm  usage  cost  in  overall  project  cost Move  projects  between  clusters  to  maximize  efficiency See  resource  usage  and  cost  of  all  grid  tenants
  4. 4. YAHOO PROPRIETARY Three Year Mission… 4 But tracking resource usage in Hadoop was hard… really hard. So three years ago, we set out on a mission to show: Image  derived  from  h3ps://flic.kr/p/dN895J  by  JD  Hancock  (CC  BY  2.0)   • Resource usage for any YARN app • Resource usage over time for clusters, queues, users, and projects • Cost for any resource usage
  5. 5. YAHOO PROPRIETARY The Language of Grid Resource Usage 5 Resource  Usage    =    amount  allocated    x    0me  allocated One 2GB mapper running for 5 hours = 10 GB-Hour Five 2GB mappers running for 1 hour = 10 GB-Hour Resource Example  Units RAM GB-­‐Hour  or  MB-­‐Second CPU vCore-­‐Hour  or  vCore-­‐Second Image  by  Casey  Fleser  @  h3ps://flic.kr/p/6ACfUz  (CC  BY  2.0)
  6. 6. • 28 months from JIRA to full deployment • First time getting resource usage for non- MR applications, like Spark, TEZ, or Slider. • Available through the Hadoop UI, even while the app is still running. • Stored long term by Grid UI team and made available through a REST API. • Can benchmark apps to see how code & config changes affect resource usage. • Can convert this to a $ cost using TCO method described later. YAHOO PROPRIETARY Introducing YARN-415 6 Capture aggregate resource allocation at the app-level in MB-secs & vCore-secs
  7. 7. • Sample cluster, queue, and user- level compute resource utilization every minute across all clusters. • Make available via Grid Utilization Dashboard and REST API. • Further aggregate by project and time at hourly, daily, and monthly intervals. • Projects can see a rolling one year history of their compute and storage usage on Doppler. YAHOO PROPRIETARY Resource Utilization Over Time 7 YARN-415 only gives us half the story… Image  from  Grid  ULlizaLon  Dashboard
  8. 8. YAHOO PROPRIETARY8 Viewing Project Compute Utilization In the Doppler web application Monthly average RAM & CPU usage for the current month and past three months, as well as quotas Zoom by time window or date range Rolling one-year historical charts for RAM & CPU ● Central solid line is daily average ● Inner (darker) band is average ± 1 SD ● Outer (lighter) band is daily min/max ● Dashed line is approved quota Hover over chart to see exact values for dates When zoomed in, use scrollbar to see other dates Flags to indicate major events, like upgrade to Hadoop 2.6 Click name in legend to show or hide series. Chart axes will dynamically resize to maximize detail. Webpage has additional panels like this for each queue ever used by the project
  9. 9. YAHOO PROPRIETARY9 Viewing Project Storage Utilization In the Doppler web application Rolling one-year historical charts for disk and namespace usage: ● Blue area is daily average ● Dashed orange line is actual quota Show current utilization and quota both before and after replication Webpage has additional panels like this for each project directory used by the project Gauges showing latest observed disk and namespace usage -- gradually turns from green to red as utilization approaches 100% Hover over chart to see exact values for dates
  10. 10. YAHOO PROPRIETARY Show Me the Money! 10 • Total  Cost  of  Ownership  (TCO)  iniLaLve  in  2015  to  began  compuLng  a  $   cost  for  all  compute  and  storage  uLlizaLon  by  projects  on  Hadoop.   • In  June  2015,  we  added  a  TCO  panel  to  all  Hadoop  project  and  project   environment  pages  in  the  Doppler  web  applicaLon  showing  historical   monthly  TCO  cost.
  11. 11. YAHOO PROPRIETARY How is Project TCO Calculated? 11 Total Hadoop TCO Disk NamespaceCPURAM 1. Compute total Hadoop TCO a. Comprised of many different sources of cost -- not just hardware (see next slide) 2. Divide total TCO amongst resource types a. Even distribution chosen initially b. Distribution can be adjusted (monthly) to allow for scarce resources to be priced more expensively. 3. Compute project resource TCO as a fraction of total resource TCO: 4. Total project TCO is the sum of all individual project resource TCOs. 25% 25% 25%25% Project Resource Usage Total Resource Usage X Total Resource TCO = Project Resource TCO This distributes overhead/unused capacity costs across projects proportional to their grid usage.
  12. 12. YAHOO PROPRIETARY12 Total Hadoop TCO Makeup $8.1 M 60% 12% 7% 6% 3% 2% 6 5 4 3 2 1 7 10% Operations Engineering ▪ Headcount for service engineering and data operations teams responsible for day-to-day ops and support 66 Acquisition/ Install (One-time) ▪ Labor, POs, transportation, space, support, upgrades, decommissions, shipping/receiving, etc. 5 Network Hardware ▪ Aggregated network component costs, including switches, wiring, terminal servers, power strips, etc. 4 Active Use and Operations (Recurring) ▪ Recurring datacenter ops cost (power, space, labor support, and facility maintenance) 3 R&D HC ▪ Headcount for platform software development, quality, and release engineering 2 Cluster Hardware ▪ Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers 1 Monthly TCOTCO Components Network Bandwidth ▪ Data transferred into and out of clusters for all colos, including cross-colo transfers 7 6 6 ILLUSTRATIVE
  13. 13. YAHOO PROPRIETARY13 TCO Dashboard In the Doppler web application Filter TCO data on: ● Date range ● Project name ● Business unit ● Cluster name ● Cluster type Search on anything in the table Export to CSV for offline analysis One row in table per project environment and month The TCO Dashboard (yo/grid-tco) allows users to view and sum TCO information along a variety of dimensions. Resource and cost totals for all filtered results are shown here Sort on any column or multiple columns Note: Cost data is for illustrative purposes only (not real unit costs)
  14. 14. • Unmasked hidden issues, like: – Projects using far more compute resources than they were ever approved for – Projects requesting more resources when they were underutilizing what they already had – Projects launching apps in queues they weren’t supposed to be using – Zombie projects that were cancelled/retired but continuing to consume grid resources • Helped teams verify a significant reduction in their compute usage after some major efficiency improvements YAHOO PROPRIETARY Results! 14
  15. 15. YAHOO PROPRIETARY15 Beyond Hadoop: Storm Project Compute Utilization In the Doppler web application • Sample assigned RAM & CPU per-topology every minute across all clusters using Nimbus’ topology summary REST API • Aggregate by user and by project • Make available via Doppler UI and REST API • Coming soon: Compare assigned memory/cpu vs. actual usage • Convert to monthly $ cost via TCO model
  16. 16. ● Get compute resource usage for all Hadoop apps through YARN-415 ● Store historical Hadoop resource utilization at the cluster, queue, user, and project levels ● Store historical Storm resource utilization at the topology, user and project levels ● Developed a cost model and applied to it compute monthly cost for all Hadoop and Storm projects ● Make utilization and cost data and charts available web apps and REST APIs YAHOO PROPRIETARY Recap 16 Resource and cost tracking for Hadoop & Storm
  17. 17. • Visibility and cost for NameNode operations • Visibility and cost for network utilization in Storm • Identify waste when there are large gaps between allocated and peak used container memory (Downsizer) • Move to an OPEX model for where teams just pay for what they use YAHOO PROPRIETARY The mission continues… 17 Image  by  Reinhard  Kuchenbäcker  @  h3ps://flic.kr/p/naFkFH  (CC  BY  2.0)
  18. 18. Q&A Authors: • Kendall Thrapp • Shawna Martell • Alessandro Bellina • Eric Payne • Sumeet Singh

×