Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Getafix : Using Popularity to Curtail Memory Costs in Interactive Analytics Engines

75 views

Published on

Getafix is a data replication scheme that cuts memory costs in interactive analytics engines. It tracks data segment popularity in a cluster and runs a modified best-fit algorithm to find a memory efficient allocation. It is built upon our optimal solution to the static version of the scheduling problem, which yields a solution that has the least makespan and replication factor. In practice, it can cut cluster memory costs by 2X without sacrificing query performance and save firms as much as $10 million annually. It also automatically tiers popular data to powerful nodes in a heterogeneous cluster and greatly simplifies sys admin work.

Website: https://medium.com/@iwonderhow/getafix-using-popularity-to-curtail-memory-costs-in-interactive-analytics-engines-1471cf5dab84
Paper: https://dl.acm.org/citation.cfm?id=3190542&dl=ACM&coll=DL
Source Code: http://dprg.cs.uiuc.edu/downloads#l_getafix

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Getafix : Using Popularity to Curtail Memory Costs in Interactive Analytics Engines

  1. 1. Getafix: Using Popularity to Curtail Memory Costs in Interactive Analytics Engines Distributed Protocols Research Group http://dprg.cs.uiuc.edu/ Mainak Ghosh*, Ashwini Raina*, Le Xu*, Xiaoyao Qian*, Indranil Gupta*, Himanshu Gupta# *University of Illinois at Urbana Champaign, #Oath Inc.
  2. 2. Real-Time Analytics 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 2 Distributed OLAP Streaming Analytics Worth 13.7 billion dollars by 2021[1] Interactive Analytics Engine [1] Research and Markets. 2015. Streaming Analytics Market by Verticals – Worldwide Market Forecast & Analysis (2015 - 2020). Report. (June 2015).
  3. 3. Interactive Analytics Engines 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 3 Data Ingestion Deep Storage Batch Data Real-time Data Query Compute Router Compute Nodes Prefetched, Cached Replicated Handles petabytes of data and millions of queries per day Requires large amount of memorySegments Less replication => less memory
  4. 4. Memory is not cheap … 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 4 Managing data is important in these systems
  5. 5. Memory Performance Tradeoff • Uniform Replication Strategy • Difficult to determine the knee offline • Can we do better? 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 5 Observed Knee
  6. 6. Skewed Access Pattern 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 6 Top 1% segments 43% of query accesses
  7. 7. Reducing Memory • Selective Replication • Scarlett [Ananthanarayanan’11] • Data Compression • Blowfish [Khandelwal’16] • EC-Cache[Rashmi’16] 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 7
  8. 8. Getafix: Goals • Minimize memory required • Reduce cost of cluster provisioning • Minimize Makespan • Small impact on average query latency 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 8 Goal
  9. 9. Agenda • Motivation • Problem Formulation • System Design • Evaluation • Takeaway 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 9
  10. 10. Colored Blocks and Bins … 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines CN3 CN4 Q1(Count) Q2(Top 10) Q3(GroupBy) Q4(Average) Goal is to provide a load balanced assignment with the least amount of replication S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) S3 (7 AM – 8 AM) 10 CN1 CN2 Segments Compute Nodes CN1
  11. 11. An arbitrary mapping ... 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines CN3 CN4 Q1(Count) Q2(Top 10) Q3(GroupBy) Q4(Average) S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) S3 (7 AM – 8 AM) 11 CN1 CN2 Segments Compute Nodes CN1 2 22 1 Load Balanced Assignment. Number of Replicas: 7
  12. 12. Solution which minimizes number of replicas … 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines CN3 CN4 Q1(Count) Q2(Top 10) Q3(GroupBy) Q4(Average) S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) S3 (7 AM – 8 AM) 12 CN1 CN2 Segments Compute Nodes CN1 1 21 1 Load Balanced Assignment. Number of Replicas: 5
  13. 13. Static Problem: Assumptions (Relaxed later) 1. Blocks are of uniform size. • All queries take same time to execute on segments 2. The number of blocks and bins are static. • All queries and segments have arrived. 3. Bins are of uniform capacity. • Compute nodes are homogeneous in computation power. 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 13
  14. 14. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 14 CN1 CN2 CN3 Capacity: 4 𝟔 + 𝟑 + 𝟐 + 𝟏 𝟑 S1 S2 S3 S4 6 3 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count
  15. 15. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 15 CN1 CN2 CN3 S1 S2 S3 S4 6 3 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Capacity: 4
  16. 16. Compute Node Selection Policy • Bin Packing heuristics like First Fit, Largest Fit and Best Fit applicable here • In our running example, we use First Fit • We choose the next CN (lowest CN id) that is not yet full. 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 16
  17. 17. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 17 CN1 CN2 CN3 S1 S2 S3 S4 6 3 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Capacity: 4
  18. 18. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 18 CN1 CN2 CN3 S2 S1 S3 S4 3 2 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Capacity: 4
  19. 19. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 19 CN1 CN2 CN3 S2 S1 S3 S4 3 2 2 1 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  20. 20. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 20 CN1 CN2 CN3 S1 S3 S4 S2 2 2 1 0 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  21. 21. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 21 CN1 CN2 CN3 S3 S1 S4 S2 2 1 1 0 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  22. 22. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 22 CN1 CN2 CN3 S1 S4 S2 S3 1 1 0 0 S1 S4 S2 S3 1 1 0 0 S1 S4 S2 S3 1 1 0 0 S1 S4 S2 S3 1 1 0 0 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  23. 23. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 23 CN1 CN2 CN3 S1 S2 S3 S4 0 0 0 0 Load Balanced but not optimal in replica count 1 2 3 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Capacity: 4
  24. 24. Segment Placement Algorithm 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 24 CN1 CN2 CN3 S1 S2 S3 S4 0 0 0 0 Assign bin capacity as total number of query segment pairs by number of compute nodes Create a priority queue on segments by access count Pick from the head, assign to compute node based on a policy and return the rest Continue till the queue is empty Best Fit provably achieves this minimal replica count 1 2 2 Capacity: 4
  25. 25. Agenda • Motivation • Problem Formulation • System Design • Evaluation • Takeaway 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 25
  26. 26. Algorithm Execution • Executes in periodic rounds • Small interval length • Tracks total access time • Total time all queries spend accessing a particular segment 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 26 CN1 Access Logger S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) Coordinator Best Fit Algorithm Segment Access (CN1): S1: 1, S2: 2 Queries
  27. 27. Segment Management • Aggressively removes replicas of unpopular segments except one • One replica avoids network fetch on access • Single replica garbage collected under resource constraint • Bootstrap loads new segments • Avoids segment loads in fast path 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 27 Deep Storage CN1 Access Logger S1 (5 AM – 6 AM) S2 (6 AM – 7 AM) Coordinator Best Fit Algorithm CN1: Insert S3 Remove S1 Bootstrap Loader Garbage Collection Fetch S3
  28. 28. Cluster Heterogeneity • Production clusters are tiered – hot, warm and cold data • Unexpected slowdowns (stragglers) • Bins are assigned unequal capacities in the algorithm • Auto-Tiering: Match popular segments to powerful nodes. • Done manually today by sys-admin. 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 28
  29. 29. Other Details … • Query Routing • Segment Load Balancer • Minimizing Network Transfer • Fault Tolerance 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 29
  30. 30. Agenda • Motivation • Problem Formulation • System Design • Evaluation • Takeaway 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 30
  31. 31. Memory Latency Tradeoff 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 31 • Getafix uses ~50% less memory compared to Scarlett • Small impact on average latency Getafix
  32. 32. Memory Savings -- Scalability • Baseline: Scarlett[2] • Total memory savings increases with client load 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 32
  33. 33. Dollar Cost Savings Memory Cost Model Working Set Data Size x Replication Factor x Cost of Memory Cost Saving Compared to Scarlett = 100 TB x (4.2 – 1.9) x $0.005 per GB per hour (EC2) = $1150 per hour = $10 million annually 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 33
  34. 34. Tiering Accuracy Getafix achieves 75% tiering accuracy compared to 42% for Getafix-B 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 34 ` 16 8 4 ` 16 8 4 Getafix-B Getafix High Moderate Low Rounds Query Processing Volume (CPU Time)
  35. 35. Takeaway • Data management in interactive analytics engine – an important problem • Best fit based segment replication algorithm minimizes replica count under assumptions • Our system, Getafix built on top of Druid relaxes these assumptions • Getafix out-performs known baselines by reducing memory and cost 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 35
  36. 36. Evaluation Summary Getafix minimizes memory and cost with little impact on performance Getafix adapts well to heterogeneity in clusters by effectively auto-tiering 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 36
  37. 37. Memory Savings -- Scalability • Baseline: Scarlett[2] • Total memory savings increases with client load • 1.72x – 1.92x reduction in maximum memory 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 37 [2] Ganesh Ananthanarayanan, et.al. Scarlett: Coping with Skewed Content Popularity in Mapreduce Clusters. (EuroSys ’11).
  38. 38. Heterogeneity Improvements 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 38 • Baseline: (Getafix-B) Getafix w/o Heterogeneity • Improves on all metrics • Total Memory Savings improves 20 – 30%
  39. 39. Heterogeneity Improvements 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 39 • Baseline: Vanilla Getafix (Getafix-B) • Improves on all metrics • Total Memory Savings improves 20 – 30% • ~20% improvement in tail latency observed
  40. 40. Evaluation Goals Memory and Cost Savings Effectiveness of Auto-Tiering 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 40
  41. 41. Getafix: Goals • Minimize memory required • Minimize cost of cluster provisioning • Minimize makespan • Minimal impact on average query latency 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 45
  42. 42. Memory Latency Tradeoff 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 46 • Baseline: Uniform and Scarlett • 2.15X worse query performance for Uniform
  43. 43. Memory Latency Tradeoff 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 47 • Baseline: Uniform and Scarlett • 2.15X worse query performance for Uniform • Getafix uses least memory
  44. 44. Evaluation Goals Getafix minimizes memory and cost with little impact on performance Support for Heterogeneity 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 48
  45. 45. Implementation • Tracks total access time • Total time all queries spend accessing a particular segment • Executes the algorithm in periodic rounds • Small interval length to catch popularity change early • Segment Popularity at round K+1 for segment sj estimated as: Popularity(sj, K + 1) = 1 2 ∗ Popularity(sj, K) + AccessTime(sj, K + 1) 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 49
  46. 46. Segment Management • Getafix follows the segment placement plan • Segments are bootstrap loaded • Avoids segment loads in fast path • Aggressively removes replicas of unpopular segments except one • One replica avoids network fetch on access • Single replica garbage collected under resource constraint 6/11/2018 Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines 50

×