2. Background – ODS Time rollups
▪ ODS receives, stores and displays time series data
▪ Many services, many points per minute
▪ Storing everything forever would take up a lot of space
▪ Solution: Beyond 2 days, keep an averaged value every 15 mins/ 1 hour
0 15 60 75 90 105 120
135 150
3. Background – ODS Time rollups
▪ Problems:
▪ Can’t tell between spikes and bumps
▪ The magnitude of the spikes are not preserved
▪ The times are not preserved
0 15 30 45 60 75 90
105 120
5. Intern project:
▪ How can we preserve the values and times of major spikes
without using more space?
6. Bottom up algorithm:
• Greedy approach:
– Keep removing the most “useless” points, until the resultant time
series exceeds an error threshold
0 15 30 45 60 75 90
105 120
14. Nice properties
• Save space from “uninteresting” parts to store more
details of “interesting” parts
• Automatically use less storage when there is less
information
• Store exact points, no need to guess
• If you think you’re missing a spike, you know it’s smaller
than the smallest spike visible in that time window.
• (From our side) Easy to turn the knob between higher
compression vs better accuracy.
15. Show me the numbers!
• Matches compression of hourly rollups, yet less
information loss than 15 min rollups
– Hourly rollup’s compression factor: ~ 20.6x
– For a similar compression,
• 67% RMS error reduction against 15 min rollups
• 83% RMS error reduction against hourly rollups
• Can also match 15 min rollups
– 15 min rollup’s compression factor: ~ 5.3x
– For a similar compression,
• 96.9% RMS error reduction against 15 min rollups
• 98.3% RMS error reduction against hourly rollups
16. “Lossless” compression?
• Some graphs are too spiky, loading all these data slows
down the chart page
• What if we cap the number of points?
– Keep at most 30 points every 3 hours
• 1 point every 6 minutes on average if necessary
– After that, only remove points that provide zero information
• Result:
– Same compression as 15 min rollups, but 94% RMS error
reduction
– If chart has fewer than 30 points in a 3 hour window =>
lossless compression
17. Drawbacks
• MapReduce job currently produces high I/O load on
cluster
• Harder to reason about sums and averages
• More time lag – currently needs 3 hours of past data to
perform compression
– But the raw information is available for 2 days, so it’s not really
a big issue
• Only guaranteed to store two points every 3 hours.
– Need to query over a longer time period to see at least one
point
18. What’s left?
• Reducing I/O load on cluster
• Reducing time lag (which also reduces I/O load)
• Need to interpolate values when applying formula
between time series
19. Thanks!
• Mentor: Justin Teller
• ODS team: Vinod, Charles, Tuomas, Scott, Alex, Ostap, Jason
• Other helpful interns: Marjori, Michal, Mateusz, Arash
• Special thanks to: Adela and Alexey
• Facebook!
----- Meeting Notes (8/22/13 13:40) -----
store # data points, check hdfs space usage
----- Meeting Notes (8/22/13 13:40) -----
easier for rapido to figure it out store points of 0 and 1 when there's data or not (good compression on hbase too)
----- Meeting Notes (8/6/13 11:43) -----
a hydrid approach - try both and pick the least error.
Check - if spiky, use BU, if smooth use rollups.
----- Meeting Notes (8/22/13 13:40) -----
overlapping time window of 4 minutes in the past