Final presentation (Scott Yak) - ODS Compress -- Cleansed

Efficient ODS Time Rollup
ODS (Monitoring)
Scott Yak
August 22, 2013

Background – ODS Time rollups
▪ ODS receives, stores and displays time series data
▪ Many services, many points per minute
▪ Storing everything forever would take up a lot of space
▪ Solution: Beyond 2 days, keep an averaged value every 15 mins/ 1 hour
0 15 60 75 90 105 120
135 150

Background – ODS Time rollups
▪ Problems:
▪ Can’t tell between spikes and bumps
▪ The magnitude of the spikes are not preserved
▪ The times are not preserved
0 15 30 45 60 75 90
105 120

Significance:
▪ Incomplete information during troubleshooting
▪ May hide issues that are not uncovered within two days

Intern project:
▪ How can we preserve the values and times of major spikes
without using more space?

Bottom up algorithm:
• Greedy approach:
– Keep removing the most “useless” points, until the resultant time
series exceeds an error threshold
0 15 30 45 60 75 90
105 120

Nice properties
• Save space from “uninteresting” parts to store more
details of “interesting” parts
• Automatically use less storage when there is less
information
• Store exact points, no need to guess
• If you think you’re missing a spike, you know it’s smaller
than the smallest spike visible in that time window.
• (From our side) Easy to turn the knob between higher
compression vs better accuracy.

Show me the numbers!
• Matches compression of hourly rollups, yet less
information loss than 15 min rollups
– Hourly rollup’s compression factor: ~ 20.6x
– For a similar compression,
• 67% RMS error reduction against 15 min rollups
• 83% RMS error reduction against hourly rollups
• Can also match 15 min rollups
– 15 min rollup’s compression factor: ~ 5.3x
– For a similar compression,
• 96.9% RMS error reduction against 15 min rollups
• 98.3% RMS error reduction against hourly rollups

“Lossless” compression?
• Some graphs are too spiky, loading all these data slows
down the chart page
• What if we cap the number of points?
– Keep at most 30 points every 3 hours
• 1 point every 6 minutes on average if necessary
– After that, only remove points that provide zero information
• Result:
– Same compression as 15 min rollups, but 94% RMS error
reduction
– If chart has fewer than 30 points in a 3 hour window =>
lossless compression

Drawbacks
• MapReduce job currently produces high I/O load on
cluster
• Harder to reason about sums and averages
• More time lag – currently needs 3 hours of past data to
perform compression
– But the raw information is available for 2 days, so it’s not really
a big issue
• Only guaranteed to store two points every 3 hours.
– Need to query over a longer time period to see at least one
point

What’s left?
• Reducing I/O load on cluster
• Reducing time lag (which also reduces I/O load)
• Need to interpolate values when applying formula
between time series

Thanks!
• Mentor: Justin Teller
• ODS team: Vinod, Charles, Tuomas, Scott, Alex, Ostap, Jason
• Other helpful interns: Marjori, Michal, Mateusz, Arash
• Special thanks to: Adela and Alexey
• Facebook!

Final presentation (Scott Yak) - ODS Compress -- Cleansed

Recommended

Recommended

More Related Content

Similar to Final presentation (Scott Yak) - ODS Compress -- Cleansed

Similar to Final presentation (Scott Yak) - ODS Compress -- Cleansed (20)

Final presentation (Scott Yak) - ODS Compress -- Cleansed

Editor's Notes