SlideShare a Scribd company logo
1 of 20
Efficient ODS Time Rollup
ODS (Monitoring)
Scott Yak
August 22, 2013
Background – ODS Time rollups
▪ ODS receives, stores and displays time series data
▪ Many services, many points per minute
▪ Storing everything forever would take up a lot of space
▪ Solution: Beyond 2 days, keep an averaged value every 15 mins/ 1 hour
0 15 60 75 90 105 120
135 150
Background – ODS Time rollups
▪ Problems:
▪ Can’t tell between spikes and bumps
▪ The magnitude of the spikes are not preserved
▪ The times are not preserved
0 15 30 45 60 75 90
105 120
Significance:
▪ Incomplete information during troubleshooting
▪ May hide issues that are not uncovered within two days
Intern project:
▪ How can we preserve the values and times of major spikes
without using more space?
Bottom up algorithm:
• Greedy approach:
– Keep removing the most “useless” points, until the resultant time
series exceeds an error threshold
0 15 30 45 60 75 90
105 120
“Live” demo
Raw data
Raw + rollups
Raw + rollups + compressed
Raw
Rollups
Compressed
Nice properties
• Save space from “uninteresting” parts to store more
details of “interesting” parts
• Automatically use less storage when there is less
information
• Store exact points, no need to guess
• If you think you’re missing a spike, you know it’s smaller
than the smallest spike visible in that time window.
• (From our side) Easy to turn the knob between higher
compression vs better accuracy.
Show me the numbers!
• Matches compression of hourly rollups, yet less
information loss than 15 min rollups
– Hourly rollup’s compression factor: ~ 20.6x
– For a similar compression,
• 67% RMS error reduction against 15 min rollups
• 83% RMS error reduction against hourly rollups
• Can also match 15 min rollups
– 15 min rollup’s compression factor: ~ 5.3x
– For a similar compression,
• 96.9% RMS error reduction against 15 min rollups
• 98.3% RMS error reduction against hourly rollups
“Lossless” compression?
• Some graphs are too spiky, loading all these data slows
down the chart page
• What if we cap the number of points?
– Keep at most 30 points every 3 hours
• 1 point every 6 minutes on average if necessary
– After that, only remove points that provide zero information
• Result:
– Same compression as 15 min rollups, but 94% RMS error
reduction
– If chart has fewer than 30 points in a 3 hour window =>
lossless compression
Drawbacks
• MapReduce job currently produces high I/O load on
cluster
• Harder to reason about sums and averages
• More time lag – currently needs 3 hours of past data to
perform compression
– But the raw information is available for 2 days, so it’s not really
a big issue
• Only guaranteed to store two points every 3 hours.
– Need to query over a longer time period to see at least one
point
What’s left?
• Reducing I/O load on cluster
• Reducing time lag (which also reduces I/O load)
• Need to interpolate values when applying formula
between time series
Thanks!
• Mentor: Justin Teller
• ODS team: Vinod, Charles, Tuomas, Scott, Alex, Ostap, Jason
• Other helpful interns: Marjori, Michal, Mateusz, Arash
• Special thanks to: Adela and Alexey
• Facebook!
Questions?

More Related Content

Similar to Final presentation (Scott Yak) - ODS Compress -- Cleansed

Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsITProceed
 
TeamXXX_Round1_Pitch_Presentation_Template.pptx
TeamXXX_Round1_Pitch_Presentation_Template.pptxTeamXXX_Round1_Pitch_Presentation_Template.pptx
TeamXXX_Round1_Pitch_Presentation_Template.pptxhannibal9091
 
KOAMTAC KDC 300 SCANNER
KOAMTAC KDC 300 SCANNERKOAMTAC KDC 300 SCANNER
KOAMTAC KDC 300 SCANNERPeter Ward
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.pptFannyBellows
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTPConnor McDonald
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsJeff Jirsa
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWRpasalapudi
 
Digital Pens - Firas Hijazi - FIATECH and COMIT
Digital Pens - Firas Hijazi - FIATECH and COMITDigital Pens - Firas Hijazi - FIATECH and COMIT
Digital Pens - Firas Hijazi - FIATECH and COMITCCT International
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalabilityGuy Tomer
 
PGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from TrenchesPGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from TrenchesPGConf APAC
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d methodAjith Narayanan
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 

Similar to Final presentation (Scott Yak) - ODS Compress -- Cleansed (20)

Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
TeamXXX_Round1_Pitch_Presentation_Template.pptx
TeamXXX_Round1_Pitch_Presentation_Template.pptxTeamXXX_Round1_Pitch_Presentation_Template.pptx
TeamXXX_Round1_Pitch_Presentation_Template.pptx
 
KOAMTAC KDC 300 SCANNER
KOAMTAC KDC 300 SCANNERKOAMTAC KDC 300 SCANNER
KOAMTAC KDC 300 SCANNER
 
Making Run Charts
Making Run ChartsMaking Run Charts
Making Run Charts
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.ppt
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTP
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
03 performance
03 performance03 performance
03 performance
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series Workloads
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Postgres at Scale - at Scale.pdf
Postgres at Scale - at Scale.pdfPostgres at Scale - at Scale.pdf
Postgres at Scale - at Scale.pdf
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
 
Digital Pens - Firas Hijazi - FIATECH and COMIT
Digital Pens - Firas Hijazi - FIATECH and COMITDigital Pens - Firas Hijazi - FIATECH and COMIT
Digital Pens - Firas Hijazi - FIATECH and COMIT
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalability
 
PGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from TrenchesPGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from Trenches
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d method
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 

Final presentation (Scott Yak) - ODS Compress -- Cleansed

  • 1. Efficient ODS Time Rollup ODS (Monitoring) Scott Yak August 22, 2013
  • 2. Background – ODS Time rollups ▪ ODS receives, stores and displays time series data ▪ Many services, many points per minute ▪ Storing everything forever would take up a lot of space ▪ Solution: Beyond 2 days, keep an averaged value every 15 mins/ 1 hour 0 15 60 75 90 105 120 135 150
  • 3. Background – ODS Time rollups ▪ Problems: ▪ Can’t tell between spikes and bumps ▪ The magnitude of the spikes are not preserved ▪ The times are not preserved 0 15 30 45 60 75 90 105 120
  • 4. Significance: ▪ Incomplete information during troubleshooting ▪ May hide issues that are not uncovered within two days
  • 5. Intern project: ▪ How can we preserve the values and times of major spikes without using more space?
  • 6. Bottom up algorithm: • Greedy approach: – Keep removing the most “useless” points, until the resultant time series exceeds an error threshold 0 15 30 45 60 75 90 105 120
  • 10. Raw + rollups + compressed
  • 11. Raw
  • 14. Nice properties • Save space from “uninteresting” parts to store more details of “interesting” parts • Automatically use less storage when there is less information • Store exact points, no need to guess • If you think you’re missing a spike, you know it’s smaller than the smallest spike visible in that time window. • (From our side) Easy to turn the knob between higher compression vs better accuracy.
  • 15. Show me the numbers! • Matches compression of hourly rollups, yet less information loss than 15 min rollups – Hourly rollup’s compression factor: ~ 20.6x – For a similar compression, • 67% RMS error reduction against 15 min rollups • 83% RMS error reduction against hourly rollups • Can also match 15 min rollups – 15 min rollup’s compression factor: ~ 5.3x – For a similar compression, • 96.9% RMS error reduction against 15 min rollups • 98.3% RMS error reduction against hourly rollups
  • 16. “Lossless” compression? • Some graphs are too spiky, loading all these data slows down the chart page • What if we cap the number of points? – Keep at most 30 points every 3 hours • 1 point every 6 minutes on average if necessary – After that, only remove points that provide zero information • Result: – Same compression as 15 min rollups, but 94% RMS error reduction – If chart has fewer than 30 points in a 3 hour window => lossless compression
  • 17. Drawbacks • MapReduce job currently produces high I/O load on cluster • Harder to reason about sums and averages • More time lag – currently needs 3 hours of past data to perform compression – But the raw information is available for 2 days, so it’s not really a big issue • Only guaranteed to store two points every 3 hours. – Need to query over a longer time period to see at least one point
  • 18. What’s left? • Reducing I/O load on cluster • Reducing time lag (which also reduces I/O load) • Need to interpolate values when applying formula between time series
  • 19. Thanks! • Mentor: Justin Teller • ODS team: Vinod, Charles, Tuomas, Scott, Alex, Ostap, Jason • Other helpful interns: Marjori, Michal, Mateusz, Arash • Special thanks to: Adela and Alexey • Facebook!

Editor's Notes

  1. ----- Meeting Notes (8/22/13 13:40) ----- store # data points, check hdfs space usage
  2. ----- Meeting Notes (8/22/13 13:40) ----- easier for rapido to figure it out store points of 0 and 1 when there's data or not (good compression on hbase too)
  3. ----- Meeting Notes (8/6/13 11:43) ----- a hydrid approach - try both and pick the least error. Check - if spiky, use BU, if smooth use rollups. ----- Meeting Notes (8/22/13 13:40) ----- overlapping time window of 4 minutes in the past