Is this normal?Finding anomalies in real-time data.
Who am I? I’m Theo (@postwait on Twitter) I write a lot of code    50+ open source projects    several commercial code bas...
What is real-time? Hard real-time systems are those where the outputs of a system based on specific inputs are considered i...
A survey of big data sytems. Traditional:   Oracle, Postgres, MySQL, Teradata,   Vertica, Netezza, Greenplum, Tableau, K T...
Big data the old way Relational databases, both column store and not.   Just work.   Likely store more data than your “big...
Big data the distributed way  distributed systems allow much larger data sets, but    markedly change the data analytics m...
Big data the real-time way  what we do needs a different approach  the old (and even the distributed)    do not design for...
So, what’s your problem?  We have telemetry...    over 10 trillion data points on near-line storage    growing super-linea...
Data, what kind?Most data is numeric:  counts, averages, derivatives, stddevs, etc.Some data is:  text changes (ssh fingerp...
Data rates. Quantity of data isn’t such a big deal   okay, yes it is, but we’ll get to that later. The rate of new data ar...
What we use.We use EsperEsper is very powerful,elegantly coded andperformance focused                          http://www....
What we do with EsperDetect absence in streams:  select b from pattern  [every a=Event -> (timer:interval(30 sec) and   no...
Making the problem harder.So, it just wasn’t enough.We want to do long term trendingand apply that information to anomaly ...
How we do it.We implemented the Snowth for storage of data. [2]We implemented a C/lua distributed system to analyze4 weeks...
Cheating is winning. Our predictions work on 5 minute windows.   4 weeks of data is 8064 windows. Given Pred(T-8063 .. T0)...
Tolerably inaccurate. When V arrives, we determine the prediction window WN we need. If WN isn’t in cache, we assume V is ...
I see challenges How do I  take offline data analytics techniques and  apply them online to high-volume, low-latency  event...
Thank you.   Circonus is hiring:     software engineers,     quants, and     visualization engineers.[1] http://esper.code...
Upcoming SlideShare
Loading in …5
×

Is this normal?

2,145 views

Published on

There are many modern techniques for identifying anomalies in datasets. There are fewer that work as online algorithms suitable for application to real-time streaming data. What’s worse? Most of these methodologies require a deep understanding of the data itself. In this talk, we tour what the options are for identifying anomalies in real-time data and discuss how much we really need to know before hand to guess at the ever-useful question: is this normal?

Published in: Technology, Business
  • Be the first to comment

Is this normal?

  1. 1. Is this normal?Finding anomalies in real-time data.
  2. 2. Who am I? I’m Theo (@postwait on Twitter) I write a lot of code 50+ open source projects several commercial code bases I wrote “Scalable Internet Architectures” I sit on the ACM Queue and Professions boards. I spend all day looking at telemetry data at Circonus
  3. 3. What is real-time? Hard real-time systems are those where the outputs of a system based on specific inputs are considered incorrect if the latency of their delivery is above a specified amount. Soft real-time systems are similar, but “less useful” instead of “incorrect.” I don’t design life support systems, avionics or other systems where lives are at stake, so it’s a soft real-time life for me.
  4. 4. A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza, Greenplum, Tableau, K The shiny: Hadoop, Hive, HBase, Pig, Cassandra The real-time: SQLstream, S4, Flumebase, Truviso, Esper, Storm
  5. 5. Big data the old way Relational databases, both column store and not. Just work. Likely store more data than your “big data.”
  6. 6. Big data the distributed way distributed systems allow much larger data sets, but markedly change the data analytics methods hard for existing quants to roll up their sleeves highly scalable and accommodate growth
  7. 7. Big data the real-time way what we do needs a different approach the old (and even the distributed) do not design for soft real-time complex observation of data. Notable exceptions are S4 and Storm.
  8. 8. So, what’s your problem? We have telemetry... over 10 trillion data points on near-line storage growing super-linearly
  9. 9. Data, what kind?Most data is numeric: counts, averages, derivatives, stddevs, etc.Some data is: text changes (ssh fingerprints, production launches) histograms highly dimensional event streams.
  10. 10. Data rates. Quantity of data isn’t such a big deal okay, yes it is, but we’ll get to that later. The rate of new data arrival makes the problem hard. low end: 15k datum / second high end: 300k datum / second growing rapidly
  11. 11. What we use.We use EsperEsper is very powerful,elegantly coded andperformance focused http://www.flickr.com/photos/mcertou/Like any good toolthat allows users towrite queries...
  12. 12. What we do with EsperDetect absence in streams: select b from pattern [every a=Event -> (timer:interval(30 sec) and not b=Event(id=a.id, metric=a.metric)]Detect ad-hoc threshold violation: select * from Event(id=”host1”, metric=”disk1”) where value > 95etc. etc. etc. [1]
  13. 13. Making the problem harder.So, it just wasn’t enough.We want to do long term trendingand apply that information to anomaly detectionThink: Holt-Winters (or multivariate regressions) Look at historic data Use that to predict the immediate future with some quantifiable confidence.
  14. 14. How we do it.We implemented the Snowth for storage of data. [2]We implemented a C/lua distributed system to analyze4 weeks of data (~8k statistical aggregates)yielding a prediction with confidences(triple exponential smoothing) [3]To keep the system real-time,we need to ensure that queries return inless than 2ms (our goal is 100µs).
  15. 15. Cheating is winning. Our predictions work on 5 minute windows. 4 weeks of data is 8064 windows. Given Pred(T-8063 .. T0) -> (P1, C1) Given Pred(T-8062 .. T0, P1) -> ~(P2, C2)
  16. 16. Tolerably inaccurate. When V arrives, we determine the prediction window WN we need. If WN isn’t in cache, we assume V is within tolerances. If WN+1 isn’t in cache, we query the Snowth for WN, WN+1 placing in cache Cache accesses are local and always < 100µs.
  17. 17. I see challenges How do I take offline data analytics techniques and apply them online to high-volume, low-latency event streams quickly? without deep expertise?
  18. 18. Thank you. Circonus is hiring: software engineers, quants, and visualization engineers.[1] http://esper.codehaus.org/tutorials/solution_patterns/solution_patterns.html[2] http://omniti.com/surge/2011/speakers/theo-schlossnagle[3] http://labs.omniti.com/people/jesus/papers/holtwinters.pdf

×