Esperwhispering:Using Esper to Find Problems in Real-time Data / Real-time and real(ly) big 1
Who am I? @postwait on twitter Author of “Scalable Internet Architectures” Pearson, ISBN: 067232699X Contributor to “Web Operations” O’Reilly, ISBN: 1449377440 Founder of OmniTI, Message Systems, Fontdeck, & Circonus I like to tackle problems that are “always on” and “always growing.” I am an Engineer A practitioner of academic computing. IEEE member and Senior ACM member. On the Editorial Board of ACM’s Queue magazine. On the ACM professions board. 2
What is BigData? • Few agree. • I say it is any data-related problem that can’t be solved (well) on one machine. • Never use a distributed system to solve a problem that can be easily solved on a single system: • performance • simplicity • debugability3
Framing the data problem • events... to make it web related, lets say it is web activity • for every user action, we have an event • an event is composed of about 20-30 known attributes (say ~400 bytes) • url, referrer, site category, • ip address, ASN, geo location info, • user-perceived performance info (like load time)4
Framing the volume problem • We see 100 of these per second on a site • Easy problem (more or less) • We run SaaS, so we need to support 2000 customers: • 200,000 events/second (or 30x = 6,000,000 column appends/second)5
What do we want? • I want answers, dammit • I would like to know what is slow (or fast) by • ASN • geo location • browser type • I’d also like to know given an event: • is it outside the average +/- 2 x σ • over the last 5 minutes6
What else do we want? • I want answers now, dammit7
What else do we want? • I want answers now, dammit defined: not later7
What is real-time? • The correctness of the answer depends on both the logical correctness of the result and temporal proximity of the result and the question. • hard real-time: old answers are worthless. • soft real-time: old answers are worth less.8
Real-time on the Internet • Hard real-time systems on the Internet; this sort of thing ain’t my bag, baby! • Someone is just going to get hurt.9
Soft real-time? • We need soft real-time systems any time we are going to react to a user. • If the answer is either wrong or late, it is less relevant to them. • The problems we look at have temporal constraints ranging from 5 seconds (counters and statistics) to 1 second (fraud detection) to 10 milliseconds (user-action reaction) and everywhere in between.10
Enter CEP • Complex Event Processing... • Queries always running. • Tuples introduced. • Tuples emitted. • ’s Esper is my hero.11
More concretely • node.js listens for web requests and submits data to Esper via AMQP • Esper runs “magic” • The output of that magic is pushed back via AMQP • node.js listens and returns data back over JSONP.13
First steps for simplicity • I want to create a view on 30 minutes of data for a specific client and populate that view with those “hit” events: create window fl9875309_hit30m.win:time(30 minute) as hit insert into fl9875309_hit30m select * from hit(_ls_part=fl9875309) • Some useful thoughts: • data flowing into this window: “istream” • data also flowing out of this window (after 30 minutes): “rstream” • if you are interested in both streams, we call it: “irstream”15
Asking a question: • EPL, as you can see looks much like SQL... so select count(*) from fl9875309_hit30m • SQLers will be very surprised by the result of this... • ideas? • Hint: this query runs forever and emits results as available • Esper defaults to use the istream of events form which it selects • So: • this statement emits a result on each event entering the window • and the return set is the total number of events within the window • We really wanted: select irstream count(*) from fl9875309_hit30m16
Asking a (cooler) question: • I’d like to know the view volume by referring site.. so select irstream referrer_host, count(*) as views from fl9875309_hit30m where referrer_host <> url_host group by referrer_host • This outputs on any event entering or leaving the window... but, • it only outputs the group that is being updated by the event(s) entering and/or leaving the window... • (perhaps) not so useful17
Snapshots • Sometimes you want to see the complete state. • Given that we’re asynch, we can decouple the output from the input. • Let’s get the top 10 referrers, every 5 seconds. select irstream referrer_host, count(*) as views from fl9875309_hit30m where referrer_host <> url_host group by referrer_host output snapshot every 5 seconds order by count(*) desc limit 1018
Finding anomalies... • Note: this is very very simplistic.19
Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS)19
Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation19
Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time19
Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev19
Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev19
Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev19
Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev19
Mapping it all out. • Looking at performance: a world’s-eye view
What’s this all mean? • Big data is all relative. • 100 records/s at 400 bytes each is... ~3GB/day or ~1TB/year • 100,000 records/s is... ~3TB/day or 1PB/year • 500,000 records/s is... ~15TB/day or 5PB/year • Which is big data? you choose. • The technology that can act on this in real-time exists and is different that the technologies to store it and crunch it. • Don’t think big... think efficient.
Thank You • Thanks you • Thank you • Thanks you • Consider attending: Surge 2011 discussing scalability matters, because scalability matters • Thank you!
1–1 of 1 previous next