Your SlideShare is downloading. ×
0
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Esperwhispering
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Esperwhispering

14,527

Published on

Using Esper to extract information from real-time data streams.

Using Esper to extract information from real-time data streams.

Published in: Technology
1 Comment
24 Likes
Statistics
Notes
No Downloads
Views
Total Views
14,527
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
321
Comments
1
Likes
24
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. Esperwhispering:Using Esper to Find Problems in Real-time Data / Real-time and real(ly) big 1
    • 2. Who am I? @postwait on twitter Author of “Scalable Internet Architectures” Pearson, ISBN: 067232699X Contributor to “Web Operations” O’Reilly, ISBN: 1449377440 Founder of OmniTI, Message Systems, Fontdeck, & Circonus I like to tackle problems that are “always on” and “always growing.” I am an Engineer A practitioner of academic computing. IEEE member and Senior ACM member. On the Editorial Board of ACM’s Queue magazine. On the ACM professions board. 2
    • 3. What is BigData? • Few agree. • I say it is any data-related problem that can’t be solved (well) on one machine. • Never use a distributed system to solve a problem that can be easily solved on a single system: • performance • simplicity • debugability3
    • 4. Framing the data problem • events... to make it web related, lets say it is web activity • for every user action, we have an event • an event is composed of about 20-30 known attributes (say ~400 bytes) • url, referrer, site category, • ip address, ASN, geo location info, • user-perceived performance info (like load time)4
    • 5. Framing the volume problem • We see 100 of these per second on a site • Easy problem (more or less) • We run SaaS, so we need to support 2000 customers: • 200,000 events/second (or 30x = 6,000,000 column appends/second)5
    • 6. What do we want? • I want answers, dammit • I would like to know what is slow (or fast) by • ASN • geo location • browser type • I’d also like to know given an event: • is it outside the average +/- 2 x σ • over the last 5 minutes6
    • 7. What else do we want? • I want answers now, dammit7
    • 8. What else do we want? • I want answers now, dammit defined: not later7
    • 9. What is real-time? • The correctness of the answer depends on both the logical correctness of the result and temporal proximity of the result and the question. • hard real-time: old answers are worthless. • soft real-time: old answers are worth less.8
    • 10. Real-time on the Internet • Hard real-time systems on the Internet; this sort of thing ain’t my bag, baby! • Someone is just going to get hurt.9
    • 11. Soft real-time? • We need soft real-time systems any time we are going to react to a user. • If the answer is either wrong or late, it is less relevant to them. • The problems we look at have temporal constraints ranging from 5 seconds (counters and statistics) to 1 second (fraud detection) to 10 milliseconds (user-action reaction) and everywhere in between.10
    • 12. Enter CEP • Complex Event Processing... • Queries always running. • Tuples introduced. • Tuples emitted. • ’s Esper is my hero.11
    • 13. Typical (OmniTI) Esper deployment: custom Java glue Application Infrastructure Cloud12
    • 14. More concretely • node.js listens for web requests and submits data to Esper via AMQP • Esper runs “magic” • The output of that magic is pushed back via AMQP • node.js listens and returns data back over JSONP.13
    • 15. What our event really looks like:{ _ls_part: { type: String }, url_schema: { type: String }, url_host: { type: String }, url: { type: String }, referrer_schema: { type: String }, referrer_host: { type: String }, referrer_path: { type: String }, ip: { type: String }, method : { type: String }, http_version : { type: String }, asn: { type: Integer }, browser: { type: String }, asn_orgname: { type: String }, browser_version: { type: String }, map_id: { type: String }, geoip_longitude: { type: Double }, red_time: { type: Double }, geoip_latitude: { type: Double }, dns_time: { type: Double }, geoip_country_code: { type: String }, con_time: { type: Double }, geoip_continent_code: { type: String }, req_start: { type: Double }, geoip_region: { type: String }, res_start: { type: Double }, geoip_metro_code: { type: Integer }, res_end: { type: Double }, geoip_country: { type: String }, dom_time: { type: Double, }, geoip_city: { type: String }, load_time: { type: Double, }, geoip_area_code: { type: Integer } }14
    • 16. What our event really looks like:{ _ls_part: { type: String }, Client Token url_schema: { type: String }, url_host: { type: String }, url: { type: String }, referrer_schema: { type: String }, referrer_host: { type: String }, referrer_path: { type: String }, ip: { type: String }, method : { type: String }, http_version : { type: String }, asn: { type: Integer }, browser: { type: String }, asn_orgname: { type: String }, browser_version: { type: String }, map_id: { type: String }, geoip_longitude: { type: Double }, red_time: { type: Double }, geoip_latitude: { type: Double }, dns_time: { type: Double }, geoip_country_code: { type: String }, con_time: { type: Double }, geoip_continent_code: { type: String }, req_start: { type: Double }, geoip_region: { type: String }, res_start: { type: Double }, geoip_metro_code: { type: Integer }, res_end: { type: Double }, geoip_country: { type: String }, dom_time: { type: Double, }, geoip_city: { type: String }, load_time: { type: Double, }, geoip_area_code: { type: Integer } }14
    • 17. What our event really looks like:{ _ls_part: { type: String }, Client Token url_schema: { type: String }, url_host: { type: String }, url: { type: String }, HTTP Info referrer_schema: { type: String }, referrer_host: { type: String }, referrer_path: { type: String }, ip: { type: String }, method : { type: String }, http_version : { type: String }, asn: { type: Integer }, browser: { type: String }, asn_orgname: { type: String }, browser_version: { type: String }, map_id: { type: String }, geoip_longitude: { type: Double }, red_time: { type: Double }, geoip_latitude: { type: Double }, dns_time: { type: Double }, geoip_country_code: { type: String }, con_time: { type: Double }, geoip_continent_code: { type: String }, req_start: { type: Double }, geoip_region: { type: String }, res_start: { type: Double }, geoip_metro_code: { type: Integer }, res_end: { type: Double }, geoip_country: { type: String }, dom_time: { type: Double, }, geoip_city: { type: String }, load_time: { type: Double, }, geoip_area_code: { type: Integer } }14
    • 18. What our event really looks like:{ _ls_part: { type: String }, Client Token url_schema: { type: String }, url_host: { type: String }, url: { type: String }, HTTP Info referrer_schema: { type: String }, referrer_host: { type: String }, referrer_path: { type: String }, ip: { type: String }, method : { type: String }, http_version : { type: String }, asn: { type: Integer }, browser: { type: String }, asn_orgname: { type: String }, browser_version: { type: String }, map_id: { type: String }, geoip_longitude: { type: Double }, red_time: { type: Double }, geoip_latitude: { type: Double }, dns_time: { type: Double }, geoip_country_code: { type: String }, con_time: { type: Double }, geoip_continent_code: { type: String }, req_start: { type: Double }, geoip_region: { type: String }, res_start: { type: Double }, geoip_metro_code: { type: Integer }, res_end: { type: Double }, geoip_country: { type: String }, dom_time: { type: Double, }, geoip_city: { type: String }, load_time: { type: Double, }, geoip_area_code: { type: Integer } } User Perceived Performance Data14
    • 19. What our event really looks like:{ _ls_part: { type: String }, Client Token url_schema: { type: String }, url_host: { type: String }, url: { type: String }, HTTP Info referrer_schema: { type: String }, referrer_host: { type: String }, referrer_path: { type: String }, User Location ip: { type: String }, method : { type: String }, http_version : { type: String }, asn: { type: Integer }, browser: { type: String }, asn_orgname: { type: String }, browser_version: { type: String }, map_id: { type: String }, geoip_longitude: { type: Double }, red_time: { type: Double }, geoip_latitude: { type: Double }, dns_time: { type: Double }, geoip_country_code: { type: String }, con_time: { type: Double }, geoip_continent_code: { type: String }, req_start: { type: Double }, geoip_region: { type: String }, res_start: { type: Double }, geoip_metro_code: { type: Integer }, res_end: { type: Double }, geoip_country: { type: String }, dom_time: { type: Double, }, geoip_city: { type: String }, load_time: { type: Double, }, geoip_area_code: { type: Integer } } User Perceived Performance Data14
    • 20. First steps for simplicity • I want to create a view on 30 minutes of data for a specific client and populate that view with those “hit” events: create window fl9875309_hit30m.win:time(30 minute) as hit insert into fl9875309_hit30m select * from hit(_ls_part=fl9875309) • Some useful thoughts: • data flowing into this window: “istream” • data also flowing out of this window (after 30 minutes): “rstream” • if you are interested in both streams, we call it: “irstream”15
    • 21. Asking a question: • EPL, as you can see looks much like SQL... so select count(*) from fl9875309_hit30m • SQLers will be very surprised by the result of this... • ideas? • Hint: this query runs forever and emits results as available • Esper defaults to use the istream of events form which it selects • So: • this statement emits a result on each event entering the window • and the return set is the total number of events within the window • We really wanted: select irstream count(*) from fl9875309_hit30m16
    • 22. Asking a (cooler) question: • I’d like to know the view volume by referring site.. so select irstream referrer_host, count(*) as views from fl9875309_hit30m where referrer_host <> url_host group by referrer_host • This outputs on any event entering or leaving the window... but, • it only outputs the group that is being updated by the event(s) entering and/or leaving the window... • (perhaps) not so useful17
    • 23. Snapshots • Sometimes you want to see the complete state. • Given that we’re asynch, we can decouple the output from the input. • Let’s get the top 10 referrers, every 5 seconds. select irstream referrer_host, count(*) as views from fl9875309_hit30m where referrer_host <> url_host group by referrer_host output snapshot every 5 seconds order by count(*) desc limit 1018
    • 24. Finding anomalies... • Note: this is very very simplistic.19
    • 25. Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS)19
    • 26. Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation19
    • 27. Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time19
    • 28. Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev19
    • 29. Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev19
    • 30. Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev19
    • 31. Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev19
    • 32. Mapping it all out. • Looking at performance: a world’s-eye view
    • 33. What’s this all mean? • Big data is all relative. • 100 records/s at 400 bytes each is... ~3GB/day or ~1TB/year • 100,000 records/s is... ~3TB/day or 1PB/year • 500,000 records/s is... ~15TB/day or 5PB/year • Which is big data? you choose. • The technology that can act on this in real-time exists and is different that the technologies to store it and crunch it. • Don’t think big... think efficient.
    • 34. Thank You • Thanks you • Thank you • Thanks you • Consider attending: Surge 2011 discussing scalability matters, because scalability matters • Thank you!

    ×