Esperwhispering:
Using Esper to Find Problems in Real-time Data




             / Real-time and real(ly) big


                          1
Who am I? @postwait on twitter


       Author of “Scalable Internet Architectures”
       Pearson, ISBN: 067232699X

       Contributor to “Web Operations”
       O’Reilly, ISBN: 1449377440



       Founder of OmniTI, Message Systems, Fontdeck, & Circonus
       I like to tackle problems that are “always on” and “always growing.”




       I am an Engineer
       A practitioner of academic computing.
       IEEE member and Senior ACM member.
       On the Editorial Board of ACM’s Queue magazine.
       On the ACM professions board.


                                     2
What is BigData?




    •   Few agree.

    •   I say it is any data-related problem that
        can’t be solved (well) on one machine.

    •   Never use a distributed system to solve a problem
        that can be easily solved on a single system:

        •   performance

        •   simplicity

        •   debugability




3
Framing the data problem




      •   events... to make it web related, lets say it is web activity

      •   for every user action, we have an event

      •   an event is composed of about 20-30 known attributes
          (say ~400 bytes)

          •   url, referrer, site category,

          •   ip address, ASN, geo location info,

          •   user-perceived performance info (like load time)




4
Framing the volume problem




      •   We see 100 of these per second on a site

          •   Easy problem (more or less)

      •   We run SaaS, so we need to support 2000 customers:

          •   200,000 events/second
              (or 30x = 6,000,000 column appends/second)




5
What do we want?



        •   I want answers, dammit

        •   I would like to know what is slow (or fast) by

            •   ASN

            •   geo location

            •   browser type

        •   I’d also like to know given an event:

            •   is it outside the average +/- 2 x σ

            •   over the last 5 minutes




6
What else do we want?




        •   I want answers now, dammit




7
What else do we want?




        •   I want answers now, dammit


                            defined: not later




7
What is real-time?




       •   The correctness of the answer depends on both the logical
           correctness of the result and temporal proximity of the result and
           the question.

           •   hard real-time: old answers are worthless.

           •   soft real-time: old answers are worth less.




8
Real-time on the Internet



     •   Hard real-time systems on the Internet;
         this sort of thing ain’t my bag, baby!




     •   Someone is just going to get hurt.




9
Soft real-time?




     •   We need soft real-time systems any time we are going to react to a user.

     •   If the answer is either wrong or late, it is less relevant to them.

     •   The problems we look at have temporal constraints ranging from
         5 seconds (counters and statistics) to
         1 second (fraud detection) to
         10 milliseconds (user-action reaction) and
         everywhere in between.




10
Enter CEP




     •   Complex Event Processing...

         •   Queries always running.

         •   Tuples introduced.

         •   Tuples emitted.

     •                    ’s Esper is my hero.




11
Typical (OmniTI) Esper deployment:



     custom Java glue




                         Application
                        Infrastructure
                            Cloud



12
More concretely




     •   node.js listens for web requests and submits data to Esper via AMQP

     •   Esper runs “magic”

     •   The output of that magic is pushed back via AMQP

     •   node.js listens and returns data back over JSONP.




13
What our event really looks like:
{
    '_ls_part': { 'type': 'String' },

    'url_schema': { 'type': 'String' },
    'url_host': { 'type': 'String' },
    'url': { 'type': 'String' },
    'referrer_schema': { 'type': 'String' },
    'referrer_host': { 'type': 'String' },
    'referrer_path': { 'type': 'String' },
    'ip': { 'type': 'String' },
    'method' : { 'type': 'String' },
    'http_version' : { 'type': 'String' },         'asn': { 'type': 'Integer' },
    'browser': { 'type': 'String' },               'asn_orgname': { 'type': 'String' },
    'browser_version': { 'type': 'String' },       'map_id': { 'type': 'String' },
                                                   'geoip_longitude': { 'type': 'Double' },
    'red_time': { 'type': 'Double' },              'geoip_latitude': { 'type': 'Double' },
    'dns_time': { 'type': 'Double' },              'geoip_country_code': { 'type': 'String' },
    'con_time': { 'type': 'Double' },              'geoip_continent_code': { 'type': 'String' },
    'req_start': { 'type': 'Double' },             'geoip_region': { 'type': 'String' },
    'res_start': { 'type': 'Double' },             'geoip_metro_code': { 'type': 'Integer' },
    'res_end': { 'type': 'Double' },               'geoip_country': { 'type': 'String' },
    'dom_time': { 'type': 'Double', },             'geoip_city': { 'type': 'String' },
    'load_time': { 'type': 'Double', },            'geoip_area_code': { 'type': 'Integer' }
                                               }


14
What our event really looks like:
{
    '_ls_part': { 'type': 'String' },
                                                   Client Token
    'url_schema': { 'type': 'String' },
    'url_host': { 'type': 'String' },
    'url': { 'type': 'String' },
    'referrer_schema': { 'type': 'String' },
    'referrer_host': { 'type': 'String' },
    'referrer_path': { 'type': 'String' },
    'ip': { 'type': 'String' },
    'method' : { 'type': 'String' },
    'http_version' : { 'type': 'String' },         'asn': { 'type': 'Integer' },
    'browser': { 'type': 'String' },               'asn_orgname': { 'type': 'String' },
    'browser_version': { 'type': 'String' },       'map_id': { 'type': 'String' },
                                                   'geoip_longitude': { 'type': 'Double' },
    'red_time': { 'type': 'Double' },              'geoip_latitude': { 'type': 'Double' },
    'dns_time': { 'type': 'Double' },              'geoip_country_code': { 'type': 'String' },
    'con_time': { 'type': 'Double' },              'geoip_continent_code': { 'type': 'String' },
    'req_start': { 'type': 'Double' },             'geoip_region': { 'type': 'String' },
    'res_start': { 'type': 'Double' },             'geoip_metro_code': { 'type': 'Integer' },
    'res_end': { 'type': 'Double' },               'geoip_country': { 'type': 'String' },
    'dom_time': { 'type': 'Double', },             'geoip_city': { 'type': 'String' },
    'load_time': { 'type': 'Double', },            'geoip_area_code': { 'type': 'Integer' }
                                               }


14
What our event really looks like:
{
    '_ls_part': { 'type': 'String' },
                                                   Client Token
    'url_schema': { 'type': 'String' },
    'url_host': { 'type': 'String' },
    'url': { 'type': 'String' },
                                                       HTTP Info
    'referrer_schema': { 'type': 'String' },
    'referrer_host': { 'type': 'String' },
    'referrer_path': { 'type': 'String' },
    'ip': { 'type': 'String' },
    'method' : { 'type': 'String' },
    'http_version' : { 'type': 'String' },         'asn': { 'type': 'Integer' },
    'browser': { 'type': 'String' },               'asn_orgname': { 'type': 'String' },
    'browser_version': { 'type': 'String' },       'map_id': { 'type': 'String' },
                                                   'geoip_longitude': { 'type': 'Double' },
    'red_time': { 'type': 'Double' },              'geoip_latitude': { 'type': 'Double' },
    'dns_time': { 'type': 'Double' },              'geoip_country_code': { 'type': 'String' },
    'con_time': { 'type': 'Double' },              'geoip_continent_code': { 'type': 'String' },
    'req_start': { 'type': 'Double' },             'geoip_region': { 'type': 'String' },
    'res_start': { 'type': 'Double' },             'geoip_metro_code': { 'type': 'Integer' },
    'res_end': { 'type': 'Double' },               'geoip_country': { 'type': 'String' },
    'dom_time': { 'type': 'Double', },             'geoip_city': { 'type': 'String' },
    'load_time': { 'type': 'Double', },            'geoip_area_code': { 'type': 'Integer' }
                                               }


14
What our event really looks like:
{
    '_ls_part': { 'type': 'String' },
                                                   Client Token
    'url_schema': { 'type': 'String' },
    'url_host': { 'type': 'String' },
    'url': { 'type': 'String' },
                                                       HTTP Info
    'referrer_schema': { 'type': 'String' },
    'referrer_host': { 'type': 'String' },
    'referrer_path': { 'type': 'String' },
    'ip': { 'type': 'String' },
    'method' : { 'type': 'String' },
    'http_version' : { 'type': 'String' },         'asn': { 'type': 'Integer' },
    'browser': { 'type': 'String' },               'asn_orgname': { 'type': 'String' },
    'browser_version': { 'type': 'String' },       'map_id': { 'type': 'String' },
                                                   'geoip_longitude': { 'type': 'Double' },
    'red_time': { 'type': 'Double' },              'geoip_latitude': { 'type': 'Double' },
    'dns_time': { 'type': 'Double' },              'geoip_country_code': { 'type': 'String' },
    'con_time': { 'type': 'Double' },              'geoip_continent_code': { 'type': 'String' },
    'req_start': { 'type': 'Double' },             'geoip_region': { 'type': 'String' },
    'res_start': { 'type': 'Double' },             'geoip_metro_code': { 'type': 'Integer' },
    'res_end': { 'type': 'Double' },               'geoip_country': { 'type': 'String' },
    'dom_time': { 'type': 'Double', },             'geoip_city': { 'type': 'String' },
    'load_time': { 'type': 'Double', },            'geoip_area_code': { 'type': 'Integer' }
                                               }
             User Perceived Performance Data
14
What our event really looks like:
{
    '_ls_part': { 'type': 'String' },
                                                   Client Token
    'url_schema': { 'type': 'String' },
    'url_host': { 'type': 'String' },
    'url': { 'type': 'String' },
                                                       HTTP Info
    'referrer_schema': { 'type': 'String' },
    'referrer_host': { 'type': 'String' },
    'referrer_path': { 'type': 'String' },
                                                                 User Location
    'ip': { 'type': 'String' },
    'method' : { 'type': 'String' },
    'http_version' : { 'type': 'String' },         'asn': { 'type': 'Integer' },
    'browser': { 'type': 'String' },               'asn_orgname': { 'type': 'String' },
    'browser_version': { 'type': 'String' },       'map_id': { 'type': 'String' },
                                                   'geoip_longitude': { 'type': 'Double' },
    'red_time': { 'type': 'Double' },              'geoip_latitude': { 'type': 'Double' },
    'dns_time': { 'type': 'Double' },              'geoip_country_code': { 'type': 'String' },
    'con_time': { 'type': 'Double' },              'geoip_continent_code': { 'type': 'String' },
    'req_start': { 'type': 'Double' },             'geoip_region': { 'type': 'String' },
    'res_start': { 'type': 'Double' },             'geoip_metro_code': { 'type': 'Integer' },
    'res_end': { 'type': 'Double' },               'geoip_country': { 'type': 'String' },
    'dom_time': { 'type': 'Double', },             'geoip_city': { 'type': 'String' },
    'load_time': { 'type': 'Double', },            'geoip_area_code': { 'type': 'Integer' }
                                               }
             User Perceived Performance Data
14
First steps for simplicity




     •   I want to create a view on 30 minutes of data for a specific client and
         populate that view with those “hit” events:

         create window fl9875309_hit30m.win:time(30 minute) as hit
         insert into fl9875309_hit30m select * from hit(_ls_part='fl9875309')


     •   Some useful thoughts:

         •   data flowing into this window: “istream”

         •   data also flowing out of this window (after 30 minutes): “rstream”

         •   if you are interested in both streams, we call it: “irstream”




15
Asking a question:

     •   EPL, as you can see looks much like SQL... so

         select count(*) from fl9875309_hit30m


     •   SQLers will be very surprised by the result of this...

         •     ideas?

     •   Hint: this query runs forever and emits results as available

     •   Esper defaults to use the istream of events form which it selects

     •   So:

         •     this statement emits a result on each event entering the window

         •     and the return set is the total number of events within the window

     •   We really wanted:

         select irstream count(*) from fl9875309_hit30m



16
Asking a (cooler) question:




     •   I’d like to know the view volume by referring site.. so

         select irstream referrer_host, count(*) as views
         from fl9875309_hit30m
         where referrer_host <> url_host
         group by referrer_host


     •   This outputs on any event entering or leaving the window... but,

         •   it only outputs the group that is being updated by the event(s)
             entering and/or leaving the window...

         •   (perhaps) not so useful




17
Snapshots




     •   Sometimes you want to see the complete state.

     •   Given that we’re asynch, we can decouple the output from the input.

     •   Let’s get the top 10 referrers, every 5 seconds.

         select irstream referrer_host, count(*) as views
         from fl9875309_hit30m
         where referrer_host <> url_host
         group by referrer_host
         output snapshot every 5 seconds
         order by count(*) desc
         limit 10




18
Finding anomalies...



     •   Note: this is very very simplistic.




19
Finding anomalies...



     •   Note: this is very very simplistic.

     •   I’d like to break the dataset out by network (AS)




19
Finding anomalies...



     •   Note: this is very very simplistic.

     •   I’d like to break the dataset out by network (AS)

     •   I’d like to find individual hits whose load_time is
         greater than the average + 3 times the standard deviation




19
Finding anomalies...



     •   Note: this is very very simplistic.

     •   I’d like to break the dataset out by network (AS)

     •   I’d like to find individual hits whose load_time is
         greater than the average + 3 times the standard deviation

     •   I’d like details about the hit’s IP, browser and load_time




19
Finding anomalies...



     •   Note: this is very very simplistic.

     •   I’d like to break the dataset out by network (AS)

     •   I’d like to find individual hits whose load_time is
         greater than the average + 3 times the standard deviation

     •   I’d like details about the hit’s IP, browser and load_time

         select asn_orgname, browser_version, ip, load_time,
                average, stddev, datapoints as sample_size
           from fl9875309_hit30m(load_time is not null)
                  .std:groupwin(asn_orgname)
                  .stat:uni(load_time, ip, browser_version, load_time) as s
          where s.load_time > s.average + 3 * s.stddev




19
Finding anomalies...



     •   Note: this is very very simplistic.

     •   I’d like to break the dataset out by network (AS)

     •   I’d like to find individual hits whose load_time is
         greater than the average + 3 times the standard deviation

     •   I’d like details about the hit’s IP, browser and load_time

         select asn_orgname, browser_version, ip, load_time,
                average, stddev, datapoints as sample_size
           from fl9875309_hit30m(load_time is not null)
                  .std:groupwin(asn_orgname)
                  .stat:uni(load_time, ip, browser_version, load_time) as s
          where s.load_time > s.average + 3 * s.stddev




19
Finding anomalies...



     •   Note: this is very very simplistic.

     •   I’d like to break the dataset out by network (AS)

     •   I’d like to find individual hits whose load_time is
         greater than the average + 3 times the standard deviation

     •   I’d like details about the hit’s IP, browser and load_time

         select asn_orgname, browser_version, ip, load_time,
                average, stddev, datapoints as sample_size
           from fl9875309_hit30m(load_time is not null)
                  .std:groupwin(asn_orgname)
                  .stat:uni(load_time, ip, browser_version, load_time) as s
          where s.load_time > s.average + 3 * s.stddev




19
Finding anomalies...



     •   Note: this is very very simplistic.

     •   I’d like to break the dataset out by network (AS)

     •   I’d like to find individual hits whose load_time is
         greater than the average + 3 times the standard deviation

     •   I’d like details about the hit’s IP, browser and load_time

         select asn_orgname, browser_version, ip, load_time,
                average, stddev, datapoints as sample_size
           from fl9875309_hit30m(load_time is not null)
                  .std:groupwin(asn_orgname)
                  .stat:uni(load_time, ip, browser_version, load_time) as s
          where s.load_time > s.average + 3 * s.stddev




19
Mapping it all out.

     •   Looking at performance: a world’s-eye view
What’s this all mean?



     •   Big data is all relative.

         •   100 records/s at 400 bytes each is... ~3GB/day or ~1TB/year

         •   100,000 records/s is... ~3TB/day or 1PB/year

         •   500,000 records/s is... ~15TB/day or 5PB/year

     •   Which is big data? you choose.

     •   The technology that can act on this in real-time exists and is different
         that the technologies to store it and crunch it.

     •   Don’t think big... think efficient.
Thank You

    • Thanks you

    • Thank you

    • Thanks you

    • Consider attending:
            Surge 2011
            discussing scalability matters,
            because scalability matters


    • Thank you!

Esperwhispering

  • 1.
    Esperwhispering: Using Esper toFind Problems in Real-time Data / Real-time and real(ly) big 1
  • 2.
    Who am I?@postwait on twitter Author of “Scalable Internet Architectures” Pearson, ISBN: 067232699X Contributor to “Web Operations” O’Reilly, ISBN: 1449377440 Founder of OmniTI, Message Systems, Fontdeck, & Circonus I like to tackle problems that are “always on” and “always growing.” I am an Engineer A practitioner of academic computing. IEEE member and Senior ACM member. On the Editorial Board of ACM’s Queue magazine. On the ACM professions board. 2
  • 3.
    What is BigData? • Few agree. • I say it is any data-related problem that can’t be solved (well) on one machine. • Never use a distributed system to solve a problem that can be easily solved on a single system: • performance • simplicity • debugability 3
  • 4.
    Framing the dataproblem • events... to make it web related, lets say it is web activity • for every user action, we have an event • an event is composed of about 20-30 known attributes (say ~400 bytes) • url, referrer, site category, • ip address, ASN, geo location info, • user-perceived performance info (like load time) 4
  • 5.
    Framing the volumeproblem • We see 100 of these per second on a site • Easy problem (more or less) • We run SaaS, so we need to support 2000 customers: • 200,000 events/second (or 30x = 6,000,000 column appends/second) 5
  • 6.
    What do wewant? • I want answers, dammit • I would like to know what is slow (or fast) by • ASN • geo location • browser type • I’d also like to know given an event: • is it outside the average +/- 2 x σ • over the last 5 minutes 6
  • 7.
    What else dowe want? • I want answers now, dammit 7
  • 8.
    What else dowe want? • I want answers now, dammit defined: not later 7
  • 9.
    What is real-time? • The correctness of the answer depends on both the logical correctness of the result and temporal proximity of the result and the question. • hard real-time: old answers are worthless. • soft real-time: old answers are worth less. 8
  • 10.
    Real-time on theInternet • Hard real-time systems on the Internet; this sort of thing ain’t my bag, baby! • Someone is just going to get hurt. 9
  • 11.
    Soft real-time? • We need soft real-time systems any time we are going to react to a user. • If the answer is either wrong or late, it is less relevant to them. • The problems we look at have temporal constraints ranging from 5 seconds (counters and statistics) to 1 second (fraud detection) to 10 milliseconds (user-action reaction) and everywhere in between. 10
  • 12.
    Enter CEP • Complex Event Processing... • Queries always running. • Tuples introduced. • Tuples emitted. • ’s Esper is my hero. 11
  • 13.
    Typical (OmniTI) Esperdeployment: custom Java glue Application Infrastructure Cloud 12
  • 14.
    More concretely • node.js listens for web requests and submits data to Esper via AMQP • Esper runs “magic” • The output of that magic is pushed back via AMQP • node.js listens and returns data back over JSONP. 13
  • 15.
    What our eventreally looks like: { '_ls_part': { 'type': 'String' }, 'url_schema': { 'type': 'String' }, 'url_host': { 'type': 'String' }, 'url': { 'type': 'String' }, 'referrer_schema': { 'type': 'String' }, 'referrer_host': { 'type': 'String' }, 'referrer_path': { 'type': 'String' }, 'ip': { 'type': 'String' }, 'method' : { 'type': 'String' }, 'http_version' : { 'type': 'String' }, 'asn': { 'type': 'Integer' }, 'browser': { 'type': 'String' }, 'asn_orgname': { 'type': 'String' }, 'browser_version': { 'type': 'String' }, 'map_id': { 'type': 'String' }, 'geoip_longitude': { 'type': 'Double' }, 'red_time': { 'type': 'Double' }, 'geoip_latitude': { 'type': 'Double' }, 'dns_time': { 'type': 'Double' }, 'geoip_country_code': { 'type': 'String' }, 'con_time': { 'type': 'Double' }, 'geoip_continent_code': { 'type': 'String' }, 'req_start': { 'type': 'Double' }, 'geoip_region': { 'type': 'String' }, 'res_start': { 'type': 'Double' }, 'geoip_metro_code': { 'type': 'Integer' }, 'res_end': { 'type': 'Double' }, 'geoip_country': { 'type': 'String' }, 'dom_time': { 'type': 'Double', }, 'geoip_city': { 'type': 'String' }, 'load_time': { 'type': 'Double', }, 'geoip_area_code': { 'type': 'Integer' } } 14
  • 16.
    What our eventreally looks like: { '_ls_part': { 'type': 'String' }, Client Token 'url_schema': { 'type': 'String' }, 'url_host': { 'type': 'String' }, 'url': { 'type': 'String' }, 'referrer_schema': { 'type': 'String' }, 'referrer_host': { 'type': 'String' }, 'referrer_path': { 'type': 'String' }, 'ip': { 'type': 'String' }, 'method' : { 'type': 'String' }, 'http_version' : { 'type': 'String' }, 'asn': { 'type': 'Integer' }, 'browser': { 'type': 'String' }, 'asn_orgname': { 'type': 'String' }, 'browser_version': { 'type': 'String' }, 'map_id': { 'type': 'String' }, 'geoip_longitude': { 'type': 'Double' }, 'red_time': { 'type': 'Double' }, 'geoip_latitude': { 'type': 'Double' }, 'dns_time': { 'type': 'Double' }, 'geoip_country_code': { 'type': 'String' }, 'con_time': { 'type': 'Double' }, 'geoip_continent_code': { 'type': 'String' }, 'req_start': { 'type': 'Double' }, 'geoip_region': { 'type': 'String' }, 'res_start': { 'type': 'Double' }, 'geoip_metro_code': { 'type': 'Integer' }, 'res_end': { 'type': 'Double' }, 'geoip_country': { 'type': 'String' }, 'dom_time': { 'type': 'Double', }, 'geoip_city': { 'type': 'String' }, 'load_time': { 'type': 'Double', }, 'geoip_area_code': { 'type': 'Integer' } } 14
  • 17.
    What our eventreally looks like: { '_ls_part': { 'type': 'String' }, Client Token 'url_schema': { 'type': 'String' }, 'url_host': { 'type': 'String' }, 'url': { 'type': 'String' }, HTTP Info 'referrer_schema': { 'type': 'String' }, 'referrer_host': { 'type': 'String' }, 'referrer_path': { 'type': 'String' }, 'ip': { 'type': 'String' }, 'method' : { 'type': 'String' }, 'http_version' : { 'type': 'String' }, 'asn': { 'type': 'Integer' }, 'browser': { 'type': 'String' }, 'asn_orgname': { 'type': 'String' }, 'browser_version': { 'type': 'String' }, 'map_id': { 'type': 'String' }, 'geoip_longitude': { 'type': 'Double' }, 'red_time': { 'type': 'Double' }, 'geoip_latitude': { 'type': 'Double' }, 'dns_time': { 'type': 'Double' }, 'geoip_country_code': { 'type': 'String' }, 'con_time': { 'type': 'Double' }, 'geoip_continent_code': { 'type': 'String' }, 'req_start': { 'type': 'Double' }, 'geoip_region': { 'type': 'String' }, 'res_start': { 'type': 'Double' }, 'geoip_metro_code': { 'type': 'Integer' }, 'res_end': { 'type': 'Double' }, 'geoip_country': { 'type': 'String' }, 'dom_time': { 'type': 'Double', }, 'geoip_city': { 'type': 'String' }, 'load_time': { 'type': 'Double', }, 'geoip_area_code': { 'type': 'Integer' } } 14
  • 18.
    What our eventreally looks like: { '_ls_part': { 'type': 'String' }, Client Token 'url_schema': { 'type': 'String' }, 'url_host': { 'type': 'String' }, 'url': { 'type': 'String' }, HTTP Info 'referrer_schema': { 'type': 'String' }, 'referrer_host': { 'type': 'String' }, 'referrer_path': { 'type': 'String' }, 'ip': { 'type': 'String' }, 'method' : { 'type': 'String' }, 'http_version' : { 'type': 'String' }, 'asn': { 'type': 'Integer' }, 'browser': { 'type': 'String' }, 'asn_orgname': { 'type': 'String' }, 'browser_version': { 'type': 'String' }, 'map_id': { 'type': 'String' }, 'geoip_longitude': { 'type': 'Double' }, 'red_time': { 'type': 'Double' }, 'geoip_latitude': { 'type': 'Double' }, 'dns_time': { 'type': 'Double' }, 'geoip_country_code': { 'type': 'String' }, 'con_time': { 'type': 'Double' }, 'geoip_continent_code': { 'type': 'String' }, 'req_start': { 'type': 'Double' }, 'geoip_region': { 'type': 'String' }, 'res_start': { 'type': 'Double' }, 'geoip_metro_code': { 'type': 'Integer' }, 'res_end': { 'type': 'Double' }, 'geoip_country': { 'type': 'String' }, 'dom_time': { 'type': 'Double', }, 'geoip_city': { 'type': 'String' }, 'load_time': { 'type': 'Double', }, 'geoip_area_code': { 'type': 'Integer' } } User Perceived Performance Data 14
  • 19.
    What our eventreally looks like: { '_ls_part': { 'type': 'String' }, Client Token 'url_schema': { 'type': 'String' }, 'url_host': { 'type': 'String' }, 'url': { 'type': 'String' }, HTTP Info 'referrer_schema': { 'type': 'String' }, 'referrer_host': { 'type': 'String' }, 'referrer_path': { 'type': 'String' }, User Location 'ip': { 'type': 'String' }, 'method' : { 'type': 'String' }, 'http_version' : { 'type': 'String' }, 'asn': { 'type': 'Integer' }, 'browser': { 'type': 'String' }, 'asn_orgname': { 'type': 'String' }, 'browser_version': { 'type': 'String' }, 'map_id': { 'type': 'String' }, 'geoip_longitude': { 'type': 'Double' }, 'red_time': { 'type': 'Double' }, 'geoip_latitude': { 'type': 'Double' }, 'dns_time': { 'type': 'Double' }, 'geoip_country_code': { 'type': 'String' }, 'con_time': { 'type': 'Double' }, 'geoip_continent_code': { 'type': 'String' }, 'req_start': { 'type': 'Double' }, 'geoip_region': { 'type': 'String' }, 'res_start': { 'type': 'Double' }, 'geoip_metro_code': { 'type': 'Integer' }, 'res_end': { 'type': 'Double' }, 'geoip_country': { 'type': 'String' }, 'dom_time': { 'type': 'Double', }, 'geoip_city': { 'type': 'String' }, 'load_time': { 'type': 'Double', }, 'geoip_area_code': { 'type': 'Integer' } } User Perceived Performance Data 14
  • 20.
    First steps forsimplicity • I want to create a view on 30 minutes of data for a specific client and populate that view with those “hit” events: create window fl9875309_hit30m.win:time(30 minute) as hit insert into fl9875309_hit30m select * from hit(_ls_part='fl9875309') • Some useful thoughts: • data flowing into this window: “istream” • data also flowing out of this window (after 30 minutes): “rstream” • if you are interested in both streams, we call it: “irstream” 15
  • 21.
    Asking a question: • EPL, as you can see looks much like SQL... so select count(*) from fl9875309_hit30m • SQLers will be very surprised by the result of this... • ideas? • Hint: this query runs forever and emits results as available • Esper defaults to use the istream of events form which it selects • So: • this statement emits a result on each event entering the window • and the return set is the total number of events within the window • We really wanted: select irstream count(*) from fl9875309_hit30m 16
  • 22.
    Asking a (cooler)question: • I’d like to know the view volume by referring site.. so select irstream referrer_host, count(*) as views from fl9875309_hit30m where referrer_host <> url_host group by referrer_host • This outputs on any event entering or leaving the window... but, • it only outputs the group that is being updated by the event(s) entering and/or leaving the window... • (perhaps) not so useful 17
  • 23.
    Snapshots • Sometimes you want to see the complete state. • Given that we’re asynch, we can decouple the output from the input. • Let’s get the top 10 referrers, every 5 seconds. select irstream referrer_host, count(*) as views from fl9875309_hit30m where referrer_host <> url_host group by referrer_host output snapshot every 5 seconds order by count(*) desc limit 10 18
  • 24.
    Finding anomalies... • Note: this is very very simplistic. 19
  • 25.
    Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) 19
  • 26.
    Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation 19
  • 27.
    Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time 19
  • 28.
    Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev 19
  • 29.
    Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev 19
  • 30.
    Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev 19
  • 31.
    Finding anomalies... • Note: this is very very simplistic. • I’d like to break the dataset out by network (AS) • I’d like to find individual hits whose load_time is greater than the average + 3 times the standard deviation • I’d like details about the hit’s IP, browser and load_time select asn_orgname, browser_version, ip, load_time, average, stddev, datapoints as sample_size from fl9875309_hit30m(load_time is not null) .std:groupwin(asn_orgname) .stat:uni(load_time, ip, browser_version, load_time) as s where s.load_time > s.average + 3 * s.stddev 19
  • 32.
    Mapping it allout. • Looking at performance: a world’s-eye view
  • 33.
    What’s this allmean? • Big data is all relative. • 100 records/s at 400 bytes each is... ~3GB/day or ~1TB/year • 100,000 records/s is... ~3TB/day or 1PB/year • 500,000 records/s is... ~15TB/day or 5PB/year • Which is big data? you choose. • The technology that can act on this in real-time exists and is different that the technologies to store it and crunch it. • Don’t think big... think efficient.
  • 34.
    Thank You • Thanks you • Thank you • Thanks you • Consider attending: Surge 2011 discussing scalability matters, because scalability matters • Thank you!