SnowPlow

How Apache Hive and other big data technologies are transforming
web analytics
How Hive is used at SnowPlow
Strengths and weaknesses of Hive vs alternatives

http://snowplowanalytics.com
@snowplowdata
@yalisassoon
Some history
       Web analytics                                  Big data

1990                   Web is born

1993                   Log file based web analytics

1996

1997                   Javascript tagging



2004                                                             publishes MapReduce paper

2006                                                         Hadoop project split out of Nutch

2008                                                         Facebook develops Hive

                                                                 publishes Dremel paper
2010
Implications

•   Web analytics solutions were developed on the assumption that granular, event level and
    customer level data was too expensive to store and query

•   Data is aggregated from the start. Data collection and analysis is tightly coupled

•   Web analytics is limited
     –   Hits
     –   Clicks, unique visitors, conversions
     –   Traffic sources


•   Web analytics is silo’ed. (Separate tools to use vs other data sets.)
     –   Hard to link to customer data (e.g. CRM)
     –   Hard to link to marketing data (e.g. DoubleClick)
     –   Hard to link to financial data (e.g. unit profit)
Let’s reinvent web analytics
    Web analytics is one (very rich) data set that is at the heart of:


                                                       Platform /
             Customer                                                                         Catalogue
                                                       application
             analytics                                                                        analytics
                                                        analytics

•      How do my users segment by            •    How do improvements to my         •   How are my different products
       behaviour?                                 application drive improved user       (items in a shop / articles on a
                                                  engagement and lifetime               newspaper / videos on a media
•      What is the customer lifetime
                                                  value?                                site) performing? What is
       value of my users? How can I
                                                                                        driving the most engagement?
       forecast it based on their            •    Which parts of my application
                                                                                        Revenue? Profit?
       behaviour?                                 should I focus development on
                                                  to drive return?                  •   How should I organise my
•      What are the ‘sliding doors’
                                                                                        catalogue online to drive the
       moments in a customers
                                                                                        best user experience? How can
       journey that impact their
                                                                                        I personalise it to different
       lifetime value?
                                                                                        users?
•      Which channels should I spend
       marketing budget to acquire
       high value customers?


                     SnowPlow: open source platform that delivers the granular web
                            analytics data, so you can perform the above
SnowPlow leverages big data and cloud technology
across its architecture. Hive on EMR is used A LOT




                 Pixel served       Hive                S3              Query in Hive
                 from Amazon        Read logs using     Single, “fat”   Output to other
                 Cloudfront         custom SerDe        Hive table      analytics
                 Request to pixel   Write a single                      programmes
                 (incl. query       table of clean,                     e.g. Excel,
                 string) logged     partitioned                         Tableau, R…
Javascript tag
                                    event data back
                                    to S3 for ease of
                                    querying



                                                             Hive!
How SnowPlow data looks in Hive:

                   User     Page      Market-     Event     Browser     OS         Device
                                      ing

 One line
per event
e.g. page
     view,
  add-to-
   basket




                                     source
                                                category    name
                                    medium                             name           type
               user_id                           action     family
                           url        term                             family     is_mobile?
               visit_id                           label    version
                          title     content                           manufact-      width
             ip_address                         property     type
                                   campaign                             urer         height
                                                  value     lang…
                                    referrer
We              Hive…                               but…

•   Easy to use and query. (Especially          •   Hard to debug
    compared with NoSQL competitors e.g.        •   Slow
    MongoDB)                                    •   Limited power
     – E.g.
                                                •   Batch based (Hadoop’s fault…)
       http://snowplowanalytics.com/analytics
       /basic-recipes.html
     – http://snowplowanalytics.com/analytics
       /customer-analytics/cohort-
       analysis.html
•   Rapidly develop ETL and analytics queries
•   Easy to run on Amazon EMR
•   Tight integration with Amazon S3
For storage and analytics, columnar databases provide
an attractive alternative




  •   Scales horizontally – to petabytes at least   •   Scales to terabytes (not petabytes)
  •   Pay-as-you-go (on EMR) – each query           •   Fixed cost (dedicated analytics server
      costs $                                           with LOTs of RAM)
  •   An increasing number of front-ends can        •   Significantly faster – seconds not minutes
      be ‘plugged in’ e.g. Toad for Cloud           •   Plug in to many analytics front ends e.g.
      Databases                                         Tableau, Qlikview, R
For segmentation, personalisation and recommendation
on web analytics data, you can’t beat Mahout




You can do computations that do not fit the SQL       •   Large number of recommendation, clustering
processing model incl. machine learning in Hive via       and categorisation algorithms
transformation scripts…
                                                      •   Plays well with Hadoop
   CREATE TABLE docs(contents STRING);
                                                      •   Large, active developer community
   FROM
   (MAP docs.contents USING
   'tokenizer_script' AS word, cnt
   FROM docs
   CLUSTER BY word) map_output
   REDUCE map_output.word,
   map_output.cnt USING 'count_script'
   AS word, cnt;

… but why would you?
For ETL in production, you really need something more
robust than Hive
•   ETL: need to define sophisticated data pipelines so that:
     –   Clear audit path: which lines of data have been processed, which have not
     –   Where they have not, error handling flows to deal with the lines. (Including potential reprocessing)
     –   Should fail gracefully (not shut down whole job)
     –   Should be easy to debug when things go wrong, diagnose the problem, and start again where left off…

•   An alternative to Hive we are exploring:




     –   Java framework for developing Hadoop-powered data processing applications
     –   Scala (Scalding) and Clojure (Cascalog) wrappers available
Where we’re going with Hive @ SnowPlow




                 Pixel served           Scalding            Infobright         Infobright
                 from Amazon            (Cascading)                            BI tools e.g.
                 Cloudfront             Ruby wrapper                           Tableau,
                 Request to pixel                                              Qlikview,
                 (incl. query                                                  Pentaho
                 string) logged                                                Data
Javascript tag
                                                                               exploration
                                                                               tools e.g. R,
                                                                               Excel
                                                                               MI tools e.g.
                                                                               Mahout

                                     Hive for ad hoc analytics on the atomic data
                                    Hive for SnowPlow users with Petabytes of data
                             But.. for most users… Hive is NOT part of the core flow
Any questions?



                 http://snowplowanalytics.com




                 http://github.com/snowplow




                 @snowplowdata

                 @yalisassoon

How we use Hive at SnowPlow, and how the role of HIve is changing

  • 1.
    SnowPlow How Apache Hiveand other big data technologies are transforming web analytics How Hive is used at SnowPlow Strengths and weaknesses of Hive vs alternatives http://snowplowanalytics.com @snowplowdata @yalisassoon
  • 2.
    Some history Web analytics Big data 1990 Web is born 1993 Log file based web analytics 1996 1997 Javascript tagging 2004 publishes MapReduce paper 2006 Hadoop project split out of Nutch 2008 Facebook develops Hive publishes Dremel paper 2010
  • 3.
    Implications • Web analytics solutions were developed on the assumption that granular, event level and customer level data was too expensive to store and query • Data is aggregated from the start. Data collection and analysis is tightly coupled • Web analytics is limited – Hits – Clicks, unique visitors, conversions – Traffic sources • Web analytics is silo’ed. (Separate tools to use vs other data sets.) – Hard to link to customer data (e.g. CRM) – Hard to link to marketing data (e.g. DoubleClick) – Hard to link to financial data (e.g. unit profit)
  • 4.
    Let’s reinvent webanalytics Web analytics is one (very rich) data set that is at the heart of: Platform / Customer Catalogue application analytics analytics analytics • How do my users segment by • How do improvements to my • How are my different products behaviour? application drive improved user (items in a shop / articles on a engagement and lifetime newspaper / videos on a media • What is the customer lifetime value? site) performing? What is value of my users? How can I driving the most engagement? forecast it based on their • Which parts of my application Revenue? Profit? behaviour? should I focus development on to drive return? • How should I organise my • What are the ‘sliding doors’ catalogue online to drive the moments in a customers best user experience? How can journey that impact their I personalise it to different lifetime value? users? • Which channels should I spend marketing budget to acquire high value customers? SnowPlow: open source platform that delivers the granular web analytics data, so you can perform the above
  • 5.
    SnowPlow leverages bigdata and cloud technology across its architecture. Hive on EMR is used A LOT Pixel served Hive S3 Query in Hive from Amazon Read logs using Single, “fat” Output to other Cloudfront custom SerDe Hive table analytics Request to pixel Write a single programmes (incl. query table of clean, e.g. Excel, string) logged partitioned Tableau, R… Javascript tag event data back to S3 for ease of querying Hive!
  • 6.
    How SnowPlow datalooks in Hive: User Page Market- Event Browser OS Device ing One line per event e.g. page view, add-to- basket source category name medium name type user_id action family url term family is_mobile? visit_id label version title content manufact- width ip_address property type campaign urer height value lang… referrer
  • 7.
    We Hive… but… • Easy to use and query. (Especially • Hard to debug compared with NoSQL competitors e.g. • Slow MongoDB) • Limited power – E.g. • Batch based (Hadoop’s fault…) http://snowplowanalytics.com/analytics /basic-recipes.html – http://snowplowanalytics.com/analytics /customer-analytics/cohort- analysis.html • Rapidly develop ETL and analytics queries • Easy to run on Amazon EMR • Tight integration with Amazon S3
  • 8.
    For storage andanalytics, columnar databases provide an attractive alternative • Scales horizontally – to petabytes at least • Scales to terabytes (not petabytes) • Pay-as-you-go (on EMR) – each query • Fixed cost (dedicated analytics server costs $ with LOTs of RAM) • An increasing number of front-ends can • Significantly faster – seconds not minutes be ‘plugged in’ e.g. Toad for Cloud • Plug in to many analytics front ends e.g. Databases Tableau, Qlikview, R
  • 9.
    For segmentation, personalisationand recommendation on web analytics data, you can’t beat Mahout You can do computations that do not fit the SQL • Large number of recommendation, clustering processing model incl. machine learning in Hive via and categorisation algorithms transformation scripts… • Plays well with Hadoop CREATE TABLE docs(contents STRING); • Large, active developer community FROM (MAP docs.contents USING 'tokenizer_script' AS word, cnt FROM docs CLUSTER BY word) map_output REDUCE map_output.word, map_output.cnt USING 'count_script' AS word, cnt; … but why would you?
  • 10.
    For ETL inproduction, you really need something more robust than Hive • ETL: need to define sophisticated data pipelines so that: – Clear audit path: which lines of data have been processed, which have not – Where they have not, error handling flows to deal with the lines. (Including potential reprocessing) – Should fail gracefully (not shut down whole job) – Should be easy to debug when things go wrong, diagnose the problem, and start again where left off… • An alternative to Hive we are exploring: – Java framework for developing Hadoop-powered data processing applications – Scala (Scalding) and Clojure (Cascalog) wrappers available
  • 11.
    Where we’re goingwith Hive @ SnowPlow Pixel served Scalding Infobright Infobright from Amazon (Cascading) BI tools e.g. Cloudfront Ruby wrapper Tableau, Request to pixel Qlikview, (incl. query Pentaho string) logged Data Javascript tag exploration tools e.g. R, Excel MI tools e.g. Mahout Hive for ad hoc analytics on the atomic data Hive for SnowPlow users with Petabytes of data But.. for most users… Hive is NOT part of the core flow
  • 12.
    Any questions? http://snowplowanalytics.com http://github.com/snowplow @snowplowdata @yalisassoon