How we use Hive at SnowPlow, and how the role of HIve is changing

SnowPlow

How Apache Hive and other big data technologies are transforming
web analytics
How Hive is used at SnowPlow
Strengths and weaknesses of Hive vs alternatives

http://snowplowanalytics.com
@snowplowdata
@yalisassoon

Some history
Web analytics Big data

1990 Web is born

1993 Log file based web analytics

1996

1997 Javascript tagging

2004 publishes MapReduce paper

2006 Hadoop project split out of Nutch

2008 Facebook develops Hive

publishes Dremel paper
2010

Implications

• Web analytics solutions were developed on the assumption that granular, event level and
customer level data was too expensive to store and query

• Data is aggregated from the start. Data collection and analysis is tightly coupled

• Web analytics is limited
– Hits
– Clicks, unique visitors, conversions
– Traffic sources

• Web analytics is silo’ed. (Separate tools to use vs other data sets.)
– Hard to link to customer data (e.g. CRM)
– Hard to link to marketing data (e.g. DoubleClick)
– Hard to link to financial data (e.g. unit profit)

Let’s reinvent web analytics
Web analytics is one (very rich) data set that is at the heart of:

Platform /
Customer Catalogue
application
analytics analytics
analytics

• How do my users segment by • How do improvements to my • How are my different products
behaviour? application drive improved user (items in a shop / articles on a
engagement and lifetime newspaper / videos on a media
• What is the customer lifetime
value? site) performing? What is
value of my users? How can I
driving the most engagement?
forecast it based on their • Which parts of my application
Revenue? Profit?
behaviour? should I focus development on
to drive return? • How should I organise my
• What are the ‘sliding doors’
catalogue online to drive the
moments in a customers
best user experience? How can
journey that impact their
I personalise it to different
lifetime value?
users?
• Which channels should I spend
marketing budget to acquire
high value customers?

SnowPlow: open source platform that delivers the granular web
analytics data, so you can perform the above

SnowPlow leverages big data and cloud technology
across its architecture. Hive on EMR is used A LOT

Pixel served Hive S3 Query in Hive
from Amazon Read logs using Single, “fat” Output to other
Cloudfront custom SerDe Hive table analytics
Request to pixel Write a single programmes
(incl. query table of clean, e.g. Excel,
string) logged partitioned Tableau, R…
Javascript tag
event data back
to S3 for ease of
querying

Hive!

How SnowPlow data looks in Hive:

User Page Market- Event Browser OS Device
ing

One line
per event
e.g. page
view,
add-to-
basket

source
category name
medium name type
user_id action family
url term family is_mobile?
visit_id label version
title content manufact- width
ip_address property type
campaign urer height
value lang…
referrer

We Hive… but…

• Easy to use and query. (Especially • Hard to debug
compared with NoSQL competitors e.g. • Slow
MongoDB) • Limited power
– E.g.
• Batch based (Hadoop’s fault…)
http://snowplowanalytics.com/analytics
/basic-recipes.html
– http://snowplowanalytics.com/analytics
/customer-analytics/cohort-
analysis.html
• Rapidly develop ETL and analytics queries
• Easy to run on Amazon EMR
• Tight integration with Amazon S3

For storage and analytics, columnar databases provide
an attractive alternative

• Scales horizontally – to petabytes at least • Scales to terabytes (not petabytes)
• Pay-as-you-go (on EMR) – each query • Fixed cost (dedicated analytics server
costs $ with LOTs of RAM)
• An increasing number of front-ends can • Significantly faster – seconds not minutes
be ‘plugged in’ e.g. Toad for Cloud • Plug in to many analytics front ends e.g.
Databases Tableau, Qlikview, R

For segmentation, personalisation and recommendation
on web analytics data, you can’t beat Mahout

You can do computations that do not fit the SQL • Large number of recommendation, clustering
processing model incl. machine learning in Hive via and categorisation algorithms
transformation scripts…
• Plays well with Hadoop
CREATE TABLE docs(contents STRING);
• Large, active developer community
FROM
(MAP docs.contents USING
'tokenizer_script' AS word, cnt
FROM docs
CLUSTER BY word) map_output
REDUCE map_output.word,
map_output.cnt USING 'count_script'
AS word, cnt;

… but why would you?

For ETL in production, you really need something more
robust than Hive
• ETL: need to define sophisticated data pipelines so that:
– Clear audit path: which lines of data have been processed, which have not
– Where they have not, error handling flows to deal with the lines. (Including potential reprocessing)
– Should fail gracefully (not shut down whole job)
– Should be easy to debug when things go wrong, diagnose the problem, and start again where left off…

• An alternative to Hive we are exploring:

– Java framework for developing Hadoop-powered data processing applications
– Scala (Scalding) and Clojure (Cascalog) wrappers available

Where we’re going with Hive @ SnowPlow

Pixel served Scalding Infobright Infobright
from Amazon (Cascading) BI tools e.g.
Cloudfront Ruby wrapper Tableau,
Request to pixel Qlikview,
(incl. query Pentaho
string) logged Data
Javascript tag
exploration
tools e.g. R,
Excel
MI tools e.g.
Mahout

Hive for ad hoc analytics on the atomic data
Hive for SnowPlow users with Petabytes of data
But.. for most users… Hive is NOT part of the core flow

Any questions?

http://snowplowanalytics.com

http://github.com/snowplow

@snowplowdata

@yalisassoon

How we use Hive at SnowPlow, and how the role of HIve is changing

More Related Content

What's hot

Viewers also liked

Similar to How we use Hive at SnowPlow, and how the role of HIve is changing

How we use Hive at SnowPlow, and how the role of HIve is changing