How we use Hive at SnowPlow, and how the role of HIve is changing

3,252 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,252
On SlideShare
0
From Embeds
0
Number of Embeds
986
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

How we use Hive at SnowPlow, and how the role of HIve is changing

  1. 1. SnowPlowHow Apache Hive and other big data technologies are transformingweb analyticsHow Hive is used at SnowPlowStrengths and weaknesses of Hive vs alternativeshttp://snowplowanalytics.com@snowplowdata@yalisassoon
  2. 2. Some history Web analytics Big data1990 Web is born1993 Log file based web analytics19961997 Javascript tagging2004 publishes MapReduce paper2006 Hadoop project split out of Nutch2008 Facebook develops Hive publishes Dremel paper2010
  3. 3. Implications• Web analytics solutions were developed on the assumption that granular, event level and customer level data was too expensive to store and query• Data is aggregated from the start. Data collection and analysis is tightly coupled• Web analytics is limited – Hits – Clicks, unique visitors, conversions – Traffic sources• Web analytics is silo’ed. (Separate tools to use vs other data sets.) – Hard to link to customer data (e.g. CRM) – Hard to link to marketing data (e.g. DoubleClick) – Hard to link to financial data (e.g. unit profit)
  4. 4. Let’s reinvent web analytics Web analytics is one (very rich) data set that is at the heart of: Platform / Customer Catalogue application analytics analytics analytics• How do my users segment by • How do improvements to my • How are my different products behaviour? application drive improved user (items in a shop / articles on a engagement and lifetime newspaper / videos on a media• What is the customer lifetime value? site) performing? What is value of my users? How can I driving the most engagement? forecast it based on their • Which parts of my application Revenue? Profit? behaviour? should I focus development on to drive return? • How should I organise my• What are the ‘sliding doors’ catalogue online to drive the moments in a customers best user experience? How can journey that impact their I personalise it to different lifetime value? users?• Which channels should I spend marketing budget to acquire high value customers? SnowPlow: open source platform that delivers the granular web analytics data, so you can perform the above
  5. 5. SnowPlow leverages big data and cloud technologyacross its architecture. Hive on EMR is used A LOT Pixel served Hive S3 Query in Hive from Amazon Read logs using Single, “fat” Output to other Cloudfront custom SerDe Hive table analytics Request to pixel Write a single programmes (incl. query table of clean, e.g. Excel, string) logged partitioned Tableau, R…Javascript tag event data back to S3 for ease of querying Hive!
  6. 6. How SnowPlow data looks in Hive: User Page Market- Event Browser OS Device ing One lineper evente.g. page view, add-to- basket source category name medium name type user_id action family url term family is_mobile? visit_id label version title content manufact- width ip_address property type campaign urer height value lang… referrer
  7. 7. We Hive… but…• Easy to use and query. (Especially • Hard to debug compared with NoSQL competitors e.g. • Slow MongoDB) • Limited power – E.g. • Batch based (Hadoop’s fault…) http://snowplowanalytics.com/analytics /basic-recipes.html – http://snowplowanalytics.com/analytics /customer-analytics/cohort- analysis.html• Rapidly develop ETL and analytics queries• Easy to run on Amazon EMR• Tight integration with Amazon S3
  8. 8. For storage and analytics, columnar databases providean attractive alternative • Scales horizontally – to petabytes at least • Scales to terabytes (not petabytes) • Pay-as-you-go (on EMR) – each query • Fixed cost (dedicated analytics server costs $ with LOTs of RAM) • An increasing number of front-ends can • Significantly faster – seconds not minutes be ‘plugged in’ e.g. Toad for Cloud • Plug in to many analytics front ends e.g. Databases Tableau, Qlikview, R
  9. 9. For segmentation, personalisation and recommendationon web analytics data, you can’t beat MahoutYou can do computations that do not fit the SQL • Large number of recommendation, clusteringprocessing model incl. machine learning in Hive via and categorisation algorithmstransformation scripts… • Plays well with Hadoop CREATE TABLE docs(contents STRING); • Large, active developer community FROM (MAP docs.contents USING tokenizer_script AS word, cnt FROM docs CLUSTER BY word) map_output REDUCE map_output.word, map_output.cnt USING count_script AS word, cnt;… but why would you?
  10. 10. For ETL in production, you really need something morerobust than Hive• ETL: need to define sophisticated data pipelines so that: – Clear audit path: which lines of data have been processed, which have not – Where they have not, error handling flows to deal with the lines. (Including potential reprocessing) – Should fail gracefully (not shut down whole job) – Should be easy to debug when things go wrong, diagnose the problem, and start again where left off…• An alternative to Hive we are exploring: – Java framework for developing Hadoop-powered data processing applications – Scala (Scalding) and Clojure (Cascalog) wrappers available
  11. 11. Where we’re going with Hive @ SnowPlow Pixel served Scalding Infobright Infobright from Amazon (Cascading) BI tools e.g. Cloudfront Ruby wrapper Tableau, Request to pixel Qlikview, (incl. query Pentaho string) logged DataJavascript tag exploration tools e.g. R, Excel MI tools e.g. Mahout Hive for ad hoc analytics on the atomic data Hive for SnowPlow users with Petabytes of data But.. for most users… Hive is NOT part of the core flow
  12. 12. Any questions? http://snowplowanalytics.com http://github.com/snowplow @snowplowdata @yalisassoon

×