Advanced Analytics using Apache Hive

4,540 views
4,361 views

Published on

Presentation done at StampedeCon 2013 on Advanced Analytics using Apache Hive windowing & table functions.

Published in: Technology
0 Comments
25 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,540
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
0
Comments
0
Likes
25
Embeds 0
No embeds

No notes for slide
  • Hive – DataWarehouse System for hadoopHow Harish & I met and we decided to collaborate
  • How we plan to go over stuff
  • Nuggets or Data Points1.5PB not as big as yahoo or facebook – huge from a retail industry perspective
  • Site Optimization and others are just few of the use cases which can be solved by leveraging ClickStream Analytics
  • Hive usage at {rr}
  • So the picture in your mind should be:- The user specifies a Function in SQL anywhere a Table can appear- Behind the scenes: at runtime the Function is responsible for taking a Partition & returning a Partition.Or:- user specifies one or more Windowing expressions- behind the scenes the internal Windowing Table Function processes the data, partition by partition.Windowing and PTF infrastructure is the same
  • Npath get the example from Hive
  • - One last thing, a quick picture of runtime- Here is now PTFs fit into the Hive flow.- A Query is translated in a set of Jobs by the Hive Driver.- Within each task, one or more SQL Operators are executed.- These operate on a stream of rows.- For PTFs a new PTF Operator gets injected into the reduce side. - It collects rows in a partition into a Partition object and invokes the PTF Function.- Whose job is to provide an output Partition; whose rows get injected back into the stream of rows.
  • Fluent way to do things
  • RANK function Inner query selects a certain set of fields partitions the data by sessionId and sorts views in that session by timestamp or order in which they have occurred starting with the first one. This query then only selects the first event of that session and that comes from rank=1Outer query groups the data by page_type and applies the count aggregate function to the sessionId
  • Example just does a countLanding events are pages where referral id is not NULLGoogle  landing events in a session  item page - non bounce pageSessions which have one row one where rank() = 1If you want to compute by a session using a time – you are computing a difference between the frist & last – FIRST & LAST value
  • Highlighting that the window does not have be number range It can be value basedIn a row in a session you want to look ahead: what some one time every activity Timeline function – Table Functions lot more leeway: some kind of pathing just like NPATH
  • How is it different from last one- Lead function - cannot pivot the value 0 fundamental pattern are the same
  • How about the following:If I understand the schema, the query below should give you the Orders andthe products purchased that contain all the listed products.So say the products you are looking for are 'P1,P2,P3', then the sum willgive you a count of the products in this Order that match one of thelisted products.The having clause will filter out all Orders that don't have at least 3matches (I.e. Matching all the listed products)The r = 1 condition will return 1 row per order.The o/p is of the form:OrderNumber, {products in order as a set}, other detailsŠCan of course return each product in the Order as a separate row if youwant to do more aggregation. For e.g count the orders that these productsappear in and then rank them or set up a cutoff threshold etc.
  • Notes: R and SQLThis would bring a different wayPull data into RPush R functionality where data is?Who is thinking about this future?
  • Advanced Analytics using Apache Hive

    1. 1. Analytics using Apache Hive with the power of Windowing and Table functions: Use Cases Murtaza Doctor - murtaza@richrelevance.com Principal Architect, RichRelevance © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    2. 2. Outline • • • • • • • {rr} story What is Clickstream Analytics Hive at {rr} Windowing & PTF Framework Case Study: use cases Current, Next & Future Q&A © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    3. 3. RichRelevance {rr} © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    4. 4. RichRelevance DataMesh Data Ingestion 3rd Party Realtime Customer Data Store Analytics & Optimization Clickstream Catalog Online sales In-store sales Ad impressions Social profiles Redemptions… 125+ models Customer models Product models A/B, MVT testing King-of-the-hill optimization Offline Data Feeds Real-time Decisioning (65 msec) [Client] Innovation Cloud Event Triggered (minutes) Batch Updates (hours) Reporting (ad hoc, OLAP, E xcel) Underlying Technologies: Hadoop, HBase, Hive, Kafka, Avro, Voldemort, Postgres, Pentaho OLAP, R Custom apps and APIs Self-Serve Analytics Personalized Category Sort Real-time Segmentation Network Ad Tracking © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. {rr} SaaS & APIs
    5. 5. Did You Know? Our data capacity includes a 1.5 PB Hadoop infrastructure, which enables us to employ 100+ algorithms in realtime Our cloud-based platform supports both real-time processes and analytical use cases, utilizing technologies to name a few: Crunch, Hive, HBase, Avro, Azkaban, Voldemort, Kafka In the US, we serve 7000 requests per second with an average response time of 50 ms Someone clicks on a {rr} recommendation every 21 milliseconds © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    6. 6. What is Clickstream Analytics? © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    7. 7. What is Clickstream Analytics? • • • • • Collect, Combine, Aggregate & Analyze Clickstream – view, click, purchase events It is all about the Session or Visit User properties – userId, location etc Site Optimization, Sentiment Analysis, Buying Patterns and many more Example: we use click through rate (clicks/sessions) to measure how well ad placement positions are doing on pages, and then can test them based on engagement to see if other positions would work better. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    8. 8. Getting MAD on Hive © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    9. 9. MAD skills from Hellerstein’s paper From Developer perspective Data Platform should be: Magnetic – Attract Data As opposed to With Hive Having to justify loading any new data to a DBA. The Quality and schema regimes have the adverse effect of Repelling Data Agile – Data comes in many Forcing a complex ETL process to shapes and forms. Enable bring data in. bringing in Data in its native form. Pluggable • Formats • Storage Handlers • Indices Deep – Ability to operate on data Only SQL directly; using existing algorithms that operate on native formats. SQL + M/C Learning + Graph + … SQL + Map Reduce scripts. But can we do better? © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    10. 10. Problem for App Developer I want to do • • • • • • Sessionization Clustering Collaborative Filtering Fraud detection Time Series Analysis Churn Analysis And I want to do combine these analysis with SQL Analytic capabilities available in most Databases as: • • • • User Defined Table Functions External Table mechanisms etc. Aster SQL/MR library provides functions for many of the Use Cases above Oracle Stored Procedure + Table Functions used to provide Analytic packages. Our work: bring same capability to Hive. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    11. 11. Hive at {rr} • • • • • Real-time data in Hive Getting to 1PB of data in Hive! Hive Tables: Event types, Catalog, Rollups etc Custom Serde Partitioning scheme: most of the tables partitioned by event date {rr} © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    12. 12. Architecture © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    13. 13. Roadblocks to the Solution • • • • • • Too many temporary tables Random sampling R for ranking & aggregate functions R can only handle smaller data sets Lots of self-joins Inefficient queries © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    14. 14. Welcome to PTFs and Windowing © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    15. 15. 3 Major SQL Concepts 1. Table Function • Enable injecting custom logic into the Query Data Flow • Contract for TF is TableIn/Table-Out • So opens up analysis beyond row calculations and aggregations • Sessionize Fn. that decides what weblog entries belong to a Session. • Syntactically function can appear anywhere a Table can in SQL. Project tableOut Table Function tableIn Join Select 2. Partitioned Table Function Select Project Partitions Out Table Function Partitions In Join Select Select • a scaling mechanism • Instead of operating on the entire table divide work into Partitions • instances operating on individual Partitions don’t communicate. • Divide weblog by Day or Week and operate independently • Intuitively like MR: processing PTF done as MR jobs. 3. Windowing current row • Operate on a set of rows surrounding the current row • Windows defined like „5 preceding and 4 succeeding‟ • On the window allow aggregations; and also Navigation: lead, lag, First, Last PTFS and Windows related • You do windowing after everything else: join, group by etc. • You define windows on ordered Partitions • You then do aggregations, inter row navigations on these windows • If all the Partitions across all Window expressions are the same, then this is a special PTF. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    16. 16. Ordered Partition: the central concept select * From Sessionize(weblog….) Partition Hive Translator In Functions: • Analyze partition of rows as a unit • Output is not a summary of rows • Sessionization : relate events to sessions. • Market Basket: find most common Product/Page combinations In Windowing: • Ranking: Rank, Tiling, • Trending: Lead/Lag, Cumulative Sum SELECT ViewsData.*, rank() as exit_rank over(DISTRIBUTE BY sessionid SORT BY timsetamp DESC), FROM ViewsData Hive Translator © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Output Partition
    17. 17. Example: Time Series Analysis Time Series Analysis: Identify Flights that have a delay problem. • We want to look at all the times a Flight happened and then make a judgment. • To do this: one conceivable starting point is to find occurrences where a Flight was late 3 or more times in a row. • Use these as a starting point for further analysis. Flights Table Origin Fl. Num Year Month Day Arr. Delay Boston 1017 2010 10 25 59.37 Boston 1017 2010 10 26 58.14 Boston 1017 2010 10 28 30.83 Boston 1017 2010 10 29 25.67 Pittsburgh 1058 2010 12 26 82.62 Analysis rows by Fl. Number. Look for sequences of Late incidents. Origin FlNum Year Boston 1017 Boston Pittsburg h Output aggregation statistics about these sequences. Day 2010 Mont h 10 25 Avg. Delay 59.37 Num Of Delays 8 1017 2010 11 10 41.54 7 1058 2010 12 26 82.62 8 © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    18. 18. Use NPath PTF Use a PTF: NPath • • • Helps you look for patterns in Time User specifies Labels: interesting conditions, for e.g. LATE : arr_delay > 15 mins Then specifies Patterns on Labels. Patterns are simple Regexes. For e.g. • • LATE.LATE.LATE+  look for occurrences where a flight is 3 or more times late. On Occurrences found (Occurrences are a set of rows) specify aggregation calculations. For e.g. • • Average Delay among late occurrences Number of delays 3. 1. Query on Flights Table select origin_city_name, fl_num, year, month, day_of_month, sz, tpath from NPATH( 'LATE.LATE+', 'LATE', arr_delay > 15, 'origin_city_name, fl_num, year, month, day_of_month, size(tpath) as numDelay, arrAvg(tpath, “arrDelay”) as avgDelay' on flights distribute by fl_num Looking at data sort by year, month, day_of_month per Flight; order ) 2. within partition by time © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Arg. 1 specify PATTERN • Arg. 2 specify conditions as LABELS • Arg. 3 specify AGGR. EXPRESSIONS
    19. 19. Runtime: PTF execution Hive Translator Input DataSet MR Job Map Splits Map Task Rows Table Sc+an Rows Select Partition Reduce Task Rows Join PTF Shuffle controlled by partition and order specification FileSink Partition Function A PartitionedTableFunction (PTF) given a Partition computes an output Partition. An invocation of PTF specifies how input dataset should be partitioned and ordered. A PTF defines shape of Output. A PTF may operate on raw data before it is partitioned and ordered. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    20. 20. {rr} Case Study on Windowing © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    21. 21. Case I: Landing/Exit Page Rate • First page the user lands on within a session • Last page the user exits through a session • Landing rate: distribution of landing events by page type • Exit rate: distribution of exit events by page type • Usage: SEO & Advertising © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    22. 22. Case I: Landing/Exit Page SELECT eventdate, landPage, exitPage, COUNT(DISTINCT sessionid) FROM ( SELECT sessionid, eventdate, first_value(pageType) over (partition by sessionid) as landPage, last_value(pageType) over(partition by sessionid) as exitPage FROM ( SELECT pageType, eventdate, sessionid, timestamp, count(*) over(PARTITION BY sessionid order by timestamp asc) as c, rank() over(PARTITION BY sessionid order by timestamp asc) as r FROM views WHERE siteid = 1 and eventdate >= '2013-01-01' and evendate < '2013--01-13' )a WHERE r = 1 or r = c )b GROUP BY eventdate, landing_page, exit_page © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    23. 23. Case I: Landing Page Breakdown © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    24. 24. Case I: Landing Page Time Series © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    25. 25. Case I: Exit Page Time Series © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    26. 26. Case II a: Bounce Rate (by Page Type) • Single page in session • Landing Page is equal to Exit Page • Usage: Site engagement metrics report © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    27. 27. Case II a: Bounce Rate (by Page Type) SELECT page_type, eventdate, sum(case when c=1 then 1 else 0 end) as bounce_count, count(1) as total_sessions FROM ( SELECT page_type, eventdate, sessionid, timestamp, count(*) over(PARTITION BY sessionid, eventdate order by timestamp asc) as c, rank() over(PARTITION BY sessionid, eventdate order by timestamp asc) as bounce_rank FROM views WHERE siteid = 1 and eventdate >= '2013-01-01' and evendate < '2013-01-13' )a WHERE bounce_rank = 1 GROUP by page_type, eventdate © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    28. 28. Case II a: Bounce Rate (by Page Type) © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    29. 29. Case II a: Bounce Rate Time Series © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    30. 30. Case II b: New versus Repeat Traffic • Comparison metric between first time visitors to site v/s who came back more than once • Usage: Insights into audience optimization © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    31. 31. Case II b: New vs Repeat Traffic SELECT userid, siteid, eventdate, sum(case when c=1 then 1 else 0 end) new_users, sum(case when c>1 then 1 else 0 end) repeat_users FROM ( SELECT userid, siteid, eventdate, count(*) over(PARTITION BY userid, siteid order by eventdate as c, rank() over(PARTITION BY userid, siteid order by eventdate ) as rank FROM views WHERE siteid = 1 and eventdate >= '2013-01-01' and eventdate < '2013-01-14’ ) page_views WHERE rank = 1 GROUP BY userid, siteid, eventdate; © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    32. 32. Case III: Path to Purchase • Most commonly taken path which leads to a purchase • Example: search page  item page  add to cart  purchase • Usage: Site Optimization, Attribution Models © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    33. 33. Case III: Path to Purchase SELECT sessionid, eventdate, collect_set(page_type) as path_to_purchase FROM ( SELECT sessionid, eventdate, page_type, last_value(page_type) over(PARTITION BY sessionid, eventdate order by timestamp) as last_page FROM ( SELECT sessionid, eventdate, timestamp, 'purchase' as page_type FROM purchases WHERE siteid=999 and eventdate = '2013-01-01' UNION ALL SELECT sessionid, eventdate, timestamp, page_type FROM views WHERE siteid = 1 and eventdate = '2013-01-01' )a )b WHERE last_page = 'purchase' © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    34. 34. Case IV: Most Frequent Next Action • Path a user takes, speaks a lot about user experience • Next most common action • Example: Search  item page • Usage: Site Optimization © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    35. 35. Case IV: Most Frequent Next Action SELECT page_type, next_page_type, c FROM( SELECT sessionid, page_type, lead(page_type,1) OVER (PARTITION BY sessionid sort by timestamp asc) as next_page_type, count(*) OVER (PARTITION BY sessionid sort by timestamp asc) as c, rank() ) OVER (PARTITION BY sessionid sort by timestamp asc) as page_view FROM views where siteid = 1 and eventdate='2013-01-01‟ )a GROUP BY page_type, next_page_type; © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    36. 36. Case IV: Most Frequent Next Action © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    37. 37. Case V: Purchase Co-Occurrence • People who bought X also bought Y • List of products more frequently bought in the same orders as a user specified list of products • Usage: Provides behavioral insights that would not surface in sales metrics © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    38. 38. Case V: Purchase Co-Occurrence SELECT siteid, eventdate, userid, sessionid, ip, timestamp, ordernumber, prods ( SELECT siteid, eventdate, userid, sessionid, ip, timestamp, ordernumber, prods.productid as productid sum(case when find_in_set(prods.productid, 'P1,P2,P3') > 0 then 1 else 0) OVER (PARTITION BY purchase_complete_page. ordernumber rows between unbounded preceding and unbounded following) as matches, collect_set(prods.productid) OVER(PARTITION BY purchase_complete_page.ordernumber rows between unbounded preceding and unbounded following) as prods, rank() OVER (PARTITION BY purchase_complete_page.ordernumber rows between unbounded preceding and unbounded following) as r FROM purchases explode(purchase_complete_page.productspurchased) prodTable as prods WHERE eventdate >= $P{startdate} and eventdate <= $P{enddate} and siteid = $P{siteid} ) WHERE matches >= 3 and r = 1 © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    39. 39. Solution: Current, Next & Future {rr} © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    40. 40. Solution: Current History of this project (started by Harish Butani) • First provided this functionality on top of Hive • See Github project for details & Hadoop Summit talk from Harish Butani on this • Had more functions and features, but not ideal • So started to fold into Hive in November 2012 • 3 patches for HQL: see Jira 896 • A separate „windowing & ptf‟ hive branch Hive Journey • • • • Available as HiveQL Currently part of Hive 0.11 Equivalent to functionality provided by Postgres Differences are documented in Jira 4197 © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    41. 41. Solution: Next Solidify Infrastructure • Performance improvements • Dynamic Registration of PTFs. More Functions • Candidate Frequent Itemsets: key process in Market Basket Analysis • TimeLine: another kind of time series analysis, based on a RichRelevance use case. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    42. 42. Solution: Future Use PTF mechanism to integrate: • R as R script PTF • Mahout functions as Mahout PTF • Groovy script PTF Reduce Task Rows Join Query structure: Select …. From Rscript( ‘r script’ on Npath(args… On Flights.. ) ) rFn. PTF FileSink rJava rEngine  Npath identifies interesting incidents  Use R to make final decision Partition R Data Frame Multi pass PTF Operator: • Enable Iterative Algorithms: Clustering, Market basket Analysis, Graph traversal etc. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
    43. 43. {rr} richrelevance is hiring! Thank You © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

    ×