Advanced Analytics using Apache Hive

Analytics using Apache Hive
with the power of Windowing
and Table functions:
Use Cases
Murtaza Doctor - murtaza@richrelevance.com
Principal Architect, RichRelevance

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Outline
•
•
•
•
•
•
•

{rr} story
What is Clickstream Analytics
Hive at {rr}
Windowing & PTF Framework
Case Study: use cases
Current, Next & Future
Q&A


RichRelevance {rr}


RichRelevance DataMesh
Data Ingestion
3rd
Party

Realtime

Customer
Data Store

Analytics &
Optimization

Clickstream
Catalog
Online sales
In-store sales
Ad impressions
Social profiles
Redemptions…

125+ models
Customer models
Product models
A/B, MVT testing
King-of-the-hill
optimization

Offline

Data Feeds
Real-time
Decisioning
(65 msec)

[Client]
Innovation
Cloud

Event
Triggered
(minutes)

Batch
Updates
(hours)

Reporting
(ad
hoc, OLAP, E
xcel)

Underlying Technologies:
Hadoop, HBase, Hive, Kafka, Avro, Voldemort, Postgres, Pentaho OLAP, R

Custom apps and APIs
Self-Serve
Analytics

Personalized
Category Sort

Real-time
Segmentation

Network Ad
Tracking

{rr} SaaS & APIs

Did You Know?
Our data capacity
includes a 1.5 PB
Hadoop
infrastructure, which
enables us to employ
100+ algorithms in realtime

Our cloud-based platform
supports both real-time
processes and analytical
use cases, utilizing
technologies to name a
few:
Crunch, Hive, HBase, Avro,
Azkaban, Voldemort, Kafka

In the US, we serve 7000
requests per second
with an average
response time of
50 ms

Someone clicks on a {rr}
recommendation
every 21 milliseconds


What is Clickstream
Analytics?


What is Clickstream Analytics?
•
•
•
•
•

Collect, Combine, Aggregate & Analyze
Clickstream – view, click, purchase events
It is all about the Session or Visit
User properties – userId, location etc
Site Optimization, Sentiment Analysis, Buying Patterns and
many more

Example: we use click through rate (clicks/sessions) to
measure how well ad placement positions are doing on
pages, and then can test them based on engagement to see if
other positions would work better.


Getting MAD on Hive


MAD skills from Hellerstein’s paper
From Developer perspective Data
Platform should be:
Magnetic – Attract Data

As opposed to

With Hive

Having to justify loading any new
data to a DBA. The Quality and
schema regimes have the
adverse effect of Repelling Data

Agile – Data comes in many
Forcing a complex ETL process to
shapes and forms. Enable
bring data in.
bringing in Data in its native form.

Pluggable
• Formats
• Storage
Handlers
• Indices

Deep – Ability to operate on data Only SQL
directly; using existing algorithms
that operate on native formats.
SQL + M/C Learning + Graph + …

SQL + Map
Reduce scripts.
But can we do
better?


Problem for App Developer
I want to do
•
•
•
•
•
•

Sessionization
Clustering
Collaborative Filtering
Fraud detection
Time Series Analysis
Churn Analysis

And I want to do combine these analysis with SQL
Analytic capabilities available in most Databases as:
•
•
•
•

User Defined Table Functions
External Table mechanisms etc.
Aster SQL/MR library provides functions for many of the Use Cases above
Oracle Stored Procedure + Table Functions used to provide Analytic
packages.

Our work: bring same capability to Hive.

Hive at {rr}
•
•
•
•
•

Real-time data in Hive
Getting to 1PB of data in Hive!
Hive Tables: Event types, Catalog, Rollups etc
Custom Serde
Partitioning scheme: most of the tables
partitioned by event date

{rr}

Architecture


Roadblocks to the Solution
•
•
•
•
•
•

Too many temporary tables
Random sampling
R for ranking & aggregate functions
R can only handle smaller data sets
Lots of self-joins
Inefficient queries


Welcome to PTFs and
Windowing


3 Major SQL Concepts
1. Table Function
• Enable injecting custom logic
into the Query Data Flow
• Contract for TF is TableIn/Table-Out
• So opens up analysis
beyond row calculations
and aggregations
• Sessionize Fn. that decides
what weblog entries belong to
a Session.
• Syntactically function can
appear anywhere a Table can
in SQL.

Project
tableOut
Table Function
tableIn
Join

Select

2. Partitioned Table Function

Select

Project
Partitions Out

Table Function
Partitions In

Join

Select

Select

• a scaling mechanism
• Instead of operating on
the entire table divide
work into Partitions
• instances operating on
individual Partitions
don’t communicate.
• Divide weblog by Day or
Week and operate
independently
• Intuitively like MR:
processing PTF done
as MR jobs.

3. Windowing

current
row

• Operate on a set of rows
surrounding the current
row
• Windows defined like „5
preceding and 4 succeeding‟
• On the window allow
aggregations; and also
Navigation: lead, lag, First,
Last

PTFS and Windows related
• You do windowing after everything else: join, group by etc.
• You define windows on ordered Partitions
• You then do aggregations, inter row navigations on these
windows
• If all the Partitions across all Window expressions are the
same, then this is a special PTF.


Ordered Partition: the central concept
select *
From Sessionize(weblog….)

Partition

Hive
Translator

In Functions:
• Analyze partition of rows as a unit
• Output is not a summary of rows
• Sessionization : relate events to
sessions.
• Market Basket: find most common
Product/Page combinations
In Windowing:
• Ranking: Rank, Tiling,
• Trending: Lead/Lag,
Cumulative Sum

SELECT ViewsData.*,
rank() as exit_rank
over(DISTRIBUTE BY sessionid
SORT BY timsetamp DESC),
FROM ViewsData

Hive
Translator


Output
Partition

Example: Time Series Analysis
Time Series Analysis: Identify Flights that have a delay problem.
• We want to look at all the times a Flight happened and then make a
judgment.
• To do this: one conceivable starting point is to find occurrences where a
Flight was late 3 or more times in a row.
• Use these as a starting point for further analysis.
Flights Table
Origin

Fl. Num

Year

Month

Day

Arr. Delay

Boston

1017

2010

10

25

59.37

Boston

1017

2010

10

26

58.14

Boston

1017

2010

10

28

30.83

Boston

1017

2010

10

29

25.67

Pittsburgh

1058

2010

12

26

82.62

Analysis rows by Fl.
Number. Look for
sequences of Late
incidents.

Origin

FlNum

Year

Boston

1017

Boston
Pittsburg
h

Output aggregation
statistics about
these sequences.

Day

2010

Mont
h
10

25

Avg.
Delay
59.37

Num Of
Delays
8

1017

2010

11

10

41.54

7

1058

2010

12

26

82.62

8


Use NPath PTF
Use a PTF: NPath
•
•
•

Helps you look for patterns in Time
User specifies Labels: interesting conditions, for e.g. LATE : arr_delay > 15 mins
Then specifies Patterns on Labels. Patterns are simple Regexes. For e.g.
•

•

LATE.LATE.LATE+  look for occurrences where a flight is 3 or more times late.

On Occurrences found (Occurrences are a set of rows) specify aggregation
calculations. For e.g.
•
•

Average Delay among late occurrences
Number of delays

3.

1. Query on Flights Table
select origin_city_name, fl_num, year, month, day_of_month, sz, tpath
from NPATH(
'LATE.LATE+',
'LATE', arr_delay > 15,
'origin_city_name, fl_num, year, month, day_of_month, size(tpath) as numDelay, arrAvg(tpath, “arrDelay”)
as avgDelay'
on
flights
distribute by fl_num
Looking at data
sort by year, month, day_of_month
per Flight; order
)

2.

within partition by
time


• Arg. 1 specify PATTERN
• Arg. 2 specify conditions as
LABELS
• Arg. 3 specify AGGR.
EXPRESSIONS

Runtime: PTF execution
Hive
Translator

Input DataSet

MR Job

Map Splits
Map Task
Rows

Table
Sc+an

Rows

Select

Partition

Reduce Task
Rows

Join

PTF

Shuffle controlled by
partition and order
specification

FileSink

Partition

Function

A PartitionedTableFunction (PTF)
given a Partition computes an output
Partition.
An invocation of PTF specifies how input
dataset should be partitioned and ordered.
A PTF defines shape of Output.
A PTF may operate on raw data before it is
partitioned and ordered.


{rr} Case Study on Windowing


Case I: Landing/Exit Page Rate
• First page the user lands on within a session
• Last page the user exits through a session
• Landing rate:
distribution of landing events by page type
• Exit rate:
distribution of exit events by page type
• Usage: SEO & Advertising


Case I: Landing/Exit Page
SELECT eventdate, landPage, exitPage, COUNT(DISTINCT sessionid)
FROM (
SELECT sessionid, eventdate,
first_value(pageType) over (partition by sessionid) as landPage,
last_value(pageType) over(partition by sessionid) as exitPage
FROM (
SELECT pageType, eventdate, sessionid, timestamp,
count(*) over(PARTITION
BY sessionid order by timestamp asc) as c,
rank() over(PARTITION
BY sessionid order by timestamp asc) as r
FROM views
WHERE siteid = 1 and
eventdate >= '2013-01-01' and evendate < '2013--01-13'
)a
WHERE r = 1 or r = c
)b
GROUP BY eventdate, landing_page, exit_page

Case I: Landing Page Breakdown


Case I: Landing Page Time Series


Case I: Exit Page Time Series


Case II a: Bounce Rate (by Page Type)
• Single page in session
• Landing Page is equal to Exit Page
• Usage: Site engagement metrics report


SELECT page_type, eventdate,
sum(case when c=1 then 1 else 0 end) as bounce_count,
count(1) as total_sessions
FROM (
SELECT page_type, eventdate, sessionid, timestamp,
count(*) over(PARTITION BY sessionid, eventdate order by
timestamp asc) as c,
rank() over(PARTITION BY sessionid, eventdate order by
timestamp asc) as bounce_rank
FROM views
eventdate >= '2013-01-01' and evendate < '2013-01-13'
)a
WHERE bounce_rank = 1
GROUP by page_type, eventdate


Case II a: Bounce Rate Time Series


Case II b: New versus Repeat Traffic
• Comparison metric between first time visitors to
site v/s who came back more than once
• Usage: Insights into audience optimization


Case II b: New vs Repeat Traffic
SELECT userid, siteid, eventdate,
sum(case when c=1 then 1 else 0 end) new_users,
sum(case when c>1 then 1 else 0 end) repeat_users
FROM (
SELECT userid, siteid, eventdate,
count(*) over(PARTITION BY userid, siteid order by
eventdate as c,
rank() over(PARTITION BY userid, siteid order by
eventdate ) as rank
FROM views
eventdate >= '2013-01-01' and eventdate < '2013-01-14’
) page_views
WHERE rank = 1
GROUP BY userid, siteid, eventdate;


Case III: Path to Purchase
• Most commonly taken path which leads to a
purchase
• Example: search page  item page  add to
cart  purchase
• Usage: Site Optimization, Attribution Models


Case III: Path to Purchase
SELECT sessionid, eventdate,
collect_set(page_type) as path_to_purchase
FROM (
SELECT sessionid, eventdate, page_type,
last_value(page_type) over(PARTITION BY sessionid, eventdate
order by timestamp) as last_page
FROM (
SELECT sessionid, eventdate, timestamp, 'purchase' as page_type
FROM purchases
WHERE siteid=999 and eventdate = '2013-01-01'
UNION ALL
SELECT sessionid, eventdate, timestamp, page_type
FROM views
WHERE siteid = 1 and eventdate = '2013-01-01'
)a
)b
WHERE
last_page = 'purchase'

Case IV: Most Frequent Next Action
• Path a user takes, speaks a lot about user
experience
• Next most common action
• Example: Search  item page
• Usage: Site Optimization


SELECT page_type, next_page_type, c
FROM(
SELECT sessionid, page_type,
lead(page_type,1) OVER (PARTITION BY sessionid sort by
timestamp asc) as next_page_type,
count(*) OVER (PARTITION BY sessionid sort by
timestamp asc) as c,
rank() ) OVER (PARTITION BY sessionid sort by
timestamp asc) as page_view
FROM views where siteid = 1 and eventdate='2013-01-01‟
)a
GROUP BY page_type, next_page_type;


Case V: Purchase Co-Occurrence
• People who bought X also bought Y
• List of products more frequently bought in the
same orders as a user specified list of products
• Usage: Provides behavioral insights that would
not surface in sales metrics


Case V: Purchase Co-Occurrence
SELECT siteid, eventdate, userid, sessionid, ip, timestamp, ordernumber,
prods (
SELECT siteid, eventdate, userid, sessionid, ip, timestamp,
ordernumber, prods.productid as productid sum(case when
find_in_set(prods.productid, 'P1,P2,P3') > 0 then 1 else 0)
OVER (PARTITION BY purchase_complete_page. ordernumber
rows between unbounded preceding and unbounded following) as
matches, collect_set(prods.productid)
OVER(PARTITION BY purchase_complete_page.ordernumber
rows between unbounded preceding and unbounded following) as
prods, rank() OVER (PARTITION BY
purchase_complete_page.ordernumber
rows between unbounded preceding and unbounded following) as r
FROM purchases explode(purchase_complete_page.productspurchased)
prodTable as prods
WHERE eventdate >= $P{startdate} and
eventdate <= $P{enddate} and
siteid = $P{siteid}
)
WHERE matches >= 3 and r = 1


Solution: Current, Next &
Future
{rr}


Solution: Current
History of this project (started by Harish Butani)
• First provided this functionality on top of Hive
• See Github project for details & Hadoop Summit talk from Harish Butani
on this
• Had more functions and features, but not ideal
• So started to fold into Hive in November 2012
• 3 patches for HQL: see Jira 896
• A separate „windowing & ptf‟ hive branch

Hive Journey
•
•
•
•

Available as HiveQL
Currently part of Hive 0.11
Equivalent to functionality provided by Postgres
Differences are documented in Jira 4197


Solution: Next
Solidify Infrastructure
• Performance improvements
• Dynamic Registration of PTFs.

More Functions
• Candidate Frequent Itemsets: key process in Market Basket Analysis
• TimeLine: another kind of time series analysis, based on a
RichRelevance use case.


Solution: Future
Use PTF mechanism to integrate:
• R as R script PTF
• Mahout functions as Mahout PTF
• Groovy script PTF

Reduce Task
Rows

Join

Query structure:
Select ….
From Rscript(
‘r script’
on Npath(args…
On Flights..
)
)

rFn.

PTF

FileSink

rJava

rEngine

 Npath identifies interesting incidents
 Use R to make final decision

Partition

R Data Frame

Multi pass PTF Operator:
• Enable Iterative Algorithms:
Clustering, Market basket
Analysis, Graph traversal etc.

{rr} richrelevance
is hiring!

Thank You


Advanced Analytics using Apache Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Advanced Analytics using Apache Hive

Similar to Advanced Analytics using Apache Hive (20)

Recently uploaded

Recently uploaded (20)

Advanced Analytics using Apache Hive

Editor's Notes