Clickstream data with spark

MAKING BIG DATA COME ALIVE
Clustering click-stream data using Spark
Marissa Saunders
Slides available at:
http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark

22
• Why?
– Why clustering?
– Why Spark?
– Why click-stream?
• What?
– What is the raw data?
• How?
– Parsing user agent data on Spark
– Distributed K-modes on Spark
• So what?
– Details of applying the method to this
use case
– Resulting clusters
– Time access patterns
– Preferred websites
• Questions
Agenda

3
Objectives
Understand:
• k-means and k-modes clustering
• why Spark is a good choice
• different data structures in Spark
– RDD, dataframe and dataset
• clickstream data and how user-agent parsing works
Demonstrate:
• mapping a function over a RDD
• defining a custom UDF and mapping it over a dataframe
• mapping a python function over a partition
• how identifying different user types can drive insight into user
behavior

5
We have a plot like this …
• 2 groups of data
• Clustering can find them
• This can lead to insight …
– There are two different groups of
unladen swallows
– The heavy species flies more
slowly
– When asking for airspeed, we
should specify if we mean African
or European swallows
Why clustering?
… with apologies to Monty Python
Bird Type
Flight velocities vs. bird mass

6
For 2 clusters:
1. Pick 2 points at random as
centroids
How does it work?

7
For 2 clusters:
centroids
2. Cluster data based on
closest point
How does it work?

8
For 2 clusters:
centroids
2. Cluster data based on closest
point
3. Calculate the mean of each
cluster as centroids
How does it work?

9
For 2 clusters:
centroids
point
4. Repeat 2 and 3 to
convergence
How does it work?

10
For 2 clusters:
centroids
point
convergence
How does it work?

11
For 2 clusters:
centroids
point
convergence
How does it work?
Converged
This is called
K-means
clustering
… and there is a Spark
function for this

12
What about categorical data?
• Use modes instead of means
– Most frequently occurring value
• Use binary distance metric for each dimension
– 0 = the same
– 1 = not the same
• Use the same iterative cluster assignment algorithm
This is called
K-modes
clustering
Color Mass Speed Type
Green/Grey Heavy Slow African
Green/Grey Heavy Fast African
Green/Black Heavy Slow African
Green/Grey Light Slow African
Blue/White Heavy Fast European
Blue/White Light Fast European
Blue/Grey Light Slow European
Blue/White Light Fast European
… and we’ve open-sourced
a Spark function for this

14
What is Spark?
Apache Spark™ is a fast and general engine for large-scale data processing.
- spark.apache.org
• Distributed
computing
• Relies on HDFS (or
other DFS)
• In-memory
• Optimized
execution
• High level
functionality

15
Block1
Block2
Block3
Block4
Block5
Block6
Block7
Block8
Why Spark?
• Take the computation to the data
• Spark works faster on partitioned data than map-reduce
– In-memory operation avoids I/O costs
– DAG optimization reduces computational costs
• Fast to develop
– Data transformation and machine learning libraries are part of Spark
http://spark.apache.org/docs/latest/cluster-overview.html
It is FAST

16
Basic data structures in Spark
• Resiliently distributed dataset (RDD)
• Dataframe = RDD with a schema
– SQL-style syntax
– Refer to column by name
– Optimized queries
• Dataset = best of both worlds?!?
Block3
Block4
Block7
Block8
Block1
Block2
Block3
Block4
Block1
Block2
Block3
Block4
Block5
Block6
Block7
Block8
Full data
set
Block1
Block2
Block5
Block6
Block7
Block8
Block5
Block6
What makes it
resilient?
Multiple copies
Stores lineage

17
A little terminology …
Block3
Block4
Block7
Block8
Block1
Block2
Block3
Block4
Block1
Block2
Block5
Block6
Block7
Block8
Block5
Block6
Full data
set
node
partition
record

19
What is clickstream data?
• Information trail left behind by each user
• Semi-structured website log files
• Includes:
– User agent information
- Device
- OS
- Browser
– Geo information
- Timezone
- Lat/Longitude
- City
- Country
– Time of access
– Referring website
– Website accessed
Photo credit: Tim Franklin Photography via Foter.com

20
What is this good for?
• Web analytics can answer questions like:
– How long do users take from first visit to purchase?
– When do users visit the website?
– What marketing channels are effective in attracting users?
– Where are users located?
– What are the paths that users take through the website?
– How long do users stay on a specific page?
– Which pages draw the most users?
– etc…

21
The sample use case
Clickstream data from 1usagov
– Created whenever anyone shortens a .gov or .mil site with bitly
– Feed at http://developer.usa.gov/1usagov
– Archive for 2011-2013: http://bitly.measuredvoice.com/bitly_archive/?C=M;O=D
Why this is a great dataset:
– Large volume
– Realistic format
- Streaming
- Not cleaned
– Interesting questions
- What subtypes of users are there?
- How do the activity patterns of these subtypes differ?
– Publically available archive

23
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g
ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0;
Windows NT 6.1; WOW64;
Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-
8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format

24
• json format
• Fields include:
• Website clicked: long url

25
• json format
• Fields include:
• Website clicked/long url
• Referring url

26
• json format
• Fields include:
• Referring url
• User agent – what machine is this?

27
• json format
• Fields include:
• Referring url
• User agent – what machine is this?
• Time accessed
• etc…

28
Parsing click stream data on Spark

29
High level picture
• Need to extract:
– Time in date, hours
– Information about the user:
- Device type
- OS
- Timezone
– Main domain of the url
– Referring url
• Do this for one record in python
• Map this function over all records
using Spark
{"h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc
.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0;
8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”
,”tz":”America/New_York "}
Day: Friday
Local_hour: 16
Device_type:pc
Browser: IE
OS: Windows 7
Is_bot: false

30
Actual transformation
• Define parsing function
• Map parsing function over RDD
Leverage user-agents library
for every record s
Apply user_agent library
RDD containing parsed json
data

31
Keep every entry as item in
list

32
Apply custom function to user
agent string

34
How does clustering have to change to be distributed?
K-means example:
Clustering is a collective operation.
How can we distribute it?

35
How does clustering have to change to be distributed?
Do k-means on each partition
Cluster the collected centroids
K-means example:

36
Mapping over data in Spark
• Map over a record:
def f(record): return transform(record)
rdd2 = rdd1.map(f)

37
Block3
Block4
Block7
Block8
Block1
Block2
Block3
Block4
Block1
Block2
Block5
Block6
Block7
Block8
Block5
Block6
Full data
set
map
Block1
What is the
equivalent here?
Spark has two possibilities:
1. mapPartition:
• get each record in turn and do something;
return after all records are done
• mapPartitionWithIndex:
• Keep track of which partition returned
which result

38
• Map over a record:
def f(record): return transform(record)
rdd2 = rdd1.map(f)
• Map over a partition:
def f(iterator): yield cluster(iterator)
rdd2 = rdd1.mapPartitions(f)
• Map over a partition with a partition key
def f(splitIndex, iterator): yield (partitionIndex, cluster(iterator))
rdd2 = rdd1.mapPartitionsWithIndex(f)
For K-modes, we have open-sourced an implementation of distributed clustering:
https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes
Iterator = cycle once
through each record

40
Getting 1usagov clickstream data
• Scrape data from archive site:
– http://1usagov.measuredvoice.com/
– json format
• Concatenate into files by month
• Store in HDFS
• Load into Spark

42
Parse to extract user agent information
• Python package user_agents
– Input string -> output information
• Add some custom parsing to extract features
– os family, os_version, device
• Use spark to map this over each clickstream entry

43
Prepare for K-modes clustering
To reduce dimensionality:
• Decide which variables to use
for clustering
• Keep only the top few
categories for each variable
Prasad Patil, as referenced on http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/
The CURSE of dimensionality ….

44
Prepare for K-modes clustering
• Decide which variables to use for clustering
– Country
– Timezone
– Device Type
– OS
– Browser
• Keep only the top few categories for each variable
Custom UDF for
Spark dataframes
Apply a series of
UDFs

45
• Uses open-source package
https://github.com/ThinkBigAnalytics/pyspark-
distributed-kmodes
Perform distributed k-modes clustering
# of modes
Max.
iterations
Full log
Partition
Partition
Partition
Centroids
Centroids
Centroids
Centroids
Local
clustering
Distributed
clustering
Create
RDD

46
Clustering results: 10 clusters

47
What do the clusters look like?
# Size Country Timezone Device
Type
OS Browser
1 617820 US: 93% US/NY: 53% Pc: 97% Win 7: 75% Firefox: 57%
2 226035 NotUS: 68% Other: 57% Mobile: 75% iOS:84%
MobileSafari:
78%
3 152053
NoGeoInfo:86
%
NoGeoInfo:
86%
Pc: 99%
Windows:
81%
Chrome/IE:
72%
4 161947 US:96% US/NY: 60% PC: 99%
Windows not
7: 99%
IE:
81%
5 105090
NoGeoInfo:76
%
NoGeoInfo:76
%
Mobile: 70% Other: 70% Other: 99%
6 235719 NotUS:99% Other:89% PC: 99% Win7: 68% Chrome: 51%
7 121464 US:100% US/LA: 59% PC:95%
MacOSX:
72%
Chrome: 54%
8 121115 US:48%
NoGeoInfo:
40%
Mobile:93%
Android:
100%
Android: 99%
9 101052 NotUS:98% Other: 90% PC: 100%
Win other than
7: 84%
Firefox: 57%
10 173424 US:100% US/NY: 48% Mobile: 68% iOS:100%
MobileSafari:
74%

50
Top sites visited: January 2012
Description Top 3 domains
US, pc,
Win7
www.nysdot.gov
212K
www.nasa.gov
59K
www.fda.gov
18K
US, pc,
Win_not7, IE
www.nasa.gov
15K
www.shrewsbury-ma.gov
9K
www.fda.gov
5K
US, pc,
Mac OS X
www.nysdot.gov 29K www.nasa.gov
16K
www.whitehouse.gov
6K
notUS, pc,
Win7
www.nasa.gov
87K
earthobservatory.nasa.go
v 15K
www.nysdot.gov
14K
notUS, pc,
Win_not7
www.nasa.gov
30K
www.navy.mil
8K
globalhealth.gov
7K
noGeo, pc,
Win, Chrome
www.nasa.gov
34K
www.nysdot.gov
17K
earthobservatory.nasa.
gov 6K
US, mobile,
iOS
www.nasa.gov
33K
v 11K
forecast.weather.gov
9K
notUS, mobile,
iOS
www.nasa.gov
82K
v 24K
www.navy.mil
13K
Mobile,
Android
www.nasa.gov
29K
v 9K
www.navy.mil
6K
noGeo, mobile,
OtherOS
www.nasa.gov
24K
www.nysdot.gov
8K
www.army.mil
5K

51
Where do users come from: January 2012
Description Top 3 domains
US, pc,
Win7
direct
342K
t.co
135K
www.facebook.com
67K
US, pc,
Win_not7, IE
direct
69K
t.co
33K
www.facebook.com
19K
US, pc,
Mac OS X
t.co
49K
direct
41K
www.facebook.com
15K
notUS, pc,
Win7
t.co
125K
www.facebook.com
45K
direct
38K
notUS, pc,
Win_not7
t.co
41K
direct
29K
www.facebook.com
14K
noGeo, pc,
Win, Chrome
t.co
56K
direct
47K
www.facebook.com
24K
US, mobile,
iOS
twitter.com
83K
direct
59K
m.facebook.com
17K
notUS, mobile,
iOS
twitter.com
119K
direct
69K
t.co
21K
Mobile,
Android
t.co
62K
direct
34K
m.facebook.com
17K
noGeo, mobile,
OtherOS
direct
63K
t.co
20K
m.facebook.com
13K

52
What happened in space that had the twitter-sphere
abuzz in January 2012?
Solar Flares!
Especially non-US users
to:
Nasa.gov
Earthobservatory.com
from:
Twitter
http://earthobservatory.nasa.gov/NaturalHazards/view.php?id=76998

53
Summary
• Data processing operations, like parsing user-agent string, can be distributed
using spark
• Clustering of large data sets can be distributed using Spark
• Clustering finds groups of related users/records
• These user types show distinct behaviors
• Segmenting users can drive insight and facilitate appropriate messaging
– When are they visiting?
– Where are they looking?
– Where are they coming from?
User
information
User
groups
Targeted
message
Web log data

54
Questions?
Slides available at:
http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark
Distributed K-modes clustering for pyspark:
https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes

Clickstream data with spark

More Related Content

What's hot

Viewers also liked

Similar to Clickstream data with spark

Recently uploaded

Clickstream data with spark

Editor's Notes