MAKING BIG DATA COME ALIVE
Clustering click-stream data using Spark
Marissa Saunders
Slides available at:
http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark
22
• Why?
– Why clustering?
– Why Spark?
– Why click-stream?
• What?
– What is the raw data?
• How?
– Parsing user agent data on Spark
– Distributed K-modes on Spark
• So what?
– Details of applying the method to this
use case
– Resulting clusters
– Time access patterns
– Preferred websites
• Questions
Agenda
3
Objectives
Understand:
• k-means and k-modes clustering
• why Spark is a good choice
• different data structures in Spark
– RDD, dataframe and dataset
• clickstream data and how user-agent parsing works
Demonstrate:
• mapping a function over a RDD
• defining a custom UDF and mapping it over a dataframe
• mapping a python function over a partition
• how identifying different user types can drive insight into user
behavior
4
Why Clustering
5
We have a plot like this …
• 2 groups of data
• Clustering can find them
• This can lead to insight …
– There are two different groups of
unladen swallows
– The heavy species flies more
slowly
– When asking for airspeed, we
should specify if we mean African
or European swallows
Why clustering?
… with apologies to Monty Python
Bird Type
Flight velocities vs. bird mass
6
For 2 clusters:
1. Pick 2 points at random as
centroids
How does it work?
7
For 2 clusters:
1. Pick 2 points at random as
centroids
2. Cluster data based on
closest point
How does it work?
8
For 2 clusters:
1. Pick 2 points at random as
centroids
2. Cluster data based on closest
point
3. Calculate the mean of each
cluster as centroids
How does it work?
9
For 2 clusters:
1. Pick 2 points at random as
centroids
2. Cluster data based on closest
point
3. Calculate the mean of each
cluster as centroids
4. Repeat 2 and 3 to
convergence
How does it work?
10
For 2 clusters:
1. Pick 2 points at random as
centroids
2. Cluster data based on closest
point
3. Calculate the mean of each
cluster as centroids
4. Repeat 2 and 3 to
convergence
How does it work?
11
For 2 clusters:
1. Pick 2 points at random as
centroids
2. Cluster data based on closest
point
3. Calculate the mean of each
cluster as centroids
4. Repeat 2 and 3 to
convergence
How does it work?
Converged
This is called
K-means
clustering
… and there is a Spark
function for this
12
What about categorical data?
• Use modes instead of means
– Most frequently occurring value
• Use binary distance metric for each dimension
– 0 = the same
– 1 = not the same
• Use the same iterative cluster assignment algorithm
This is called
K-modes
clustering
Color Mass Speed Type
Green/Grey Heavy Slow African
Green/Grey Heavy Fast African
Green/Black Heavy Slow African
Green/Grey Light Slow African
Blue/White Heavy Fast European
Blue/White Light Fast European
Blue/Grey Light Slow European
Blue/White Light Fast European
… and we’ve open-sourced
a Spark function for this
13
Why Spark?
14
What is Spark?
Apache Spark™ is a fast and general engine for large-scale data processing.
- spark.apache.org
• Distributed
computing
• Relies on HDFS (or
other DFS)
• In-memory
• Optimized
execution
• High level
functionality
15
Block1
Block2
Block3
Block4
Block5
Block6
Block7
Block8
Why Spark?
• Take the computation to the data
• Spark works faster on partitioned data than map-reduce
– In-memory operation avoids I/O costs
– DAG optimization reduces computational costs
• Fast to develop
– Data transformation and machine learning libraries are part of Spark
http://spark.apache.org/docs/latest/cluster-overview.html
It is FAST
16
Basic data structures in Spark
• Resiliently distributed dataset (RDD)
• Dataframe = RDD with a schema
– SQL-style syntax
– Refer to column by name
– Optimized queries
• Dataset = best of both worlds?!?
Block3
Block4
Block7
Block8
Block1
Block2
Block3
Block4
Block1
Block2
Block3
Block4
Block5
Block6
Block7
Block8
Full data
set
Block1
Block2
Block5
Block6
Block7
Block8
Block5
Block6
What makes it
resilient?
Multiple copies
Stores lineage
17
A little terminology …
Block3
Block4
Block7
Block8
Block1
Block2
Block3
Block4
Block1
Block2
Block5
Block6
Block7
Block8
Block5
Block6
Full data
set
node
partition
record
18
Why Clickstream?
19
What is clickstream data?
• Information trail left behind by each user
• Semi-structured website log files
• Includes:
– User agent information
- Device
- OS
- Browser
– Geo information
- Timezone
- Lat/Longitude
- City
- Country
– Time of access
– Referring website
– Website accessed
Photo credit: Tim Franklin Photography via Foter.com
20
What is this good for?
• Web analytics can answer questions like:
– How long do users take from first visit to purchase?
– When do users visit the website?
– What marketing channels are effective in attracting users?
– Where are users located?
– What are the paths that users take through the website?
– How long do users stay on a specific page?
– Which pages draw the most users?
– etc…
21
The sample use case
Clickstream data from 1usagov
– Created whenever anyone shortens a .gov or .mil site with bitly
– Feed at http://developer.usa.gov/1usagov
– Archive for 2011-2013: http://bitly.measuredvoice.com/bitly_archive/?C=M;O=D
Why this is a great dataset:
– Large volume
– Realistic format
- Streaming
- Not cleaned
– Interesting questions
- What subtypes of users are there?
- How do the activity patterns of these subtypes differ?
– Publically available archive
22
What is the raw data?
23
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g
ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0;
Windows NT 6.1; WOW64;
Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-
8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format
24
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g
ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0;
Windows NT 6.1; WOW64;
Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-
8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format
• Fields include:
• Website clicked: long url
25
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g
ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0;
Windows NT 6.1; WOW64;
Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-
8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format
• Fields include:
• Website clicked/long url
• Referring url
26
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g
ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0;
Windows NT 6.1; WOW64;
Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-
8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format
• Fields include:
• Website clicked/long url
• Referring url
• User agent – what machine is this?
27
What is the raw data?
{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g
ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0;
Windows NT 6.1; WOW64;
Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-
8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}
• json format
• Fields include:
• Website clicked/long url
• Referring url
• User agent – what machine is this?
• Time accessed
• etc…
28
Parsing click stream data on Spark
29
High level picture
• Need to extract:
– Time in date, hours
– Information about the user:
- Device type
- OS
- Timezone
– Main domain of the url
– Referring url
• Do this for one record in python
• Map this function over all records
using Spark
{"h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc
.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0;
Windows NT 6.1; WOW64;
Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-
8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”
,”tz":”America/New_York "}
Day: Friday
Local_hour: 16
Device_type:pc
Browser: IE
OS: Windows 7
Is_bot: false
30
Actual transformation
• Define parsing function
• Map parsing function over RDD
Leverage user-agents library
for every record s
Apply user_agent library
RDD containing parsed json
data
31
Actual transformation
• Define parsing function
• Map parsing function over RDD
Leverage user-agents library
Keep every entry as item in
list
32
Actual transformation
• Define parsing function
• Map parsing function over RDD
Leverage user-agents library
Apply custom function to user
agent string
33
Distributed K-modes
34
How does clustering have to change to be distributed?
K-means example:
Clustering is a collective operation.
How can we distribute it?
35
How does clustering have to change to be distributed?
Do k-means on each partition
Cluster the collected centroids
K-means example:
36
Mapping over data in Spark
• Map over a record:
def f(record): return transform(record)
rdd2 = rdd1.map(f)
37
Mapping over data in Spark
Block3
Block4
Block7
Block8
Block1
Block2
Block3
Block4
Block1
Block2
Block5
Block6
Block7
Block8
Block5
Block6
Full data
set
map
Block1
What is the
equivalent here?
Spark has two possibilities:
1. mapPartition:
• get each record in turn and do something;
return after all records are done
• mapPartitionWithIndex:
• Keep track of which partition returned
which result
38
Mapping over data in Spark
• Map over a record:
def f(record): return transform(record)
rdd2 = rdd1.map(f)
• Map over a partition:
def f(iterator): yield cluster(iterator)
rdd2 = rdd1.mapPartitions(f)
• Map over a partition with a partition key
def f(splitIndex, iterator): yield (partitionIndex, cluster(iterator))
rdd2 = rdd1.mapPartitionsWithIndex(f)
For K-modes, we have open-sourced an implementation of distributed clustering:
https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes
Iterator = cycle once
through each record
39
Applying to 1USAGOV data
40
Getting 1usagov clickstream data
• Scrape data from archive site:
– http://1usagov.measuredvoice.com/
– json format
• Concatenate into files by month
• Store in HDFS
• Load into Spark
41
Loading json data
42
Parse to extract user agent information
• Python package user_agents
– Input string -> output information
• Add some custom parsing to extract features
– os family, os_version, device
• Use spark to map this over each clickstream entry
43
Prepare for K-modes clustering
To reduce dimensionality:
• Decide which variables to use
for clustering
• Keep only the top few
categories for each variable
Prasad Patil, as referenced on http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/
The CURSE of dimensionality ….
44
Prepare for K-modes clustering
• Decide which variables to use for clustering
– Country
– Timezone
– Device Type
– OS
– Browser
• Keep only the top few categories for each variable
Custom UDF for
Spark dataframes
Apply a series of
UDFs
45
• Uses open-source package
https://github.com/ThinkBigAnalytics/pyspark-
distributed-kmodes
Perform distributed k-modes clustering
# of modes
Max.
iterations
Full log
Partition
Partition
Partition
Centroids
Centroids
Centroids
Centroids
Local
clustering
Distributed
clustering
Create
RDD
46
Clustering results: 10 clusters
47
What do the clusters look like?
# Size Country Timezone Device
Type
OS Browser
1 617820 US: 93% US/NY: 53% Pc: 97% Win 7: 75% Firefox: 57%
2 226035 NotUS: 68% Other: 57% Mobile: 75% iOS:84%
MobileSafari:
78%
3 152053
NoGeoInfo:86
%
NoGeoInfo:
86%
Pc: 99%
Windows:
81%
Chrome/IE:
72%
4 161947 US:96% US/NY: 60% PC: 99%
Windows not
7: 99%
IE:
81%
5 105090
NoGeoInfo:76
%
NoGeoInfo:76
%
Mobile: 70% Other: 70% Other: 99%
6 235719 NotUS:99% Other:89% PC: 99% Win7: 68% Chrome: 51%
7 121464 US:100% US/LA: 59% PC:95%
MacOSX:
72%
Chrome: 54%
8 121115 US:48%
NoGeoInfo:
40%
Mobile:93%
Android:
100%
Android: 99%
9 101052 NotUS:98% Other: 90% PC: 100%
Win other than
7: 84%
Firefox: 57%
10 173424 US:100% US/NY: 48% Mobile: 68% iOS:100%
MobileSafari:
74%
48
Access patterns
49
Access patterns
50
Top sites visited: January 2012
Description Top 3 domains
US, pc,
Win7
www.nysdot.gov
212K
www.nasa.gov
59K
www.fda.gov
18K
US, pc,
Win_not7, IE
www.nasa.gov
15K
www.shrewsbury-ma.gov
9K
www.fda.gov
5K
US, pc,
Mac OS X
www.nysdot.gov 29K www.nasa.gov
16K
www.whitehouse.gov
6K
notUS, pc,
Win7
www.nasa.gov
87K
earthobservatory.nasa.go
v 15K
www.nysdot.gov
14K
notUS, pc,
Win_not7
www.nasa.gov
30K
www.navy.mil
8K
globalhealth.gov
7K
noGeo, pc,
Win, Chrome
www.nasa.gov
34K
www.nysdot.gov
17K
earthobservatory.nasa.
gov 6K
US, mobile,
iOS
www.nasa.gov
33K
earthobservatory.nasa.go
v 11K
forecast.weather.gov
9K
notUS, mobile,
iOS
www.nasa.gov
82K
earthobservatory.nasa.go
v 24K
www.navy.mil
13K
Mobile,
Android
www.nasa.gov
29K
earthobservatory.nasa.go
v 9K
www.navy.mil
6K
noGeo, mobile,
OtherOS
www.nasa.gov
24K
www.nysdot.gov
8K
www.army.mil
5K
51
Where do users come from: January 2012
Description Top 3 domains
US, pc,
Win7
direct
342K
t.co
135K
www.facebook.com
67K
US, pc,
Win_not7, IE
direct
69K
t.co
33K
www.facebook.com
19K
US, pc,
Mac OS X
t.co
49K
direct
41K
www.facebook.com
15K
notUS, pc,
Win7
t.co
125K
www.facebook.com
45K
direct
38K
notUS, pc,
Win_not7
t.co
41K
direct
29K
www.facebook.com
14K
noGeo, pc,
Win, Chrome
t.co
56K
direct
47K
www.facebook.com
24K
US, mobile,
iOS
twitter.com
83K
direct
59K
m.facebook.com
17K
notUS, mobile,
iOS
twitter.com
119K
direct
69K
t.co
21K
Mobile,
Android
t.co
62K
direct
34K
m.facebook.com
17K
noGeo, mobile,
OtherOS
direct
63K
t.co
20K
m.facebook.com
13K
52
What happened in space that had the twitter-sphere
abuzz in January 2012?
Solar Flares!
Especially non-US users
to:
Nasa.gov
Earthobservatory.com
from:
Twitter
http://earthobservatory.nasa.gov/NaturalHazards/view.php?id=76998
53
Summary
• Data processing operations, like parsing user-agent string, can be distributed
using spark
• Clustering of large data sets can be distributed using Spark
• Clustering finds groups of related users/records
• These user types show distinct behaviors
• Segmenting users can drive insight and facilitate appropriate messaging
– When are they visiting?
– Where are they looking?
– Where are they coming from?
User
information
User
groups
Targeted
message
Web log data
54
Questions?
Slides available at:
http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark
Distributed K-modes clustering for pyspark:
https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes

Clickstream data with spark

  • 1.
    MAKING BIG DATACOME ALIVE Clustering click-stream data using Spark Marissa Saunders Slides available at: http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark
  • 2.
    22 • Why? – Whyclustering? – Why Spark? – Why click-stream? • What? – What is the raw data? • How? – Parsing user agent data on Spark – Distributed K-modes on Spark • So what? – Details of applying the method to this use case – Resulting clusters – Time access patterns – Preferred websites • Questions Agenda
  • 3.
    3 Objectives Understand: • k-means andk-modes clustering • why Spark is a good choice • different data structures in Spark – RDD, dataframe and dataset • clickstream data and how user-agent parsing works Demonstrate: • mapping a function over a RDD • defining a custom UDF and mapping it over a dataframe • mapping a python function over a partition • how identifying different user types can drive insight into user behavior
  • 4.
  • 5.
    5 We have aplot like this … • 2 groups of data • Clustering can find them • This can lead to insight … – There are two different groups of unladen swallows – The heavy species flies more slowly – When asking for airspeed, we should specify if we mean African or European swallows Why clustering? … with apologies to Monty Python Bird Type Flight velocities vs. bird mass
  • 6.
    6 For 2 clusters: 1.Pick 2 points at random as centroids How does it work?
  • 7.
    7 For 2 clusters: 1.Pick 2 points at random as centroids 2. Cluster data based on closest point How does it work?
  • 8.
    8 For 2 clusters: 1.Pick 2 points at random as centroids 2. Cluster data based on closest point 3. Calculate the mean of each cluster as centroids How does it work?
  • 9.
    9 For 2 clusters: 1.Pick 2 points at random as centroids 2. Cluster data based on closest point 3. Calculate the mean of each cluster as centroids 4. Repeat 2 and 3 to convergence How does it work?
  • 10.
    10 For 2 clusters: 1.Pick 2 points at random as centroids 2. Cluster data based on closest point 3. Calculate the mean of each cluster as centroids 4. Repeat 2 and 3 to convergence How does it work?
  • 11.
    11 For 2 clusters: 1.Pick 2 points at random as centroids 2. Cluster data based on closest point 3. Calculate the mean of each cluster as centroids 4. Repeat 2 and 3 to convergence How does it work? Converged This is called K-means clustering … and there is a Spark function for this
  • 12.
    12 What about categoricaldata? • Use modes instead of means – Most frequently occurring value • Use binary distance metric for each dimension – 0 = the same – 1 = not the same • Use the same iterative cluster assignment algorithm This is called K-modes clustering Color Mass Speed Type Green/Grey Heavy Slow African Green/Grey Heavy Fast African Green/Black Heavy Slow African Green/Grey Light Slow African Blue/White Heavy Fast European Blue/White Light Fast European Blue/Grey Light Slow European Blue/White Light Fast European … and we’ve open-sourced a Spark function for this
  • 13.
  • 14.
    14 What is Spark? ApacheSpark™ is a fast and general engine for large-scale data processing. - spark.apache.org • Distributed computing • Relies on HDFS (or other DFS) • In-memory • Optimized execution • High level functionality
  • 15.
    15 Block1 Block2 Block3 Block4 Block5 Block6 Block7 Block8 Why Spark? • Takethe computation to the data • Spark works faster on partitioned data than map-reduce – In-memory operation avoids I/O costs – DAG optimization reduces computational costs • Fast to develop – Data transformation and machine learning libraries are part of Spark http://spark.apache.org/docs/latest/cluster-overview.html It is FAST
  • 16.
    16 Basic data structuresin Spark • Resiliently distributed dataset (RDD) • Dataframe = RDD with a schema – SQL-style syntax – Refer to column by name – Optimized queries • Dataset = best of both worlds?!? Block3 Block4 Block7 Block8 Block1 Block2 Block3 Block4 Block1 Block2 Block3 Block4 Block5 Block6 Block7 Block8 Full data set Block1 Block2 Block5 Block6 Block7 Block8 Block5 Block6 What makes it resilient? Multiple copies Stores lineage
  • 17.
    17 A little terminology… Block3 Block4 Block7 Block8 Block1 Block2 Block3 Block4 Block1 Block2 Block5 Block6 Block7 Block8 Block5 Block6 Full data set node partition record
  • 18.
  • 19.
    19 What is clickstreamdata? • Information trail left behind by each user • Semi-structured website log files • Includes: – User agent information - Device - OS - Browser – Geo information - Timezone - Lat/Longitude - City - Country – Time of access – Referring website – Website accessed Photo credit: Tim Franklin Photography via Foter.com
  • 20.
    20 What is thisgood for? • Web analytics can answer questions like: – How long do users take from first visit to purchase? – When do users visit the website? – What marketing channels are effective in attracting users? – Where are users located? – What are the paths that users take through the website? – How long do users stay on a specific page? – Which pages draw the most users? – etc…
  • 21.
    21 The sample usecase Clickstream data from 1usagov – Created whenever anyone shortens a .gov or .mil site with bitly – Feed at http://developer.usa.gov/1usagov – Archive for 2011-2013: http://bitly.measuredvoice.com/bitly_archive/?C=M;O=D Why this is a great dataset: – Large volume – Realistic format - Streaming - Not cleaned – Interesting questions - What subtypes of users are there? - How do the activity patterns of these subtypes differ? – Publically available archive
  • 22.
    22 What is theraw data?
  • 23.
    23 What is theraw data? {‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4- 8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”} • json format
  • 24.
    24 What is theraw data? {‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4- 8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”} • json format • Fields include: • Website clicked: long url
  • 25.
    25 What is theraw data? {‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4- 8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”} • json format • Fields include: • Website clicked/long url • Referring url
  • 26.
    26 What is theraw data? {‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4- 8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”} • json format • Fields include: • Website clicked/long url • Referring url • User agent – what machine is this?
  • 27.
    27 What is theraw data? {‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.g ov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4- 8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”} • json format • Fields include: • Website clicked/long url • Referring url • User agent – what machine is this? • Time accessed • etc…
  • 28.
  • 29.
    29 High level picture •Need to extract: – Time in date, hours – Information about the user: - Device type - OS - Timezone – Main domain of the url – Referring url • Do this for one record in python • Map this function over all records using Spark {"h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc .gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4- 8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL” ,”tz":”America/New_York "} Day: Friday Local_hour: 16 Device_type:pc Browser: IE OS: Windows 7 Is_bot: false
  • 30.
    30 Actual transformation • Defineparsing function • Map parsing function over RDD Leverage user-agents library for every record s Apply user_agent library RDD containing parsed json data
  • 31.
    31 Actual transformation • Defineparsing function • Map parsing function over RDD Leverage user-agents library Keep every entry as item in list
  • 32.
    32 Actual transformation • Defineparsing function • Map parsing function over RDD Leverage user-agents library Apply custom function to user agent string
  • 33.
  • 34.
    34 How does clusteringhave to change to be distributed? K-means example: Clustering is a collective operation. How can we distribute it?
  • 35.
    35 How does clusteringhave to change to be distributed? Do k-means on each partition Cluster the collected centroids K-means example:
  • 36.
    36 Mapping over datain Spark • Map over a record: def f(record): return transform(record) rdd2 = rdd1.map(f)
  • 37.
    37 Mapping over datain Spark Block3 Block4 Block7 Block8 Block1 Block2 Block3 Block4 Block1 Block2 Block5 Block6 Block7 Block8 Block5 Block6 Full data set map Block1 What is the equivalent here? Spark has two possibilities: 1. mapPartition: • get each record in turn and do something; return after all records are done • mapPartitionWithIndex: • Keep track of which partition returned which result
  • 38.
    38 Mapping over datain Spark • Map over a record: def f(record): return transform(record) rdd2 = rdd1.map(f) • Map over a partition: def f(iterator): yield cluster(iterator) rdd2 = rdd1.mapPartitions(f) • Map over a partition with a partition key def f(splitIndex, iterator): yield (partitionIndex, cluster(iterator)) rdd2 = rdd1.mapPartitionsWithIndex(f) For K-modes, we have open-sourced an implementation of distributed clustering: https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes Iterator = cycle once through each record
  • 39.
  • 40.
    40 Getting 1usagov clickstreamdata • Scrape data from archive site: – http://1usagov.measuredvoice.com/ – json format • Concatenate into files by month • Store in HDFS • Load into Spark
  • 41.
  • 42.
    42 Parse to extractuser agent information • Python package user_agents – Input string -> output information • Add some custom parsing to extract features – os family, os_version, device • Use spark to map this over each clickstream entry
  • 43.
    43 Prepare for K-modesclustering To reduce dimensionality: • Decide which variables to use for clustering • Keep only the top few categories for each variable Prasad Patil, as referenced on http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/ The CURSE of dimensionality ….
  • 44.
    44 Prepare for K-modesclustering • Decide which variables to use for clustering – Country – Timezone – Device Type – OS – Browser • Keep only the top few categories for each variable Custom UDF for Spark dataframes Apply a series of UDFs
  • 45.
    45 • Uses open-sourcepackage https://github.com/ThinkBigAnalytics/pyspark- distributed-kmodes Perform distributed k-modes clustering # of modes Max. iterations Full log Partition Partition Partition Centroids Centroids Centroids Centroids Local clustering Distributed clustering Create RDD
  • 46.
  • 47.
    47 What do theclusters look like? # Size Country Timezone Device Type OS Browser 1 617820 US: 93% US/NY: 53% Pc: 97% Win 7: 75% Firefox: 57% 2 226035 NotUS: 68% Other: 57% Mobile: 75% iOS:84% MobileSafari: 78% 3 152053 NoGeoInfo:86 % NoGeoInfo: 86% Pc: 99% Windows: 81% Chrome/IE: 72% 4 161947 US:96% US/NY: 60% PC: 99% Windows not 7: 99% IE: 81% 5 105090 NoGeoInfo:76 % NoGeoInfo:76 % Mobile: 70% Other: 70% Other: 99% 6 235719 NotUS:99% Other:89% PC: 99% Win7: 68% Chrome: 51% 7 121464 US:100% US/LA: 59% PC:95% MacOSX: 72% Chrome: 54% 8 121115 US:48% NoGeoInfo: 40% Mobile:93% Android: 100% Android: 99% 9 101052 NotUS:98% Other: 90% PC: 100% Win other than 7: 84% Firefox: 57% 10 173424 US:100% US/NY: 48% Mobile: 68% iOS:100% MobileSafari: 74%
  • 48.
  • 49.
  • 50.
    50 Top sites visited:January 2012 Description Top 3 domains US, pc, Win7 www.nysdot.gov 212K www.nasa.gov 59K www.fda.gov 18K US, pc, Win_not7, IE www.nasa.gov 15K www.shrewsbury-ma.gov 9K www.fda.gov 5K US, pc, Mac OS X www.nysdot.gov 29K www.nasa.gov 16K www.whitehouse.gov 6K notUS, pc, Win7 www.nasa.gov 87K earthobservatory.nasa.go v 15K www.nysdot.gov 14K notUS, pc, Win_not7 www.nasa.gov 30K www.navy.mil 8K globalhealth.gov 7K noGeo, pc, Win, Chrome www.nasa.gov 34K www.nysdot.gov 17K earthobservatory.nasa. gov 6K US, mobile, iOS www.nasa.gov 33K earthobservatory.nasa.go v 11K forecast.weather.gov 9K notUS, mobile, iOS www.nasa.gov 82K earthobservatory.nasa.go v 24K www.navy.mil 13K Mobile, Android www.nasa.gov 29K earthobservatory.nasa.go v 9K www.navy.mil 6K noGeo, mobile, OtherOS www.nasa.gov 24K www.nysdot.gov 8K www.army.mil 5K
  • 51.
    51 Where do userscome from: January 2012 Description Top 3 domains US, pc, Win7 direct 342K t.co 135K www.facebook.com 67K US, pc, Win_not7, IE direct 69K t.co 33K www.facebook.com 19K US, pc, Mac OS X t.co 49K direct 41K www.facebook.com 15K notUS, pc, Win7 t.co 125K www.facebook.com 45K direct 38K notUS, pc, Win_not7 t.co 41K direct 29K www.facebook.com 14K noGeo, pc, Win, Chrome t.co 56K direct 47K www.facebook.com 24K US, mobile, iOS twitter.com 83K direct 59K m.facebook.com 17K notUS, mobile, iOS twitter.com 119K direct 69K t.co 21K Mobile, Android t.co 62K direct 34K m.facebook.com 17K noGeo, mobile, OtherOS direct 63K t.co 20K m.facebook.com 13K
  • 52.
    52 What happened inspace that had the twitter-sphere abuzz in January 2012? Solar Flares! Especially non-US users to: Nasa.gov Earthobservatory.com from: Twitter http://earthobservatory.nasa.gov/NaturalHazards/view.php?id=76998
  • 53.
    53 Summary • Data processingoperations, like parsing user-agent string, can be distributed using spark • Clustering of large data sets can be distributed using Spark • Clustering finds groups of related users/records • These user types show distinct behaviors • Segmenting users can drive insight and facilitate appropriate messaging – When are they visiting? – Where are they looking? – Where are they coming from? User information User groups Targeted message Web log data
  • 54.
    54 Questions? Slides available at: http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark DistributedK-modes clustering for pyspark: https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes

Editor's Notes

  • #3 As an image: word map of title; flow chart showing process?
  • #6 Figure – demonstrate clustering on 2 d example numerical Demonstrate clustering on categorical variables using a Venn Diagram
  • #15 From the Apache website: For those that like pictures, here’s a word cloud from the wiki page and the Apache info pages for Spark, MLlib, GraphX, SparkStreaming, SparkSQL. The things that pop out at me are: distributed, python, R, scala, jave, machine-learning, Hadoop and HDFS I’d add in-memory to this list - this is the major thing that sets Spark apart from Hadoop map reduce and the thing that makes it able to run faster than MR. Spark is not, however, magic. You can’t just run a python or R script in a Spark context and expect it to automagically be distributed. But, hopefully todays talk will show you how to make the (relatively easy) changes that you need to accomplish this. Moving away from general descriptions, lets take a look at the Spark ecosystem.