4. Who cares about raw data ?
• SAAS analytics are full-featured
• Custom variables to link with your backend data
• Did you really join all data for your
future needs ?
• Do you have access / want to push to the JS
all necessary data ?
• What kinds of analysis will you do later on ?
www.dataiku.com
5. A real example
Segmentation and tracking user-satisfaction
www.dataiku.com
Raw
tracking
data
User-level
stats
User base
segmentation
Metrics per
segments
Tracking over time
TB
GB
8. Labeling
www.dataiku.com
Search for a
specific Topic
Newcomer
from Google
News
Here you need your
business intelligence
Foreigner
Discovering The
Site
Fan who loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
9. Compute metrics per segment
738k sessions
0.83€ per session
0.73€ acquisition costs
www.dataiku.com
938k sessions
Search for a
specific Topic
Newcomer
from Google
News
Here you need to
cross with your CRM
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
0.23€ acquisition costs
938k sessions
0.3€ per session
0.23€ acquisition costs
68k sessions
0.3€ per session
1.23€ acquisition costs
1k sessions
0€ per session
0€ acquisition costs
10. Track metrics over time
www.dataiku.com
Using your already-computed segments
Search for a
specific Topic
Newcomer
from Google
News
Fan that loves
to comment
Home Page
Wanderer
Foreigner
Discovering The
Site
Dark Bot
(Competitor?)
Damn
our latest
release
has diverging
effects
on segments
11. A few other examples
• Churn prediction and explanation
• Customer lifetime value prediction
www.dataiku.com
13. So, I have these Apache logs
• First level of web tracking
• "Nothing required"
www.dataiku.com
14. Are backend logs a solution ?
Challenge 1 : Identify a visitor
www.dataiku.com
• IP ?
• NAT / Proxy
• Not everyone has a public IP address
• IP + user-agent ?
• Big companies !
15. Are backend logs a solution ?
Challenge 2 : Re-create sessions
• Using expiration times
• Advanced SQL / Hive / …
www.dataiku.com
makes this easier
16. Are backend logs a solution ?
Challenge 3 : single-page webapps
• Track behaviour within each page
• Track events, not pages
Also: getting logs from IT is sometimes another challenge
www.dataiku.com
17. Client-side tracking
• visitor_id and session_id handled with cookies
• Tracking page loads and various events
• Historically, "tracking" = fetching a 1x1 image
• AJAX
www.dataiku.com
www.website.com
Browser
tracker.com
JS tracking code
Tracking calls
18. Are cookies good for your (web) health ?
• Each cookie belongs to a domain
www.dataiku.com
(and its subdomains)
• Who can write a cookie ?
– The HTTP server, who becomes owner
(via the Set-Cookie HTTP header)
– JS code running on the "owner" domain
• Who can read a cookie ?
– The owner HTTP server (sent by the browser)
– JS code running on the "owner" domain
19. First-party cookies
• Set by the originating server (HTTP) or JS code
• Belong to the originating domain
• Sent by HTTP to the originating domain only
• Readable by JS code
www.dataiku.com
www.website.com
Browser
Contents
Cookies for www.website.com:
None
tracker.com
GET /
Cookies: none
Fetch tracking script
Tracking JS code: read cookies for www.website.com
Tracking JS code: create visitor id and set cookie
20. First-party cookies
• Set by the originating server (HTTP) or JS code
• Belong to the originating domain
• Sent by HTTP to the originating domain only
• Readable by JS code
www.dataiku.com
www.website.com
Browser
tracker.com
GET /track?visitor_id=d37ecba
Cookies: None
JS code: send AJAX request to tracker.com with visitor_id
Cookies for www.website.com:
visitor_id=d37ecba
21. Third-party cookies
• Set (in HTTP) by the tracker's domain – Belong to the tracker's domain
• Not send by HTTP to the originating domain (does not belong)
• NOT readable by JS code (does not belong)
www.dataiku.com
www.website.com
Browser
tracker.com
GET /
Cookies: none
Fetch tracking script
Contents
Cookies for www.website.com:
None
Cookies for tracker.com:
None
22. Third-party cookies
• Set (in HTTP) by the tracker's domain – Belong to the tracker's domain
• Not send by HTTP to the originating domain (does not belong)
• NOT readable by JS code (does not belong)
www.dataiku.com
www.website.com
Browser
Cookies for www.website.com:
None
Tracker code: assign visitor_id
tracker.com
Cookies for tracker.com:
None
GET /track
Cookies: None
200 OK
Set-Cookie: visitor_id=33d7
23. Third-party cookies
• Set (in HTTP) by the tracker's domain – Belong to the tracker's domain
• Not send by HTTP to the originating domain (does not belong)
• NOT readable by JS code (does not belong)
Tracker code: read visitor_id
GET /track
Cookies: visitor_id=33d7
www.dataiku.com
www.website.com
Browser
tracker.com
Cookies for tracker.com:
visitor_id=33d7
200 OK
Cookies for www.website.com:
None
24. Why each ?
www.dataiku.com
First party cookie
• Tracks on a single website
• Requires JS code for tracking
• Reduced privacy impact:
No exchange of information
between sites
• Usage: track your user's
behaviour
Third party cookie
• Tracks across all websites
using the same tracker
• More frowned upon
• Usage: generally, ads
but also multi-website
Rarely blocked
(used for logins)
Blocked by up to
40% visitors
25. What are your obligations ?
With ALL cookies
• You should ask user whether he wants cookies
• Even non-tracking related cookies
• Yes, even login-related ones
www.dataiku.com
26. What are your obligations ?
With third party cookies
• Obey the Do-Not-Track header
www.dataiku.com
www.website.com
Browser
Tracker code: DO NOTHING
tracker.com
GET /track
Cookies: None
DNT: 1
200 OK
27. What are your obligations ?
With third party cookies
• Provide an opt-out URL
• Allows the user to /optin , /optout or /status
See in action : www.youronlinechoices.com
www.dataiku.com
28. What are your obligations ?
With third party cookies
• Provide a P3P policy
• Else, older IE blocks you
"What are you doing with my data ?"
www.dataiku.com
Looks like this:
CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
29. Tracking in mobile apps
www.dataiku.com
• Preserve battery
– Each network call is costly
– Do not track everything synchronously
• Network access is intermittent
– Queue events and wait for network access
30. So, what are my choices ?
• You might really want to be your own web tracker
• Most used open source Webtracker :
www.dataiku.com
Piwik
• Provides both raw data and nice dashboards
– MySQL backend
– Raw data via API
– Slightly less suited for analytics
32. WT1
An open source (Apache License) server
to build your own web tracking
https://github.com/dataiku/wt1
• Designed to provide you with raw data,
directly usable for analytics
• Very high performance and scalability
www.dataiku.com
33. Features
www.dataiku.com
• 1st or 3rd party cookies
– Handling of DNT and opt-out
– Helps handling P3P
• Track events or pages with key-value data
• Visitor-scope and session-scope variables
• "Live view" debugging console
34. Features
www.dataiku.com
• Dashboards: None
• Events processing and storage
– Filesystem, S3
– Event queues: Flume
– Custom processors
• JSON API for custom tracking
• iOS library
35. Architecture
www.dataiku.com
Client-side
JS tracker
iOS
library
• 1st or 3rd
party cookies
• Event-level tracking
• Automatic batching
• Queuing to deal with
network interruptions
WT1 Server
Raw storage
• Filesystem
• S3
JSON POST
Event processors:
• Real-time aggregations
• Custom code
Event queues
• Flume
• Kafka, RabbitMQ, …
• Java
• > 20K events / second
• Handles DNT, P3P, opt-out, …
36. Future work
www.dataiku.com
• Android library
• More event queues supported OOTB
– Kafka
– RabbitMQ
• Avro storage
Web tracking is important, right ? You must understand how your users behave on your website
One of the core points of lean
So, let's not do it anymore and let others do it !
A huge number of SAAS solutions – provide great dashboards
Chances are good that you should use one of them !
Talk about encouraging you to do it yourself but you should probably start with hosted solution for startup.
You generally have to choose between "cheap" (or free) solutions
Free: Google Analytics entry point to sell ads. Not bad but you should know what it's about.
Example add data:
complaints / support calls
History prior to setting up *this* tracking
Analysis: ML, not inaccessible and for elites
Track user satisfaction metrics over time *by behaviour*
Not science fiction
Raw -> User: recreate *features* for users. Time-baed aggregations
What Olivier Grisel just said
Just a few quick remarks
Fairly standard if you are used to web trackers
GA-like API