A talk on Data Science in Piano, contains the following:
1. Tips on how to make sure your data are analysis-friendly
2. A short introduction into how to do data science with a for loop (partially stolen from https://goo.gl/wHwZKv)
3. A brief look on output evolution for paywall health check for our clients (publishers)
4. A sneak peek into challenges we face currently
4. 4
P I A N O F R A M E W O R K
What Data Science? 4
R U B Y S L A V A 2 0 1 5 P I A N O . I O
5. 5
P I A N O F R A M E W O R K
Whydo we need analytics (data science) 5
R U B Y S L A V A 2 0 1 5 P I A N O . I O
PAID USERSREGISTEREDENGAGEDCASUAL
1. What is my potential? (Potential Clients)
2. X has happened, why? Is Y a good idea? (Existing
Clients, Account managers)
3. Input into products
7. 7
D A T A
Whatdata do we collect? 7
R U B Y S L A V A 2 0 1 5 P I A N O . I O
1. Transactional Data
Subscriptions, Users
2. System settings
Products, Clients, Offers, Messaging
3. Clickstream Data
Pageviews, user agents
4. Conversion Data
Steps in a conversion funnel
Pageviews: Mongo DB
Transactions + System
settings: PostGreSQL
Conversion Data:
Google Analytics
Pageviews:
Amazon S3
(Cassandra)
Transactions + System
settings: Oracle
Conversion Data:
Amazon S3
(Cassandra)
Pageviews +
Conversion Data:
BigQuery (Cassandra)
Transactions + System
settings: MySQL
?
And then you wonder why 60 – 80 % of Data Scientist’s job is cleaning and merging datasets…
Piano
Press+Tinypass
8. 8
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
1 2 3 4 5 6 7 8 9 10+
%PayingUsers
# Devices
D A T A
Whatare we even measuring?
R U B Y S L A V A 2 0 1 5 P I A N O . I O
1. Cookies …
… can be deleted
2. Fingerprinting …
… doesn’t work for mobile
3. IP address …
… shared networks, proxies, VPNs
4. Registration …
… Try convincing a publisher all their
readers should register
Ref: ‘Not your business’
Stats > Math
9. 9
D A T A
It is important to understand the data generating process
R U B Y S L A V A 2 0 1 5 P I A N O . I O
We are looking at how much users read in
order to estimate the ideal setting for a
metered paywall. In this particular case
we’ve realized the site refreshes every 5
minutes the user is inactive.
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+ 15+ 16+ 17+ 18+ 19+ 20+
%ofUsersreadingarticles
Readership Level
All PVs Without Refreshed PVs
Domain knowledge and observations
10. 10
D A T A
Being able to perform a calculation, doesn’t mean you should
R U B Y S L A V A 2 0 1 5 P I A N O . I O
A session is defined as stream of actions for
a user with delays less than 30 minutes
between each action tied by consecutive
referrer.
It can also be tracked vie a session cookie
create table pvs_with_visits select
@sessionId:=if(@prevUser=uid AND diff <= 1800 , @sessionId, @sessionId+1) as sessionId,
@prevUser:=uid AS uid,
url,
t,
diff,
rldiff,
ref,
dt,
sid
from
(select @sessionId:=0, @prevUser:='-') b
join
(select
TIME_TO_SEC(if(@prevU=uid, TIMEDIFF(t, @prevD), '00:00')) as diff,
if(@prevU=uid & @prevrl!=1000, @prevrl-a_rl,0) as rldiff,
@prevU:=uid as uid,
@prevD:=t as t,
@prevrl:=a_rl,
url,
ref,
dt,
sid
from
pageviews
join
(select @prev:=0, @prevU='-')a
order by
uid,
t) a;
Have mercy on your analysts
select *
from (SELECT
*,
FIRST_VALUE(referrerSegmentId) OVER (PARTITION BY uid, session_order order by datetime) AS session_ref,
FIRST_VALUE(url_class) OVER (PARTITION BY uid, session_order order by datetime) AS session_start_class
FROM (
SELECT
*,
MAX(session_order) OVER (PARTITION BY uid) AS n_sessions_of_user
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY datetime_string) AS pvs_order_in_visit
FROM (
SELECT
*,
CONCAT(uid, '-', CAST(session_order AS STRING)) AS session_id
FROM (
SELECT
*,
session_order_anc + 1 AS session_order
FROM (
SELECT
*,
SUM(new_event_boundary) OVER (PARTITION BY uid ORDER BY datetime_string) AS session_order_anc
FROM (
SELECT
*,
(datetime_ut - lag_1)/1000000/60 AS minutes_since_last_interval,
CASE WHEN (datetime_ut - lag_1)/1000000 > 30 * 60 THEN 1 ELSE 0 END AS new_event_boundary
FROM (
SELECT
*,
LAG(datetime_ut) OVER (PARTITION BY uid ORDER BY datetime_string) AS lag_1,
MONTH(datetime_string) AS month
FROM (
SELECT
*,
datetime AS datetime_string,
TIMESTAMP(datetime) AS datetime_ut
FROM
[tmp.pageviews_r])))))))))",
project = project;
11. 11
D A T A
Storage
R U B Y S L A V A 2 0 1 5 P I A N O . I O
• DO NOT keep information only in “Application Logic”
Period_Type_ID Decoding
1 Day(s)
2 Week
3 Every 10 days
4 Every other week
5 Every 15 days
6 Every 20 days
7 Month(s)/30-day
8 Month(s) Actual
9 2 Months/60-day
10 2 Months Actual
11 3 Months/90-day
12 3 Months Actual
13 6 Months/180-day
14 6 Months Actual
15 Year(s) 365-days
Duration = Period_Type * Cycle_Count
12. 12
D A T A
Storage
R U B Y S L A V A 2 0 1 5 P I A N O . I O
• DO store history, either in form of delta logs or record validity
dates
• As is versus as is
• As is versus as was
• As was versus as was
Payment_id Duration Fee Start Date End Date
123456 366 39.00 1. 1. 2015 31. 1. 2015
Cancellation on
30. 6. 2015
Payment_id Duration Fee Start Date End Date
123456 180 19.18 1. 1. 2015 30. 6. 2015
13. 13
1.3 Billion rows @ 25 columns
3 months of US clickstream data
150 GB gzipped ≈ 1.5 TB full
1.3 Billion rows @ 6 columns
3 months of US clickstream data
252 GB full
614M rows @ 10 columns
3 months of session data
78.1 GB
340M rows @ 15 columns
3 months of user data
49.5 GB
3170 rows
@ 31 columns
1.3 MB
D A T A
On Big Data
R U B Y S L A V A 2 0 1 5 P I A N O . I O
• Do we really have Big Data? (50 %)
• Do we need to work with in in Big Data form? (1 % – 5 %)
Your big data might be
quite small…
15. 15
M E T H O D S
If you can write a for loop you can do Data Science 15
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What is a p-value?
One simple trick,
the statisticians hate it!
16. 16
M E T H O D S
If you can write a for loop you can do Data Science 16
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Is there a significant difference between
average CPU usage?
84% 72% 81% 69%
57% 46% 74% 61%
63% 76% 56% 87%
99% 91% 69% 65%
66% 44%
62% 69%
Data Centre 1 Data Centre 2
Mean Data Centre 1: 73.5 %
Mean Data Centre 1: 66.9 %
Difference: 6.6 %
source: https://speakerdeck.com/jakevdp/statistics-for-hackers
17. 17
M E T H O D S
If you can write a for loop you can do Data Science 17
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What would statistics do?
𝑡 =
73.5 − 66.9
316.8
8 −
124.8
12
t > tcrit
0.932 > 1.796 Difference is not significant
source: https://speakerdeck.com/jakevdp/statistics-for-hackers
18. 18
M E T H O D S
If you can write a for loop you can do Data Science 18
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Whatcan for loop do?
1. Shuffle the observations between 2 groups randomly
2. Compute means for each group
3. Compute the differences of the means
4. Repeat n times (n can be 10 000)
source: https://speakerdeck.com/jakevdp/statistics-for-hackers An approach used in real scientific
papers
19. 19
M E T H O D S
If you can write a for loop you can do Data Science 19
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Amore interesting application
1. Take the dataset you’ve build the model on
2. Shuffle Y values
3. Build the model from random data
In what % of cases did you build a model better
than the original one?
Call:
glm(formula = DD_index ~ ., data = perm_data_dummies)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.71028 -0.07349 0.00514 0.08096 0.63189
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.052e-01 7.716e-02 5.251 1.52e-07 ***
`authorAdam Molon` -4.710e-01 5.098e-02 -9.239 < 2e-16 ***
`authorAlexandra Gibbs` -2.160e-02 1.546e-02 -1.397 0.162359
`authorAlex Crippen` 1.025e-01 1.647e-02 6.220 5.02e-10 ***
`authorAlex Rosenberg` 5.432e-02 1.495e-02 3.634 0.000279 ***
…
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.01748868)
Null deviance: 3115.15 on 35716 degrees of freedom
Residual deviance: 608.73 on 34807 degrees of freedom
AIC: -42258
Number of Fisher Scoring iterations: 2
21. 21
O U T P U T S
Health Diagnostics for Sites with a Paywall:Approach 1
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive identify
areas for improvement for any of our cca
1200 (600) sites?
• Percentiles
• One-dimensional view
• Not everyone can be above average
𝑆𝑡𝑜𝑝 𝑅𝑎𝑡𝑒 =
𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠 𝑏𝑒𝑖𝑛𝑔 𝑆𝑡𝑜𝑝𝑝𝑒𝑑
𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
95% 90% 80% 75% 70% 60% 50% 40% 30% 25% 20% 10% 5%
You are here
22. 22
O U T P U T S
Health Diagnostics for Sites with a Paywall:Approach 2
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive identify
areas for improvement for any of our cca
1200 (600) sites?
• Clustering (PAM -> Fuzzy)
• KPIs in relation to each other
• Easy to read (or so we thought)
• Too much variation
Site 1
Site 2
23. 23
O U T P U T S
Health Diagnostics for Sites with a Paywall:Approach 3
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive identify
areas for improvement for any of our cca
1200 (600) sites?
• Site similarity (Size, Age)
Example of a site with good 3rd party benchmarks
Site 1
Example of a „lonely“ site
xyx
24. 24
D A T A
Health Diagnostics for Sites with a Paywall:Approach 3
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive
identify areas for improvement for any of
our cca 1200 (600) sites?
• Compare to:
• Similar Sites
• Sites of the same Publisher
• Worst Site
• Best Site
• Display multiple KPIs in different
units in one chart
26. 26
C H A L L E N G E S
Building Data Products
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What if we didn’t have to analyze the data,
what if we could just say – [This] is
interesting?
[This] can be a combination of multiple
variables such as author, section, traffic from
device and anything else.
[This] sits in hierarchy
We want to know [This] is interesting
because of [This] alone not because of
[Parent(s) of This] or [Child of this]
27. 27
C H A L L E N G E S
Building Data Products
R U B Y S L A V A 2 0 1 5 P I A N O . I O
If [This] is constructed from as author,
section, traffic from device, the hierarchy in
which [This] sits also includes author,
section, device individually as well as all
possible combinations of 2 variables
If we assume an ever changing number of
variables [This] can be constructed for, in
order to construct a hierarchy of all possible
[This] elements, the following applies:
#V This =
𝑘=1
𝑛
𝑛!
𝑘! ∗ 𝑛 − 𝑘 !
Variables Queries
1 1
2 3
3 7
5 31
10 1,023
15 32,767
20 1,048,575
28. 28
C H A L L E N G E S
Building Data Products
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Currently we are looking for interesting
[This] in a very simple context. We define a
[Segment] which can be any type of users
(in our case a loyal user), and we measure
how string their preference for [This] is over
the general preference for [This] in whole
population.
And the results are exciting, sometimes
[This] is clearly interesting because one of
their parents
To be continued…