SlideShare a Scribd company logo
1 of 30
Download to read offline
1
2 0 1 5 P I A N O . I O
Data Science @
Rubyslava
October 2015
2
2 0 1 5 P I A N O . I O
2 0 1 5 P I A N O .
Overview
R U B YS L A V A
Intro
Data
Methods
Outputs
Challenges
01
02
03
04
05
3
4
P I A N O F R A M E W O R K
What Data Science? 4
R U B Y S L A V A 2 0 1 5 P I A N O . I O
5
P I A N O F R A M E W O R K
Whydo we need analytics (data science) 5
R U B Y S L A V A 2 0 1 5 P I A N O . I O
PAID USERSREGISTEREDENGAGEDCASUAL
1. What is my potential? (Potential Clients)
2. X has happened, why? Is Y a good idea? (Existing
Clients, Account managers)
3. Input into products
6
7
D A T A
Whatdata do we collect? 7
R U B Y S L A V A 2 0 1 5 P I A N O . I O
1. Transactional Data
Subscriptions, Users
2. System settings
Products, Clients, Offers, Messaging
3. Clickstream Data
Pageviews, user agents
4. Conversion Data
Steps in a conversion funnel
Pageviews: Mongo DB
Transactions + System
settings: PostGreSQL
Conversion Data:
Google Analytics
Pageviews:
Amazon S3
(Cassandra)
Transactions + System
settings: Oracle
Conversion Data:
Amazon S3
(Cassandra)
Pageviews +
Conversion Data:
BigQuery (Cassandra)
Transactions + System
settings: MySQL
?
And then you wonder why 60 – 80 % of Data Scientist’s job is cleaning and merging datasets…
Piano
Press+Tinypass
8
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
1 2 3 4 5 6 7 8 9 10+
%PayingUsers
# Devices
D A T A
Whatare we even measuring?
R U B Y S L A V A 2 0 1 5 P I A N O . I O
1. Cookies …
… can be deleted
2. Fingerprinting …
… doesn’t work for mobile
3. IP address …
… shared networks, proxies, VPNs
4. Registration …
… Try convincing a publisher all their
readers should register
Ref: ‘Not your business’
Stats > Math
9
D A T A
It is important to understand the data generating process
R U B Y S L A V A 2 0 1 5 P I A N O . I O
We are looking at how much users read in
order to estimate the ideal setting for a
metered paywall. In this particular case
we’ve realized the site refreshes every 5
minutes the user is inactive.
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+ 15+ 16+ 17+ 18+ 19+ 20+
%ofUsersreadingarticles
Readership Level
All PVs Without Refreshed PVs
Domain knowledge and observations
10
D A T A
Being able to perform a calculation, doesn’t mean you should
R U B Y S L A V A 2 0 1 5 P I A N O . I O
A session is defined as stream of actions for
a user with delays less than 30 minutes
between each action tied by consecutive
referrer.
It can also be tracked vie a session cookie
create table pvs_with_visits select
@sessionId:=if(@prevUser=uid AND diff <= 1800 , @sessionId, @sessionId+1) as sessionId,
@prevUser:=uid AS uid,
url,
t,
diff,
rldiff,
ref,
dt,
sid
from
(select @sessionId:=0, @prevUser:='-') b
join
(select
TIME_TO_SEC(if(@prevU=uid, TIMEDIFF(t, @prevD), '00:00')) as diff,
if(@prevU=uid & @prevrl!=1000, @prevrl-a_rl,0) as rldiff,
@prevU:=uid as uid,
@prevD:=t as t,
@prevrl:=a_rl,
url,
ref,
dt,
sid
from
pageviews
join
(select @prev:=0, @prevU='-')a
order by
uid,
t) a;
Have mercy on your analysts
select *
from (SELECT
*,
FIRST_VALUE(referrerSegmentId) OVER (PARTITION BY uid, session_order order by datetime) AS session_ref,
FIRST_VALUE(url_class) OVER (PARTITION BY uid, session_order order by datetime) AS session_start_class
FROM (
SELECT
*,
MAX(session_order) OVER (PARTITION BY uid) AS n_sessions_of_user
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY datetime_string) AS pvs_order_in_visit
FROM (
SELECT
*,
CONCAT(uid, '-', CAST(session_order AS STRING)) AS session_id
FROM (
SELECT
*,
session_order_anc + 1 AS session_order
FROM (
SELECT
*,
SUM(new_event_boundary) OVER (PARTITION BY uid ORDER BY datetime_string) AS session_order_anc
FROM (
SELECT
*,
(datetime_ut - lag_1)/1000000/60 AS minutes_since_last_interval,
CASE WHEN (datetime_ut - lag_1)/1000000 > 30 * 60 THEN 1 ELSE 0 END AS new_event_boundary
FROM (
SELECT
*,
LAG(datetime_ut) OVER (PARTITION BY uid ORDER BY datetime_string) AS lag_1,
MONTH(datetime_string) AS month
FROM (
SELECT
*,
datetime AS datetime_string,
TIMESTAMP(datetime) AS datetime_ut
FROM
[tmp.pageviews_r])))))))))",
project = project;
11
D A T A
Storage
R U B Y S L A V A 2 0 1 5 P I A N O . I O
• DO NOT keep information only in “Application Logic”
Period_Type_ID Decoding
1 Day(s)
2 Week
3 Every 10 days
4 Every other week
5 Every 15 days
6 Every 20 days
7 Month(s)/30-day
8 Month(s) Actual
9 2 Months/60-day
10 2 Months Actual
11 3 Months/90-day
12 3 Months Actual
13 6 Months/180-day
14 6 Months Actual
15 Year(s) 365-days
Duration = Period_Type * Cycle_Count
12
D A T A
Storage
R U B Y S L A V A 2 0 1 5 P I A N O . I O
• DO store history, either in form of delta logs or record validity
dates
• As is versus as is
• As is versus as was
• As was versus as was
Payment_id Duration Fee Start Date End Date
123456 366 39.00 1. 1. 2015 31. 1. 2015
Cancellation on
30. 6. 2015
Payment_id Duration Fee Start Date End Date
123456 180 19.18 1. 1. 2015 30. 6. 2015
13
1.3 Billion rows @ 25 columns
3 months of US clickstream data
150 GB gzipped ≈ 1.5 TB full
1.3 Billion rows @ 6 columns
3 months of US clickstream data
252 GB full
614M rows @ 10 columns
3 months of session data
78.1 GB
340M rows @ 15 columns
3 months of user data
49.5 GB
3170 rows
@ 31 columns
1.3 MB
D A T A
On Big Data
R U B Y S L A V A 2 0 1 5 P I A N O . I O
• Do we really have Big Data? (50 %)
• Do we need to work with in in Big Data form? (1 % – 5 %)
Your big data might be
quite small…
14
15
M E T H O D S
If you can write a for loop you can do Data Science 15
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What is a p-value?
One simple trick,
the statisticians hate it!
16
M E T H O D S
If you can write a for loop you can do Data Science 16
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Is there a significant difference between
average CPU usage?
84% 72% 81% 69%
57% 46% 74% 61%
63% 76% 56% 87%
99% 91% 69% 65%
66% 44%
62% 69%
Data Centre 1 Data Centre 2
Mean Data Centre 1: 73.5 %
Mean Data Centre 1: 66.9 %
Difference: 6.6 %
source: https://speakerdeck.com/jakevdp/statistics-for-hackers
17
M E T H O D S
If you can write a for loop you can do Data Science 17
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What would statistics do?
𝑡 =
73.5 − 66.9
316.8
8 −
124.8
12
t > tcrit
0.932 > 1.796 Difference is not significant
source: https://speakerdeck.com/jakevdp/statistics-for-hackers
18
M E T H O D S
If you can write a for loop you can do Data Science 18
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Whatcan for loop do?
1. Shuffle the observations between 2 groups randomly
2. Compute means for each group
3. Compute the differences of the means
4. Repeat n times (n can be 10 000)
source: https://speakerdeck.com/jakevdp/statistics-for-hackers An approach used in real scientific
papers
19
M E T H O D S
If you can write a for loop you can do Data Science 19
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Amore interesting application
1. Take the dataset you’ve build the model on
2. Shuffle Y values
3. Build the model from random data
In what % of cases did you build a model better
than the original one?
Call:
glm(formula = DD_index ~ ., data = perm_data_dummies)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.71028 -0.07349 0.00514 0.08096 0.63189
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.052e-01 7.716e-02 5.251 1.52e-07 ***
`authorAdam Molon` -4.710e-01 5.098e-02 -9.239 < 2e-16 ***
`authorAlexandra Gibbs` -2.160e-02 1.546e-02 -1.397 0.162359
`authorAlex Crippen` 1.025e-01 1.647e-02 6.220 5.02e-10 ***
`authorAlex Rosenberg` 5.432e-02 1.495e-02 3.634 0.000279 ***
…
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.01748868)
Null deviance: 3115.15 on 35716 degrees of freedom
Residual deviance: 608.73 on 34807 degrees of freedom
AIC: -42258
Number of Fisher Scoring iterations: 2
20
21
O U T P U T S
Health Diagnostics for Sites with a Paywall:Approach 1
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive identify
areas for improvement for any of our cca
1200 (600) sites?
• Percentiles
• One-dimensional view
• Not everyone can be above average
𝑆𝑡𝑜𝑝 𝑅𝑎𝑡𝑒 =
𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠 𝑏𝑒𝑖𝑛𝑔 𝑆𝑡𝑜𝑝𝑝𝑒𝑑
𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
95% 90% 80% 75% 70% 60% 50% 40% 30% 25% 20% 10% 5%
You are here
22
O U T P U T S
Health Diagnostics for Sites with a Paywall:Approach 2
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive identify
areas for improvement for any of our cca
1200 (600) sites?
• Clustering (PAM -> Fuzzy)
• KPIs in relation to each other
• Easy to read (or so we thought)
• Too much variation
Site 1
Site 2
23
O U T P U T S
Health Diagnostics for Sites with a Paywall:Approach 3
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive identify
areas for improvement for any of our cca
1200 (600) sites?
• Site similarity (Size, Age)
Example of a site with good 3rd party benchmarks
Site 1
Example of a „lonely“ site
xyx
24
D A T A
Health Diagnostics for Sites with a Paywall:Approach 3
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive
identify areas for improvement for any of
our cca 1200 (600) sites?
• Compare to:
• Similar Sites
• Sites of the same Publisher
• Worst Site
• Best Site
• Display multiple KPIs in different
units in one chart
25
26
C H A L L E N G E S
Building Data Products
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What if we didn’t have to analyze the data,
what if we could just say – [This] is
interesting?
[This] can be a combination of multiple
variables such as author, section, traffic from
device and anything else.
[This] sits in hierarchy
We want to know [This] is interesting
because of [This] alone not because of
[Parent(s) of This] or [Child of this]
27
C H A L L E N G E S
Building Data Products
R U B Y S L A V A 2 0 1 5 P I A N O . I O
If [This] is constructed from as author,
section, traffic from device, the hierarchy in
which [This] sits also includes author,
section, device individually as well as all
possible combinations of 2 variables
If we assume an ever changing number of
variables [This] can be constructed for, in
order to construct a hierarchy of all possible
[This] elements, the following applies:
#V This =
𝑘=1
𝑛
𝑛!
𝑘! ∗ 𝑛 − 𝑘 !
Variables Queries
1 1
2 3
3 7
5 31
10 1,023
15 32,767
20 1,048,575
28
C H A L L E N G E S
Building Data Products
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Currently we are looking for interesting
[This] in a very simple context. We define a
[Segment] which can be any type of users
(in our case a loyal user), and we measure
how string their preference for [This] is over
the general preference for [This] in whole
population.
And the results are exciting, sometimes
[This] is clearly interesting because one of
their parents
To be continued…
2929
30
2 0 1 5
Thank you for your time!
Roman Gavuliak
Lead Data Scientist
@rgavuliak
P I A N O . I O

More Related Content

Similar to Piano rubyslava final

Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer DataWSO2
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Lucidworks
 
Effective monitoring with StatsD
Effective monitoring with StatsDEffective monitoring with StatsD
Effective monitoring with StatsDDatadog
 
When Data Visualizations and Data Imports Just Don’t Work
When Data Visualizations and Data Imports Just Don’t WorkWhen Data Visualizations and Data Imports Just Don’t Work
When Data Visualizations and Data Imports Just Don’t WorkJim Kaplan CIA CFE
 
The CIO survival guide: Neva webcast - May 2017
The CIO survival guide: Neva webcast - May 2017The CIO survival guide: Neva webcast - May 2017
The CIO survival guide: Neva webcast - May 2017PeopleReign, Inc.
 
Design + Devops: What We've Learned from Our Developer Friends
Design + Devops: What We've Learned from Our Developer FriendsDesign + Devops: What We've Learned from Our Developer Friends
Design + Devops: What We've Learned from Our Developer FriendsUXPA International
 
Metadata and the Power of Pattern-Finding
Metadata and the Power of Pattern-FindingMetadata and the Power of Pattern-Finding
Metadata and the Power of Pattern-FindingDATAVERSITY
 
Discover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and DiscoveryDiscover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and DiscoveryNew Delhi Salesforce Developer Group
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014gdusbabek
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptxLonghow Lam
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily LifeBryan Yang
 
#rstats lessons for #measure
#rstats lessons for #measure#rstats lessons for #measure
#rstats lessons for #measureMark Edmondson
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Splunk AI & Machine Learning Roundtable 2019 - Zurich
Splunk AI & Machine Learning Roundtable 2019 - ZurichSplunk AI & Machine Learning Roundtable 2019 - Zurich
Splunk AI & Machine Learning Roundtable 2019 - ZurichSplunk
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription productVienna Data Science Group
 
Creating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick ImplementationCreating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick ImplementationLewandog, Inc,
 
Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...
Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...
Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...Massimiliano Crosato
 

Similar to Piano rubyslava final (20)

Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer Data
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
Effective monitoring with StatsD
Effective monitoring with StatsDEffective monitoring with StatsD
Effective monitoring with StatsD
 
When Data Visualizations and Data Imports Just Don’t Work
When Data Visualizations and Data Imports Just Don’t WorkWhen Data Visualizations and Data Imports Just Don’t Work
When Data Visualizations and Data Imports Just Don’t Work
 
The CIO survival guide: Neva webcast - May 2017
The CIO survival guide: Neva webcast - May 2017The CIO survival guide: Neva webcast - May 2017
The CIO survival guide: Neva webcast - May 2017
 
Design + Devops: What We've Learned from Our Developer Friends
Design + Devops: What We've Learned from Our Developer FriendsDesign + Devops: What We've Learned from Our Developer Friends
Design + Devops: What We've Learned from Our Developer Friends
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Metadata and the Power of Pattern-Finding
Metadata and the Power of Pattern-FindingMetadata and the Power of Pattern-Finding
Metadata and the Power of Pattern-Finding
 
Discover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and DiscoveryDiscover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and Discovery
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptx
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily Life
 
#rstats lessons for #measure
#rstats lessons for #measure#rstats lessons for #measure
#rstats lessons for #measure
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Splunk AI & Machine Learning Roundtable 2019 - Zurich
Splunk AI & Machine Learning Roundtable 2019 - ZurichSplunk AI & Machine Learning Roundtable 2019 - Zurich
Splunk AI & Machine Learning Roundtable 2019 - Zurich
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription product
 
Creating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick ImplementationCreating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick Implementation
 
Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...
Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...
Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...
 

Recently uploaded

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Recently uploaded (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Piano rubyslava final

  • 1. 1 2 0 1 5 P I A N O . I O Data Science @ Rubyslava October 2015
  • 2. 2 2 0 1 5 P I A N O . I O 2 0 1 5 P I A N O . Overview R U B YS L A V A Intro Data Methods Outputs Challenges 01 02 03 04 05
  • 3. 3
  • 4. 4 P I A N O F R A M E W O R K What Data Science? 4 R U B Y S L A V A 2 0 1 5 P I A N O . I O
  • 5. 5 P I A N O F R A M E W O R K Whydo we need analytics (data science) 5 R U B Y S L A V A 2 0 1 5 P I A N O . I O PAID USERSREGISTEREDENGAGEDCASUAL 1. What is my potential? (Potential Clients) 2. X has happened, why? Is Y a good idea? (Existing Clients, Account managers) 3. Input into products
  • 6. 6
  • 7. 7 D A T A Whatdata do we collect? 7 R U B Y S L A V A 2 0 1 5 P I A N O . I O 1. Transactional Data Subscriptions, Users 2. System settings Products, Clients, Offers, Messaging 3. Clickstream Data Pageviews, user agents 4. Conversion Data Steps in a conversion funnel Pageviews: Mongo DB Transactions + System settings: PostGreSQL Conversion Data: Google Analytics Pageviews: Amazon S3 (Cassandra) Transactions + System settings: Oracle Conversion Data: Amazon S3 (Cassandra) Pageviews + Conversion Data: BigQuery (Cassandra) Transactions + System settings: MySQL ? And then you wonder why 60 – 80 % of Data Scientist’s job is cleaning and merging datasets… Piano Press+Tinypass
  • 8. 8 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 1 2 3 4 5 6 7 8 9 10+ %PayingUsers # Devices D A T A Whatare we even measuring? R U B Y S L A V A 2 0 1 5 P I A N O . I O 1. Cookies … … can be deleted 2. Fingerprinting … … doesn’t work for mobile 3. IP address … … shared networks, proxies, VPNs 4. Registration … … Try convincing a publisher all their readers should register Ref: ‘Not your business’ Stats > Math
  • 9. 9 D A T A It is important to understand the data generating process R U B Y S L A V A 2 0 1 5 P I A N O . I O We are looking at how much users read in order to estimate the ideal setting for a metered paywall. In this particular case we’ve realized the site refreshes every 5 minutes the user is inactive. 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00% 1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+ 15+ 16+ 17+ 18+ 19+ 20+ %ofUsersreadingarticles Readership Level All PVs Without Refreshed PVs Domain knowledge and observations
  • 10. 10 D A T A Being able to perform a calculation, doesn’t mean you should R U B Y S L A V A 2 0 1 5 P I A N O . I O A session is defined as stream of actions for a user with delays less than 30 minutes between each action tied by consecutive referrer. It can also be tracked vie a session cookie create table pvs_with_visits select @sessionId:=if(@prevUser=uid AND diff <= 1800 , @sessionId, @sessionId+1) as sessionId, @prevUser:=uid AS uid, url, t, diff, rldiff, ref, dt, sid from (select @sessionId:=0, @prevUser:='-') b join (select TIME_TO_SEC(if(@prevU=uid, TIMEDIFF(t, @prevD), '00:00')) as diff, if(@prevU=uid & @prevrl!=1000, @prevrl-a_rl,0) as rldiff, @prevU:=uid as uid, @prevD:=t as t, @prevrl:=a_rl, url, ref, dt, sid from pageviews join (select @prev:=0, @prevU='-')a order by uid, t) a; Have mercy on your analysts select * from (SELECT *, FIRST_VALUE(referrerSegmentId) OVER (PARTITION BY uid, session_order order by datetime) AS session_ref, FIRST_VALUE(url_class) OVER (PARTITION BY uid, session_order order by datetime) AS session_start_class FROM ( SELECT *, MAX(session_order) OVER (PARTITION BY uid) AS n_sessions_of_user FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY datetime_string) AS pvs_order_in_visit FROM ( SELECT *, CONCAT(uid, '-', CAST(session_order AS STRING)) AS session_id FROM ( SELECT *, session_order_anc + 1 AS session_order FROM ( SELECT *, SUM(new_event_boundary) OVER (PARTITION BY uid ORDER BY datetime_string) AS session_order_anc FROM ( SELECT *, (datetime_ut - lag_1)/1000000/60 AS minutes_since_last_interval, CASE WHEN (datetime_ut - lag_1)/1000000 > 30 * 60 THEN 1 ELSE 0 END AS new_event_boundary FROM ( SELECT *, LAG(datetime_ut) OVER (PARTITION BY uid ORDER BY datetime_string) AS lag_1, MONTH(datetime_string) AS month FROM ( SELECT *, datetime AS datetime_string, TIMESTAMP(datetime) AS datetime_ut FROM [tmp.pageviews_r])))))))))", project = project;
  • 11. 11 D A T A Storage R U B Y S L A V A 2 0 1 5 P I A N O . I O • DO NOT keep information only in “Application Logic” Period_Type_ID Decoding 1 Day(s) 2 Week 3 Every 10 days 4 Every other week 5 Every 15 days 6 Every 20 days 7 Month(s)/30-day 8 Month(s) Actual 9 2 Months/60-day 10 2 Months Actual 11 3 Months/90-day 12 3 Months Actual 13 6 Months/180-day 14 6 Months Actual 15 Year(s) 365-days Duration = Period_Type * Cycle_Count
  • 12. 12 D A T A Storage R U B Y S L A V A 2 0 1 5 P I A N O . I O • DO store history, either in form of delta logs or record validity dates • As is versus as is • As is versus as was • As was versus as was Payment_id Duration Fee Start Date End Date 123456 366 39.00 1. 1. 2015 31. 1. 2015 Cancellation on 30. 6. 2015 Payment_id Duration Fee Start Date End Date 123456 180 19.18 1. 1. 2015 30. 6. 2015
  • 13. 13 1.3 Billion rows @ 25 columns 3 months of US clickstream data 150 GB gzipped ≈ 1.5 TB full 1.3 Billion rows @ 6 columns 3 months of US clickstream data 252 GB full 614M rows @ 10 columns 3 months of session data 78.1 GB 340M rows @ 15 columns 3 months of user data 49.5 GB 3170 rows @ 31 columns 1.3 MB D A T A On Big Data R U B Y S L A V A 2 0 1 5 P I A N O . I O • Do we really have Big Data? (50 %) • Do we need to work with in in Big Data form? (1 % – 5 %) Your big data might be quite small…
  • 14. 14
  • 15. 15 M E T H O D S If you can write a for loop you can do Data Science 15 R U B Y S L A V A 2 0 1 5 P I A N O . I O What is a p-value? One simple trick, the statisticians hate it!
  • 16. 16 M E T H O D S If you can write a for loop you can do Data Science 16 R U B Y S L A V A 2 0 1 5 P I A N O . I O Is there a significant difference between average CPU usage? 84% 72% 81% 69% 57% 46% 74% 61% 63% 76% 56% 87% 99% 91% 69% 65% 66% 44% 62% 69% Data Centre 1 Data Centre 2 Mean Data Centre 1: 73.5 % Mean Data Centre 1: 66.9 % Difference: 6.6 % source: https://speakerdeck.com/jakevdp/statistics-for-hackers
  • 17. 17 M E T H O D S If you can write a for loop you can do Data Science 17 R U B Y S L A V A 2 0 1 5 P I A N O . I O What would statistics do? 𝑡 = 73.5 − 66.9 316.8 8 − 124.8 12 t > tcrit 0.932 > 1.796 Difference is not significant source: https://speakerdeck.com/jakevdp/statistics-for-hackers
  • 18. 18 M E T H O D S If you can write a for loop you can do Data Science 18 R U B Y S L A V A 2 0 1 5 P I A N O . I O Whatcan for loop do? 1. Shuffle the observations between 2 groups randomly 2. Compute means for each group 3. Compute the differences of the means 4. Repeat n times (n can be 10 000) source: https://speakerdeck.com/jakevdp/statistics-for-hackers An approach used in real scientific papers
  • 19. 19 M E T H O D S If you can write a for loop you can do Data Science 19 R U B Y S L A V A 2 0 1 5 P I A N O . I O Amore interesting application 1. Take the dataset you’ve build the model on 2. Shuffle Y values 3. Build the model from random data In what % of cases did you build a model better than the original one? Call: glm(formula = DD_index ~ ., data = perm_data_dummies) Deviance Residuals: Min 1Q Median 3Q Max -0.71028 -0.07349 0.00514 0.08096 0.63189 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 4.052e-01 7.716e-02 5.251 1.52e-07 *** `authorAdam Molon` -4.710e-01 5.098e-02 -9.239 < 2e-16 *** `authorAlexandra Gibbs` -2.160e-02 1.546e-02 -1.397 0.162359 `authorAlex Crippen` 1.025e-01 1.647e-02 6.220 5.02e-10 *** `authorAlex Rosenberg` 5.432e-02 1.495e-02 3.634 0.000279 *** … Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.01748868) Null deviance: 3115.15 on 35716 degrees of freedom Residual deviance: 608.73 on 34807 degrees of freedom AIC: -42258 Number of Fisher Scoring iterations: 2
  • 20. 20
  • 21. 21 O U T P U T S Health Diagnostics for Sites with a Paywall:Approach 1 R U B Y S L A V A 2 0 1 5 P I A N O . I O Can we, without doing a deep dive identify areas for improvement for any of our cca 1200 (600) sites? • Percentiles • One-dimensional view • Not everyone can be above average 𝑆𝑡𝑜𝑝 𝑅𝑎𝑡𝑒 = 𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠 𝑏𝑒𝑖𝑛𝑔 𝑆𝑡𝑜𝑝𝑝𝑒𝑑 𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠 0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 8.00% 95% 90% 80% 75% 70% 60% 50% 40% 30% 25% 20% 10% 5% You are here
  • 22. 22 O U T P U T S Health Diagnostics for Sites with a Paywall:Approach 2 R U B Y S L A V A 2 0 1 5 P I A N O . I O Can we, without doing a deep dive identify areas for improvement for any of our cca 1200 (600) sites? • Clustering (PAM -> Fuzzy) • KPIs in relation to each other • Easy to read (or so we thought) • Too much variation Site 1 Site 2
  • 23. 23 O U T P U T S Health Diagnostics for Sites with a Paywall:Approach 3 R U B Y S L A V A 2 0 1 5 P I A N O . I O Can we, without doing a deep dive identify areas for improvement for any of our cca 1200 (600) sites? • Site similarity (Size, Age) Example of a site with good 3rd party benchmarks Site 1 Example of a „lonely“ site xyx
  • 24. 24 D A T A Health Diagnostics for Sites with a Paywall:Approach 3 R U B Y S L A V A 2 0 1 5 P I A N O . I O Can we, without doing a deep dive identify areas for improvement for any of our cca 1200 (600) sites? • Compare to: • Similar Sites • Sites of the same Publisher • Worst Site • Best Site • Display multiple KPIs in different units in one chart
  • 25. 25
  • 26. 26 C H A L L E N G E S Building Data Products R U B Y S L A V A 2 0 1 5 P I A N O . I O What if we didn’t have to analyze the data, what if we could just say – [This] is interesting? [This] can be a combination of multiple variables such as author, section, traffic from device and anything else. [This] sits in hierarchy We want to know [This] is interesting because of [This] alone not because of [Parent(s) of This] or [Child of this]
  • 27. 27 C H A L L E N G E S Building Data Products R U B Y S L A V A 2 0 1 5 P I A N O . I O If [This] is constructed from as author, section, traffic from device, the hierarchy in which [This] sits also includes author, section, device individually as well as all possible combinations of 2 variables If we assume an ever changing number of variables [This] can be constructed for, in order to construct a hierarchy of all possible [This] elements, the following applies: #V This = 𝑘=1 𝑛 𝑛! 𝑘! ∗ 𝑛 − 𝑘 ! Variables Queries 1 1 2 3 3 7 5 31 10 1,023 15 32,767 20 1,048,575
  • 28. 28 C H A L L E N G E S Building Data Products R U B Y S L A V A 2 0 1 5 P I A N O . I O Currently we are looking for interesting [This] in a very simple context. We define a [Segment] which can be any type of users (in our case a loyal user), and we measure how string their preference for [This] is over the general preference for [This] in whole population. And the results are exciting, sometimes [This] is clearly interesting because one of their parents To be continued…
  • 29. 2929
  • 30. 30 2 0 1 5 Thank you for your time! Roman Gavuliak Lead Data Scientist @rgavuliak P I A N O . I O