Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks
1. Building understanding out of incomplete and biased
datasets using Machine Learning and Databricks
LUKE HEINRICH | MIKE DIAS
2. Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
6. Product Analytics helps us
understand how users navigate
our products and engage with
our features, all with the aim of
building better products
PRODUCT ANALYTICS
7. PRODUCT ANALYTICS PIPELINE (SIMPLIFIED)
Cloud products
Streaming
Streaming
On-premises
products
Data Lake
stream
batch
8. ATLASSIAN DATA LAKE (SIMPLIFIED)
S3
S3
S3
GLUE
SQL
Data analysts / scientists
Product Managers
Visualization tools
RDBS
Product analytics
Integrations
PROCES
9. DATA @ ATLASSIAN
AWS Based
End to end
everything is on
AWS
Petabyte scale
Growing by 100TB a
month
2000 Users
Internal users of the
platform each month
200+ Data
Workers
Users who’s primary
job is to work with data
12. Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
13. Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
18. WHAT FRACTION OF CUSTOMERS USE FEATURE X?
True distribution 50%
Red group 2x as
likely to send
analytics
19. WHAT FRACTION OF CUSTOMERS USE FEATURE X?
True distribution 50%
Red group 2x as
likely to send
analytics
37%
20. WHAT FRACTION OF CUSTOMERS USE FEATURE X?
True distribution 50%
Red group 2x as
likely to send
analytics
Yellow group 3x as
likely to send
analytics
37%
70%
21. REPORTING RATIOS BY SIZE FOR ONE OF OUR PRODUCTS
0x
0.5x
1x
1.5x
2x
Small Medium Large
Significantly
underrepresented
23. HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
The true value is …
They …
They report:
24. Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15
The true value is … 45
They …
They report:
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
25. Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15
The true value is … 45
They …
Do nothing to the
number
They report: 15
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
26. Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10
The true value is … 45 30
They …
Do nothing to the
number
They report: 15
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
27. Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10
The true value is … 45 30
They …
Do nothing to the
number
Extrapolates using
the % reporting
They report: 15 20
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
28. Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10 8
The true value is … 45 30 20
They …
Do nothing to the
number
Extrapolates using
the % reporting
As Jane, but
segmenting by
size
They report: 15 20 25
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
29. Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10 8
The true value is … 45 30 20
They …
Do nothing to the
number
Extrapolates using
the % reporting
As Jane, but
segmenting by
size
They report: 15 20 25
Largest impact, lowest
reported number
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
30. Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10 8
The true value is … 45 30 20
They …
Do nothing to the
number
Extrapolates using
the % reporting
As Jane, but
segmenting by
size
They report: 15 20 25
Largest impact, lowest
reported number
Smallest impact, highest
reported number
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
31. Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
32. Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
33. WE ALWAYS NEED TO DO
ESTIMATION TO TRY AND
GET THE NUMBER RIGHT
34. Unbiased
on average, the number will be right
Consistent
the more data, the better
Efficient
using the best method to estimate something
…
DESIRED
THEORETICALLY
ALSO DESIRED
IN PRACTICE
35. Auditable
there’s a paper trail of how the number was
estimated
Consistent (the second one!)
speaks in the same language as other estimates
Best practice
leverages the strongest estimation methodologies
Understood
methodology has been socialised with stakeholders
Accessible
readily available to all users who want to make an
estimate
…
Unbiased
on average, the number will be right
Consistent
the more data, the better
Efficient
using the best method to estimate something
…
DESIRED
THEORETICALLY
ALSO DESIRED
IN PRACTICE
37. We provide similarity
scores for each license
For a given license, it distributes
its total of 100% “similarity”
across other licenses
These scores allow the
user to quickly estimate
unknown data
Allowing them to work as if they
had a complete picture
How it works (the high level)
…
1%
8%
6%
38. WHAT THAT LOOKS LIKE
license metric
A NULL
B 7
C 9
D 1
… …
1
Calculated metrics
From the sample of data we get
Unknown
39. WHAT THAT LOOKS LIKE
license metric
A NULL
B 7
C 9
D 1
… …
base target similarity
A B 0.05
A C 0.08
A D 0.01
A E 0.12
… … …
1 2+
Similarity tables
We provide
Calculated metrics
From the sample of data we get
40. WHAT THAT LOOKS LIKE
license metric
A NULL
B 7
C 9
D 1
… …
base target similarity
A B 0.05
A C 0.08
A D 0.01
A E 0.12
… … …
license metric estimated
A NULL 8
B 7 7
C 9 9
D 1 1
… … …
Similarity tables
We provide
Calculated metrics
From the sample of data we get
Estimated unknowns
A = 0.05*B+0.08*C + …. = 8
1 2 3+ =
41. WHAT THAT LOOKS LIKE
license metric
A NULL
B 7
C 9
D 1
… …
base target similarity
A B 0.05
A C 0.08
A D 0.01
A E 0.12
… … …
license metric estimated
A NULL 8
B 7 7
C 9 9
D 1 1
… … …
Similarity tables
We provide
Calculated metrics
From the sample of data we get
Estimated unknowns
A = 0.05*B+0.08*C + …. = 8
1 2 3+ =
Estimated
42. Under the
hood
Data structure
Random Forest
kernel
license tenure
owns_con
fluence
seats … usage
A 1.5 1 25 … 17
B 6.3 0 500 … 251
C 5.2 1 100 … 87
D 3.9 0 50 … 28
… … … .. … …
L 4.7 1 250 … ?
M 0.7 1 50 … ?
Attribute of
product usage
Features known
for all licenses
Optimise
prediction on
known data
Score out on
unknown licenses
Output tables
and transfer
43. Data structure
Random Forest
kernel
Under the
hood
Output tables
and transfer
We build a random forest, but focus on the tree
structure embedded in it
Rather than focusing on the prediction, we care about focus on the structure that
generated it
From it, we extract the weights of the prediction
The prediction of the random forest can be expressed as a weighted average of data in
the corresponding leaf nodes, and these “weights” are what we extract
These weights give us our supervised similarity
The random forest does our feature selection for us, overcoming the downfall of
unsupervised methods
44. Data structure
Under the
hood
Output tables
We materialise both the tree structure and the implied
similarities into tables in our data lake
The implied similarity table was discussed earlier but we also convert the scored forest
into a table of license * tree * leaf
Tables are then available to data users to use in
estimation of any metric and its distribution
As per before, any arbitrary metric based on our incomplete product
analytics can be estimated for the full set of licenses
Output tables
and transfer
Random Forest
kernel
Various "similarities" are available depending on the
metric being analysed
As similarity is defined with regard to an objective (ex. use, product co-
use etc.), we provide models built to different responses to the user to
select which one will best transfer to their problem
45. -- the user defines/owns their metric
WITH my_metric AS
(
SELECT
license
, metric
FROM (user-generated logic)
)
SELECT
m1.license
, m1.metric
-- take the metric when it is known (m1) or estimate it with the similarity scores
, COALESCE(m1.metric, SUM(s.similarity * m2.metric)) AS estimated_metric
FROM my_metric m1
-- join to similar licenses (and their similarity weights)
LEFT JOIN similarities s
ON m1.license = s.base
-- take the metric values for those “similar” licenses
LEFT JOIN my_metric m2
ON s.target = m2.license
GROUP BY 1,2
;
The user
experience
Similarity tables
Direct tree
structure
46. Similarity tables
Direct tree
structure
-- the user defines/owns their metric
WITH my_metric AS
(
SELECT
license
, metric
FROM (user-generated logic)
),
-- those without data
needs_est AS
(
SELECT
f.*
FROM scored_forest f
LEFT JOIN my_metric m
ON f.license = m.license
WHERE f.metric IS NULL
),
-- those with data
estimators AS
(
SELECT
f.*
, m.metric
FROM scored_forest f
LEFT JOIN my_metric m
ON f.license = m.license
WHERE m.metric IS NOT NULL
),
The user
experience
-- prediction with tree structure
kernels AS
(
SELECT
license
, AVG(tree_est) AS metric
FROM
(
SELECT
n.license
, n.tree
, AVG(e.metric) AS tree_est
FROM needs_est n
INNER JOIN estimators e
ON n.tree = e.tree
AND n.leaf = e.leaf
GROUP BY 1,2
)
GROUP BY 1
)
-- union the known and the estimated
SELECT *
FROM kernels
UNION ALL
SELECT *
FROM my_metric
WHERE metric IS NOT NULL
;
47. PIPELINE MODEL PIPELINE
ATLASSIAN DATA
LAKE
RDBS
KNOWN, INPUT
VARIABLES
Integrations
OUTPUT MODEL
RESULTS
Similarities
PRODUCT
ANALYTICS
Tree structure
DATA WORKERS
48. Auditable
there’s a paper trail of how the number was
estimated
Consistent (the second one!)
speaks in the same language as other estimates
Best practice
leverages the strongest estimation methodologies
Understood
methodology has been socialised with stakeholders
Accessible
readily available to all users who want to make an
estimate
…
Unbiased
on average, the number will be right
Consistent
the more data, the better
Efficient
using the best method to estimate something
…
DESIRED
THEORETICALLY
ALSO DESIRED
IN PRACTICE
49. Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
50. Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
54. ? ? ?+ =
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
55. ? ? ?+ =
Who using this license
already exists in the
data?
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
56. ? ? ?+ =
Who using this license
already exists in the
data?
Then based on adding
the left license in, who is
new on this one?
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
62. Under the
hood
We first model total MAU through focusing on
incremental MAU
Within a customer, we start with their highest use instance and start “adding” others to
it according to the formula:
The fraction or weight “w” is what we model based on attributes about the license we’re
adding compared to the license that have already been considered such as “product
seen before” or “country seen before” etc.
We then score this model out for all customers
This iterative process is followed to create estimates for all individual customers,
starting with their largest instance (either predicted or known) and summing individually
through all others
Modelling it out
Allocating it
back to
licenses
Output tables
and transfer
63. Modelling it out
Allocating it
back to
licenses
Under the
hood
Output tables
and transfer
Total MAU is then allocated back over individual
licenses through an equal proportional allocation
As the sum of MAU across instances is not our estimate of total MAU, we give each
instances a haircut to create their “allocated MAU” based on how large they are
relative to other licenses within a customer
We then pivot our similarity scores to focus back on
“known” examples
Instead of focusing on the unknown values we instead focus on how the known
predicts the unknown, creating “influence scores” based on how much unknown
“allocated MAU” is considered similar to the known license.
64. Allocating it
back to
licenses
Output tables
and transfer
Modelling it out
Under the
hood And provide these influence tables for ready
estimation
These influence scores can then be used to estimate total MAU of a feature or system
for the whole population, based on a simple multiplier applied to observed data.
This is done for various windows
All of the calculations above change if you’re just using a subset of products, a subset
of geographies and so on - we provide influence tables for various aggregations.
65. -- the user defines/owns their “MAU”
WITH my_mau AS
(
SELECT
license
, feature_mau
FROM (user-generated logic)
)
SELECT
-- upweight the observed MAU according to influence to estimate the total
SUM(feature_mau * weight_mau) AS feature_mau_estimate
FROM my_mau m
-- attach on influence scores
INNER JOIN mau_influence_weight w
ON m.license = w.license
;
The user
experience
MAU
extrapolation
66. Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
67. Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
68. Solution
Creating methods for all data
workers to create consistent
and accurate estimates of the
total
Summary
Result
Quicker, more trustworthy
insights back to product
teams
Problem
We have an incomplete and
biased dataset that we want
to infer from