Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks

Building understanding out of incomplete and biased
datasets using Machine Learning and Databricks
LUKE HEINRICH | MIKE DIAS

Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A

Unleash the potential
of every team
ATLASSIAN MISSION

ON-PREMISES & CLOUD OPTIONS
Cloud Server Data Center

Product Analytics helps us
understand how users navigate
our products and engage with
our features, all with the aim of
building better products
PRODUCT ANALYTICS

PRODUCT ANALYTICS PIPELINE (SIMPLIFIED)
Cloud products
Streaming
Streaming
On-premises
products
Data Lake
stream
batch

ATLASSIAN DATA LAKE (SIMPLIFIED)
S3
S3
S3
GLUE
SQL
Data analysts / scientists
Product Managers
Visualization tools
RDBS
Product analytics
Integrations
PROCES

DATA @ ATLASSIAN
AWS Based
End to end
everything is on
AWS
Petabyte scale
Growing by 100TB a
month
2000 Users
Internal users of the
platform each month
200+ Data
Workers
Users who’s primary
job is to work with data

DATA COMPLETENESS
licences
CLOUD
product analytics

DATA COMPLETENESS
licences product analytics
ON-

WHAT FRACTION OF
CUSTOMERS USE FEATURE
X?

WHAT FRACTION OF CUSTOMERS USE FEATURE X?
Group A
Usage: 10%
Group B
Usage: 90%
50% of customers are in each group

Group A
Usage: 10%
Group B
Usage: 90%
50% of customers are in each group
The answer: 50% use feature X

True distribution 50%

Red group 2x as
likely to send
analytics

Red group 2x as
likely to send
analytics
37%

Red group 2x as
likely to send
analytics
Yellow group 3x as
likely to send
analytics
37%
70%

REPORTING RATIOS BY SIZE FOR ONE OF OUR PRODUCTS
0x
0.5x
1x
1.5x
2x
Small Medium Large
Signiﬁcantly
underrepresented

HOW MANY USERS ARE
IMPACTED BY THIS
PLATFORM CHANGE?

HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
The true value is …
They …
They report:

15
The true value is … 45
They …
They report:
CHANGE?

15
The true value is … 45
They …
Do nothing to the
number
They report: 15
CHANGE?

15 10
The true value is … 45 30
They …
Do nothing to the
number
They report: 15
CHANGE?

15 10
The true value is … 45 30
They …
Do nothing to the
number
Extrapolates using
the % reporting
They report: 15 20
CHANGE?

15 10 8
The true value is … 45 30 20
They …
Do nothing to the
number
Extrapolates using
the % reporting
As Jane, but
segmenting by
size
They report: 15 20 25
CHANGE?

15 10 8
They …
Do nothing to the
number
Extrapolates using
the % reporting
As Jane, but
segmenting by
size
Largest impact, lowest
reported number
CHANGE?

15 10 8
They …
Do nothing to the
number
Extrapolates using
the % reporting
As Jane, but
segmenting by
size
Largest impact, lowest
reported number
Smallest impact, highest
reported number
CHANGE?

WE ALWAYS NEED TO DO
ESTIMATION TO TRY AND
GET THE NUMBER RIGHT

Unbiased
on average, the number will be right
Consistent
the more data, the better
Efﬁcient
using the best method to estimate something
…
DESIRED
THEORETICALLY
ALSO DESIRED
IN PRACTICE

Auditable
there’s a paper trail of how the number was
estimated
Consistent (the second one!)
speaks in the same language as other estimates
Best practice
leverages the strongest estimation methodologies
Understood
methodology has been socialised with stakeholders
Accessible
readily available to all users who want to make an
estimate
…
Unbiased
on average, the number will be right
Consistent
the more data, the better
Efﬁcient
using the best method to estimate something
…
DESIRED
THEORETICALLY
ALSO DESIRED
IN PRACTICE

WE PROVIDE
SELF-SERVE ESTIMATION

We provide similarity
scores for each license
For a given license, it distributes
its total of 100% “similarity”
across other licenses
These scores allow the
user to quickly estimate
unknown data
Allowing them to work as if they
had a complete picture
How it works (the high level)
…
1%
8%
6%

WHAT THAT LOOKS LIKE
license metric
A NULL
B 7
C 9
D 1
… …
1
Calculated metrics
From the sample of data we get
Unknown

license metric
A NULL
B 7
C 9
D 1
… …
base target similarity
A B 0.05
A C 0.08
A D 0.01
A E 0.12
… … …
1 2+
Similarity tables
We provide
Calculated metrics

license metric
A NULL
B 7
C 9
D 1
… …
A B 0.05
A C 0.08
A D 0.01
A E 0.12
… … …
license metric estimated
A NULL 8
B 7 7
C 9 9
D 1 1
… … …
Similarity tables
We provide
Calculated metrics
Estimated unknowns
A = 0.05*B+0.08*C + …. = 8
1 2 3+ =

license metric
A NULL
B 7
C 9
D 1
… …
A B 0.05
A C 0.08
A D 0.01
A E 0.12
… … …
license metric estimated
A NULL 8
B 7 7
C 9 9
D 1 1
… … …
Similarity tables
We provide
Calculated metrics
Estimated unknowns
A = 0.05*B+0.08*C + …. = 8
1 2 3+ =
Estimated

Under the
hood
Data structure
Random Forest
kernel
license tenure
owns_con
ﬂuence
seats … usage
A 1.5 1 25 … 17
B 6.3 0 500 … 251
C 5.2 1 100 … 87
D 3.9 0 50 … 28
… … … .. … …
L 4.7 1 250 … ?
M 0.7 1 50 … ?
Attribute of
product usage
Features known
for all licenses
Optimise
prediction on
known data
Score out on
unknown licenses
Output tables
and transfer

Data structure
Random Forest
kernel
Under the
hood
Output tables
and transfer
We build a random forest, but focus on the tree
structure embedded in it
Rather than focusing on the prediction, we care about focus on the structure that
generated it
From it, we extract the weights of the prediction
The prediction of the random forest can be expressed as a weighted average of data in
the corresponding leaf nodes, and these “weights” are what we extract
These weights give us our supervised similarity
The random forest does our feature selection for us, overcoming the downfall of
unsupervised methods

Data structure
Under the
hood
Output tables
We materialise both the tree structure and the implied
similarities into tables in our data lake
The implied similarity table was discussed earlier but we also convert the scored forest
into a table of license * tree * leaf
Tables are then available to data users to use in
estimation of any metric and its distribution
As per before, any arbitrary metric based on our incomplete product
analytics can be estimated for the full set of licenses
Output tables
and transfer
Random Forest
kernel
Various "similarities" are available depending on the
metric being analysed
As similarity is deﬁned with regard to an objective (ex. use, product co-
use etc.), we provide models built to different responses to the user to
select which one will best transfer to their problem

-- the user defines/owns their metric
WITH my_metric AS
(
SELECT
license
, metric
FROM (user-generated logic)
)
SELECT
m1.license
, m1.metric
-- take the metric when it is known (m1) or estimate it with the similarity scores
, COALESCE(m1.metric, SUM(s.similarity * m2.metric)) AS estimated_metric
FROM my_metric m1
-- join to similar licenses (and their similarity weights)
LEFT JOIN similarities s
ON m1.license = s.base
-- take the metric values for those “similar” licenses
LEFT JOIN my_metric m2
ON s.target = m2.license
GROUP BY 1,2
;
The user
experience
Similarity tables
Direct tree
structure

Similarity tables
Direct tree
structure
-- the user defines/owns their metric
WITH my_metric AS
(
SELECT
license
, metric
),
-- those without data
needs_est AS
(
SELECT
f.*
FROM scored_forest f
LEFT JOIN my_metric m
ON f.license = m.license
WHERE f.metric IS NULL
),
-- those with data
estimators AS
(
SELECT
f.*
, m.metric
FROM scored_forest f
LEFT JOIN my_metric m
ON f.license = m.license
WHERE m.metric IS NOT NULL
),
The user
experience
-- prediction with tree structure
kernels AS
(
SELECT
license
, AVG(tree_est) AS metric
FROM
(
SELECT
n.license
, n.tree
, AVG(e.metric) AS tree_est
FROM needs_est n
INNER JOIN estimators e
ON n.tree = e.tree
AND n.leaf = e.leaf
GROUP BY 1,2
)
GROUP BY 1
)
-- union the known and the estimated
SELECT *
FROM kernels
UNION ALL
SELECT *
FROM my_metric
WHERE metric IS NOT NULL
;

PIPELINE MODEL PIPELINE
ATLASSIAN DATA
LAKE
RDBS
KNOWN, INPUT
VARIABLES
Integrations
OUTPUT MODEL
RESULTS
Similarities
PRODUCT
ANALYTICS
Tree structure
DATA WORKERS

WHERE OUR SIMILARITY
SCORES FALL DOWN:
SUBADDITIVE METRICS

14 11 17+ =
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD

14 ? ?+ =

? ? ?+ =

? ? ?+ =
Who using this license
already exists in the
data?

? ? ?+ =
Who using this license
already exists in the
data?
Then based on adding
the left license in, who is
new on this one?

WE’RE BACK TO SQUARE
ONE
WITH ALL OUR PROBLEMS

9 10 ?+ =

9 10 11+ =

9 10 19+ =

HOW DO WE PROVIDE
A SIMILAR FRAMEWORK?

Under the
hood
We ﬁrst model total MAU through focusing on
incremental MAU
Within a customer, we start with their highest use instance and start “adding” others to
it according to the formula:
The fraction or weight “w” is what we model based on attributes about the license we’re
adding compared to the license that have already been considered such as “product
seen before” or “country seen before” etc.
We then score this model out for all customers
This iterative process is followed to create estimates for all individual customers,
starting with their largest instance (either predicted or known) and summing individually
through all others
Modelling it out
Allocating it
back to
licenses
Output tables
and transfer

Modelling it out
Allocating it
back to
licenses
Under the
hood
Output tables
and transfer
Total MAU is then allocated back over individual
licenses through an equal proportional allocation
As the sum of MAU across instances is not our estimate of total MAU, we give each
instances a haircut to create their “allocated MAU” based on how large they are
relative to other licenses within a customer
We then pivot our similarity scores to focus back on
“known” examples
Instead of focusing on the unknown values we instead focus on how the known
predicts the unknown, creating “inﬂuence scores” based on how much unknown
“allocated MAU” is considered similar to the known license.

Allocating it
back to
licenses
Output tables
and transfer
Modelling it out
Under the
hood And provide these influence tables for ready
estimation
These influence scores can then be used to estimate total MAU of a feature or system
for the whole population, based on a simple multiplier applied to observed data.
This is done for various windows
All of the calculations above change if you’re just using a subset of products, a subset
of geographies and so on - we provide influence tables for various aggregations.

-- the user defines/owns their “MAU”
WITH my_mau AS
(
SELECT
license
, feature_mau
)
SELECT
-- upweight the observed MAU according to influence to estimate the total
SUM(feature_mau * weight_mau) AS feature_mau_estimate
FROM my_mau m
-- attach on influence scores
INNER JOIN mau_influence_weight w
ON m.license = w.license
;
The user
experience
MAU
extrapolation

Solution
Creating methods for all data
workers to create consistent
and accurate estimates of the
total
Summary
Result
Quicker, more trustworthy
insights back to product
teams
Problem
We have an incomplete and
biased dataset that we want
to infer from

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks

Similar to Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks