SlideShare a Scribd company logo
1 of 70
Download to read offline
Building understanding out of incomplete and biased
datasets using Machine Learning and Databricks
LUKE HEINRICH | MIKE DIAS
Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
Unleash the potential
of every team
ATLASSIAN MISSION
OUR MOST POPULAR PRODUCTS
ON-PREMISES & CLOUD OPTIONS
Cloud Server Data Center
Product Analytics helps us
understand how users navigate
our products and engage with
our features, all with the aim of
building better products
PRODUCT ANALYTICS
PRODUCT ANALYTICS PIPELINE (SIMPLIFIED)
Cloud products
Streaming
Streaming
On-premises
products
Data Lake
stream
batch
ATLASSIAN DATA LAKE (SIMPLIFIED)
S3
S3
S3
GLUE
SQL
Data analysts / scientists
Product Managers
Visualization tools
RDBS
Product analytics
Integrations
PROCES
DATA @ ATLASSIAN
AWS Based
End to end
everything is on
AWS
Petabyte scale
Growing by 100TB a
month
2000 Users
Internal users of the
platform each month
200+ Data
Workers
Users who’s primary
job is to work with data
DATA COMPLETENESS
licences
CLOUD
product analytics
DATA COMPLETENESS
licences product analytics
ON-
Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
WHAT FRACTION OF
CUSTOMERS USE FEATURE
X?
WHAT FRACTION OF CUSTOMERS USE FEATURE X?
Group A
Usage: 10%
Group B
Usage: 90%
50% of customers are in each group
WHAT FRACTION OF CUSTOMERS USE FEATURE X?
Group A
Usage: 10%
Group B
Usage: 90%
50% of customers are in each group
The answer: 50% use feature X
WHAT FRACTION OF CUSTOMERS USE FEATURE X?
True distribution 50%
WHAT FRACTION OF CUSTOMERS USE FEATURE X?
True distribution 50%
Red group 2x as
likely to send
analytics
WHAT FRACTION OF CUSTOMERS USE FEATURE X?
True distribution 50%
Red group 2x as
likely to send
analytics
37%
WHAT FRACTION OF CUSTOMERS USE FEATURE X?
True distribution 50%
Red group 2x as
likely to send
analytics
Yellow group 3x as
likely to send
analytics
37%
70%
REPORTING RATIOS BY SIZE FOR ONE OF OUR PRODUCTS
0x
0.5x
1x
1.5x
2x
Small Medium Large
Significantly
underrepresented
HOW MANY USERS ARE
IMPACTED BY THIS
PLATFORM CHANGE?
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
The true value is …
They …
They report:
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15
The true value is … 45
They …
They report:
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15
The true value is … 45
They …
Do nothing to the
number
They report: 15
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10
The true value is … 45 30
They …
Do nothing to the
number
They report: 15
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10
The true value is … 45 30
They …
Do nothing to the
number
Extrapolates using
the % reporting
They report: 15 20
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10 8
The true value is … 45 30 20
They …
Do nothing to the
number
Extrapolates using
the % reporting
As Jane, but
segmenting by
size
They report: 15 20 25
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10 8
The true value is … 45 30 20
They …
Do nothing to the
number
Extrapolates using
the % reporting
As Jane, but
segmenting by
size
They report: 15 20 25
Largest impact, lowest
reported number
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Person Jim (Product A) Jane (Product B) John (Product C)
Sees this many users
affected in the data
15 10 8
The true value is … 45 30 20
They …
Do nothing to the
number
Extrapolates using
the % reporting
As Jane, but
segmenting by
size
They report: 15 20 25
Largest impact, lowest
reported number
Smallest impact, highest
reported number
HOW MANY USERS ARE IMPACTED BY THIS PLATFORM
CHANGE?
Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
WE ALWAYS NEED TO DO
ESTIMATION TO TRY AND
GET THE NUMBER RIGHT
Unbiased
on average, the number will be right
Consistent
the more data, the better
Efficient
using the best method to estimate something
…
DESIRED
THEORETICALLY
ALSO DESIRED
IN PRACTICE
Auditable
there’s a paper trail of how the number was
estimated
Consistent (the second one!)
speaks in the same language as other estimates
Best practice
leverages the strongest estimation methodologies
Understood
methodology has been socialised with stakeholders
Accessible
readily available to all users who want to make an
estimate
…
Unbiased
on average, the number will be right
Consistent
the more data, the better
Efficient
using the best method to estimate something
…
DESIRED
THEORETICALLY
ALSO DESIRED
IN PRACTICE
WE PROVIDE
SELF-SERVE ESTIMATION
We provide similarity
scores for each license
For a given license, it distributes
its total of 100% “similarity”
across other licenses
These scores allow the
user to quickly estimate
unknown data
Allowing them to work as if they
had a complete picture
How it works (the high level)
…
1%
8%
6%
WHAT THAT LOOKS LIKE
license metric
A NULL
B 7
C 9
D 1
… …
1
Calculated metrics
From the sample of data we get
Unknown
WHAT THAT LOOKS LIKE
license metric
A NULL
B 7
C 9
D 1
… …
base target similarity
A B 0.05
A C 0.08
A D 0.01
A E 0.12
… … …
1 2+
Similarity tables
We provide
Calculated metrics
From the sample of data we get
WHAT THAT LOOKS LIKE
license metric
A NULL
B 7
C 9
D 1
… …
base target similarity
A B 0.05
A C 0.08
A D 0.01
A E 0.12
… … …
license metric estimated
A NULL 8
B 7 7
C 9 9
D 1 1
… … …
Similarity tables
We provide
Calculated metrics
From the sample of data we get
Estimated unknowns
A = 0.05*B+0.08*C + …. = 8
1 2 3+ =
WHAT THAT LOOKS LIKE
license metric
A NULL
B 7
C 9
D 1
… …
base target similarity
A B 0.05
A C 0.08
A D 0.01
A E 0.12
… … …
license metric estimated
A NULL 8
B 7 7
C 9 9
D 1 1
… … …
Similarity tables
We provide
Calculated metrics
From the sample of data we get
Estimated unknowns
A = 0.05*B+0.08*C + …. = 8
1 2 3+ =
Estimated
Under the
hood
Data structure
Random Forest
kernel
license tenure
owns_con
fluence
seats … usage
A 1.5 1 25 … 17
B 6.3 0 500 … 251
C 5.2 1 100 … 87
D 3.9 0 50 … 28
… … … .. … …
L 4.7 1 250 … ?
M 0.7 1 50 … ?
Attribute of
product usage
Features known
for all licenses
Optimise
prediction on
known data
Score out on
unknown licenses
Output tables
and transfer
Data structure
Random Forest
kernel
Under the
hood
Output tables
and transfer
We build a random forest, but focus on the tree
structure embedded in it
Rather than focusing on the prediction, we care about focus on the structure that
generated it
From it, we extract the weights of the prediction
The prediction of the random forest can be expressed as a weighted average of data in
the corresponding leaf nodes, and these “weights” are what we extract
These weights give us our supervised similarity
The random forest does our feature selection for us, overcoming the downfall of
unsupervised methods
Data structure
Under the
hood
Output tables
We materialise both the tree structure and the implied
similarities into tables in our data lake
The implied similarity table was discussed earlier but we also convert the scored forest
into a table of license * tree * leaf
Tables are then available to data users to use in
estimation of any metric and its distribution
As per before, any arbitrary metric based on our incomplete product
analytics can be estimated for the full set of licenses
Output tables
and transfer
Random Forest
kernel
Various "similarities" are available depending on the
metric being analysed
As similarity is defined with regard to an objective (ex. use, product co-
use etc.), we provide models built to different responses to the user to
select which one will best transfer to their problem
-- the user defines/owns their metric
WITH my_metric AS
(
SELECT
license
, metric
FROM (user-generated logic)
)
SELECT
m1.license
, m1.metric
-- take the metric when it is known (m1) or estimate it with the similarity scores
, COALESCE(m1.metric, SUM(s.similarity * m2.metric)) AS estimated_metric
FROM my_metric m1
-- join to similar licenses (and their similarity weights)
LEFT JOIN similarities s
ON m1.license = s.base
-- take the metric values for those “similar” licenses
LEFT JOIN my_metric m2
ON s.target = m2.license
GROUP BY 1,2
;
The user
experience
Similarity tables
Direct tree
structure
Similarity tables
Direct tree
structure
-- the user defines/owns their metric
WITH my_metric AS
(
SELECT
license
, metric
FROM (user-generated logic)
),
-- those without data
needs_est AS
(
SELECT
f.*
FROM scored_forest f
LEFT JOIN my_metric m
ON f.license = m.license
WHERE f.metric IS NULL
),
-- those with data
estimators AS
(
SELECT
f.*
, m.metric
FROM scored_forest f
LEFT JOIN my_metric m
ON f.license = m.license
WHERE m.metric IS NOT NULL
),
The user
experience
-- prediction with tree structure
kernels AS
(
SELECT
license
, AVG(tree_est) AS metric
FROM
(
SELECT
n.license
, n.tree
, AVG(e.metric) AS tree_est
FROM needs_est n
INNER JOIN estimators e
ON n.tree = e.tree
AND n.leaf = e.leaf
GROUP BY 1,2
)
GROUP BY 1
)
-- union the known and the estimated
SELECT *
FROM kernels
UNION ALL
SELECT *
FROM my_metric
WHERE metric IS NOT NULL
;
PIPELINE MODEL PIPELINE
ATLASSIAN DATA
LAKE
RDBS
KNOWN, INPUT
VARIABLES
Integrations
OUTPUT MODEL
RESULTS
Similarities
PRODUCT
ANALYTICS
Tree structure
DATA WORKERS
Auditable
there’s a paper trail of how the number was
estimated
Consistent (the second one!)
speaks in the same language as other estimates
Best practice
leverages the strongest estimation methodologies
Understood
methodology has been socialised with stakeholders
Accessible
readily available to all users who want to make an
estimate
…
Unbiased
on average, the number will be right
Consistent
the more data, the better
Efficient
using the best method to estimate something
…
DESIRED
THEORETICALLY
ALSO DESIRED
IN PRACTICE
Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
WHERE OUR SIMILARITY
SCORES FALL DOWN:
SUBADDITIVE METRICS
14 11 17+ =
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
14 ? ?+ =
? ? ?+ =
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
? ? ?+ =
Who using this license
already exists in the
data?
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
? ? ?+ =
Who using this license
already exists in the
data?
Then based on adding
the left license in, who is
new on this one?
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
WE’RE BACK TO SQUARE
ONE
WITH ALL OUR PROBLEMS
9 10 ?+ =
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
9 10 11+ =
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
9 10 19+ =
MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
HOW DO WE PROVIDE
A SIMILAR FRAMEWORK?
Under the
hood
We first model total MAU through focusing on
incremental MAU
Within a customer, we start with their highest use instance and start “adding” others to
it according to the formula:
The fraction or weight “w” is what we model based on attributes about the license we’re
adding compared to the license that have already been considered such as “product
seen before” or “country seen before” etc.
We then score this model out for all customers
This iterative process is followed to create estimates for all individual customers,
starting with their largest instance (either predicted or known) and summing individually
through all others
Modelling it out
Allocating it
back to
licenses
Output tables
and transfer
Modelling it out
Allocating it
back to
licenses
Under the
hood
Output tables
and transfer
Total MAU is then allocated back over individual
licenses through an equal proportional allocation
As the sum of MAU across instances is not our estimate of total MAU, we give each
instances a haircut to create their “allocated MAU” based on how large they are
relative to other licenses within a customer
We then pivot our similarity scores to focus back on
“known” examples
Instead of focusing on the unknown values we instead focus on how the known
predicts the unknown, creating “influence scores” based on how much unknown
“allocated MAU” is considered similar to the known license.
Allocating it
back to
licenses
Output tables
and transfer
Modelling it out
Under the
hood And provide these influence tables for ready
estimation
These influence scores can then be used to estimate total MAU of a feature or system
for the whole population, based on a simple multiplier applied to observed data.
This is done for various windows
All of the calculations above change if you’re just using a subset of products, a subset
of geographies and so on - we provide influence tables for various aggregations.
-- the user defines/owns their “MAU”
WITH my_mau AS
(
SELECT
license
, feature_mau
FROM (user-generated logic)
)
SELECT
-- upweight the observed MAU according to influence to estimate the total
SUM(feature_mau * weight_mau) AS feature_mau_estimate
FROM my_mau m
-- attach on influence scores
INNER JOIN mau_influence_weight w
ON m.license = w.license
;
The user
experience
MAU
extrapolation
Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
Agenda
Introduction to data at Atlassian
The problem of bias and incompleteness
Solving this with self-serve estimation
Tackling subadditivity in this framework
Summary and Q&A
Solution
Creating methods for all data
workers to create consistent
and accurate estimates of the
total
Summary
Result
Quicker, more trustworthy
insights back to product
teams
Problem
We have an incomplete and
biased dataset that we want
to infer from
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Q&A
LUKE HEINRICH | MIKE DIAS

More Related Content

What's hot

Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Sri Ambati
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
IBM Predictive analytics IoT Presentation
IBM Predictive analytics IoT PresentationIBM Predictive analytics IoT Presentation
IBM Predictive analytics IoT PresentationIan Skerrett
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
Big data in the cloud - welcome to cost oriented design
Big data in the cloud - welcome to cost oriented designBig data in the cloud - welcome to cost oriented design
Big data in the cloud - welcome to cost oriented designArnon Rotem-Gal-Oz
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify Dataconomy Media
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...MSAdvAnalytics
 
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...Codemotion
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
 
Making your RDBMS fast!
Making your RDBMS fast! Making your RDBMS fast!
Making your RDBMS fast! VictorSzoltysek
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
 
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBig Data Week
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsLinear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsHesen Peng
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 

What's hot (20)

Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
IBM Predictive analytics IoT Presentation
IBM Predictive analytics IoT PresentationIBM Predictive analytics IoT Presentation
IBM Predictive analytics IoT Presentation
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Big data in the cloud - welcome to cost oriented design
Big data in the cloud - welcome to cost oriented designBig data in the cloud - welcome to cost oriented design
Big data in the cloud - welcome to cost oriented design
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
 
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Presentation
PresentationPresentation
Presentation
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Making your RDBMS fast!
Making your RDBMS fast! Making your RDBMS fast!
Making your RDBMS fast!
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsLinear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actions
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 

Similar to Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
7qc_tools_173.ppt for 7 QC tools implementation of the quality
7qc_tools_173.ppt for 7 QC tools implementation of the quality7qc_tools_173.ppt for 7 QC tools implementation of the quality
7qc_tools_173.ppt for 7 QC tools implementation of the qualityMANISHDUBEY14500
 
7qctools173-100325112603-phpapp01.ppt
7qctools173-100325112603-phpapp01.ppt7qctools173-100325112603-phpapp01.ppt
7qctools173-100325112603-phpapp01.pptOswaldo Gonzales
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?BalaBit
 
Retail products - machine learning recommendation engine
Retail products   - machine learning recommendation engineRetail products   - machine learning recommendation engine
Retail products - machine learning recommendation enginehkbhadraa
 
Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Wil Davis
 
Introduction to Tableau
Introduction to TableauIntroduction to Tableau
Introduction to TableauKanika Nagpal
 
Top tableau questions and answers in 2019
Top tableau questions and answers in 2019Top tableau questions and answers in 2019
Top tableau questions and answers in 2019minatibiswal1
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Dr Sulaimon Afolabi
 
Power BI vs Tableau vs Cognos: A Data Analytics Research
Power BI vs Tableau vs Cognos: A Data Analytics ResearchPower BI vs Tableau vs Cognos: A Data Analytics Research
Power BI vs Tableau vs Cognos: A Data Analytics ResearchLuciano Vilas Boas
 
Lobsters, Wine and Market Research
Lobsters, Wine and Market ResearchLobsters, Wine and Market Research
Lobsters, Wine and Market ResearchTed Clark
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...Big Data Week
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 

Similar to Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks (20)

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
7qc_tools_173.ppt for 7 QC tools implementation of the quality
7qc_tools_173.ppt for 7 QC tools implementation of the quality7qc_tools_173.ppt for 7 QC tools implementation of the quality
7qc_tools_173.ppt for 7 QC tools implementation of the quality
 
7qctools173-100325112603-phpapp01.ppt
7qctools173-100325112603-phpapp01.ppt7qctools173-100325112603-phpapp01.ppt
7qctools173-100325112603-phpapp01.ppt
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
 
Retail products - machine learning recommendation engine
Retail products   - machine learning recommendation engineRetail products   - machine learning recommendation engine
Retail products - machine learning recommendation engine
 
Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...
 
Introduction to Tableau
Introduction to TableauIntroduction to Tableau
Introduction to Tableau
 
Top tableau questions and answers in 2019
Top tableau questions and answers in 2019Top tableau questions and answers in 2019
Top tableau questions and answers in 2019
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1
 
Online Shop Recommendation System
Online Shop Recommendation SystemOnline Shop Recommendation System
Online Shop Recommendation System
 
Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
 
Power BI vs Tableau vs Cognos: A Data Analytics Research
Power BI vs Tableau vs Cognos: A Data Analytics ResearchPower BI vs Tableau vs Cognos: A Data Analytics Research
Power BI vs Tableau vs Cognos: A Data Analytics Research
 
7qc Tools 173
7qc Tools 1737qc Tools 173
7qc Tools 173
 
7qc Tools 173
7qc Tools 1737qc Tools 173
7qc Tools 173
 
Lobsters, Wine and Market Research
Lobsters, Wine and Market ResearchLobsters, Wine and Market Research
Lobsters, Wine and Market Research
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and Databricks

  • 1. Building understanding out of incomplete and biased datasets using Machine Learning and Databricks LUKE HEINRICH | MIKE DIAS
  • 2. Agenda Introduction to data at Atlassian The problem of bias and incompleteness Solving this with self-serve estimation Tackling subadditivity in this framework Summary and Q&A
  • 3. Unleash the potential of every team ATLASSIAN MISSION
  • 4. OUR MOST POPULAR PRODUCTS
  • 5. ON-PREMISES & CLOUD OPTIONS Cloud Server Data Center
  • 6. Product Analytics helps us understand how users navigate our products and engage with our features, all with the aim of building better products PRODUCT ANALYTICS
  • 7. PRODUCT ANALYTICS PIPELINE (SIMPLIFIED) Cloud products Streaming Streaming On-premises products Data Lake stream batch
  • 8. ATLASSIAN DATA LAKE (SIMPLIFIED) S3 S3 S3 GLUE SQL Data analysts / scientists Product Managers Visualization tools RDBS Product analytics Integrations PROCES
  • 9. DATA @ ATLASSIAN AWS Based End to end everything is on AWS Petabyte scale Growing by 100TB a month 2000 Users Internal users of the platform each month 200+ Data Workers Users who’s primary job is to work with data
  • 12. Agenda Introduction to data at Atlassian The problem of bias and incompleteness Solving this with self-serve estimation Tackling subadditivity in this framework Summary and Q&A
  • 13. Agenda Introduction to data at Atlassian The problem of bias and incompleteness Solving this with self-serve estimation Tackling subadditivity in this framework Summary and Q&A
  • 14. WHAT FRACTION OF CUSTOMERS USE FEATURE X?
  • 15. WHAT FRACTION OF CUSTOMERS USE FEATURE X? Group A Usage: 10% Group B Usage: 90% 50% of customers are in each group
  • 16. WHAT FRACTION OF CUSTOMERS USE FEATURE X? Group A Usage: 10% Group B Usage: 90% 50% of customers are in each group The answer: 50% use feature X
  • 17. WHAT FRACTION OF CUSTOMERS USE FEATURE X? True distribution 50%
  • 18. WHAT FRACTION OF CUSTOMERS USE FEATURE X? True distribution 50% Red group 2x as likely to send analytics
  • 19. WHAT FRACTION OF CUSTOMERS USE FEATURE X? True distribution 50% Red group 2x as likely to send analytics 37%
  • 20. WHAT FRACTION OF CUSTOMERS USE FEATURE X? True distribution 50% Red group 2x as likely to send analytics Yellow group 3x as likely to send analytics 37% 70%
  • 21. REPORTING RATIOS BY SIZE FOR ONE OF OUR PRODUCTS 0x 0.5x 1x 1.5x 2x Small Medium Large Significantly underrepresented
  • 22. HOW MANY USERS ARE IMPACTED BY THIS PLATFORM CHANGE?
  • 23. HOW MANY USERS ARE IMPACTED BY THIS PLATFORM CHANGE? Person Jim (Product A) Jane (Product B) John (Product C) Sees this many users affected in the data The true value is … They … They report:
  • 24. Person Jim (Product A) Jane (Product B) John (Product C) Sees this many users affected in the data 15 The true value is … 45 They … They report: HOW MANY USERS ARE IMPACTED BY THIS PLATFORM CHANGE?
  • 25. Person Jim (Product A) Jane (Product B) John (Product C) Sees this many users affected in the data 15 The true value is … 45 They … Do nothing to the number They report: 15 HOW MANY USERS ARE IMPACTED BY THIS PLATFORM CHANGE?
  • 26. Person Jim (Product A) Jane (Product B) John (Product C) Sees this many users affected in the data 15 10 The true value is … 45 30 They … Do nothing to the number They report: 15 HOW MANY USERS ARE IMPACTED BY THIS PLATFORM CHANGE?
  • 27. Person Jim (Product A) Jane (Product B) John (Product C) Sees this many users affected in the data 15 10 The true value is … 45 30 They … Do nothing to the number Extrapolates using the % reporting They report: 15 20 HOW MANY USERS ARE IMPACTED BY THIS PLATFORM CHANGE?
  • 28. Person Jim (Product A) Jane (Product B) John (Product C) Sees this many users affected in the data 15 10 8 The true value is … 45 30 20 They … Do nothing to the number Extrapolates using the % reporting As Jane, but segmenting by size They report: 15 20 25 HOW MANY USERS ARE IMPACTED BY THIS PLATFORM CHANGE?
  • 29. Person Jim (Product A) Jane (Product B) John (Product C) Sees this many users affected in the data 15 10 8 The true value is … 45 30 20 They … Do nothing to the number Extrapolates using the % reporting As Jane, but segmenting by size They report: 15 20 25 Largest impact, lowest reported number HOW MANY USERS ARE IMPACTED BY THIS PLATFORM CHANGE?
  • 30. Person Jim (Product A) Jane (Product B) John (Product C) Sees this many users affected in the data 15 10 8 The true value is … 45 30 20 They … Do nothing to the number Extrapolates using the % reporting As Jane, but segmenting by size They report: 15 20 25 Largest impact, lowest reported number Smallest impact, highest reported number HOW MANY USERS ARE IMPACTED BY THIS PLATFORM CHANGE?
  • 31. Agenda Introduction to data at Atlassian The problem of bias and incompleteness Solving this with self-serve estimation Tackling subadditivity in this framework Summary and Q&A
  • 32. Agenda Introduction to data at Atlassian The problem of bias and incompleteness Solving this with self-serve estimation Tackling subadditivity in this framework Summary and Q&A
  • 33. WE ALWAYS NEED TO DO ESTIMATION TO TRY AND GET THE NUMBER RIGHT
  • 34. Unbiased on average, the number will be right Consistent the more data, the better Efficient using the best method to estimate something … DESIRED THEORETICALLY ALSO DESIRED IN PRACTICE
  • 35. Auditable there’s a paper trail of how the number was estimated Consistent (the second one!) speaks in the same language as other estimates Best practice leverages the strongest estimation methodologies Understood methodology has been socialised with stakeholders Accessible readily available to all users who want to make an estimate … Unbiased on average, the number will be right Consistent the more data, the better Efficient using the best method to estimate something … DESIRED THEORETICALLY ALSO DESIRED IN PRACTICE
  • 37. We provide similarity scores for each license For a given license, it distributes its total of 100% “similarity” across other licenses These scores allow the user to quickly estimate unknown data Allowing them to work as if they had a complete picture How it works (the high level) … 1% 8% 6%
  • 38. WHAT THAT LOOKS LIKE license metric A NULL B 7 C 9 D 1 … … 1 Calculated metrics From the sample of data we get Unknown
  • 39. WHAT THAT LOOKS LIKE license metric A NULL B 7 C 9 D 1 … … base target similarity A B 0.05 A C 0.08 A D 0.01 A E 0.12 … … … 1 2+ Similarity tables We provide Calculated metrics From the sample of data we get
  • 40. WHAT THAT LOOKS LIKE license metric A NULL B 7 C 9 D 1 … … base target similarity A B 0.05 A C 0.08 A D 0.01 A E 0.12 … … … license metric estimated A NULL 8 B 7 7 C 9 9 D 1 1 … … … Similarity tables We provide Calculated metrics From the sample of data we get Estimated unknowns A = 0.05*B+0.08*C + …. = 8 1 2 3+ =
  • 41. WHAT THAT LOOKS LIKE license metric A NULL B 7 C 9 D 1 … … base target similarity A B 0.05 A C 0.08 A D 0.01 A E 0.12 … … … license metric estimated A NULL 8 B 7 7 C 9 9 D 1 1 … … … Similarity tables We provide Calculated metrics From the sample of data we get Estimated unknowns A = 0.05*B+0.08*C + …. = 8 1 2 3+ = Estimated
  • 42. Under the hood Data structure Random Forest kernel license tenure owns_con fluence seats … usage A 1.5 1 25 … 17 B 6.3 0 500 … 251 C 5.2 1 100 … 87 D 3.9 0 50 … 28 … … … .. … … L 4.7 1 250 … ? M 0.7 1 50 … ? Attribute of product usage Features known for all licenses Optimise prediction on known data Score out on unknown licenses Output tables and transfer
  • 43. Data structure Random Forest kernel Under the hood Output tables and transfer We build a random forest, but focus on the tree structure embedded in it Rather than focusing on the prediction, we care about focus on the structure that generated it From it, we extract the weights of the prediction The prediction of the random forest can be expressed as a weighted average of data in the corresponding leaf nodes, and these “weights” are what we extract These weights give us our supervised similarity The random forest does our feature selection for us, overcoming the downfall of unsupervised methods
  • 44. Data structure Under the hood Output tables We materialise both the tree structure and the implied similarities into tables in our data lake The implied similarity table was discussed earlier but we also convert the scored forest into a table of license * tree * leaf Tables are then available to data users to use in estimation of any metric and its distribution As per before, any arbitrary metric based on our incomplete product analytics can be estimated for the full set of licenses Output tables and transfer Random Forest kernel Various "similarities" are available depending on the metric being analysed As similarity is defined with regard to an objective (ex. use, product co- use etc.), we provide models built to different responses to the user to select which one will best transfer to their problem
  • 45. -- the user defines/owns their metric WITH my_metric AS ( SELECT license , metric FROM (user-generated logic) ) SELECT m1.license , m1.metric -- take the metric when it is known (m1) or estimate it with the similarity scores , COALESCE(m1.metric, SUM(s.similarity * m2.metric)) AS estimated_metric FROM my_metric m1 -- join to similar licenses (and their similarity weights) LEFT JOIN similarities s ON m1.license = s.base -- take the metric values for those “similar” licenses LEFT JOIN my_metric m2 ON s.target = m2.license GROUP BY 1,2 ; The user experience Similarity tables Direct tree structure
  • 46. Similarity tables Direct tree structure -- the user defines/owns their metric WITH my_metric AS ( SELECT license , metric FROM (user-generated logic) ), -- those without data needs_est AS ( SELECT f.* FROM scored_forest f LEFT JOIN my_metric m ON f.license = m.license WHERE f.metric IS NULL ), -- those with data estimators AS ( SELECT f.* , m.metric FROM scored_forest f LEFT JOIN my_metric m ON f.license = m.license WHERE m.metric IS NOT NULL ), The user experience -- prediction with tree structure kernels AS ( SELECT license , AVG(tree_est) AS metric FROM ( SELECT n.license , n.tree , AVG(e.metric) AS tree_est FROM needs_est n INNER JOIN estimators e ON n.tree = e.tree AND n.leaf = e.leaf GROUP BY 1,2 ) GROUP BY 1 ) -- union the known and the estimated SELECT * FROM kernels UNION ALL SELECT * FROM my_metric WHERE metric IS NOT NULL ;
  • 47. PIPELINE MODEL PIPELINE ATLASSIAN DATA LAKE RDBS KNOWN, INPUT VARIABLES Integrations OUTPUT MODEL RESULTS Similarities PRODUCT ANALYTICS Tree structure DATA WORKERS
  • 48. Auditable there’s a paper trail of how the number was estimated Consistent (the second one!) speaks in the same language as other estimates Best practice leverages the strongest estimation methodologies Understood methodology has been socialised with stakeholders Accessible readily available to all users who want to make an estimate … Unbiased on average, the number will be right Consistent the more data, the better Efficient using the best method to estimate something … DESIRED THEORETICALLY ALSO DESIRED IN PRACTICE
  • 49. Agenda Introduction to data at Atlassian The problem of bias and incompleteness Solving this with self-serve estimation Tackling subadditivity in this framework Summary and Q&A
  • 50. Agenda Introduction to data at Atlassian The problem of bias and incompleteness Solving this with self-serve estimation Tackling subadditivity in this framework Summary and Q&A
  • 51. WHERE OUR SIMILARITY SCORES FALL DOWN: SUBADDITIVE METRICS
  • 52. 14 11 17+ = MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
  • 53. MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD 14 ? ?+ =
  • 54. ? ? ?+ = MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
  • 55. ? ? ?+ = Who using this license already exists in the data? MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
  • 56. ? ? ?+ = Who using this license already exists in the data? Then based on adding the left license in, who is new on this one? MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
  • 57. WE’RE BACK TO SQUARE ONE WITH ALL OUR PROBLEMS
  • 58. 9 10 ?+ = MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
  • 59. 9 10 11+ = MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
  • 60. 9 10 19+ = MONTHLY ACTIVE USERS IN AN UNCERTAIN WORLD
  • 61. HOW DO WE PROVIDE A SIMILAR FRAMEWORK?
  • 62. Under the hood We first model total MAU through focusing on incremental MAU Within a customer, we start with their highest use instance and start “adding” others to it according to the formula: The fraction or weight “w” is what we model based on attributes about the license we’re adding compared to the license that have already been considered such as “product seen before” or “country seen before” etc. We then score this model out for all customers This iterative process is followed to create estimates for all individual customers, starting with their largest instance (either predicted or known) and summing individually through all others Modelling it out Allocating it back to licenses Output tables and transfer
  • 63. Modelling it out Allocating it back to licenses Under the hood Output tables and transfer Total MAU is then allocated back over individual licenses through an equal proportional allocation As the sum of MAU across instances is not our estimate of total MAU, we give each instances a haircut to create their “allocated MAU” based on how large they are relative to other licenses within a customer We then pivot our similarity scores to focus back on “known” examples Instead of focusing on the unknown values we instead focus on how the known predicts the unknown, creating “influence scores” based on how much unknown “allocated MAU” is considered similar to the known license.
  • 64. Allocating it back to licenses Output tables and transfer Modelling it out Under the hood And provide these influence tables for ready estimation These influence scores can then be used to estimate total MAU of a feature or system for the whole population, based on a simple multiplier applied to observed data. This is done for various windows All of the calculations above change if you’re just using a subset of products, a subset of geographies and so on - we provide influence tables for various aggregations.
  • 65. -- the user defines/owns their “MAU” WITH my_mau AS ( SELECT license , feature_mau FROM (user-generated logic) ) SELECT -- upweight the observed MAU according to influence to estimate the total SUM(feature_mau * weight_mau) AS feature_mau_estimate FROM my_mau m -- attach on influence scores INNER JOIN mau_influence_weight w ON m.license = w.license ; The user experience MAU extrapolation
  • 66. Agenda Introduction to data at Atlassian The problem of bias and incompleteness Solving this with self-serve estimation Tackling subadditivity in this framework Summary and Q&A
  • 67. Agenda Introduction to data at Atlassian The problem of bias and incompleteness Solving this with self-serve estimation Tackling subadditivity in this framework Summary and Q&A
  • 68. Solution Creating methods for all data workers to create consistent and accurate estimates of the total Summary Result Quicker, more trustworthy insights back to product teams Problem We have an incomplete and biased dataset that we want to infer from
  • 69. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 70. Q&A LUKE HEINRICH | MIKE DIAS