Building Machine Learning Models
with Strict Privacy Boundaries
Renaud Bourassa
rbourassa@slack-corp.com
March 29, 2019
Agenda
1. Data at Slack and how it applies
to Machine Learning.
2. Building a privacy preserving
search ranking model.
What is Slack?
What is Slack?
At its core, Slack is a communication platform.
Data at Slack
● Two interesting characteristics that differentiate Slack from other
communication platforms:
1. Within an organization, data is public by default.
2. Across organizations, data is strictly private by default.
● In many traditional communication platforms, including email,
data within an organization is private by default.
“Hello!”
Public by Default
Sender
Recipient
Sender
Recipient
Public by Default
● Data in Slack is (mostly) public by default and available to all users
within the organization.
#channel“Hello!”
(Mostly Public)
Sender
Public by Default
● Data in Slack is (mostly) public by default and available to all users
within the organization.
#channel“Hello!”
Recipient
(Mostly Public)
Public by Default
● What does this mean in the context of Machine Learning?
Lots of public data at the organization level.
○ Gives us a huge source of data to build Machine Learning
models.
○ Makes Machine Learning a valuable tool to help users sift
through the data.
Data at Slack
● Two interesting characteristics that differentiate Slack from other
communication platforms:
1. Within an organization, data is public by default.
2. Across organizations, data is strictly private by default.
Strict Privacy Boundaries
● Data in Slack should not leak across organizations.
#pizza
#burgers
Organization A Organization B
#cats
#dogs
Strict Privacy Boundaries
● Models in Slack should not leak data across organizations.
#cats
#dogs
#pizza
#burgers
Training Topic Model
“Layoffs”
“Company B”“Company B is
planning layoffs”
Bad!
Strict Privacy Boundaries
● What does this mean in the context of Machine Learning?
Models should respect the privacy boundaries between
organizations.
○ Models should not leak data explicitly.
○ Models should not leak data implicitly.
Search
Problem
Given a query, return the most
relevant documents (e.g.
messages, files).
Learn To Rank
q
D={d1
,d2
,…,dn
}Solr
Model
f(q,d)
f(q,di
)
f(q,dj
)
f(q,dk
)
…
di
dj
dk
…
● Sort documents by scores in a way that maximizes utility.
Learn to Rank
● How do we train this model?
DW
Query
Logs
Click
Logs
(q1
,{d1,1
,d1,2
,…,d1,n
})
(q2
,{d2,1
,d2,2
,…,d2,m
})
…
Model
Training
Model
f(q,d)
Learn to Rank
● How do we train this model in a privacy-preserving way?
DW
Query
Logs
Click
Logs
…
#cats
#dogs
#pizza
#burgers
Individual Models
● Why not build one model per organization?
○ Sparsity
High dimensional inputs with low coverage within a single
organization.
○ Complexity
Over 500,000 organizations ranging from a few users to
Fortune 500 companies.
Global Model
● How can we train a global privacy-preserving model?
○ Attribute Parameterization
Feature transformation technique that factors out private
information and reduces sparsity.
Learning from User Interactions in Personal Search via Attribute
Parameterization (Bendersky et al. 2017)
Attribute Parameterization
“MLConf”
Query Document
Attributes
user_id:U123
terms:[“MLConf”]
user_id:U456
channel_id:C789
terms:[“Hey”,…]
One Hot Encoding
Attribute Parameterization
user_id:U123
terms:[“MLConf”]
user_id:U456
channel_id:C789
terms:[“Hey”,…]
Model f(q,d)
Attribute Parameterization
user_id:U123
terms:[“MLConf”]
user_id:U456
channel_id:C789
terms:[“Hey”,…]
Model f(q,d)
g(dterms
)
Parameterization
Examples:
● num_terms(dterms
)
● num_emojis(dterms
)
Attribute Parameterization
user_id:U123
terms:[“MLConf”]
user_id:U456
channel_id:C789
terms:[“Hey”,…]
Model f(q,d)
ctr(dchannel_id
)
Parameterization
Definition:
ctr(dx
) = clicks(dx
) / impressions(dx
)
Definition:
ctr(qx
,dy
) = clicks(qx
AND dy
) / impressions(qx
AND dy
)
Attribute Parameterization
user_id:U123
terms:[“MLConf”]
user_id:U456
channel_id:C789
terms:[“Hey”,…]
Model f(q,d)ctr(quser_id
,dchannel_id
)
Parameterization
Examples:
● ctr(quser_id
,duser_id
)
● ctr(quser_id
,dreactor_id
)
● ctr(qteam_id
,dterm
)
Attribute Parameterization
user_id:U123
terms:[“MLConf”]
user_id:U456
channel_id:C789
terms:[“Hey”,…]
Model f(q,d)ctr(qterms
,dterms
)
Could leak private
data between
organizations!
Attribute Parameterization
user_id:U123
terms:[“MLConf”]
user_id:U456
channel_id:C789
terms:[“Hey”,…]
Model f(q,d)
Safe!
ctr(quser_id
,qterms
,dterms
)
Attribute Parameterization
q
Solr
Model
f(q,d)
di
dj
dk
…
● Precompute and index CTR features in feature store.
D
Attribute
Parameterization
Feature
Store
DW
Query
Logs
Click
Logs
Learn to Rank
● How do we train this model in a privacy-preserving way?
By learning from carefully crafted functions of the high
dimensional attributes of the query and documents, we are able to
factor out the private data and reduce the sparsity of our training
set before it reaches the model.
Thank You!
We’re hiring!
https://goo.gl/FqzD6U

Renaud bourassa building machine learning models with strict privacy boundaries