Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Renaud bourassa building machine learning models with strict privacy boundaries

178 views

Published on

Building Machine Learning Models with Strict Privacy Boundaries

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Renaud bourassa building machine learning models with strict privacy boundaries

  1. 1. Building Machine Learning Models with Strict Privacy Boundaries Renaud Bourassa rbourassa@slack-corp.com March 29, 2019
  2. 2. Agenda 1. Data at Slack and how it applies to Machine Learning. 2. Building a privacy preserving search ranking model.
  3. 3. What is Slack?
  4. 4. What is Slack? At its core, Slack is a communication platform.
  5. 5. Data at Slack ● Two interesting characteristics that differentiate Slack from other communication platforms: 1. Within an organization, data is public by default. 2. Across organizations, data is strictly private by default.
  6. 6. ● In many traditional communication platforms, including email, data within an organization is private by default. “Hello!” Public by Default Sender Recipient
  7. 7. Sender Recipient Public by Default ● Data in Slack is (mostly) public by default and available to all users within the organization. #channel“Hello!” (Mostly Public)
  8. 8. Sender Public by Default ● Data in Slack is (mostly) public by default and available to all users within the organization. #channel“Hello!” Recipient (Mostly Public)
  9. 9. Public by Default ● What does this mean in the context of Machine Learning? Lots of public data at the organization level. ○ Gives us a huge source of data to build Machine Learning models. ○ Makes Machine Learning a valuable tool to help users sift through the data.
  10. 10. Data at Slack ● Two interesting characteristics that differentiate Slack from other communication platforms: 1. Within an organization, data is public by default. 2. Across organizations, data is strictly private by default.
  11. 11. Strict Privacy Boundaries ● Data in Slack should not leak across organizations. #pizza #burgers Organization A Organization B #cats #dogs
  12. 12. Strict Privacy Boundaries ● Models in Slack should not leak data across organizations. #cats #dogs #pizza #burgers Training Topic Model “Layoffs” “Company B”“Company B is planning layoffs” Bad!
  13. 13. Strict Privacy Boundaries ● What does this mean in the context of Machine Learning? Models should respect the privacy boundaries between organizations. ○ Models should not leak data explicitly. ○ Models should not leak data implicitly.
  14. 14. Search Problem Given a query, return the most relevant documents (e.g. messages, files).
  15. 15. Learn To Rank q D={d1 ,d2 ,…,dn }Solr Model f(q,d) f(q,di ) f(q,dj ) f(q,dk ) … di dj dk … ● Sort documents by scores in a way that maximizes utility.
  16. 16. Learn to Rank ● How do we train this model? DW Query Logs Click Logs (q1 ,{d1,1 ,d1,2 ,…,d1,n }) (q2 ,{d2,1 ,d2,2 ,…,d2,m }) … Model Training Model f(q,d)
  17. 17. Learn to Rank ● How do we train this model in a privacy-preserving way? DW Query Logs Click Logs … #cats #dogs #pizza #burgers
  18. 18. Individual Models ● Why not build one model per organization? ○ Sparsity High dimensional inputs with low coverage within a single organization. ○ Complexity Over 500,000 organizations ranging from a few users to Fortune 500 companies.
  19. 19. Global Model ● How can we train a global privacy-preserving model? ○ Attribute Parameterization Feature transformation technique that factors out private information and reduces sparsity. Learning from User Interactions in Personal Search via Attribute Parameterization (Bendersky et al. 2017)
  20. 20. Attribute Parameterization “MLConf” Query Document Attributes user_id:U123 terms:[“MLConf”] user_id:U456 channel_id:C789 terms:[“Hey”,…]
  21. 21. One Hot Encoding Attribute Parameterization user_id:U123 terms:[“MLConf”] user_id:U456 channel_id:C789 terms:[“Hey”,…] Model f(q,d)
  22. 22. Attribute Parameterization user_id:U123 terms:[“MLConf”] user_id:U456 channel_id:C789 terms:[“Hey”,…] Model f(q,d) g(dterms ) Parameterization Examples: ● num_terms(dterms ) ● num_emojis(dterms )
  23. 23. Attribute Parameterization user_id:U123 terms:[“MLConf”] user_id:U456 channel_id:C789 terms:[“Hey”,…] Model f(q,d) ctr(dchannel_id ) Parameterization Definition: ctr(dx ) = clicks(dx ) / impressions(dx )
  24. 24. Definition: ctr(qx ,dy ) = clicks(qx AND dy ) / impressions(qx AND dy ) Attribute Parameterization user_id:U123 terms:[“MLConf”] user_id:U456 channel_id:C789 terms:[“Hey”,…] Model f(q,d)ctr(quser_id ,dchannel_id ) Parameterization Examples: ● ctr(quser_id ,duser_id ) ● ctr(quser_id ,dreactor_id ) ● ctr(qteam_id ,dterm )
  25. 25. Attribute Parameterization user_id:U123 terms:[“MLConf”] user_id:U456 channel_id:C789 terms:[“Hey”,…] Model f(q,d)ctr(qterms ,dterms ) Could leak private data between organizations!
  26. 26. Attribute Parameterization user_id:U123 terms:[“MLConf”] user_id:U456 channel_id:C789 terms:[“Hey”,…] Model f(q,d) Safe! ctr(quser_id ,qterms ,dterms )
  27. 27. Attribute Parameterization q Solr Model f(q,d) di dj dk … ● Precompute and index CTR features in feature store. D Attribute Parameterization Feature Store DW Query Logs Click Logs
  28. 28. Learn to Rank ● How do we train this model in a privacy-preserving way? By learning from carefully crafted functions of the high dimensional attributes of the query and documents, we are able to factor out the private data and reduce the sparsity of our training set before it reaches the model.
  29. 29. Thank You! We’re hiring! https://goo.gl/FqzD6U

×