- 1. Practical Considerations for Interactive AI: Robustness, Privacy, Fairness, Transparency Tom Diethe tdiethe@amazon.com Interactive AI CDT Winter School January 29 2020
- 2. Outline 1 Interactive AI at Amazon 2 Robustness & Transparency via Continual Learning Bayesian Continual Learning Continual Learning in Practice 3 Algorithmic Privacy Diﬀerential Privacy Privacy for Text Experiments on Text Data Optimizing the Privacy Utility Trade-oﬀ DPareto experiments 4 Algorithmic Fairness 5 Summary Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 1 / 44
- 3. Outline 1 Interactive AI at Amazon 2 Robustness & Transparency via Continual Learning 3 Algorithmic Privacy 4 Algorithmic Fairness 5 Summary Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 2 / 44
- 4. Interactive AI at Amazon Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 3 / 44
- 5. Alexa AI What is Alexa? A cloud-based voice service that can help you with tasks, entertainment, general information, shopping, and more The more you talk to Alexa, the more Alexa adapts to your speech patterns, vocabulary, and personal preferences Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 4 / 44
- 6. Alexa AI What is Alexa? A cloud-based voice service that can help you with tasks, entertainment, general information, shopping, and more The more you talk to Alexa, the more Alexa adapts to your speech patterns, vocabulary, and personal preferences How do we ensure that ... we create robust and eﬃcient AI systems? we ensure that the privacy of customer data is safeguarded? customers are treated fairly by ML algorithms? Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 4 / 44
- 7. Failure Modes Unintentional failures: ML system produces a formally correct but completely unsafe outcome Outliers/anomalies Dataset shift Limited memory Intentional failures: failure is caused by an active adversary attempting to subvert the system to attain her goals, such as to: misclassify the result infer private training data steal the underlying algorithm Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 5 / 44
- 8. Outline 1 Interactive AI at Amazon 2 Robustness & Transparency via Continual Learning Bayesian Continual Learning Continual Learning in Practice 3 Algorithmic Privacy 4 Algorithmic Fairness 5 Summary Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 6 / 44
- 9. FX (xt1 , . . . , xtn ) = FX (xt1+τ , . . . , xtn+τ ) for all τ, t1, . . . , tn for all n ∈ N Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 7 / 44
- 10. Sagemaker Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 8 / 44
- 11. Robustness & Transparency via Continual Learning Data arrive continually (Possibly) non-IID Tasks may change over time (e.g. trends/fashions in shopping) New tasks may emerge (e.g. new product categories, new marketplaces) Robustness How can we adapt to new data whilst retaining existing knowledge? Transparency: How can we have systems can signal they’re going wrong? Standard approaches: Train individual models on each task. Train combination Maintain single model and use regularization to ﬁx inﬂuential parameters Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
- 12. Robustness & Transparency via Continual Learning Data arrive continually (Possibly) non-IID Tasks may change over time (e.g. trends/fashions in shopping) New tasks may emerge (e.g. new product categories, new marketplaces) Robustness How can we adapt to new data whilst retaining existing knowledge? Transparency: How can we have systems can signal they’re going wrong? Standard approaches: Train individual models on each task. Train combination Maintain single model and use regularization to ﬁx inﬂuential parameters Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
- 13. Robustness & Transparency via Continual Learning Data arrive continually (Possibly) non-IID Tasks may change over time (e.g. trends/fashions in shopping) New tasks may emerge (e.g. new product categories, new marketplaces) Robustness How can we adapt to new data whilst retaining existing knowledge? Transparency: How can we have systems can signal they’re going wrong? Standard approaches: Train individual models on each task. Train combination Maintain single model and use regularization to ﬁx inﬂuential parameters Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
- 14. Robustness & Transparency via Continual Learning Data arrive continually (Possibly) non-IID Tasks may change over time (e.g. trends/fashions in shopping) New tasks may emerge (e.g. new product categories, new marketplaces) Robustness How can we adapt to new data whilst retaining existing knowledge? Transparency: How can we have systems can signal they’re going wrong? Standard approaches: Train individual models on each task. Train combination Maintain single model and use regularization to ﬁx inﬂuential parameters Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
- 15. Robustness & Transparency via Continual Learning Data arrive continually (Possibly) non-IID Tasks may change over time (e.g. trends/fashions in shopping) New tasks may emerge (e.g. new product categories, new marketplaces) Robustness How can we adapt to new data whilst retaining existing knowledge? Transparency: How can we have systems can signal they’re going wrong? Standard approaches: Train individual models on each task. Train combination Maintain single model and use regularization to ﬁx inﬂuential parameters Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
- 16. Robustness & Transparency via Continual Learning Data arrive continually (Possibly) non-IID Tasks may change over time (e.g. trends/fashions in shopping) New tasks may emerge (e.g. new product categories, new marketplaces) Robustness How can we adapt to new data whilst retaining existing knowledge? Transparency: How can we have systems can signal they’re going wrong? Standard approaches: Train individual models on each task. Train combination Maintain single model and use regularization to ﬁx inﬂuential parameters Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
- 17. Robustness & Transparency via Continual Learning Data arrive continually (Possibly) non-IID Tasks may change over time (e.g. trends/fashions in shopping) New tasks may emerge (e.g. new product categories, new marketplaces) Robustness How can we adapt to new data whilst retaining existing knowledge? Transparency: How can we have systems can signal they’re going wrong? Standard approaches: Train individual models on each task. Train combination Maintain single model and use regularization to ﬁx inﬂuential parameters Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
- 18. Robustness & Transparency via Continual Learning Data arrive continually (Possibly) non-IID Tasks may change over time (e.g. trends/fashions in shopping) New tasks may emerge (e.g. new product categories, new marketplaces) Robustness How can we adapt to new data whilst retaining existing knowledge? Transparency: How can we have systems can signal they’re going wrong? Standard approaches: Train individual models on each task. Train combination Maintain single model and use regularization to ﬁx inﬂuential parameters Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
- 19. Bayesian Continual Learning [Nguyen 2018] Given e.g. data in task t as Dt = x (nt ) t , y (nt ) t Nt n=1 , parameters θ (e.g. BLR, BNN, GP ...) p(θ|D1:T ) ∝ p(θ)p(D1:T |θ) = p(θ) T t−1 NT n=1 p y (nt ) t |θ, x (nt ) t = p(θ|D1:T−1)p(DT |θ). Natural recursive algorithm! Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 10 / 44
- 20. Bayesian Continual Learning [Nguyen 2018] Given e.g. data in task t as Dt = x (nt ) t , y (nt ) t Nt n=1 , parameters θ (e.g. BLR, BNN, GP ...) p(θ|D1:T ) ∝ p(θ)p(D1:T |θ) = p(θ) T t−1 NT n=1 p y (nt ) t |θ, x (nt ) t = p(θ|D1:T−1)p(DT |θ). Natural recursive algorithm! Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 10 / 44
- 21. Generative models in continual learning Generative models in continual learning. Task i consists of items of class i and generated samples from the previous task; the goal is to generate samples from all previously seen classes Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 11 / 44
- 22. Why is this Useful? Fashion-MNIST examples generated by a Wasserstein GAN in Bayesian continual learning Generative models play an important role in mitigating this, as they can be used to generate samples of previous tasks [Wu 2018], a method known as generative replay For deep learning models this is a form of transparency: a window onto what the model has learnt Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 12 / 44
- 23. Engineering a Continual Learning System Automating Data Retention Policies: Sketcher/Compressor: when the data rate is too high Joiner: when labels arrive late Shared infrastructure: optimal use of space, like an OS cache Automating Monitoring and Quality Control: Data monitoring: dataset shift detection, anomaly detection Prediction monitoring: monitor performance of models Automating the ML Life-Cycle: Trainer and HPO: store provenance, warm start training Model policy engine: ensure re-training performed at right cadence Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 13 / 44
- 24. “Zero-Touch” Machine Learning Model Policy Engine Streams Model Stream Trainer HPO Data Statistics Data Monitoring Anomaly Detection, Distribution Shift Measurement Retrain Rollback Prediction statistics Prediction Statistics Prediction Monitoring Accuracy, Shift Predictor Business Metrics Business Logic Business metrics Costs Desired accuracy Joiner System State DB Diagnostic Logs Sketcher/ Sampler Predictions Predictions Shared Infrastructure Model DB Training Data Reservoir Validation Data Reservoir Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 14 / 44
- 25. Summary: Continual Learning Continual Learning Bayesian methods are a natural ﬁt for continual learning However it’s tricky to make them work well with deep learning methods Engineering viewpoint is also required Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 15 / 44
- 26. Outline 1 Interactive AI at Amazon 2 Robustness & Transparency via Continual Learning 3 Algorithmic Privacy Diﬀerential Privacy Privacy for Text Experiments on Text Data Optimizing the Privacy Utility Trade-oﬀ DPareto experiments 4 Algorithmic Fairness 5 Summary Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 16 / 44
- 27. A ﬁrst attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and oﬃce If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (ﬁdelity in this case) Oﬃce Dept. Salary D.O.B. Nationality Gender London IT £##### May 1985 Portuguese Female Still presents risk of re-identiﬁcation!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 17 / 44
- 28. A ﬁrst attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and oﬃce If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (ﬁdelity in this case) Oﬃce Dept. Salary D.O.B. Nationality Gender London IT £##### May 1985 Portuguese Female Still presents risk of re-identiﬁcation!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 17 / 44
- 29. A ﬁrst attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and oﬃce If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (ﬁdelity in this case) Oﬃce Dept. Salary D.O.B. Nationality Gender UK IT £##### 1980-1985 - Female Still presents risk of re-identiﬁcation!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 17 / 44
- 30. Anonymized Data Isn’t Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released “anonymized” data on state employees that showed every hospital visit Goal was to help researchers. Removed all obvious identiﬁers such as name, address, and social security number MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization, requested a copy of the data Reidentiﬁcation William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identiﬁers. Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6 people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code. Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his oﬃce. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 18 / 44
- 31. Anonymized Data Isn’t Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released “anonymized” data on state employees that showed every hospital visit Goal was to help researchers. Removed all obvious identiﬁers such as name, address, and social security number MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization, requested a copy of the data Reidentiﬁcation William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identiﬁers. Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6 people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code. Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his oﬃce. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 18 / 44
- 32. Anonymized Data Isn’t Example 2: In 2006, Netﬂix released data pertaining to how 500,000 of its users rated movies over a six-year period Netﬂix “anonymized” the data before releasing it by removing usernames, but assigned unique identiﬁcation numbers to users in order to allow for continuous tracking of user ratings and trends Reidentiﬁcation Researchers used this information to uniquely identify individual Netﬂix users by crossing the data with the public IMDB database. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netﬂix database. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 19 / 44
- 33. Anonymized Data Isn’t Example 2: In 2006, Netﬂix released data pertaining to how 500,000 of its users rated movies over a six-year period Netﬂix “anonymized” the data before releasing it by removing usernames, but assigned unique identiﬁcation numbers to users in order to allow for continuous tracking of user ratings and trends Reidentiﬁcation Researchers used this information to uniquely identify individual Netﬂix users by crossing the data with the public IMDB database. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netﬂix database. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 19 / 44
- 34. Diﬀerential Privacy A randomised mechanism M : X → Y is -diﬀerentially private if for all neighbouring inputs x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have P[M(x) ∈ E] ≤ e P M x ∈ E 0 5 10 15 20 25 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Ratio bounded by e M(D) M(D') Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 20 / 44
- 35. Diﬀerential Privacy A randomised mechanism M : X → Y is -diﬀerentially private if for all neighbouring inputs x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have P[M(x) ∈ E] ≤ e P M x ∈ E 0 5 10 15 20 25 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Ratio bounded by e M(D) M(D') Mechanisms: Randomised response −→ plausible deniability Laplace mechanism: e.g. ˜µ = µ + ξ, ξ ∼ Lap 1 n Output perturbation ... Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 20 / 44
- 36. Randomized Response [Warner ’65] Say you want to release a bit x ∈ {Yes, No}. Do the following: 1 ﬂip a coin 2 if tails, respond truthfully with x 3 if heads, ﬂip a second coin and respond “Yes” if heads; respond “No” if tails Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 21 / 44
- 37. Randomized Response [Warner ’65] Say you want to release a bit x ∈ {Yes, No}. Do the following: 1 ﬂip a coin 2 if tails, respond truthfully with x 3 if heads, ﬂip a second coin and respond “Yes” if heads; respond “No” if tails Claim: Above algorithm satisﬁes (log 3)-diﬀerential privacy Pr[Response = Yes|x = Yes] Pr[Response = Yes|x = No] = 1/2 × 1 + 1/2 × 1/2 1/2 × 0 + 1/2 × 1/2 = 3/4 1/4 = 3 =⇒ e = 3 Same for Pr[Response=No|x=Yes] Pr[Response=No|x=No] . Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 21 / 44
- 38. Important Properties Robustness to post-processing: M is ( , δ)-DP, then f (M) is ( , δ)-DP Composition: if M1, . . . , Mn are ( , δ)-DP, then g (M1, . . . , Mn) is ( n i=1 i , n i=1 δi )-DP Protects against arbitrary side knowledge Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 22 / 44
- 39. User-AI system interaction via natural language User’s goal: meet some speciﬁc need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Diﬀerential Privacy Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
- 40. User-AI system interaction via natural language User’s goal: meet some speciﬁc need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Diﬀerential Privacy Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
- 41. User-AI system interaction via natural language User’s goal: meet some speciﬁc need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Diﬀerential Privacy Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
- 42. User-AI system interaction via natural language User’s goal: meet some speciﬁc need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Diﬀerential Privacy Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
- 43. User-AI system interaction via natural language User’s goal: meet some speciﬁc need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Diﬀerential Privacy Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
- 44. User-AI system interaction via natural language User’s goal: meet some speciﬁc need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Diﬀerential Privacy Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
- 45. Desired Functionality Intent Query x Modiﬁed Query x GetWeather Will it be colder in Cleveland Will it be colder in Ohio PlayMusic Play Cantopop on lastfm Play C-pop on lastfm BookRestaurant Book a restaurant in Milladore Book a restaurant in Wood County SearchCreativeWork I want to watch Manthan ﬁlm I want to watch Hindi ﬁlm Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 24 / 44
- 46. Word Embeddings Mapping from words into vectors of real numbers (many ways to do this!) e.g. Neural network based models (e.g. Word2Vec, GloVe, fastText) Deﬁnes a mapping φ : W → Rn Nearest neigbours are often synonyms Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 25 / 44
- 47. Metric Diﬀerential Privacy Recall the deﬁnition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric diﬀerential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 26 / 44
- 48. Metric Diﬀerential Privacy Recall the deﬁnition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric diﬀerential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 26 / 44
- 49. Metric Diﬀerential Privacy Recall the deﬁnition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric diﬀerential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 26 / 44
- 50. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisﬁes the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 27 / 44
- 51. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisﬁes the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 27 / 44
- 52. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisﬁes the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 27 / 44
- 53. UTILITYPRIVACY Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 28 / 44
- 54. Example: Diﬀerentially Private SGD Algorithm 1: Diﬀerentially Private SGD Input: dataset z = (z1, . . . , zn) Hyperparameters: learning rate η, mini-batch size m, number of epochs T, noise variance σ2, clipping norm L Initialize w ← 0 for t ∈ [T] do for k ∈ [n/m] do Sample S ⊂ [n] with |S| = m uniformly at random Let g ← 1 m j∈S clipL( (zj , w)) + 2L m N(0, σ2I) Update w ← w − ηg return w 5+ hyper-parameters aﬀecting both privacy and utility For deep learning applications we only have empirical utility (not analyitic) How do we ﬁnd the hyperparameters that give us an optimal trade-oﬀ? Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 29 / 44
- 55. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
- 56. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
- 57. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
- 58. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
- 59. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
- 60. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
- 61. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
- 62. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
- 63. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
- 64. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 65. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 66. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 67. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 68. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 69. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 70. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 71. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 72. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 73. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
- 74. DPareto DPareto Repeat: 1 For each objective (privacy, utility): 1 Fit a surrogate model (Gaussian process (GP)) using the available dataset 2 Calculate the predictive distribution using the GP mean and variance functions 2 Use the posterior of the surrogate models to form an acquisition function 3 Collect the next point at the estimated global max. of the acquisition function until budget exhausted Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 32 / 44
- 75. DPareto vs Random Sampling 28 ) 20 22 24 26 28 Sampled points 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 PFhypervolume Hypervolume Evolution MLP1 (RS) MLP1 (BO) MLP2 (RS) MLP2 (BO) 10−1 100 101 ε 0.0 0.2 0.4 0.6 0.8 1.0 Classiﬁcationerror MLP2 Pareto Fronts Initial +256 RS +256 BO 10−1 100 101 ε 0.16 0.18 0.20 0.22 0.24 Classiﬁcationerror LogReg+SGD Samples 1500 RS 256 BO Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 33 / 44
- 76. Summary: Privacy Enhancing Technologies Privacy Privacy risks can be counter-intuitive and tricky to formalize High-dimensional data and side knowledge make privacy hard Semantic guarantees (eg. DP) behave better than syntactic ones (eg. k-anonymization) Diﬀerential privacy is a mature privacy enhancing technology Metric DP provides local plausible deniability, accuracy can be good even in cases with an inﬁnite number of outcomes Empirical privacy-utility trade-oﬀ evaluation enables application-speciﬁc decisions Bayesian optimization provides computationally eﬃcient method to recover the Pareto front (esp. with large number of hyper-parameters) Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 34 / 44
- 77. Outline 1 Interactive AI at Amazon 2 Robustness & Transparency via Continual Learning 3 Algorithmic Privacy 4 Algorithmic Fairness 5 Summary Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 35 / 44
- 78. The Need for Algorithmic Fairness Risks: 1 ML predictors might discriminate against groups of individuals protected by law or by ethics 2 choosing a model that minimizes the expected loss may be good for the majority population, but overlooks the minority populations Examples: image classiﬁcation [Buolamwini & Gebru, 2018] and natural language tasks [Bolukbasi et al., 2016] Causes: 1 training data may contain biases 2 the analysis of the training data may inadvertently introduce biases 3 Unlike privacy, there’s no single agreed on deﬁnition! Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 36 / 44
- 79. Statistical Bias Deﬁnition: The diﬀerence between an estimator’s expected value and the true value Is statistical bias an adequate fairness criterion? “The model summarises the data correctly, if the data is biased it’s not the algorithm’s fault” Says nothing about the distribution of errors (variance of estimator) Biases are inevitable! Take ownership ... Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
- 80. Statistical Bias Deﬁnition: The diﬀerence between an estimator’s expected value and the true value Is statistical bias an adequate fairness criterion? “The model summarises the data correctly, if the data is biased it’s not the algorithm’s fault” Says nothing about the distribution of errors (variance of estimator) Biases are inevitable! Take ownership ... Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
- 81. Statistical Bias Deﬁnition: The diﬀerence between an estimator’s expected value and the true value Is statistical bias an adequate fairness criterion? “The model summarises the data correctly, if the data is biased it’s not the algorithm’s fault” Says nothing about the distribution of errors (variance of estimator) Biases are inevitable! Take ownership ... Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
- 82. Statistical Bias Deﬁnition: The diﬀerence between an estimator’s expected value and the true value Is statistical bias an adequate fairness criterion? “The model summarises the data correctly, if the data is biased it’s not the algorithm’s fault” Says nothing about the distribution of errors (variance of estimator) Biases are inevitable! Take ownership ... Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
- 83. Statistical Bias Deﬁnition: The diﬀerence between an estimator’s expected value and the true value Is statistical bias an adequate fairness criterion? “The model summarises the data correctly, if the data is biased it’s not the algorithm’s fault” Says nothing about the distribution of errors (variance of estimator) Biases are inevitable! Take ownership ... Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
- 84. Calibration Calibrated Classiﬁer [Dawid 1982] “a forecaster is well calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent" Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 38 / 44
- 85. Calibration α-Accuracy: If we do not want a predictor f not to downplay S ⊆ X, we require it to be (approx.) unbiased over S for some small α ∈ [0, 1]: |Ei∼S (fi − p∗ i )| ≤ α α-Calibration: for any v ∈ [0, 1], let Sv = {i ∈ S : fi = v}, then: |Ei∼Sv (fi − p∗ i )| = |v − Ei∼Sv (p∗ i )| ≤ α i.e. we are calibrated for all but a small number of items α. Weakness: Guarantees too coarse. E.g. assign every member in S the value Ei∼S (p∗ i ). The is perfectly calibrated, but “qualiﬁed” members of S with large p∗ i will be hurt. Typically this is applied over large disjoint sets - e.g. race or gender. Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 39 / 44
- 86. Multicalibration [Herbert-Johnson 2018] Stronger notion: ensure calibration on every subpopulation (including qualiﬁed members from before). But ... requires perfect predictions! Need an intermediary deﬁnition that balances protecting subgroups vs information bottleneck of small samples Multicalibration Deﬁnition “A predictor f is multicalibrated w.r.t. a family of subpopulations C if it is calibrated w.r.t. every S ∈ C”, where C are computationally-identiﬁable subsets Let C ⊆ 2X be a collection of subsets of X and α ∈ [0, 1]. A predictor f is (C, α)-multicalibrated if for all S ∈ C, f is α-calibrated w.r.t. S. Think of C as a collection of subpopulations where set membership can be determined eﬃciently, e.g. through boolean operations or by small decision trees C can be quite rich, with many overlapping subgroups of a protected group S Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 40 / 44
- 87. Summary: Algorithmic Fairness Multicalibration One particular notion of algorithmic fairness Attractive since it can be run as post-hoc But ... currently limited to small datasets How does this interact with privacy? Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 41 / 44
- 88. Outline 1 Interactive AI at Amazon 2 Robustness & Transparency via Continual Learning 3 Algorithmic Privacy 4 Algorithmic Fairness 5 Summary Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 42 / 44
- 89. Summary www.mbmlbook.com Interactive AI requires more than just smart algorithms! Requires us to think also about robustness and ethical implications Future work (potential CDT projects!): Multi-calibration using random forests Optimize the fairness–utility, privacy–utility, privacy–fairness–utility trade-oﬀs Build privacy and fairness directly into continual learning systems Leverage crowdsourcing and active learning to test privacy and fairness hypotheses Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 43 / 44
- 90. Questions? tdiethe@amazon.com Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 44 / 44