Successfully reported this slideshow.
Your SlideShare is downloading. ×

Explainable AI in Industry (KDD 2019 Tutorial)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 145 Ad

Explainable AI in Industry (KDD 2019 Tutorial)

Download to read offline

[Video recording available at https://www.youtube.com/playlist?list=PLewjn-vrZ7d3x0M4Uu_57oaJPRXkiS221]

Artificial Intelligence is increasingly playing an integral role in determining our day-to-day experiences. Moreover, with proliferation of AI based solutions in areas such as hiring, lending, criminal justice, healthcare, and education, the resulting personal and professional implications of AI are far-reaching. The dominant role played by AI models in these domains has led to a growing concern regarding potential bias in these models, and a demand for model transparency and interpretability. In addition, model explainability is a prerequisite for building trust and adoption of AI systems in high stakes domains requiring reliability and safety such as healthcare and automated transportation, and critical industrial applications with significant economic implications such as predictive maintenance, exploration of natural resources, and climate change modeling.

As a consequence, AI researchers and practitioners have focused their attention on explainable AI to help them better trust and understand models at scale. The challenges for the research community include (i) defining model explainability, (ii) formulating explainability tasks for understanding model behavior and developing solutions for these tasks, and finally (iii) designing measures for evaluating the performance of models in explainability tasks.

In this tutorial, we present an overview of model interpretability and explainability in AI, key regulations / laws, and techniques / tools for providing explainability as part of AI/ML systems. Then, we focus on the application of explainability techniques in industry, wherein we present practical challenges / guidelines for effectively using explainability techniques and lessons learned from deploying explainable models for several web-scale machine learning and data mining applications. We present case studies across different companies, spanning application domains such as search & recommendation systems, hiring, sales, and lending. Finally, based on our experiences in industry, we identify open problems and research directions for the data mining / machine learning community.

[Video recording available at https://www.youtube.com/playlist?list=PLewjn-vrZ7d3x0M4Uu_57oaJPRXkiS221]

Artificial Intelligence is increasingly playing an integral role in determining our day-to-day experiences. Moreover, with proliferation of AI based solutions in areas such as hiring, lending, criminal justice, healthcare, and education, the resulting personal and professional implications of AI are far-reaching. The dominant role played by AI models in these domains has led to a growing concern regarding potential bias in these models, and a demand for model transparency and interpretability. In addition, model explainability is a prerequisite for building trust and adoption of AI systems in high stakes domains requiring reliability and safety such as healthcare and automated transportation, and critical industrial applications with significant economic implications such as predictive maintenance, exploration of natural resources, and climate change modeling.

As a consequence, AI researchers and practitioners have focused their attention on explainable AI to help them better trust and understand models at scale. The challenges for the research community include (i) defining model explainability, (ii) formulating explainability tasks for understanding model behavior and developing solutions for these tasks, and finally (iii) designing measures for evaluating the performance of models in explainability tasks.

In this tutorial, we present an overview of model interpretability and explainability in AI, key regulations / laws, and techniques / tools for providing explainability as part of AI/ML systems. Then, we focus on the application of explainability techniques in industry, wherein we present practical challenges / guidelines for effectively using explainability techniques and lessons learned from deploying explainable models for several web-scale machine learning and data mining applications. We present case studies across different companies, spanning application domains such as search & recommendation systems, hiring, sales, and lending. Finally, based on our experiences in industry, we identify open problems and research directions for the data mining / machine learning community.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Explainable AI in Industry (KDD 2019 Tutorial) (20)

Advertisement

Recently uploaded (20)

Advertisement

Explainable AI in Industry (KDD 2019 Tutorial)

  1. 1. Explainable AI in Industry KDD 2019 Tutorial Sahin Cem Geyik, Krishnaram Kenthapadi & Varun Mithal Krishna Gade & Ankur Taly https://sites.google.com/view/kdd19-explainable-ai-tutorial1
  2. 2. Agenda ● Motivation ● AI Explainability: Foundations and Techniques ○ Explainability concepts, problem formulations, and evaluation methods ○ Post hoc Explainability ○ Intrinsically Explainable models ● AI Explainability: Industrial Practice ○ Case Studies from LinkedIn, Fiddler Labs, and Google Research ● Demo ● Key Takeaways and Challenges 2
  3. 3. Motivation 3
  4. 4. Third Wave of AI Factors driving rapid advancement of AI Symbolic AI Logic rules represent knowledge No learning capability and poor handling of uncertainty Statistical AI Statistical models for specific domains training on big data No contextual capability and minimal explainability Explainable AI Systems construct explanatory models Systems learn and reason with new tasks and situations GPUs , On-chip Neural Network Data Availability Cloud Infrastructure New Algorithms
  5. 5. Need for Explainable AI ● Machine Learning centric today. ● ML Models are opaque, non- intuitive and difficult to understand Marketing Security Logistics Finance Current AI Systems User ● Why did you do that? ● Why not something else? ● When do you succeed or fail? ● How do I correct an error? ● When do I trust you? Explainable AI and ML is essential for future customers to understand, trust, and effectively manage the emerging generation of AI applications
  6. 6. Black-box AI creates business risk for Industry
  7. 7. Internal Audit, Regulators IT & Operations Data Scientists Business Owner Can I trust our AI decisions? Are these AI system decisions fair? Customer Support How do I answer this customer complaint? How do I monitor and debug this model? Is this the best model that can be built? Black-box AI Why I am getting this decision? How can I get a better decision? Poor Decision Black-box AI creates confusion and doubt
  8. 8. What is Explainable AI? Data Black-Box AI AI product Confusion with Today’s AI Black Box ● Why did you do that? ● Why did you not do that? ● When do you succeed or fail? ● How do I correct an error? Black Box AI Decision, Recommendation Clear & Transparent Predictions ● I understand why ● I understand why not ● I know why you succeed or fail ● I understand, so I trust you Explainable AI Data Explainable AI Explainable AI Product Decision Explanation Feedback
  9. 9. Why Explainability: Verify the ML Model / System 9 Credit: Samek, Binder, Tutorial on Interpretable ML, MICCAI’18
  10. 10. 10 Why Explainability: Improve ML Model Credit: Samek, Binder, Tutorial on Interpretable ML, MICCAI’18
  11. 11. 11 Why Explainability: Learn New Insights Credit: Samek, Binder, Tutorial on Interpretable ML, MICCAI’18
  12. 12. 12 Why Explainability: Learn Insights in the Sciences Credit: Samek, Binder, Tutorial on Interpretable ML, MICCAI’18
  13. 13. Why Explainability: Debug (Mis-)Predictions 13 Top label: “clog” Why did the network label this image as “clog”?
  14. 14. Immigration Reform and Control Act Citizenship Rehabilitation Act of 1973; Americans with Disabilities Act of 1990 Disability status Civil Rights Act of 1964 Race Age Discrimination in Employment Act of 1967 Age Equal Pay Act of 1963; Civil Rights Act of 1964 Sex And more... Why Explainability: Laws against Discrimination 14
  15. 15. Fairness Privacy Transparency Explainability 15
  16. 16. GDPR Concerns Around Lack of Explainability in AI “ Companies should commit to ensuring systems that could fall under GDPR, including AI, will be compliant. The threat of sizeable fines of €20 million or 4% of global turnover provides a sharp incentive. Article 22 of GDPR empowers individuals with the right to demand an explanation of how an AI system made a decision that affects them. ” - European Commision VP, European Commision
  17. 17. Fairness Privacy Transparency Explainability 17
  18. 18. Fairness Privacy Transparency Explainability 18
  19. 19. 19 SR 11-7 and OCC regulations for Financial Institutions
  20. 20. Model Diagnostics Root Cause Analytics Performance monitoring Fairness monitoring Model Comparison Cohort Analysis Explainable Decisions API Support Model Launch Signoff Model Release Mgmt Model Evaluation Compliance Testing Model Debugging Model Visualization Explainable AI Train QA Predict Deploy A/B Test Monitor Debug Feedback Loop “Explainability by Design” for AI products
  21. 21. Example: Facebook adds Explainable AI to build Trust
  22. 22. Foundations and Techniques 22
  23. 23. Achieving Explainable AI Approach 1: Post-hoc explain a given AI model ● Individual prediction explanations in terms of input features, influential examples, concepts, local decision rules ● Global prediction explanations in terms of entire model in terms of partial dependence plots, global feature importance, global decision rules Approach 2: Build an interpretable model ● Logistic regression, Decision trees, Decision lists and sets, Generalized Additive Models (GAMs) 23
  24. 24. Achieving Explainable AI Approach 1: Post-hoc explain a given AI model ● Individual prediction explanations in terms of input features, influential examples, concepts, local decision rules ● Global prediction explanations in terms of entire model in terms of partial dependence plots, global feature importance, global decision rules Approach 2: Build an interpretable model ● Logistic regression, Decision trees, Decision lists and sets, Generalized Additive Models (GAMs) 24
  25. 25. Top label: “fireboat” Why did the network label this image as “fireboat”? 25
  26. 26. Top label: “clog” Why did the network label this image as “clog”? 26
  27. 27. Credit Line Increase Fair lending laws [ECOA, FCRA] require credit decisions to be explainable Bank Credit Lending Model Why? Why not? How? ? Request Denied Query AI System Credit Lending Score = 0.3 Credit Lending in a black-box ML world
  28. 28. Attribute a model’s prediction on an input to features of the input Examples: ● Attribute an object recognition network’s prediction to its pixels ● Attribute a text sentiment network’s prediction to individual words ● Attribute a lending model’s prediction to its features A reductive formulation of “why this prediction” but surprisingly useful :-) The Attribution Problem
  29. 29. Application of Attributions ● Debugging model predictions E.g., Attribution an image misclassification to the pixels responsible for it ● Generating an explanation for the end-user E.g., Expose attributions for a lending prediction to the end-user ● Analyzing model robustness E.g., Craft adversarial examples using weaknesses surfaced by attributions ● Extract rules from the model E.g., Combine attribution to craft rules (pharmacophores) capturing prediction logic of a drug screening network 29
  30. 30. Next few slides We will cover the following attribution methods** ● Ablations ● Gradient based methods ● Score Backpropagation based methods ● Shapley Value based methods **Not a complete list! See Ancona et al. [ICML 2019], Guidotti et al. [arxiv 2018] for a comprehensive survey 30
  31. 31. Ablations Drop each feature and attribute the change in prediction to that feature Useful tool but not a perfect attribution method. Why? ● Unrealistic inputs ● Improper accounting of interactive features ● Computationally expensive 31
  32. 32. Feature*Gradient Attribution to a feature is feature value times gradient, i.e., xi* 𝜕y/𝜕 xi ● Gradient captures sensitivity of output w.r.t. feature ● Equivalent to Feature*Coefficient for linear models ○ First-order Taylor approximation of non-linear models ● Popularized by SaliencyMaps [NIPS 2013], Baehrens et al. [JMLR 2010] 32 Gradients in the vicinity of the input seem like noise
  33. 33. score intensity Interesting gradients uninteresting gradients (saturation) 1.0 0.0 Baseline … scaled inputs ... … gradients of scaled inputs …. Input
  34. 34. IG(input, base) ::= (input - base) * ∫0 -1▽F(𝛂*input + (1-𝛂)*base) d𝛂 Original image Integrated Gradients Integrated Gradients [ICML 2017] Integrate the gradients along a straight-line path from baseline to input
  35. 35. What is a baseline? ● Ideally, the baseline is an informationless input for the model ○ E.g., Black image for image models ○ E.g., Empty text or zero embedding vector for text models ● Integrated Gradients explains F(input) - F(baseline) in terms of input features Aside: Baselines (or Norms) are essential to explanations [Kahneman-Miller 86] ● E.g., A man suffers from indigestion. Doctor blames it to a stomach ulcer. Wife blames it on eating turnips. Both are correct relative to their baselines. ● The baseline may also be an important analysis knob.
  36. 36. Original image “Clog” Why is this image labeled as “clog”?
  37. 37. Original image Integrated Gradients (for label “clog”) “Clog” Why is this image labeled as “clog”?
  38. 38. Detecting an architecture bug ● Deep network [Kearns, 2016] predicts if a molecule binds to certain DNA site ● Finding: Some atoms had identical attributions despite different connectivity
  39. 39. ● Deep network [Kearns, 2016] predicts if a molecule binds to certain DNA site ● Finding: Some atoms had identical attributions despite different connectivity Detecting an architecture bug ● Bug: The architecture had a bug due to which the convolved bond features did not affect the prediction!
  40. 40. ● Deep network predicts various diseases from chest x-rays Original image Integrated gradients (for top label) Detecting a data issue
  41. 41. ● Deep network predicts various diseases from chest x-rays ● Finding: Attributions fell on radiologist’s markings (rather than the pathology) Original image Integrated gradients (for top label) Detecting a data issue
  42. 42. Score Back-Propagation based Methods Re-distribute the prediction score through the neurons in the network ● LRP [JMLR 2017], DeepLift [ICML 2017], Guided BackProp [ICLR 2014] Easy case: Output of a neuron is a linear function of previous neurons (i.e., ni = ⅀ wij * nj) e.g., the logit neuron ● Re-distribute the contribution in proportion to the coefficients wij 42 Image credit heatmapping.org
  43. 43. Score Back-Propagation based Methods Re-distribute the prediction score through the neurons in the network ● LRP [JMLR 2017], DeepLift [ICML 2017], Guided BackProp [ICLR 2014] Tricky case: Output of a neuron is a non-linear function, e.g., ReLU, Sigmoid, etc. ● Guided BackProp: Only consider ReLUs that are on (linear regime), and which contribute positively ● LRP: Use first-order Taylor decomposition to linearize activation function ● DeepLift: Distribute activation difference relative a reference point in proportion to edge weights 43 Image credit heatmapping.org
  44. 44. Score Back-Propagation based Methods Re-distribute the prediction score through the neurons in the network ● LRP [JMLR 2017], DeepLift [ICML 2017], Guided BackProp [ICLR 2014] Pros: ● Conceptually simple ● Methods have been empirically validated to yield sensible result Cons: ● Hard to implement, requires instrumenting the model ● Often breaks implementation invariance Think: F(x, y, z) = x * y *z and G(x, y, z) = x * (y * z) Image credit heatmapping.org
  45. 45. So far we’ve looked at differentiable models. But, what about non-differentiable models? E.g., ● Decision trees ● Boosted trees ● Random forests ● etc.
  46. 46. Classic result in game theory on distributing gain in a coalition game ● Coalition Games ○ Players collaborating to generate some gain (think: revenue) ○ Set function v(S) determining the gain for any subset S of players Shapley Value [Annals of Mathematical studies,1953]
  47. 47. Classic result in game theory on distributing gain in a coalition game ● Coalition Games ○ Players collaborating to generate some gain (think: revenue) ○ Set function v(S) determining the gain for any subset S of players ● Shapley Values are a fair way to attribute the total gain to the players based on their contributions ○ Concept: Marginal contribution of a player to a subset of other players (v(S U {i}) - v(S)) ○ Shapley value for a player is a specific weighted aggregation of its marginal over all possible subsets of other players Shapley Value for player i = ⅀S⊆N w(S) * (v(S U {i}) - v(S)) (where w(S) = N! / |S|! (N - |S| -1)!) Shapley Value [Annals of Mathematical studies,1953]
  48. 48. Shapley values are unique under four simple axioms ● Dummy: If a player never contributes to the game then it must receive zero attribution ● Efficiency: Attributions must add to the total gain ● Symmetry: Symmetric players must receive equal attribution ● Linearity: Attribution for the (weighted) sum of two games must be the same as the (weighted) sum of the attributions for each of the games Shapley Value Justification
  49. 49. SHAP [NeurIPS 2018], QII [S&P 2016], Strumbelj & Konenko [JMLR 2009] ● Define a coalition game for each model input X ○ Players are the features in the input ○ Gain is the model prediction (output), i.e., gain = F(X) ● Feature attributions are the Shapley values of this game Shapley Values for Explaining ML models
  50. 50. SHAP [NeurIPS 2018], QII [S&P 2016], Strumbelj & Konenko [JMLR 2009] ● Define a coalition game for each model input X ○ Players are the features in the input ○ Gain is the model prediction (output), i.e., gain = F(X) ● Feature attributions are the Shapley values of this game Challenge: Shapley Values require the gain to be defined for all subsets of players ● What is the prediction when some players (features) are absent? i.e., what is F(x_1, <absent>, x_3, …, <absent>)? Shapley Values for Explaining ML models
  51. 51. Key Idea: Take the expected prediction when the (absent) feature is sampled from a certain distribution. Different approaches choose different distributions ● [SHAP, NIPS 2018] Use conditional distribution w.r.t. the present features ● [QII, S&P 2016] Use marginal distribution ● [Strumbelj et al., JMLR 2009] Use uniform distribution ● [Integrated Gradients, ICML 2017] Use a specific baseline point Modeling Feature Absence
  52. 52. Exact Shapley value computation is exponential in the number of features ● Shapley values can be expressed as an expectation of marginals 𝜙(i) = ES ~ D [marginal(S, i)] ● Sampling-based methods can be used to approximate the expectation ● See: “Computational Aspects of Cooperative Game Theory”, Chalkiadakis et al. 2011 ● The method is still computationally infeasible for models with hundreds of features, e.g., image models Computing Shapley Values
  53. 53. Evaluating Attribution Methods
  54. 54. Human Review Have humans review attributions and/or compare them to (human provided) groundtruth on “feature importance” Pros: ● Helps assess if attributions are human-intelligible ● Helps increase trust in the attribution method Cons: ● Attributions may appear incorrect because model reasons differently ● Confirmation bias
  55. 55. Perturbations (Samek et al., IEEE NN and LS 2017) Perturb top-k features by attribution and observe change in prediction ● Higher the change, better the method ● Perturbation may amount to replacing the feature with a random value ● Samek et al. formalize this using a metric: Area over perturbation curve ○ Plot the prediction for input with top-k features perturbed as a function of k ○ Take the area over this curve Prediction for perturbed inputs Number of perturbed features 10 20 30 40 50 60 Drop in prediction when top 40 features are perturbed Area over perturbation curve
  56. 56. Axiomatic Justification Inspired by how Shapley Values are justified ● List desirable criteria (axioms) for an attribution method ● Establish a uniqueness result: X is the only method that satisfies these criteria Integrated Gradients, SHAP, QII, Strumbelj & Konenko are justified in this manner Theorem [Integrated Gradients, ICML 2017]: Integrated Gradients is the unique path-integral method satisfying: Sensitivity, Insensitivity, Linearity preservation, Implementation invariance, Completeness, and Symmetry
  57. 57. Some limitations and caveats
  58. 58. Attributions are pretty shallow Attributions do not explain: ● Feature interactions ● What training examples influenced the prediction ● Global properties of the model An instance where attributions are useless: ● A model that predicts TRUE when there are even number of black pixels and FALSE otherwise
  59. 59. Attributions are for human consumption ● Humans interpret attributions and generate insights ○ Doctor maps attributions for x-rays to pathologies ● Visualization matters as much as the attribution technique
  60. 60. Attributions are for human consumption Naive scaling of attributions from 0 to 255 Attributions have a large range and long tail across pixels After clipping attributions at 99% to reduce range ● Humans interpret attributions and generate insights ○ Doctor maps attributions for x-rays to pathologies ● Visualization matters as much as the attribution technique
  61. 61. Other types of Post-hoc Explanations
  62. 62. Example based Explanations 62 ● Prototypes: Representative of all the training data. ● Criticisms: Data instance that is not well represented by the set of prototypes. Figure credit: Examples are not Enough, Learn to Criticize! Criticism for Interpretability. Kim, Khanna and Koyejo. NIPS 2016 Learned prototypes and criticisms from Imagenet dataset (two types of dog breeds)
  63. 63. Influence functions ● Trace a model’s prediction through the learning algorithm and back to its training data ● Training points “responsible” for a given prediction 63 Figure credit: Understanding Black-box Predictions via Influence Functions. Koh and Liang ICML 2017
  64. 64. Local Interpretable Model-agnostic Explanations (Ribeiro et al. KDD 2016) 64 Figure credit: Anchors: High-Precision Model-Agnostic Explanations. Ribeiro et al. AAAI 2018 Figure credit: Ribeiro et al. KDD 2016
  65. 65. Anchors 65 Figure credit: Anchors: High-Precision Model-Agnostic Explanations. Ribeiro et al. AAAI 2018
  66. 66. 66 Figure credit: Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Kim et al. 2018
  67. 67. Global Explanations 67
  68. 68. Global Explanations Methods ● Partial Dependence Plot: Shows the marginal effect one or two features have on the predicted outcome of a machine learning model 68
  69. 69. Global Explanations Methods ● Permutations: The importance of a feature is the increase in the prediction error of the model after we permuted the feature’s values, which breaks the relationship between the feature and the true outcome. 69
  70. 70. Achieving Explainable AI Approach 1: Post-hoc explain a given AI model ● Individual prediction explanations in terms of input features, influential examples, concepts, local decision rules ● Global prediction explanations in terms of entire model in terms of partial dependence plots, global feature importance, global decision rules Approach 2: Build an interpretable model ● Logistic regression, Decision trees, Decision lists and sets, Generalized Additive Models (GAMs) 70
  71. 71. Decision Trees 71 Is the person fit? Age < 30 ? Eats a lot of pizzas? Exercises in the morning? Unfit UnfitFit Fit Yes No Yes Yes No No
  72. 72. Decision List 72 Figure credit: Interpretable Decision Sets: A Joint Framework for Description and Prediction, Lakkaraju, Bach, Leskovec
  73. 73. Decision Set 73 Figure credit: Interpretable Decision Sets: A Joint Framework for Description and Prediction, Lakkaraju, Bach, Leskovec
  74. 74. GLMs and GAMs 74 Intelligible Models for Classification and Regression. Lou, Caruana and Gehrke KDD 2012 Accurate Intelligible Models with Pairwise Interactions. Lou, Caruana, Gehrke and Hooker. KDD 2013
  75. 75. Case Studies from Industry 75
  76. 76. Case Study: Talent Search Varun Mithal, Girish Kathalagiri, Sahin Cem Geyik 76
  77. 77. LinkedIn Recruiter ● Recruiter Searches for Candidates ○ Standardized and free-text search criteria ● Retrieval and Ranking ○ Filter candidates using the criteria ○ Rank candidates in multiple levels using ML models 77
  78. 78. Modeling Approaches ● Pairwise XGBoost ● GLMix ● DNNs via TensorFlow ● Optimization Criteria: inMail Accepts ○ Positive: inMail sent by recruiter, and positively responded by candidate ■ Mutual interest between the recruiter and the candidate 78
  79. 79. Feature Importance in XGBoost 79
  80. 80. How We Utilize Feature Importances for GBDT ● Understanding feature digressions ○ Which a feature that was impactful no longer is? ○ Should we debug feature generation? ● Introducing new features in bulk and identifying effective ones ○ An activity feature for last 3 hours, 6 hours, 12 hours, 24 hours introduced (costly to compute) ○ Should we keep all such features? ● Separating the factors for that caused an improvement ○ Did an improvement come from a new feature, or a new labeling strategy, data source? ○ Did the ordering between features change? ● Shortcoming: A global view, not case by case 80
  81. 81. GLMix Models ● Generalized Linear Mixed Models ○ Global: Linear Model ○ Per-contract: Linear Model ○ Per-recruiter: Linear Model ● Lots of parameters overall ○ For a specific recruiter or contract the weights can be summed up ● Inherently explainable ○ Contribution of a feature is “weight x feature value” ○ Can be examined in a case-by-case manner as well 81
  82. 82. TensorFlow Models in Recruiter and Explaining Them ● We utilize the Integrated Gradients [ICML 2017] method ● How do we determine the baseline example? ○ Every query creates its own feature values for the same candidate ○ Query match features, time-based features ○ Recruiter affinity, and candidate affinity features ○ A candidate would be scored differently by each query ○ Cannot recommend a “Software Engineer” to a search for a “Forensic Chemist” ○ There is no globally neutral example for comparison! 82
  83. 83. Query-Specific Baseline Selection ● For each query: ○ Score examples by the TF model ○ Rank examples ○ Choose one example as the baseline ○ Compare others to the baseline example ● How to choose the baseline example ○ Last candidate ○ Kth percentile in ranking ○ A random candidate ○ Request by user (answering a question like: “Why was I presented candidate x above candidate y?”) 83
  84. 84. Example 84
  85. 85. Example - Detailed 85 Feature Description Difference (1 vs 2) Contribution Feature………. Description………. -2.0476928 -2.144455602 Feature………. Description………. -2.3223877 1.903594618 Feature………. Description………. 0.11666667 0.2114946752 Feature………. Description………. -2.1442587 0.2060414469 Feature………. Description………. -14 0.1215354111 Feature………. Description………. 1 0.1000282466 Feature………. Description………. -92 -0.085286277 Feature………. Description………. 0.9333333 0.0568533262 Feature………. Description………. -1 -0.051796317 Feature………. Description………. -1 -0.050895940
  86. 86. Pros & Cons ● Explains potentially very complex models ● Case-by-case analysis ○ Why do you think candidate x is a better match for my position? ○ Why do you think I am a better fit for this job? ○ Why am I being shown this ad? ○ Great for debugging real-time problems in production ● Global view is missing ○ Aggregate Contributions can be computed ○ Could be costly to compute 86
  87. 87. Lessons Learned and Next Steps ● Global explanations vs. Case-by-case Explanations ○ Global gives an overview, better for making modeling decisions ○ Case-by-case could be more useful for the non-technical user, better for debugging ● Integrated gradients worked well for us ○ Complex models make it harder for developers to map improvement to effort ○ Use-case gave intuitive results, on top of completely describing score differences ● Next steps ○ Global explanations for Deep Models 87
  88. 88. Case Study: Model Interpretation for Predictive Models in B2B Sales Predictions Jilei Yang, Wei Di, Songtao Guo 88
  89. 89. Problem Setting ● Predictive models in B2B sales prediction ○ E.g.: random forest, gradient boosting, deep neural network, … ○ High accuracy, low interpretability ● Global feature importance → Individual feature reasoning 89
  90. 90. Example 90
  91. 91. Revisiting LIME ● Given a target sample 𝑥 𝑘, approximate its prediction 𝑝𝑟𝑒𝑑(𝑥 𝑘) by building a sample-specific linear model: 𝑝𝑟𝑒𝑑(𝑋) ≈ 𝛽 𝑘1 𝑋1 + 𝛽 𝑘2 𝑋2 + …, 𝑋 ∈ 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑥 𝑘) ● E.g., for company CompanyX: 0.76 ≈ 1.82 ∗ 0.17 + 1.61 ∗ 0.11+… 91
  92. 92. xLIME Piecewise Linear Regression Localized Stratified Sampling 92
  93. 93. Piecewise Linear Regression Motivation: Separate top positive feature influencers and top negative feature influencers 93
  94. 94. Impact of Piecewise Approach ● Target sample 𝑥 𝑘=(𝑥 𝑘1, 𝑥 𝑘2, ⋯) ● Top feature contributor ○ LIME: large magnitude of 𝛽 𝑘𝑗 ⋅ 𝑥 𝑘𝑗 ○ xLIME: large magnitude of 𝛽 𝑘𝑗 − ⋅ 𝑥 𝑘𝑗 ● Top positive feature influencer ○ LIME: large magnitude of 𝛽 𝑘𝑗 ○ xLIME: large magnitude of negative 𝛽 𝑘𝑗 − or positive 𝛽 𝑘𝑗 + ● Top negative feature influencer ○ LIME: large magnitude of 𝛽 𝑘𝑗 ○ xLIME: large magnitude of positive 𝛽 𝑘𝑗 − or negative 𝛽 𝑘𝑗 + 94
  95. 95. Localized Stratified Sampling: Idea Method: Sampling based on empirical distribution around target value at each feature level 95
  96. 96. Localized Stratified Sampling: Method ● Sampling based on empirical distribution around target value for each feature ● For target sample 𝑥 𝑘 = (𝑥 𝑘1 , 𝑥 𝑘2 , ⋯), sampling values of feature 𝑗 according to 𝑝𝑗 (𝑋𝑗) ⋅ 𝑁(𝑥 𝑘𝑗 , (𝛼 ⋅ 𝑠𝑗 )2) ○ 𝑝𝑗 (𝑋𝑗) : empirical distribution. ○ 𝑥 𝑘𝑗 : feature value in target sample. ○ 𝑠𝑗 : standard deviation. ○ 𝛼 : Interpretable range: tradeoff between interpretable coverage and local accuracy. ● In LIME, sampling according to 𝑁(𝑥𝑗 , 𝑠𝑗 2). 96 _
  97. 97. Summary 97
  98. 98. LTS LCP (LinkedIn Career Page) Upsell ● A subset of churn data ○ Total Companies: ~ 19K ○ Company features: 117 ● Problem: Estimate whether there will be upsell given a set of features about the company’s utility from the product 98
  99. 99. Top Feature Contributor 99
  100. 100. 100
  101. 101. Top Feature Influencers 101
  102. 102. Key Takeaways ● Looking at the explanation as contributor vs. influencer features is useful ○ Contributor: Which features end-up in the current outcome case-by-case ○ Influencer: What needs to be done to improve likelihood, case-by-case ● xLIME aims to improve on LIME via: ○ Piecewise linear regression: More accurately describes local point, helps with finding correct influencers ○ Localized stratified sampling: More realistic set of local points ● Better captures the important features 102
  103. 103. Case Study: Relevance Debugging and Explaining @ Daniel Qiu, Yucheng Qian 103
  104. 104. Debugging Relevance Models 104
  105. 105. Architecture 105
  106. 106. What Could Go Wrong? 106
  107. 107. Challenges 107
  108. 108. Solution 108
  109. 109. Call Graph 109
  110. 110. Timing 110
  111. 111. Features 111
  112. 112. Advanced Use Cases 112
  113. 113. Perturbation 113
  114. 114. Comparison 114
  115. 115. Holistic Comparison 115
  116. 116. Granular Comparison 116
  117. 117. Replay 117
  118. 118. Teams ● Search ● Feed ● Comments ● People you may know ● Jobs you may be interested in ● Notification 118
  119. 119. Case Study: Integrated Gradients for Adversarial Analysis of Question-Answering models Ankur Taly** (Fiddler labs) (Joint work with Mukund Sundararajan, Kedar Dhamdhere, Pramod Mudrakarta) 119 **This research was carried out at Google Research
  120. 120. Tabular QA Visual QA Reading Comprehension Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager Robustness question: Do these models understand the question? :-) Kazemi and Elqursh (2017) model. 61.1% on VQA 1.0 dataset (state of the art = 66.7%) Neural Programmer (2017) model 33.5% accuracy on WikiTableQuestions Yu et al (2018) model. 84.6 F-1 score on SQuAD (state of the art) 120 Q: How many medals did India win? A: 197 Q: How symmetrical are the white bricks on either side of the building? A: very Q: Name of the quarterback who was 38 in Super Bowl XXXIII? A: John Elway
  121. 121. Kazemi and Elqursh (2017) model. Accuracy: 61.1% (state of the art: 66.7%) Visual QA Q: How symmetrical are the white bricks on either side of the building? A: very 121
  122. 122. Visual QA Q: How symmetrical are the white bricks on either side of the building? A: very Q: How asymmetrical are the white bricks on either side of the building? A: very 122 Kazemi and Elqursh (2017) model. Accuracy: 61.1% (state of the art: 66.7%)
  123. 123. Visual QA Q: How symmetrical are the white bricks on either side of the building? A: very Q: How asymmetrical are the white bricks on either side of the building? A: very Q: How big are the white bricks on either side of the building? A: very 123 Kazemi and Elqursh (2017) model. Accuracy: 61.1% (state of the art: 66.7%)
  124. 124. Visual QA Q: How symmetrical are the white bricks on either side of the building? A: very Q: How asymmetrical are the white bricks on either side of the building? A: very Q: How big are the white bricks on either side of the building? A: very Q: How fast are the bricks speaking on either side of the building? A: very 124 Kazemi and Elqursh (2017) model. Accuracy: 61.1% (state of the art: 66.7%)
  125. 125. Visual QA Q: How symmetrical are the white bricks on either side of the building? A: very Q: How asymmetrical are the white bricks on either side of the building? A: very Q: How big are the white bricks on either side of the building? A: very Q: How fast are the bricks speaking on either side of the building? A: very 125 Test/dev accuracy does not show us the entire picture. Need to look inside! Kazemi and Elqursh (2017) model. Accuracy: 61.1% (state of the art: 66.7%)
  126. 126. Analysis procedure ● Attribute the answer (or answer selection logic) to question words ○ Baseline: Empty question, but full context (image, text, paragraph) ■ By design, attribution will not fall on the context ● Visualize attributions per example ● Aggregate attributions across examples
  127. 127. Visual QA attributions Q: How symmetrical are the white bricks on either side of the building? A: very How symmetrical are the white bricks on either side of the building? red: high attribution blue: negative attribution gray: near-zero attribution 127
  128. 128. Over-stability [Jia and Liang, EMNLP 2017] Jia & Liang note that: ● Image networks suffer from “over-sensitivity” to pixel perturbations ● Paragraph QA models suffer from “over-stability” to semantics-altering edits Attributions show how such over-stability manifests in Visual QA, Tabular QA and Paragraph QA networks
  129. 129. During inference, drop all words from the dataset except ones which are frequently top attributions ● E.g. How many red buses are in the picture? 129 Visual QA Over-stability Top tokens: color, many, what, is, how, there, …
  130. 130. During inference, drop all words from the dataset except ones which are frequently top attributions ● E.g. How many red buses are in the picture? 130 Top tokens: color, many, what, is, how, there, … Visual QA Over-stability 80% of final accuracy reached with just 100 words (Orig vocab size: 5305) 50% of final accuracy with just one word “color”
  131. 131. Attack: Subject ablation Low-attribution nouns 'tweet', 'childhood', 'copyrights', 'mornings', 'disorder', 'importance', 'topless', 'critter', 'jumper', 'fits' What is the man doing? → What is the tweet doing? How many children are there? → How many tweet are there? VQA model’s response remains the same 75.6% of the time on questions that it originally answered correctly 131 Replace the subject of a question with a low-attribution noun from the vocabulary ● This ought to change the answer but often does not!
  132. 132. Many other attacks! ● Visual QA ○ Prefix concatenation attack (accuracy drop: 61.1% to 19%) ○ Stop word deletion attack (accuracy drop: 61.1% to 52%) ● Tabular QA ○ Prefix concatenation attack (accuracy drop: 33.5% to 11.4%) ○ Stop word deletion attack (accuracy drop: 33.5% to 28.5%) ○ Table row reordering attack (accuracy drop: 33.5 to 23%) ● Paragraph QA ○ Improved paragraph concatenation attacks of Jia and Liang from [EMNLP 2017] Paper: Did the model understand the question? [ACL 2018]
  133. 133. Fiddler Demo 133
  134. 134. Fiddler is an explainable AI engine designed for the enterprise Pluggable Platform Explainable AI Trust & Governance Simplified Setup Integrate, deploy, visualize a wide variety of custom models Deliver clear decisions and explanations to your end users Easy governed access helps teams build and understand Trusted AI Lean and pluggable AI platform with cloud or on-prem integrations
  135. 135. 135 All your data Any data warehouse Custom Models Fiddler Modeling Layer Explainable AI for everyone APIs, Dashboards, Reports, Trusted Insights Fiddler - Explainable AI Engine
  136. 136. Benefits Across the Organization Dev Ops Data Scientists Customers Business Owner More trust and customer retention DevOps and IT have greater visibility into AI Transparent AI workflow and GDPR compliance Data Scientists are more agile and effective
  137. 137. Fiddler enables building Explainable AI Applications like this!
  138. 138. Can explanations help build Trust? ● Can we know when the model is uncertain? ● Does the model make the same mistake as a human? ● Are we comfortable with the model? *Zachary Lipton, et al. The Mythos of Model Interpretability ICML 2016
  139. 139. Can explanations help identify Causality? ● Predictions vs actions ● Explanations on why this happened as opposed to how *Zachary Lipton, et al. The Mythos of Model Interpretability ICML 2016
  140. 140. Can explanations be Transferable? ● Training and test setups often differ from the wild ● Real world data is always changing and noisy *Zachary Lipton, et al. The Mythos of Model Interpretability ICML 2016
  141. 141. Can explanations provide more Information? ● Often times models aid human decisions ● Extra bits of information other than model decision could be valuable *Zachary Lipton, et al. The Mythos of Model Interpretability ICML 2016
  142. 142. Challenges & Tradeoffs 142 User PrivacyTransparency Fairness Performance ? ● Lack of standard interface for ML models makes pluggable explanations hard ● Explanation needs vary depending on the type of the user who needs it and also the problem at hand. ● The algorithm you employ for explanations might depend on the use-case, model type, data format, etc. ● There are trade-offs w.r.t. Explainability, Performance, Fairness, and Privacy.
  143. 143. Reflections ● Case studies on explainable AI in practice ● Need “Explainability by Design” when building AI products 143
  144. 144. Fairness Privacy Transparency Explainability Related KDD’19 sessions: 1.Tutorial: Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned (Sun) 2.Workshop: Explainable AI/ML (XAI) for Accountability, Fairness, and Transparency (Mon) 3.Social Impact Workshop (Wed, 8:15 – 11:45) 4.Keynote: Cynthia Rudin, Do Simpler Models Exist and How Can We Find Them? (Thu, 8 - 9am) 5.Several papers on fairness (e.g., ADS7 (Thu, 10-12), ADS9 (Thu, 1:30-3:30)) 6.Research Track Session RT17: Interpretability (Thu, 10am - 12pm) 144
  145. 145. Thanks! Questions? ● Feedback most welcome :-) ○ krishna@fiddler.ai, sgeyik@linkedin.com, kkenthapadi@linkedin.com, vamithal@linkedin.com, ankur@fiddler.ai ● Tutorial website: https://sites.google.com/view/kdd19-explainable- ai-tutorial ● To try Fiddler, please send an email to info@fiddler.ai 145

Editor's Notes

  • Krishna to talk about the evolution of AI


  • Models have many interactions within
    a business, multiple stakeholders, and varying requirements. Businesses need a
    360-degree view of all models within a unified modeling universe that allows them to
    understand how models affect one another and all aspects of business functions across
    the organization.

    Worth saying that some of the highlighted concerns are applicable even for decisions not involving humans.
    Example: “Can I trust our AI decisions?” arises in domains with high stakes, e.g., deciding where to excavate/drill for mining/oil exploration;
    understanding the predicted impact of certain actions (e.g., reducing carbon emission by xx%) on global warming.
  • Krishna
  • At the same time, explainability is important from compliance & legal perspective as well.

    There have been several laws in countries such as the United States that prohibit discrimination based on “protected attributes” such as race, gender, age, disability status, and religion. Many of these laws have their origins in the Civil Rights Movement in the United States (e.g., US Civil Rights Act of 1964).

    When legal frameworks prohibit the use of such protected attributes in decision making, there are usually two competing approaches on how this is enforced in practice: Disparate Treatment vs. Disparate Impact. Avoiding disparate treatment requires that such attributes should not be actively used as a criterion for decision making and no group of people should be discriminated against because of their membership in some protected group. Avoiding disparate impact requires that the end result of any decision making should result in equal opportunities for members of all protected groups irrespective of how the decision is made.

    Please see NeurIPS’17 tutorial titled Fairness in machine learning by Solon Barocas and Moritz Hardt for a thorough discussion.
  • Renewed focus in light of privacy / algorithmic bias / discrimination issues observed recently with algorithmic systems.

    Examples: EU General Data Protection Regulation (GDPR), which came into effect in May, 2018, and subsequent regulations such as California’s Consumer Privacy Act, which would take effect in January, 2020.

    The focus is not only around privacy of users (as the name may indicate), but also on related dimensions such as algorithmic bias (or ensuring fairness), transparency, and explainability of decisions.


    Image credit: https://pixabay.com/en/gdpr-symbol-privacy-icon-security-3499380/
  • European Commission Vice-President for the #DigitalSingleMarket.
  • Retain human decision in order to assign responsibility.

    https://gdpr-info.eu/art-22-gdpr/

    Bryce Goodman, Seth Flaxman, European Union regulations on algorithmic decision-making and a “right to explanation”
    https://arxiv.org/pdf/1606.08813.pdf


    Renewed focus in light of privacy / algorithmic bias / discrimination issues observed recently with algorithmic systems.

    Examples: EU General Data Protection Regulation (GDPR), which came into effect in May, 2018, and subsequent regulations such as California’s Consumer Privacy Act, which would take effect in January, 2020.

    The focus is not only around privacy of users (as the name may indicate), but also on related dimensions such as algorithmic bias (or ensuring fairness), transparency, and explainability of decisions.


    Image credit: https://pixabay.com/en/gdpr-symbol-privacy-icon-security-3499380/
  • https://gdpr-info.eu/recitals/no-71/

    Renewed focus in light of privacy / algorithmic bias / discrimination issues observed recently with algorithmic systems.

    Examples: EU General Data Protection Regulation (GDPR), which came into effect in May, 2018, and subsequent regulations such as California’s Consumer Privacy Act, which would take effect in January, 2020.

    The focus is not only around privacy of users (as the name may indicate), but also on related dimensions such as algorithmic bias (or ensuring fairness), transparency, and explainability of decisions.


    Image credit: https://pixabay.com/en/gdpr-symbol-privacy-icon-security-3499380/
  • Convey that this is the Need
    Purpose: Convey the need for adopting “explainability by design” approach when building AI products, that is, we should take explainability into account while designing & building AI/ML systems, as opposed to viewing these factors as an after-thought.
  • This criteria distinguishes whether interpretability is achieved by restricting the complexity of the machine learning model (intrinsic) or by applying methods that analyze the model after training (post hoc). Intrinsic interpretability refers to machine learning models that are considered interpretable due to their simple structure, such as short decision trees or sparse linear models. Post hoc interpretability refers to the application of interpretation methods after model training.
  • This criteria distinguishes whether interpretability is achieved by restricting the complexity of the machine learning model (intrinsic) or by applying methods that analyze the model after training (post hoc). Intrinsic interpretability refers to machine learning models that are considered interpretable due to their simple structure, such as short decision trees or sparse linear models. Post hoc interpretability refers to the application of interpretation methods after model training.
  • We ask this question for a training image classification network.
  • There are several different formulations of this Why question.
    Explain the function captured by the network in the local vicinity of the network
    Explain the entire function of the network
    Explain each prediction in terms of training examples that influence it
    And then there is attribution which is what we consider.
  • Some important methods not covered
    Grad-CAM [ICCV 2017]
    Grad-CAM++ [arxiv 2017]
    SmoothGrad [arxiv 2017]
  • Point out correlation is not causation (yet again)
  • In such cases, it may be feasible to attribute to a small set of derived features, e.g., super pixels
  • Attributions are useful when the network behavior entails that a strict subset of input features are important
  • This criteria distinguishes whether interpretability is achieved by restricting the complexity of the machine learning model (intrinsic) or by applying methods that analyze the model after training (post hoc). Intrinsic interpretability refers to machine learning models that are considered interpretable due to their simple structure, such as short decision trees or sparse linear models. Post hoc interpretability refers to the application of interpretation methods after model training.
  • So why do need relevance debugging and explaining? The most obvious reason is we want to improve the performance of our relevance machine learning model. After days and weeks of data collection, feature engineering, model training and finally the model was deployed, we want to know how our machine learning models actually work in production. Does it work well with other pieces of the system? Does it work well on some edge cases? How can we improve it? If we only look at the metrics numbers, we won't be able to answer these questions.
    By Improving the relevance machine learning models, we provide value to our members. Our members come to LinkedIn to look for people and jobs they are interested in, and to stay connected to their professional network. We want to provide the most relevant experience to our members.
    In this way, we can build trust with our members. Our members would only trust us when we understand what we are doing. For example if we recommend an entry level job to a CEO, we better understand what's going on why we are making that mistake.
  • Now, let’s take a look at the architecture of our relevance system at LinkedIn. Let’s take search as an example.
    Our members use their mobile apps or browser to find relevant information, the request goes through frontend API layer and will be sent to the federator layer. There are different types of entities the member maybe looking for, for example they could be looking for another member, or for a job. So the federator will send the query to the broker layer of different types of entities.
    For each type of these entities, the data will be partitioned in to different shards, and each shards has a cluster of machines located in different data center. So the request goes to different shards and the search results will be fetched and finally returned to the member.
    This is where machine learning models comes into play. The federator layer will try to understand what the user is looking for to decide which broker to query. It could be one or more brokers depending on the search keyword. So there’s a machine learning model in this layer to figure out the user’s intention. In the broker and shards layer, there are machine learning models to score the search results. There will be a first pass ranking and second pass ranking.
  • So what could go wrong in this diagram? First of all, the services could go down. The request could fail because of a bug or the query took too long and got timed out.
    The data in each shard could be stale, the data doesn’t exists in our search index.
    When it comes to the machine learning part,
    1.The quality of the feature might not be good, sometimes the feature doesn’t really represent what we thought it should be.
    2.The features could be inconsistent between offline training and online serving due to some issues with the feature generation pipeline.
    3.When we ramp a new model, it doesn’t always work well for all the use cases, so that could also go wrong.
  • To debug these kind of issues could be hard.
    First of all, the infrastructure is complex. As we have seen in the architecture, there are many services involved in this system, they are deployed in different hosts in different data centers, and most of the time, they are owned by different teams. So for an AI engineer, it could be hard for them to understand all the components in the system.
    The issue might be hard to reproduce locally. The production environment could be very different from dev environment and it could be constantly changing.
    Some of the issues could be time related, and these kind of issues is especially hard to debug. For example when a member report that he saw a feed item or received a notification that is not relevant to them, it could be days later when an AI engineer pick up the issue. And he would have no idea what was going on at that time. So basically they have no way to debug the issue.
    Also, debugging the issues locally could be time consuming, you’ll need to set up local environment, deploy multiple service locally, get the permission to access production data to reproduce the issue and most of the time you won’t be able to reproduce. The whole process could take a long time and it’s really not productive.
  • So what we do is to provide the infrastructure to collect the debugging data and a web UI tool to query the system and visualize the data.
    This is the overall architecture of the solution that we provide.
    The Engineers will use the Web UI tool to query the relevance system.
    The request sent out from the tool will identify itself as a debugging call, so when each service receives the request, it will collect the data for debugging for example the request and response of this service, some statistics about the query and most importantly the features and scores of each item.
    And each of these services will send out the collected data using Kafka and the data will be stored in some kind of storage system. After the query has finished, the web UI tool will query the storage system and visualize the call graph and data to the engineer.
    The reason we use kafka instead of returning the debug data in the response is the data size could be large and it affects the performance of the system. Also, if a service fails, we lost all the debugging data from other services.
  • Next, I’ll show you some UI parts of our debugging tool and how we can use them to debug the issues.
    The call graph here shows how the query was sent to different services. The service call graph could be different depending on the search key word and search vertical. So it’s very important for our engineers to understand the call graph, and be able to see the request and response and what happened in each of these services. For example, if a service returned 500, we would like to see the hostname, the version of the service, and also the logs.
    Sometimes the engineers not only want to see what was returned by the frontend API, they also want to debug a particular service in a particular data center or cluster. They could use our tools to specify which service or which host they want to debug and construct the query. In this way, they can tweak the request to see how does it change the results.
  • Our users could also see some statistics of the query, such as how long does each phases take and how efficient is the query. This could help them to locate the issues or improve latency of the system.
  • Sometimes the engineers would get questions like “Why am I seeing this feed item”, ”Why is this search result ranked in the first place”. To answer these questions, they would like to look at the features of these items. The issue usually happens when they are experimenting a new feature in the model, and the quality of the feature doesn’t meet their expectation and doesn’t perform well in the online system. In this case, the engineers would be interested in this particular new feature to see if the value is abnormal.

    Sometimes the issue happens when there’s a new model being ramped in production. Since we could have multiple models deployed for the A/B tests, the engineers would also like to know what is the model being used at the time of scoring.
    This is also useful because sometimes a model has a failure and it falls back to a default model, they would also want to confirm which one was used.

    Our engineers would also get the questions such as “Why am I not seeing something”. It would be very helpful to see what has been returned and what has been dropped in each layer and to see how was this item re-ranked. Sometimes the data doesn’t exists in our online store, sometimes the item are dropped because the score is not high enough. This would help the engineers to find out the root cause quickly.
  • Here’re some more advanced and complex use cases when it comes to debugging and explaining relevance systems. We’ll now go over each one.
  • The first use case is system perturbation. The idea is simple: I want to change something in the system and see how the results change. For example, I want to confirm a bug fix before making code changes, or just to observe how a new model would work before exposing it to traffic. So how does it work? First, the engineer uses our tool to specify what kind of perturbation he or she wants to perform, which is injected as part of the request. Our tool then passes the perturbation to downstream services, overrides the system behavior, and surfaces the results back to the engineer.

    Some examples of perturbation include:
    -To override settings or treatments of online A/B tests
    -To select a new machine learning model and sanity check its results before ramping up on traffic
    To override feature values and see if an item would be ranked differently in the search results
  • Another advanced use case is side-by-side comparison:
    -We want to compare the results of 2 different machine learning models
    Or we may want to compare features and ranking scores of 2 items. The 2 items can come from a single query (to understand their relative ranking), or from 2 different queries (to understand differences in models).
  • Here’s the user interface when comparing results from 2 different queries. The UI shows changes in ranking positions of the same item. As you can see in this example, Query 1 on the left side, using a different model than Query 2 on the right side, ranked the Lead Software Engineer job higher.
  • And why do the models rank the same item differently? Our tool can compare this item from 2 queries and show the difference in the ranking process of the 2 models. As you can see in the table, this UI highlights the differences in feature values, both raw and computed by the models. AI engineers use this comparison information to understand model behavior and identify potential problems or improvements.
  • The last use case is Replay. Scoring and ranking relevant information could be time sensitive. For example, we may get a complaint about old items showing up in a member’s feed. To debug the problem, we need to reproduce the feed items returned by the model at a specific timepoint in the past. Our tool provides the capability of replaying past model inferences.
    How does it work? First, when making model inferences, the online system emits features and ranking cores using Kafka. These Kafka events are then consumed by our tool using Samza and stored in database. Later, AI engineers can query the database for historical data when debugging.

    This is the UI for Feed Replay. On the left side, an AI engineer can query sessions of a given member within a time range. And when a session is selected, on the right side are the list of feed items shown to the member during that session.
  • So we have talked about many different use cases, and I would like add a note on security and privacy. As you have seen earlier, debugging and explaining the relevance systems require access to personalized data. And we have always made privacy concerns a central part of the tool. Like the rest of LinkedIn, our debugging tool is GDPR compliant. We control and audit all access. Debugging data has a limited retention time of about a couple of weeks. And the replay capability from the previous slide only works for our internal employees. With the debugging tool, we give our engineers as much visibility as possible while protecting our members’ privacy.

    Here’re a list of AI teams that are using our tool on a daily basis: search, feed, comments, etc. As AI becomes more and more prevalent in our daily life, it’s important to understand the models and debug them when they don’t behave as expected. We hope our talk has been helpful, and we’re happy to answer questions. Thank you!
  • Have these models read the question carefully? Can we say “yes”, because they achieve good accuracy? That is the question we wish to address in this paper. Let us start by looking at an example.
  • Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
  • Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
  • Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
  • Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
  • Completely changing the question’s meaning is still not chanigng the behaviour of the model. Clearly, the model has not fully grasped the question.
    TODO: Bold + color:gray
  • You may feel that Integration and Gradients are inverse operations of each other, and we did nothing.
  • We encourage all of you to check out KDD’19 sessions/tutorials/talks related to these topics:

    Tutorial: Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned (Sun)
    Workshop: Explainable AI/ML (XAI) for Accountability, Fairness, and Transparency (Mon)
    Social Impact Workshop (Wed, 8:15 – 11:45)
    Keynote: Cynthia Rudin, Do Simpler Models Exist and How Can We Find Them? (Thu, 8 - 9am)
    Several papers on fairness (e.g., ADS7 (Thu, 10-12), ADS9 (Thu, 1:30-3:30))
    Research Track Session RT17: Interpretability (Thu, 10am - 12pm)

×