www.prmia.org© PRMIA 2020
Model Risk Management for Machine Learning Models
Sri Krishnamurthy, CFA, CAP
Founder & CEO
www.QuantUniversity.com
www.prmia.org© PRMIA 2020
Thought Leadership Webinar
www.prmia.org© PRMIA 2020
Presenter
Sri Krishnamurthy, CFA, CAP
Founder & CEO, QuantUniversity
• Advisory and Consultancy for Financial Analytics
• Prior experience at MathWorks, Citigroup, and Endeca and
25+ years in financial services and energy
• Columnist for the Wilmott Magazine
• Teaches Analytics in the Babson College MBA program and at
Northeastern University, Boston
• Reviewer: Journal of Asset Management
www.prmia.org© PRMIA 2020
About www.QuantUniversity.com
• Boston-based Data Science, Quant
Finance and Machine Learning
training and consulting advisory
• Trained more than 5,000 students
in Quantitative methods, Data
Science and Big Data Technologies
using MATLAB, Python and R
• Building a platform for AI
and Machine Learning Enablement
in the Enterprise
www.prmia.org© PRMIA 2020
Agenda
Considerations for MRM
for Machine Learning
models
Case Study
Machine Learning
www.prmia.org© PRMIA 2020
Machine Learning in FinancePart 1
www.prmia.org© PRMIA 2020
The world as we know has changed!
www.prmia.org© PRMIA 2020
Machine Learning and AI Have Revolutionized Finance
www.prmia.org© PRMIA 2020
Machine Learning & AI in Finance: A Paradigm Shift
Stochastic
Models
Factor Models Optimization
Risk Factors P/Q Quants
Derivative
pricing
Trading
Strategies
Simulations
Distribution
fitting
Real-time
analytics
Predictive
analytics
Machine
Learning
RPA NLP
Deep
Learning
Computer
Vision
Graph
Analytics
Chatbots
Sentiment
Analysis
Alternative
Data
Quant Data Scientist/ML
Engineer
www.prmia.org© PRMIA 2020
Machine Learning
1. https://en.wikipedia.org/wiki/Machine_learning
Figure Source: http://www.fsb.org/wp-content/uploads/P011117.pdf
AI
• Artificial intelligence is
intelligence demonstrated by
machines, in contrast to the
natural intelligence displayed by
humans and animals1.
Definitions: Machine Learning and AI
• Machine learning is the scientific
study of algorithms and statistical
models that computer systems use
to effectively perform a specific
task without using explicit
instructions, relying on patterns
and inference instead1.
1. https://en.wikipedia.org/wiki/Machine_learning
2. Figure Source: http://www.fsb.org/wp-content/uploads/P011117.pdf
www.prmia.org© PRMIA 2020
Polling Question 1
• Question: Have you deployed machine learning models in your
organization?
a) Considering it
b) Will be rolled out soon
c) In Production
d) Not yet
www.prmia.org© PRMIA 2020
Considerations for MRM for Machine
Learning models
Part 2
www.prmia.org© PRMIA 2020
The Basics
www.prmia.org© PRMIA 2020
Model Risk Defined
www.prmia.org© PRMIA 2020
The Machine Learning and AI Workflow
Data Scraping/
Ingestion
Data
Exploration
Data Cleansing
and Processing
Feature
Engineering
Model
Evaluation
& Tuning
Model
Selection
Model
Deployment/
Inference
Supervised
Unsupervised
Modeling
Data Engineer, Dev Ops Engineer
• Auto ML
• Model Validation
• Interpretability
Robotic Process Automation (RPA) (Microservices, Pipelines )
• SW: Web/ Rest API
• HW: GPU, Cloud
• Monitoring
• Regression
• KNN
• Decision Trees
• Naive Bayes
• Neural Networks
• Ensembles
• Clustering
• PCA
• Autoencoder
• RMS
• MAPS
• MAE
• Confusion Matrix
• Precision/Recall
• ROC
• Hyper-parameter
tuning
• Parameter Grids
Risk Management/ Compliance(All stages)
Software / Web Engineer Data Scientist/Quants
Analysts&
DecisionMakers
www.prmia.org© PRMIA 2020
Elements of Model Risk Management
www.prmia.org© PRMIA 2020
Model Governance Structure
www.prmia.org© PRMIA 2020
• Components that needs to be tracked
What constitutes an ML model?
• Interdependencies
• Lineage/Provenance
of individual
components
• Model params
• Hyper parameters
• Pipeline specifications
• Model specific
• Tests
• Data versions
Data Model
EnvironmentProcess
• Programming environment
• Execution environment
• Hardware specs
• Cloud
• GPU
www.prmia.org© PRMIA 2020
Elements of a Machine Learning System
Source: Sculley et al., 2015 "Hidden Technical Debt in Machine Learning Systems"
www.prmia.org© PRMIA 2020 19
AI Governance Is Gaining Focus
https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
www.prmia.org© PRMIA 2020 20
Theory to Practice: How to cross the chasm ?
• Theory
• Regulations
• Local Laws
• Practical ML systems
• Company Expertise
• Company culture and Best
practices
www.prmia.org© PRMIA 2020 21
1. ML Life cycle management
2. Tracking
3. Metadata management
4. Scaling
5. Reproducibility
6. Interpretability
7. Testing
8. Measurement
Themes We Will Discuss Today
www.prmia.org© PRMIA 2020
Polling Question 2
• Which is the most challenging aspect in your organization ?
a) ML Life cycle management
b) Tracking & Metadata management
c) Scaling
d) Reproducibility & Interpretability
e) Testing & Measurement
www.prmia.org© PRMIA 2020
Up Next
www.prmia.org© PRMIA 2020
24
Model Lifecycle Management
www.prmia.org© PRMIA 2020
Source: T. van derWeide, O. Smirnov, M. Zielinski, D. Papadopoulos, and T. van Kasteren. Versioned machine learning pipelines for batch experimentation. In ML Systems, Workshop NIPS 2016, 2016.
Provenance and Lineage of Pipelines
www.prmia.org© PRMIA 2020 26
Versioning
www.prmia.org© PRMIA 2020
Schemas proposed
Sebastian Schelter, Joos-Hendrik Boese, Johannes Kirschnick, Thoralf Klein, and Stephan Seufert. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. NIPS Workshop on
Machine Learning Systems, 2017.
www.prmia.org© PRMIA 2020
Schemas proposed
G. C. Publio, D. Esteves, and H. Zafar, “ML-Schema : Exposing the Semantics of Machine Learning with Schemas and Ontologies,” in Reproducibility in ML Workshop, ICML’18, 2018.
www.prmia.org© PRMIA 2020
MLFlow
www.prmia.org© PRMIA 2020
DVC
Source: https://dvc.org/
www.prmia.org© PRMIA 2020
31
Sample Project Structure
REF: Harvard Computefest 2020 demo example
www.prmia.org© PRMIA 2020
GoCD
Source: https://www.gocd.org/
www.prmia.org© PRMIA 2020
Up Next
www.prmia.org© PRMIA 2020
I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the Kepler scientific workflow system. In Provenance and annotation of data, pages 118–132.
Current Approaches
www.prmia.org© PRMIA 2020
Miao, Hui & Chavan, Amit & Deshpande, Amol. (2016). ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows.
Current Approaches
www.prmia.org© PRMIA 2020
Related Work
Xueping Liang, Sachin Shetty, Deepak Tosh, Charles Kamhoua, Kevin Kwiat, and Laurent Njilla. 2017. ProvChain: A Blockchain-based Data Provenance Architecture in Cloud Environment with Enhanced
Privacy and Availability. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '17). IEEE Press, Piscataway, NJ, USA, 468-477. DOI:
https://doi.org/10.1109/CCGRID.2017.8
Focus on Cloud data
provenance using Blockchain
www.prmia.org© PRMIA 2020
Related Work
Ramachandran, Aravind & Kantarcioglu, Dr. (2017). Using Blockchain and smart contracts for secure data provenance management.
DataProv: Built on top of
Ethereum, the platform
utilizes smart contracts and
open provenance model
(OPM) to record immutable
data trails.
www.prmia.org© PRMIA 2020
Related Work
Sarpatwar, Kanthi & Vaculín, Roman & Min, Hong & Su, Gong & Heath, Terry & Ganapavarapu, Giridhar & Dillenberger, Donna. (2019). Towards Enabling Trusted Artificial Intelligence via Blockchain.
10.1007/978-3-030-17277-0_8.
Trusted AI and provenance of
AI models
www.prmia.org© PRMIA 2020
Model Inference Standards
www.prmia.org© PRMIA 2020
Up Next
www.prmia.org© PRMIA 2020
Meta Data Management
www.prmia.org© PRMIA 2020
Meta Data Management
1. Add people to Amundsen’s data graph, by integrating with
integration with HR systems like Workday. Show commonly
used and bookmarked data assets.
2. Add dashboards and reports (e.g. Tableau, Looker, Apache
Superset) to Amundsen.
3. Add support for lineage across disparate data assets like
dashboards and tables.
4. Add events/schemas (e.g. schema registry) to Amundsen.
5. Add streams (e.g. Apache Kafka, AWS Kinesis) to Amundsen.
https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9
www.prmia.org© PRMIA 2020
43
• Machine learning applications fail is due to the lack of rich, diverse and
clean datasets needed to build models.
• Historical datasets may be hard to acquire or may be skewed towards the
majority class.
• All plausible scenarios of the future haven’t happened yet!
• Synthetic data used to enrich and augment existing datasets to provide
comprehensive samples while training machine learning problems.
Role of Data Augmentation
www.prmia.org© PRMIA 2020
Up Next
www.prmia.org© PRMIA 2020
GPUs for Scaling
REF : NVIDIA DLI Multi-GPU course slide deck
www.prmia.org© PRMIA 2020
GPUs for Scaling
REF : NVIDIA DLI Multi-GPU course slide deck
www.prmia.org© PRMIA 2020
“TSNE Optimizations
There are four optimizations used to improve the performance of TSNE on GPUs:
1. calculating higher dimensional probabilities with less GPU memory,
2. approximating higher dimensional probabilities,
3. reducing arithmetic operations, and
4. broadcasting along rows.”
Ref: https://medium.com/rapids-ai/tsne-with-gpus-hours-to-seconds-9d9c17c941db
Using GPUs requires GPU compatible code changes
www.prmia.org© PRMIA 2020
Polling Question 3
• What kinds of ML tools do you use in your organization?
a) None
b) On-prem - Enterprise
c) Cloud - Enterprise
d) On-prem – Open Source
e) Cloud – Open Source
www.prmia.org© PRMIA 2020
Up Next
www.prmia.org© PRMIA 2020
The Reproducibility Challenge
https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
www.prmia.org© PRMIA 2020
• Repeatability (Same team, same experimental setup)
— The measurement can be obtained with stated precision by the same team using the same
measurement procedure, the same measuring system, under the same operating conditions, in
the same location on multiple trials. For computational experiments, this means that a
researcher can reliably repeat her own computation.
• Replicability (Different team, same experimental setup)
— The measurement can be obtained with stated precision by a different team using the
same measurement procedure, the same measuring system, under the same operating
conditions, in the same or a different location on multiple trials. For computational
experiments, this means that an independent group can obtain the same result using the
author’s own artifacts.
• Reproducibility (Different team, different experimental setup)
— The measurement can be obtained with stated precision by a different team, a different
measuring system, in a different location on multiple trials. For computational
experiments, this means that an independent group can obtain the same result using
artifacts which they develop completely independently.
Repeatable or Reproducible or Replicable
https://www.acm.org/publications/policies/artifact-review-badging
www.prmia.org© PRMIA 2020
Up Next
www.prmia.org© PRMIA 2020
“Interpretability is the degree to which a human can
consistently predict the model's result”1
What is the objective?2
• Simply be to get more useful information from the mode
• Uncover causal structure in observational data
• Transparency? Convergence?
• Model complexity?
• Culture?
The Interpretability Challenge
1. https://christophm.github.io/interpretable-ml-book/interpretability.html
2. https://arxiv.org/abs/1606.03490
www.prmia.org© PRMIA 2020
• Partial dependence plots (PDP)
• Shapley Values
• Lime (Local Interpretable Model-Agnostic Explanations)
• SHAP (SHapley Additive exPlanations)
Reference: https://christophm.github.io/interpretable-ml-book/
Shapley Values
www.prmia.org© PRMIA 2020
• Partial dependence plots (PDP) show the dependence between the target
response and a set of ‘target’ features, marginalizing over the values of all
other features (the ‘complement’ features).
• Intuitively, we can interpret the partial dependence as the expected target
response as a function of the ‘target’ features.
https://scikit-learn.org/stable/modules/partial_dependence.html
The Interpretability Challenge
www.prmia.org© PRMIA 2020
Which model to choose?
Client Objective:
• Build the best forecasting model that has a
MAPE of 5% or less
Result:
· Regression – 7% MAPE
· Neural Networks – 4% MAPE
· Random Forest – 5% MAPE
Client choice:
· Regression despite being the worst of the
top-3 models
· “I won’t deploy anything that I don’t
understand”
Source: http://engineering.electrical-equipment.org/electrical-distribution/electric-load-forecasting-advantages-challenges.html
www.prmia.org© PRMIA 2020
Up Next
www.prmia.org© PRMIA 2020
Testing for Machine Learning Models
Figureref: http://www.actuaries.org/CTTEES_SOLV/Documents/StressTestingPaper.pdf
www.prmia.org© PRMIA 2020
59
Comprehensive Testing Is Important
www.prmia.org© PRMIA 2020 60
Can Machine Learning algorithms be gamed?
https://www.youtube.com/watch?time_continue=36&v=MIbFvK2S9g8
https://arxiv.org/abs/1904.08653
84
www.prmia.org© PRMIA 2020
Up Next
www.prmia.org© PRMIA 2020
Model Risk Assessment Framework
www.prmia.org© PRMIA 2020
Quantifying Model Risk Is Important
www.prmia.org© PRMIA 2020
RISKGRADING
RiskScores
Impact
5 5 10 15 20 25
4 4 8 12 16 20
3 3 6 9 12 15
2 2 4 6 8 10
1 1 2 3 4 5
1 2 3 4 5
Likelihood of occurrence
Red High Risk
Yellow Moderate Risk
Green LowRisk
High Impact- High likelihood of occurrence: Needs adequate model risk
controlmeasures to mitigate risk
High Impact – Lowlikelihood of occurrence:Address through model risk
control measures
and contingency plans
Low Impact – High likelihood of occurrence : Lower priority model risk
control measures
LowImpact – Lowlikelihood of occurrence:Least prioritymodel risk control
measures
www.prmia.org© PRMIA 2020
Summary
1. ML Life cycle management
2. Tracking
3. Metadata management
4. Scaling
5. Reproducibility
6. Interpretability
7. Testing
8. Measurement
www.prmia.org© PRMIA 2020
Up Next Case study:
Using Synthetic Data for Model Validation
www.prmia.org© PRMIA 2020
Polling Question 4
• Have you considered using Synthetic/Simulated data for testing
and validating models?
a) No
b) Considering it
c) Yes
d) Tried it and decided not to use it
www.prmia.org© PRMIA 2020
Synthetic Data
• Synthetic data is "any production data applicable to a given situation that
are not obtained by direct measurement.”1
• In finance, Synthetic data has been used in stress and scenario analysis for
many years now.
• Example: Montecarlo simulations have been used to generate future
scenarios.
• In Machine Learning, Synthetic Data plays an important role to prevent
overfitting, handle imbalance class problems, and to accommodate
plausible scenarios.
1 https://en.wikipedia.org/wiki/Synthetic_data
www.prmia.org© PRMIA 2020
Challenges with Real Datasets
All scenarios haven’t played out
• Stress scenarios
• What-if scenarios
Figureref:http://www.actuaries.org/CTTEES_SOLV/Documents/StressTestingPaper.pdf
www.prmia.org© PRMIA 2020
Access
• Hard to find
• Rare class problems
• Privacy concerns making it
difficult to share
Challenges with Real Datasets
Picture source: www.pixabay.com
www.prmia.org© PRMIA 2020
Imbalanced
• Need more samples of rare class
• Need proxies for data points that
were not observed or recorded
Challenges with Real Datasets
Picture source: www.pixabay.com
www.prmia.org© PRMIA 2020
Synthetic Data in Finance
Ref: Machine Learning for Asset Managers, Marcos M. López de Prado,,CAMBRIDGE UNIVERSITY PRESS 2020
www.prmia.org© PRMIA 2020
73
www.prmia.org© PRMIA 2020
MRM Use Cases
• Data Anonymization
— Anonymize training and test data sets for internal and external model
validation
• Data Augmentation
— Augment sparse datasets with realistic datasets
• Handling Imbalanced data classes
— Handle Algorithmic bias and to test efficacy of model for rare-class
problems
• Stress and Scenario testing
— Simulate test scenarios for extreme but plausible scenarios to test
model behavior
www.prmia.org© PRMIA 2020
VIX Characteristics
REF: https://www.investopedia.com/terms/v/vix.asp
www.prmia.org© PRMIA 2020
Demo: Synthetic VIX Generation
www.prmia.org© PRMIA 2020
Up Next Demo
If you would like access to the demo and the QuSandbox,
please contact us at info@qusandbox.com.
www.prmia.org© PRMIA 2020
Use Code MRMPRMIA for $100 off!
Register here
www.prmia.org© PRMIA 2020
QuantUniversity’s Model Risk related papers
Email me at sri@quantuniversity.com for a copy
www.prmia.org© PRMIA 2020
Q&A Sri Krishnamurthy, CFA, CAP
Founder and CEO
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not
be distributed or used in any other publication without the prior written consent of QuantUniversity LLC.

Model Risk Management for Machine Learning

  • 1.
    www.prmia.org© PRMIA 2020 ModelRisk Management for Machine Learning Models Sri Krishnamurthy, CFA, CAP Founder & CEO www.QuantUniversity.com www.prmia.org© PRMIA 2020 Thought Leadership Webinar
  • 2.
    www.prmia.org© PRMIA 2020 Presenter SriKrishnamurthy, CFA, CAP Founder & CEO, QuantUniversity • Advisory and Consultancy for Financial Analytics • Prior experience at MathWorks, Citigroup, and Endeca and 25+ years in financial services and energy • Columnist for the Wilmott Magazine • Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston • Reviewer: Journal of Asset Management
  • 3.
    www.prmia.org© PRMIA 2020 Aboutwww.QuantUniversity.com • Boston-based Data Science, Quant Finance and Machine Learning training and consulting advisory • Trained more than 5,000 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R • Building a platform for AI and Machine Learning Enablement in the Enterprise
  • 4.
    www.prmia.org© PRMIA 2020 Agenda Considerationsfor MRM for Machine Learning models Case Study Machine Learning
  • 5.
    www.prmia.org© PRMIA 2020 MachineLearning in FinancePart 1
  • 6.
    www.prmia.org© PRMIA 2020 Theworld as we know has changed!
  • 7.
    www.prmia.org© PRMIA 2020 MachineLearning and AI Have Revolutionized Finance
  • 8.
    www.prmia.org© PRMIA 2020 MachineLearning & AI in Finance: A Paradigm Shift Stochastic Models Factor Models Optimization Risk Factors P/Q Quants Derivative pricing Trading Strategies Simulations Distribution fitting Real-time analytics Predictive analytics Machine Learning RPA NLP Deep Learning Computer Vision Graph Analytics Chatbots Sentiment Analysis Alternative Data Quant Data Scientist/ML Engineer
  • 9.
    www.prmia.org© PRMIA 2020 MachineLearning 1. https://en.wikipedia.org/wiki/Machine_learning Figure Source: http://www.fsb.org/wp-content/uploads/P011117.pdf AI • Artificial intelligence is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals1. Definitions: Machine Learning and AI • Machine learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead1. 1. https://en.wikipedia.org/wiki/Machine_learning 2. Figure Source: http://www.fsb.org/wp-content/uploads/P011117.pdf
  • 10.
    www.prmia.org© PRMIA 2020 PollingQuestion 1 • Question: Have you deployed machine learning models in your organization? a) Considering it b) Will be rolled out soon c) In Production d) Not yet
  • 11.
    www.prmia.org© PRMIA 2020 Considerationsfor MRM for Machine Learning models Part 2
  • 12.
  • 13.
  • 14.
    www.prmia.org© PRMIA 2020 TheMachine Learning and AI Workflow Data Scraping/ Ingestion Data Exploration Data Cleansing and Processing Feature Engineering Model Evaluation & Tuning Model Selection Model Deployment/ Inference Supervised Unsupervised Modeling Data Engineer, Dev Ops Engineer • Auto ML • Model Validation • Interpretability Robotic Process Automation (RPA) (Microservices, Pipelines ) • SW: Web/ Rest API • HW: GPU, Cloud • Monitoring • Regression • KNN • Decision Trees • Naive Bayes • Neural Networks • Ensembles • Clustering • PCA • Autoencoder • RMS • MAPS • MAE • Confusion Matrix • Precision/Recall • ROC • Hyper-parameter tuning • Parameter Grids Risk Management/ Compliance(All stages) Software / Web Engineer Data Scientist/Quants Analysts& DecisionMakers
  • 15.
    www.prmia.org© PRMIA 2020 Elementsof Model Risk Management
  • 16.
  • 17.
    www.prmia.org© PRMIA 2020 •Components that needs to be tracked What constitutes an ML model? • Interdependencies • Lineage/Provenance of individual components • Model params • Hyper parameters • Pipeline specifications • Model specific • Tests • Data versions Data Model EnvironmentProcess • Programming environment • Execution environment • Hardware specs • Cloud • GPU
  • 18.
    www.prmia.org© PRMIA 2020 Elementsof a Machine Learning System Source: Sculley et al., 2015 "Hidden Technical Debt in Machine Learning Systems"
  • 19.
    www.prmia.org© PRMIA 202019 AI Governance Is Gaining Focus https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
  • 20.
    www.prmia.org© PRMIA 202020 Theory to Practice: How to cross the chasm ? • Theory • Regulations • Local Laws • Practical ML systems • Company Expertise • Company culture and Best practices
  • 21.
    www.prmia.org© PRMIA 202021 1. ML Life cycle management 2. Tracking 3. Metadata management 4. Scaling 5. Reproducibility 6. Interpretability 7. Testing 8. Measurement Themes We Will Discuss Today
  • 22.
    www.prmia.org© PRMIA 2020 PollingQuestion 2 • Which is the most challenging aspect in your organization ? a) ML Life cycle management b) Tracking & Metadata management c) Scaling d) Reproducibility & Interpretability e) Testing & Measurement
  • 23.
  • 24.
  • 25.
    www.prmia.org© PRMIA 2020 Source:T. van derWeide, O. Smirnov, M. Zielinski, D. Papadopoulos, and T. van Kasteren. Versioned machine learning pipelines for batch experimentation. In ML Systems, Workshop NIPS 2016, 2016. Provenance and Lineage of Pipelines
  • 26.
  • 27.
    www.prmia.org© PRMIA 2020 Schemasproposed Sebastian Schelter, Joos-Hendrik Boese, Johannes Kirschnick, Thoralf Klein, and Stephan Seufert. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. NIPS Workshop on Machine Learning Systems, 2017.
  • 28.
    www.prmia.org© PRMIA 2020 Schemasproposed G. C. Publio, D. Esteves, and H. Zafar, “ML-Schema : Exposing the Semantics of Machine Learning with Schemas and Ontologies,” in Reproducibility in ML Workshop, ICML’18, 2018.
  • 29.
  • 30.
  • 31.
    www.prmia.org© PRMIA 2020 31 SampleProject Structure REF: Harvard Computefest 2020 demo example
  • 32.
  • 33.
  • 34.
    www.prmia.org© PRMIA 2020 I.Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the Kepler scientific workflow system. In Provenance and annotation of data, pages 118–132. Current Approaches
  • 35.
    www.prmia.org© PRMIA 2020 Miao,Hui & Chavan, Amit & Deshpande, Amol. (2016). ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows. Current Approaches
  • 36.
    www.prmia.org© PRMIA 2020 RelatedWork Xueping Liang, Sachin Shetty, Deepak Tosh, Charles Kamhoua, Kevin Kwiat, and Laurent Njilla. 2017. ProvChain: A Blockchain-based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '17). IEEE Press, Piscataway, NJ, USA, 468-477. DOI: https://doi.org/10.1109/CCGRID.2017.8 Focus on Cloud data provenance using Blockchain
  • 37.
    www.prmia.org© PRMIA 2020 RelatedWork Ramachandran, Aravind & Kantarcioglu, Dr. (2017). Using Blockchain and smart contracts for secure data provenance management. DataProv: Built on top of Ethereum, the platform utilizes smart contracts and open provenance model (OPM) to record immutable data trails.
  • 38.
    www.prmia.org© PRMIA 2020 RelatedWork Sarpatwar, Kanthi & Vaculín, Roman & Min, Hong & Su, Gong & Heath, Terry & Ganapavarapu, Giridhar & Dillenberger, Donna. (2019). Towards Enabling Trusted Artificial Intelligence via Blockchain. 10.1007/978-3-030-17277-0_8. Trusted AI and provenance of AI models
  • 39.
  • 40.
  • 41.
  • 42.
    www.prmia.org© PRMIA 2020 MetaData Management 1. Add people to Amundsen’s data graph, by integrating with integration with HR systems like Workday. Show commonly used and bookmarked data assets. 2. Add dashboards and reports (e.g. Tableau, Looker, Apache Superset) to Amundsen. 3. Add support for lineage across disparate data assets like dashboards and tables. 4. Add events/schemas (e.g. schema registry) to Amundsen. 5. Add streams (e.g. Apache Kafka, AWS Kinesis) to Amundsen. https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9
  • 43.
    www.prmia.org© PRMIA 2020 43 •Machine learning applications fail is due to the lack of rich, diverse and clean datasets needed to build models. • Historical datasets may be hard to acquire or may be skewed towards the majority class. • All plausible scenarios of the future haven’t happened yet! • Synthetic data used to enrich and augment existing datasets to provide comprehensive samples while training machine learning problems. Role of Data Augmentation
  • 44.
  • 45.
    www.prmia.org© PRMIA 2020 GPUsfor Scaling REF : NVIDIA DLI Multi-GPU course slide deck
  • 46.
    www.prmia.org© PRMIA 2020 GPUsfor Scaling REF : NVIDIA DLI Multi-GPU course slide deck
  • 47.
    www.prmia.org© PRMIA 2020 “TSNEOptimizations There are four optimizations used to improve the performance of TSNE on GPUs: 1. calculating higher dimensional probabilities with less GPU memory, 2. approximating higher dimensional probabilities, 3. reducing arithmetic operations, and 4. broadcasting along rows.” Ref: https://medium.com/rapids-ai/tsne-with-gpus-hours-to-seconds-9d9c17c941db Using GPUs requires GPU compatible code changes
  • 48.
    www.prmia.org© PRMIA 2020 PollingQuestion 3 • What kinds of ML tools do you use in your organization? a) None b) On-prem - Enterprise c) Cloud - Enterprise d) On-prem – Open Source e) Cloud – Open Source
  • 49.
  • 50.
    www.prmia.org© PRMIA 2020 TheReproducibility Challenge https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
  • 51.
    www.prmia.org© PRMIA 2020 •Repeatability (Same team, same experimental setup) — The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation. • Replicability (Different team, same experimental setup) — The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts. • Reproducibility (Different team, different experimental setup) — The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently. Repeatable or Reproducible or Replicable https://www.acm.org/publications/policies/artifact-review-badging
  • 52.
  • 53.
    www.prmia.org© PRMIA 2020 “Interpretabilityis the degree to which a human can consistently predict the model's result”1 What is the objective?2 • Simply be to get more useful information from the mode • Uncover causal structure in observational data • Transparency? Convergence? • Model complexity? • Culture? The Interpretability Challenge 1. https://christophm.github.io/interpretable-ml-book/interpretability.html 2. https://arxiv.org/abs/1606.03490
  • 54.
    www.prmia.org© PRMIA 2020 •Partial dependence plots (PDP) • Shapley Values • Lime (Local Interpretable Model-Agnostic Explanations) • SHAP (SHapley Additive exPlanations) Reference: https://christophm.github.io/interpretable-ml-book/ Shapley Values
  • 55.
    www.prmia.org© PRMIA 2020 •Partial dependence plots (PDP) show the dependence between the target response and a set of ‘target’ features, marginalizing over the values of all other features (the ‘complement’ features). • Intuitively, we can interpret the partial dependence as the expected target response as a function of the ‘target’ features. https://scikit-learn.org/stable/modules/partial_dependence.html The Interpretability Challenge
  • 56.
    www.prmia.org© PRMIA 2020 Whichmodel to choose? Client Objective: • Build the best forecasting model that has a MAPE of 5% or less Result: · Regression – 7% MAPE · Neural Networks – 4% MAPE · Random Forest – 5% MAPE Client choice: · Regression despite being the worst of the top-3 models · “I won’t deploy anything that I don’t understand” Source: http://engineering.electrical-equipment.org/electrical-distribution/electric-load-forecasting-advantages-challenges.html
  • 57.
  • 58.
    www.prmia.org© PRMIA 2020 Testingfor Machine Learning Models Figureref: http://www.actuaries.org/CTTEES_SOLV/Documents/StressTestingPaper.pdf
  • 59.
  • 60.
    www.prmia.org© PRMIA 202060 Can Machine Learning algorithms be gamed? https://www.youtube.com/watch?time_continue=36&v=MIbFvK2S9g8 https://arxiv.org/abs/1904.08653 84
  • 61.
  • 62.
    www.prmia.org© PRMIA 2020 ModelRisk Assessment Framework
  • 63.
  • 64.
    www.prmia.org© PRMIA 2020 RISKGRADING RiskScores Impact 55 10 15 20 25 4 4 8 12 16 20 3 3 6 9 12 15 2 2 4 6 8 10 1 1 2 3 4 5 1 2 3 4 5 Likelihood of occurrence Red High Risk Yellow Moderate Risk Green LowRisk High Impact- High likelihood of occurrence: Needs adequate model risk controlmeasures to mitigate risk High Impact – Lowlikelihood of occurrence:Address through model risk control measures and contingency plans Low Impact – High likelihood of occurrence : Lower priority model risk control measures LowImpact – Lowlikelihood of occurrence:Least prioritymodel risk control measures
  • 65.
    www.prmia.org© PRMIA 2020 Summary 1.ML Life cycle management 2. Tracking 3. Metadata management 4. Scaling 5. Reproducibility 6. Interpretability 7. Testing 8. Measurement
  • 66.
    www.prmia.org© PRMIA 2020 UpNext Case study: Using Synthetic Data for Model Validation
  • 67.
    www.prmia.org© PRMIA 2020 PollingQuestion 4 • Have you considered using Synthetic/Simulated data for testing and validating models? a) No b) Considering it c) Yes d) Tried it and decided not to use it
  • 68.
    www.prmia.org© PRMIA 2020 SyntheticData • Synthetic data is "any production data applicable to a given situation that are not obtained by direct measurement.”1 • In finance, Synthetic data has been used in stress and scenario analysis for many years now. • Example: Montecarlo simulations have been used to generate future scenarios. • In Machine Learning, Synthetic Data plays an important role to prevent overfitting, handle imbalance class problems, and to accommodate plausible scenarios. 1 https://en.wikipedia.org/wiki/Synthetic_data
  • 69.
    www.prmia.org© PRMIA 2020 Challengeswith Real Datasets All scenarios haven’t played out • Stress scenarios • What-if scenarios Figureref:http://www.actuaries.org/CTTEES_SOLV/Documents/StressTestingPaper.pdf
  • 70.
    www.prmia.org© PRMIA 2020 Access •Hard to find • Rare class problems • Privacy concerns making it difficult to share Challenges with Real Datasets Picture source: www.pixabay.com
  • 71.
    www.prmia.org© PRMIA 2020 Imbalanced •Need more samples of rare class • Need proxies for data points that were not observed or recorded Challenges with Real Datasets Picture source: www.pixabay.com
  • 72.
    www.prmia.org© PRMIA 2020 SyntheticData in Finance Ref: Machine Learning for Asset Managers, Marcos M. López de Prado,,CAMBRIDGE UNIVERSITY PRESS 2020
  • 73.
  • 74.
    www.prmia.org© PRMIA 2020 MRMUse Cases • Data Anonymization — Anonymize training and test data sets for internal and external model validation • Data Augmentation — Augment sparse datasets with realistic datasets • Handling Imbalanced data classes — Handle Algorithmic bias and to test efficacy of model for rare-class problems • Stress and Scenario testing — Simulate test scenarios for extreme but plausible scenarios to test model behavior
  • 75.
    www.prmia.org© PRMIA 2020 VIXCharacteristics REF: https://www.investopedia.com/terms/v/vix.asp
  • 76.
    www.prmia.org© PRMIA 2020 Demo:Synthetic VIX Generation
  • 77.
    www.prmia.org© PRMIA 2020 UpNext Demo If you would like access to the demo and the QuSandbox, please contact us at info@qusandbox.com.
  • 78.
    www.prmia.org© PRMIA 2020 UseCode MRMPRMIA for $100 off! Register here
  • 79.
    www.prmia.org© PRMIA 2020 QuantUniversity’sModel Risk related papers Email me at sri@quantuniversity.com for a copy
  • 80.
    www.prmia.org© PRMIA 2020 Q&ASri Krishnamurthy, CFA, CAP Founder and CEO Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC.