Machine Learning
Jon Mead
Technical Services Director, North America
Egress Software Technologies, Inc.
June 14, 2019
Addressing the Disillusionment to Bring Actual
Business Benefit
About the Speaker
Jon Mead
Technical Services Director North America, Egress Software
An experienced technical engineer, Jon has worked across corporate
and government organizations to effectively deploy and manage
SaaS technologies in complex environments. As Technical Services
Director for North America, Jon provides expert technical support
and guidance to Egress clients as they achieve key compliance and
business objectives. Working closely with strategic personnel at
Egress, Jon plays an integral part in the development and delivery of
the company’s innovative data security platform that empowers
users to send, receive and manage information without risk.
A leader in intelligent, user-centric data security
• A decade of success in sophisticated defense, government and private sector data privacy.
• Identify, Classify, Secure, Control, Monitor, Audit & Report
• 2000+ Enterprise customers across industry:
• US Headquarters in Boston, MA.
• Vetted, certified products and services (NIST, ISO, NATO, Common Criteria)
Banking and
insurance
Government Healthcare Non-profit Professional
services
Industry
regulators
Utilities
About Egress
Machine Learning: Where do we begin?
» Define the real-world problem
 Is there a problem to solve?
 Can we solve the problem?
 Should we solve the problem?
 What data do we need to solve this problem?
 Can/Should we use Machine Learning?
The rise of mistake-driven breaches – 2018 Verizon Data Breach Report*
*53,308 security incidents, 2,216 data breaches, 65 countries, 67 contributors. https://www.verizonenterprise.com/verizon-insights-lab/dbir/
Machine Learning: An Example
» Define the real-world problem
R Is there a problem to solve?
R Can we solve the problem?
R Should we solve the problem?
R What data do we need to solve this problem?
R Can/Should we use Machine Learning?
Business Problem:
 How does an organization handle real-world risks to data as it travels over
untrusted networks to potentially untrusted recipients?
 Can an organization consider human error and/or malicious behavior with
that data?
 Ultimately, how can an organization avoid data breaches and demonstrate
compliance with rigorous data protection regulations, such as CCPA, in the
real-world?
Egress: A Problem worth Machine Learning?
Machine Learning Process
• Define the objective of the Problem Statement
• Data Gathering
• Data Preparation
• Exploratory Data Analysis
• Building a Machine Learning Model
• Model Evaluation & Optimization
• Predictions
Business Machine Learning Process
• Define the Business Objective (Problem)
• Source the appropriate data
• Split the data in a meaningful way
• Select the evaluation metric(s)
• Define all features that may be created from the data
• Train the model
• Feature selection
• Production system
• Feed the model
Define the Business Objective
Source the appropriate data
Split the data
Select the evaluation metrics
Define all features
Train the model
Feature Selection
Create Production Version
 Machine Learning in practice
 Machine Learning in production
 Common Pitfalls when deploying from practice to
production?
 How are these pitfalls defined?
Deploying Machine Learning: Common Pitfalls?
 Sampling Bias
 Data Leakage
 Unknown Unknowns
 Scaling and Normalization
 Impact of Outliers
 Fitting Data
 Overfitting the Model
 Social Engineering
Deploying Machine Learning: Common Pitfalls?
Deploying Machine Learning: Sampling Bias
Symptom-Based Sampling
Truncate Selection
Caveman Effect
 Use Tags and Labels to organize structured data
 Unstructured Data – How do we organize?
 How can we prevent data leakage in our machine
learning model?
Deploying Machine Learning: Data Leakage
 What are the unknown unknowns in Machine
Learning?
 Why are unknown unknowns a problem for Machine
Learning?
 How can we address unknown unknowns in our
machine learning model?
Deploying Machine Learning: Unknown Unknowns
 What is the impact of Scaling in Machine Learning
and how can it hurt our model?
 What is normalization and why should we consider it
when working with Machine Learning?
Deploying Machine Learning: Scaling and Normalization
Deploying Machine Learning: Outliers
Univariate Method
Multivariate Method
Minkowski Error
Deploying Machine Learning: Select the Fitting Data
Deploying Machine Learning: Select the Fitting Data
 Without enough data, organizations are at risk of
overfitting the machine learning model
 Using all the data in the world does not mean that
the developed model is accurate, or even viable
 Complication is impressive, but simplicity is brilliance
Deploying Machine Learning: Overfitting
 What is the impact of Social Engineering in Machine
Learning?
 How can models defend against social engineering
attacks?
Deploying Machine Learning: Social Engineering
Original Business Problem:
 How does an organization handle real-world risks to data as it travels over
untrusted networks to potentially untrusted recipients?
 Can an organization consider human error and/or malicious behavior with
that data?
 Ultimately, how can an organization avoid data breaches and demonstrate
compliance with rigorous data protection regulations, such as CCPA, in the
real-world?
Egress: How did we employ Machine Learning?
» Apply protection and rights management
on-the-fly based on risk
» Protect against the accidental
sharing of data
» Auto-encrypt messages for
other Egress clients
» Increases user engagement
and adoption
Risk-Based Protection: What?
» Analyses previous email communications
to protect from accidental sends
» Calculates a risk score based on domain,
user behaviour and system info
» Applies protection based on
sensitivity of data and risk score
» Uses any email protection,
including TLS, O365, Voltage, etc.
Risk-Based Protection: How?
 Use historical behavior to detect anomalies
 Parallel processing and cloud AI enables
“cognitive” processing of vast quantities of
collected data
 “Graph” databases: Link relationships and past
behaviour to quickly detect anomalies and
pattern changes
 Outcomes change with learning, time, and data
 Analysis of user “cliques” (groups) to detect and
prevent accidents
A New Way: Machine Learning to Detect Errors
 Data Leakage
 Scaling with Machine Learning
 Selecting Appropriate Fitting Data
 Social Engineering
That’s great… but what about all those pitfalls?
 Data Leakage
 Identified left-out data
 Unsupervised Probabilistic Machine Learning
 Historical Behavior with real-time comparison
Egress Data Leakage Resolution
 Scaling with Machine Learning: Serverless Technologies
 What is Serverless?
 Why use serverless?
 Benefits from the serverless architecture in practice with Machine
Learning
Egress Addresses Scaling with Machine Learning
 Fitting Data Problem
 Data Selection and Testing application
 Build several models to develop the Golden Model
 Run parallel models in fitting and in product
 Feed the Machine
Egress Selection of Appropriate Fitting Data
 Organizational Domain Relationship Model
 Behavior-Based Risk Assessment: Why did we use a
problematic approach?
 How did we mitigate the behavior-based risk assessment
model – Eager Update and User-Models
Egress Defending against Social Engineering / Malicious Data Manipulation
Future: What does this mean for our Clients
Data
Privacy
Data Security
NYDFS 23
NYCRR
500*
GDPR CA AB375
2017 2018 2019 ?2020
Feb 2018
Phase 2
Transition
ends. Full
compliance
Sept 2018
Phase 3
NAIC Model SC H4655
Colorado (3
CCR 704-1)
VT 4:4 Vt
Code R. 8:8-
4
CO House
Bill 18-1128
US state
Amended
Laws
Thank you!
Talk to us at the Egress stand.
E: info@egress.com
T: 1-800-732-0746
W: www.egress.com
Twitter: @EgressSoftware
"Despite what most SaaS companies are saying, Machine Learning requires time and
preparation. Whenever you hear the term AI, you must think about the data behind it." -
Alexandre Gonfalonieri, February 2019
Appendix
E: info@egress.com
T: 1-800-732-0746
W: www.egress.com
Twitter: @EgressSoftware
Sources:
• https://www.neuraldesigner.com/blog/3_methods_to_deal_with_outliers
• https://cds.nyu.edu/unknown-unknowns-machine-learning/
• https://elitedatascience.com/model-training
• https://towardsdatascience.com/machine-learning-general-process-8f1b510bd8af
• https://towardsdatascience.com/how-to-build-a-data-set-for-your-machine-learning-project-5b3b871881ac
• https://www.kdnuggets.com/2017/08/understanding-overfitting-meme-supervised-learning.html
• https://towardsdatascience.com/identifying-and-correcting-label-bias-in-machine-learning-ed177d30349e
• https://thenextweb.com/contributors/2018/10/27/4-human-caused-biases-machine-learning/
• https://www.datanami.com/2018/07/18/three-ways-biased-data-can-ruin-your-ml-models/
• https://en.wikipedia.org/wiki/Sampling_bias
• https://towardsdatascience.com/security-and-privacy-considerations-in-artificial-intelligence-machine-
learning-part-5-when-6d6d9f457734
• https://machinelearningmastery.com/data-leakage-machine-learning/
• https://imarticus.org/what-is-machine-learning-and-does-it-matter/
• “Identifying and Correcting Label Bias in Machine Learning”, Heinrich Jiang and Ofir Nachum, 15 Jan 2019
• https://www.kdnuggets.com/2018/12/essence-machine-learning.html

Machine Learning: Addressing the Disillusionment to Bring Actual Business Benefit

  • 2.
    Machine Learning Jon Mead TechnicalServices Director, North America Egress Software Technologies, Inc. June 14, 2019 Addressing the Disillusionment to Bring Actual Business Benefit
  • 3.
    About the Speaker JonMead Technical Services Director North America, Egress Software An experienced technical engineer, Jon has worked across corporate and government organizations to effectively deploy and manage SaaS technologies in complex environments. As Technical Services Director for North America, Jon provides expert technical support and guidance to Egress clients as they achieve key compliance and business objectives. Working closely with strategic personnel at Egress, Jon plays an integral part in the development and delivery of the company’s innovative data security platform that empowers users to send, receive and manage information without risk.
  • 4.
    A leader inintelligent, user-centric data security • A decade of success in sophisticated defense, government and private sector data privacy. • Identify, Classify, Secure, Control, Monitor, Audit & Report • 2000+ Enterprise customers across industry: • US Headquarters in Boston, MA. • Vetted, certified products and services (NIST, ISO, NATO, Common Criteria) Banking and insurance Government Healthcare Non-profit Professional services Industry regulators Utilities About Egress
  • 6.
    Machine Learning: Wheredo we begin? » Define the real-world problem  Is there a problem to solve?  Can we solve the problem?  Should we solve the problem?  What data do we need to solve this problem?  Can/Should we use Machine Learning?
  • 7.
    The rise ofmistake-driven breaches – 2018 Verizon Data Breach Report* *53,308 security incidents, 2,216 data breaches, 65 countries, 67 contributors. https://www.verizonenterprise.com/verizon-insights-lab/dbir/
  • 8.
    Machine Learning: AnExample » Define the real-world problem R Is there a problem to solve? R Can we solve the problem? R Should we solve the problem? R What data do we need to solve this problem? R Can/Should we use Machine Learning?
  • 9.
    Business Problem:  Howdoes an organization handle real-world risks to data as it travels over untrusted networks to potentially untrusted recipients?  Can an organization consider human error and/or malicious behavior with that data?  Ultimately, how can an organization avoid data breaches and demonstrate compliance with rigorous data protection regulations, such as CCPA, in the real-world? Egress: A Problem worth Machine Learning?
  • 10.
    Machine Learning Process •Define the objective of the Problem Statement • Data Gathering • Data Preparation • Exploratory Data Analysis • Building a Machine Learning Model • Model Evaluation & Optimization • Predictions
  • 11.
    Business Machine LearningProcess • Define the Business Objective (Problem) • Source the appropriate data • Split the data in a meaningful way • Select the evaluation metric(s) • Define all features that may be created from the data • Train the model • Feature selection • Production system • Feed the model
  • 12.
    Define the BusinessObjective Source the appropriate data Split the data Select the evaluation metrics Define all features Train the model Feature Selection Create Production Version
  • 13.
     Machine Learningin practice  Machine Learning in production  Common Pitfalls when deploying from practice to production?  How are these pitfalls defined? Deploying Machine Learning: Common Pitfalls?
  • 14.
     Sampling Bias Data Leakage  Unknown Unknowns  Scaling and Normalization  Impact of Outliers  Fitting Data  Overfitting the Model  Social Engineering Deploying Machine Learning: Common Pitfalls?
  • 15.
    Deploying Machine Learning:Sampling Bias Symptom-Based Sampling Truncate Selection Caveman Effect
  • 16.
     Use Tagsand Labels to organize structured data  Unstructured Data – How do we organize?  How can we prevent data leakage in our machine learning model? Deploying Machine Learning: Data Leakage
  • 17.
     What arethe unknown unknowns in Machine Learning?  Why are unknown unknowns a problem for Machine Learning?  How can we address unknown unknowns in our machine learning model? Deploying Machine Learning: Unknown Unknowns
  • 18.
     What isthe impact of Scaling in Machine Learning and how can it hurt our model?  What is normalization and why should we consider it when working with Machine Learning? Deploying Machine Learning: Scaling and Normalization
  • 19.
    Deploying Machine Learning:Outliers Univariate Method Multivariate Method Minkowski Error
  • 20.
    Deploying Machine Learning:Select the Fitting Data
  • 21.
    Deploying Machine Learning:Select the Fitting Data
  • 22.
     Without enoughdata, organizations are at risk of overfitting the machine learning model  Using all the data in the world does not mean that the developed model is accurate, or even viable  Complication is impressive, but simplicity is brilliance Deploying Machine Learning: Overfitting
  • 23.
     What isthe impact of Social Engineering in Machine Learning?  How can models defend against social engineering attacks? Deploying Machine Learning: Social Engineering
  • 25.
    Original Business Problem: How does an organization handle real-world risks to data as it travels over untrusted networks to potentially untrusted recipients?  Can an organization consider human error and/or malicious behavior with that data?  Ultimately, how can an organization avoid data breaches and demonstrate compliance with rigorous data protection regulations, such as CCPA, in the real-world? Egress: How did we employ Machine Learning?
  • 26.
    » Apply protectionand rights management on-the-fly based on risk » Protect against the accidental sharing of data » Auto-encrypt messages for other Egress clients » Increases user engagement and adoption Risk-Based Protection: What?
  • 27.
    » Analyses previousemail communications to protect from accidental sends » Calculates a risk score based on domain, user behaviour and system info » Applies protection based on sensitivity of data and risk score » Uses any email protection, including TLS, O365, Voltage, etc. Risk-Based Protection: How?
  • 28.
     Use historicalbehavior to detect anomalies  Parallel processing and cloud AI enables “cognitive” processing of vast quantities of collected data  “Graph” databases: Link relationships and past behaviour to quickly detect anomalies and pattern changes  Outcomes change with learning, time, and data  Analysis of user “cliques” (groups) to detect and prevent accidents A New Way: Machine Learning to Detect Errors
  • 29.
     Data Leakage Scaling with Machine Learning  Selecting Appropriate Fitting Data  Social Engineering That’s great… but what about all those pitfalls?
  • 30.
     Data Leakage Identified left-out data  Unsupervised Probabilistic Machine Learning  Historical Behavior with real-time comparison Egress Data Leakage Resolution
  • 31.
     Scaling withMachine Learning: Serverless Technologies  What is Serverless?  Why use serverless?  Benefits from the serverless architecture in practice with Machine Learning Egress Addresses Scaling with Machine Learning
  • 32.
     Fitting DataProblem  Data Selection and Testing application  Build several models to develop the Golden Model  Run parallel models in fitting and in product  Feed the Machine Egress Selection of Appropriate Fitting Data
  • 33.
     Organizational DomainRelationship Model  Behavior-Based Risk Assessment: Why did we use a problematic approach?  How did we mitigate the behavior-based risk assessment model – Eager Update and User-Models Egress Defending against Social Engineering / Malicious Data Manipulation
  • 34.
    Future: What doesthis mean for our Clients Data Privacy Data Security NYDFS 23 NYCRR 500* GDPR CA AB375 2017 2018 2019 ?2020 Feb 2018 Phase 2 Transition ends. Full compliance Sept 2018 Phase 3 NAIC Model SC H4655 Colorado (3 CCR 704-1) VT 4:4 Vt Code R. 8:8- 4 CO House Bill 18-1128 US state Amended Laws
  • 35.
    Thank you! Talk tous at the Egress stand. E: info@egress.com T: 1-800-732-0746 W: www.egress.com Twitter: @EgressSoftware "Despite what most SaaS companies are saying, Machine Learning requires time and preparation. Whenever you hear the term AI, you must think about the data behind it." - Alexandre Gonfalonieri, February 2019
  • 36.
    Appendix E: info@egress.com T: 1-800-732-0746 W:www.egress.com Twitter: @EgressSoftware
  • 37.
    Sources: • https://www.neuraldesigner.com/blog/3_methods_to_deal_with_outliers • https://cds.nyu.edu/unknown-unknowns-machine-learning/ •https://elitedatascience.com/model-training • https://towardsdatascience.com/machine-learning-general-process-8f1b510bd8af • https://towardsdatascience.com/how-to-build-a-data-set-for-your-machine-learning-project-5b3b871881ac • https://www.kdnuggets.com/2017/08/understanding-overfitting-meme-supervised-learning.html • https://towardsdatascience.com/identifying-and-correcting-label-bias-in-machine-learning-ed177d30349e • https://thenextweb.com/contributors/2018/10/27/4-human-caused-biases-machine-learning/ • https://www.datanami.com/2018/07/18/three-ways-biased-data-can-ruin-your-ml-models/ • https://en.wikipedia.org/wiki/Sampling_bias • https://towardsdatascience.com/security-and-privacy-considerations-in-artificial-intelligence-machine- learning-part-5-when-6d6d9f457734 • https://machinelearningmastery.com/data-leakage-machine-learning/ • https://imarticus.org/what-is-machine-learning-and-does-it-matter/ • “Identifying and Correcting Label Bias in Machine Learning”, Heinrich Jiang and Ofir Nachum, 15 Jan 2019 • https://www.kdnuggets.com/2018/12/essence-machine-learning.html