Machine Learning is increasingly being used by companies as a disruptor or providing a USP. This means that Machine Learning models need to cope with being a critical part of solutions and if those solutions use PCI-DSS or PII then the models must be highly secure.
In addition, if a Machine Learning model is part of your USP then you will want to protect it. Also, the EU AI Regulation and UK AI Strategy means that AI is becoming increasingly regulated. This means you need to be able to prove what model made a prediction and why it made it by providing auditability and explainabilty.
In this talk we go over these issues and how to address them including using AWS and how to implement development best practices.
2. WHO I AM?
⢠ipswich AWS User Group Leader and contributes to the AWS Community by
speaking at a number of summits, community days and meet-ups.
⢠Regular blogger, open-source contributor, and SME on Machine Learning,
MLOps, DevOps, Containers and Serverless
⢠Very experienced principal solutions architect, a lead developer and head of
practice for Inawisdom.
⢠12 AWS Certifications including SA Pro, Dev Ops, Data Analytics Specialism
and Machine Learning Specialism.
⢠Over 6 years of AWS experience and been responsible for running
production workloads of over 200 containers in a performance system that
responded to 18,000 requests per second
⢠Visionary in ML Ops, Produced production workloads of ML models at scale,
including 1500 inferences per minute, with active monitoring and alerting
⢠Has developed professionally in Python, NodeJS + J2EE
⢠Implemented pipelines that deployed changes rapidly using both âCanaryâ
blue/green and full deployment options
⢠Extensive experience of writing complex applications + solutions to a high
standard and quality
Phil Basford
philip.basford@gmail.com
@philipbasford
3. What are the risks we need to mitigate?
IP AND SECURITY CONSIDERATIONS
Data Access &
Privacy
Model
Management
Custom IP &
Algorithms
Business
target
understanding
Data
inspection
and collection
Data Pre-
processing
Feature
Engineering
Machine
Learning
modelling
Model
inference
Infrastructure and Physical
Security
4. Data Characteristics, Personal Information and Company Secrets
DATA ACCESS & PRIVACY
For PPI and PCI-DSS data the the protection
and securing of data is essential and is a legal
requirement (GDPR)
The source of data and understanding of what
is contained within the data, is very important
as it is part of the âsecret sauceâ (e.g. domain
knowledge and understanding features)
To safeguard IPR and data, then separating
analytical and operational workloads is
important. Ideally there is clear isolation which
helps with areas of responsibility and hand-offs
Machine Learning models are products of their
data and that data being wrong or
unsatisfactory will impact your predictions.
An important consideration whilst training is
that you might be using multiple CPUs/GPUs
and/or instances. This will require data to be
exchanged securely across boundaries
5. Data Characteristics, Personal Information and Company Secrets (mitigation)
DATA ACCESS & PRIVACY
Data: Only use the data you need. If the data
contains PII or PCI-DSS and you do not
need those values then remove them or
have them sanitised. If you do need them
then have complete auditability on their use
Data Foundation: In most situations it is
advisable to have your data secured within
comprehensive data lake or data
warehouse that has clear lineage and
classification of data
Data Layers or Virtualisation: It is
very important to build layers into your
data foundation of raw, sanitised,
curated and optimise. Alternatively use
virtualisation to restrict what data Is
usable/viewable
Validation: There must be validations put
on source data used as inputs and any
predictions as output made before it being
consumed. This can help detect or
prevents bad actors from influencing the
model
Secure GPU pipeline and data exchange:
Using shared tenancy in the cloud means
that a secure inter-process communication is
needed when sending data or the model
between chips. Your cloud provider should
have mechanisms to allow you to do this
6. PROTECTING CUSTOM IP & ALGORITHMS
Specific attention is needed around
allowing data scientists to experiment
quickly with real data. However the
ongoing training process need to be done
within a robust architecture and pipelines.
The investment in CI/CD and automated
processes to build, test and deploy with
clear quality standards is required.
Most supervised learning uses established
algorithms. However it is very common for
unsupervised problems to implement
custom algorithms. I.e. clustering data
together based on specific domain
knowledge
The uniqueness of your model is formed
from the characteristics contained in your
data and the logic you apply to them
including the algorithm + hyperparameters
you use
Machine Learning code created by Data
Scientists needs to be kept securely and
obfuscation may provide extra protection
7. PROTECTING CUSTOM IP & ALGORITHMS
Secure Source Code: Storing code including
notebooks in GIT helps version control. Also, Using
private GIT repositories allows for tight control on
access. Combined with CI/CD this can allow for a
clear operational hand off.
Docker: Protects IPR when used with a private
registry, it is also advised to make images immutable
and that they are signed. Docker also means the
operations team does not need to look inside the
container Private Repositories: The use of private code +
artifact repositories and docker image registries to
store code + binaries including algorithms. These
also support governance of dependencies that are
approved for use.
Security Scanning: Regular vulnerability and
security best practice scanning and static analysis is
essential. This includes it being part of pull requests
and the build processes, to prevent and detect issues
Licensing: Establish a process to review the privacy
and license agreements for all software and ML
libraries needed. Ensure these agreements comply
with your organizationâs legal, privacy, and security
terms and conditions.
8. MODEL MANAGEMENT
Models produced from frameworks like
skit learn, PyTorch, TensorFlow etc can
be reverse engineered. However the
more complicated the model or custom
the data the harder it is to accomplish
Models need to be stored in a secure
model storage within a secure
network using private encryption keys
The secure model storage needs to
be versioned and auditable so that
unintentional or intentional, loss or
corruption of models is prevented
The ML Ops pipeline needs to be the
official path for builds and deploys
models to the secure model storage
9. MODEL MANAGEMENT
Model Registry: Secure storage for trained
models that provide versioning and auditability
Feature Store: Central storage for features to
allow for reuse and catalog of which models use
them.
Bias & Drift Detection: The monitoring of
performance of the model in Data Scientist
terms is very important. This especially
important when human in the loop is involved
to spot features not performing correctly
10. The central model repository 1. Model Storage: The ability to securely store all kind of
models from all the major frameworks or your own
algorithm
2. Model Versioning: The ability store multi increments of
model and the changes between versions
3. Model Approval: The ability to register models as
approved versions and trigger deployments and who
approved them.
4. Stage Transitions: The ability to transition models threw
stages of deployment into production
5. Data Versioning:
a) The source of the data and immutable copy of it in an
optimised format
b) The details processes run on the data to create an
optimised format
c) The version and source code of the script for processes
6. Meta Data: meta data or manifest that covers:
a) The âHyperparametersâ and setting used for training
b) The location of the âscriptâ in source control and its
revision
MODEL REGISTRY
11. Auditability and Governance
1. First we need to know what âartifactâ was used during
inference:
a) Each artifact produced needs a version number
b) Each prediction needs to be signed with that version number
2. For a versioned âartifactâ or model you need to know the
following from the training:
a) The âHyperparametersâ and setting used for training.
b) The location of the âscriptâ in source control and its revision.
c) The version number of the âalgorithmâ, âframeworkâ, and any other
dependencies.
d) The âdata setâ needs to versioned and stored somewhere that allows it
to be retrieved.
e) An approval process be it automated or manual but provides clear
auditing of if a model is approved and who approved it.
3. Lastly, the âdata setâ itself needs to have data lineage:
a) The source of the data and the processes run on the data
b) The version and source code of the script for processes
MODEL LINEAGE
12. 13
Model Positioning
⤠The interfering or polluting of training data to
alter the predictions made by a model
⤠This type of attack is typically to drive benefit for
yourself or to discredit a provider
⤠Publicly known examples are fake reviews on
Trip Advisor or Amazon
ADVERSARIAL MACHINE LEARNING:
Model Extraction
⤠The black box brute force attack of a model to
recreate the model and understand its secrets
⤠For example the weights in a CNN network. The
attacks uses systematic probing of a model
⤠Used conjunction with other adversarial attacks
Data Inversion & Inference
⤠Exposing personal information from a model due
to it being improved with data outside of the
general training data (linked to knowledge
transfer)
⤠Important to exclude your PII or payment details
from services multi tenancy or invest in single
tenancy models
⤠The mostly publicly known example is auto
complete/correct learning of emails from private
correspondents
Knowledge Transfers
⤠The white box brute force attack of a model to
train a secondary model to produce the same
result as the primary model
⤠The secondary model is less complex (big) as
the primary
⤠This technique can used to protect a primary
model
⤠The secondary model can used to understand
the input into the original
Model Evasion
⤠Manipulating the prediction a model makes by
finding the outliers in the input.
⤠Attacks are focused on classifiers and adding
enough noise into the source
13. 14
Monitoring, observing
and alerting using
CloudWatch and X-
Ray. Infrastructure as
Code with SAM and
CloudFormation.
Operational
Excellence
Least privilege, Data
Encryption at Rest,
and Data Encryption
in Transit using IAM
Policies, Resource
Policies, KMS, Secret
Manager, VPC and
Security Group.
Security
Elastic scaling based
on demand and
meeting response
times using Auto
Scaling, Serverless,
and Per Request
managed services.
Performance
Serverless and fully
managed services to
lower TCO. Resource
Tag everything
possible for cost
analysis. Right sizing
instance types for
model hosting.
Cost
Optimisation
Fault tolerance and
auto healing to meet a
target availability
using Auto Scaling,
Multi AZ, Multi Region,
Read Replicas and
Snapshots.
Reliance
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf
AWS WELL ARCHITECTED
15. Networking, Encryption, Auditability, and Access Control
INFRASTRUCTURE AND OVERALL SECURITY
AWS KMS: Encrypt everything! however if your
data is PII or PCI-DSS then consider using a
dedicated Custom Key in KMS to-do this. This
allows you tighter control by limiting the ability to
decrypt data, providing another layer security over
S3.
AWS IAM: Grants access to other AWS services
using IAM roles and you need to make sure your
policies are locked down to only the Actions and
Resources you need.
Amazon S3: S3 in an easy to use object store.
However please make sure you enable
encryption, resource policies, logging and
versioning on your buckets.
Amazon VPC: AWS Service can run outside a
VPC and access data over the public internet
(hopefully using HTTPS). Please do not do that
and also this runs contrary to most corporate
Information Security Policies. Therefore please
deploy in VPC with Private Links for extra
security.
16. Repositories and CI/CD
BUILD PROCESS AND CODE SECURITY
CodeCommit: Provides a public and private git
repositories that stores code securely. Can be
combined with best practice and ML security
scanning with access control. However is not as
good as GIT Hub and other AWS services support
both
ECR: Provides a public and private docker
repositories that store encrypted docker images
with vulnerability scanning, access control and
networking save guards
CodeArtifact: Provides public and private
library/dependency repositories (Like Nexus).
Python whls/packages are stored securely using
encryption with access control and networking
safeguards. Auditing of dependencies can be
done, but there is no process for the approval of
licence terms
CodePipeline + CodeBuild: Provides basic
CI/CD capabilities within your AWS network.
However Jenkins is better in combo with GIT hub
but requires IaaS and access integration for
control
17. Feature Engineering, Training and Inference
MACHINE LEARNING
Amazon SageMaker Training: Provides secure
training from within AWS. It supports running
training from with in a VPC and access to private
repos and data stores. Provides GPU and
Memory encryption for GPU/Data parallelism
Amazon SageMaker Preprocessing: Provides
the ability to run python code within docker to do
feature engineering. Provides lineage when used
with the rest of SageMaker
Amazon SageMaker Endpoints/Batch: Provide
secure model hosting within VPC for real-time
inference or batching. Able to access to private
repositories and data stores. Support access
control and access over HTTPS
Amazon SageMaker Model Monitor + Clarify:
In combination they can evaluate and detect
issues within the features or bias in accuracy of
the predictions, triggering alerts for investigation
or retraining
18. Data Storage, ETL and Data Access
DATA AND DATA PROCESSING
Redshift: Provides a relational data store in a
columnar format for analytics. Supports multiple
databases and schema to control and centralise
access to data.
S3: Using multiple buckets and folders to store
data in means that you can limit access to certain
sensitive in a data set. Providing copies that are
sanitised and encrypted with different keys to
people who do not need the sensitive values
EMR: Provides a managed Hadoop service with
secure block storage in EFS or object storage in
S3. Runs with a VPC and supports encryption
between nodes.
Glue Job + Catalog: Full âserverlessâ ability to
run PySpark. Native Integration with S3 and
Redshift. Meta data storage in Glue Catalog and
LakeFormation for fine grain data access
19. Using full âserverlessâ and Cloud Native abilities
INTEGRATION AND INFERENCE
Lambda: Provides the ability to run application
code securely with an VPC without the need to
maintain and patch servers. It is not possible to
access to the underlying instance.
API Gateway: Provides REST and WSS secure
APIs using HTTPS with access controls via IAM
and AWS Cognito (supporting SAML2 federation).
Supports private access via VPC
DynamoDB: Non-SQL data store that is good for
store hot reference data + water marking. This
helps to prevent model extraction as not all Inputs
are publicly known
Step Functions: Provides Orchestration and
work flow manage of batch process . Lots of
native integration with other services and uses
access control in IAM to access them. You are not
access to the underlying instance. Airflow is an
alternative with wider integrations with other
vendors like Databrick. However Airflow is a
managed service or needs deploying via IaaS.
20. Solid foundational layers are required to accelerate and scale delivery
DATA AND INFRASTRUCTURE
Foundations:
⤠A solid Data
Lake/Warehouse with good
sources of data is required
for long term scaling of ML
usages.
⤠Running models
operationally also means
considering availability,
fault tolerance and scaling
of instances.
⤠Having a robust security
posture using multiple
layers with auditability is
essential
⤠Consistent architecture,
development approaches
and deployments aid
maintainability
Scaling and refinement:
⤠Did your models improve,
or do they still meet, the
outcomes and KPIs that
you set out to affect?
⤠Have innovations in
technology meant that
complexity in development
or deployment can be
simplified? Allowing more
focus to be put on other
uses of ML?
⤠Are your models running
on the latest and most
optimal hardware?
21. 22
⤠Events and talks focused the Public Cloud and
AWS
⤠Founded 2016, rebooted 20201
⤠20-30 members
⤠Meets 3-4 times a year
https://www.meetup.com/Ipswich-AWS-User-Group/
@awsipswich
IPSWICH AWS USER GROUP
Data Access & Privacy
The protection of data is essential. For PPI and PCI data it is a legal requirement to secure data. In most situations it is advisable to have a secure data lake or data warehouse that has clear lineage and classification of data within.
The sourcing of data and understanding of what is contained within the data is also important as it is part of âsecret sourceâ (e.g. domain knowledge, understanding features, or enrichment with 3rd Party data sets).
To safeguard IPR and data then the data and analytics platform needs separating from the operational workloads. This helps with areas of responsibility and hand offs between teams.
Customer Algs and IP
The protection of data is essential. For PPI and PCI data it is a legal requirement to secure data. In most situations it is advisable to have a secure data lake or data warehouse that has clear lineage and classification of data within.
The sourcing of data and understanding of what is contained within the data is also important as it is part of âsecret sourceâ (e.g. domain knowledge, understanding features, or enrichment with 3rd Party data sets).
To safeguard IPR and data then the data and analytics platform needs separating from the operational workloads. This helps with areas of responsibility and hand offs between teams.
Model Management
Models produced from frameworks like sckit learn, PyTorch, TensorFlow etc can be reverse engineered. Though the more complicated the model or custom the data the harder it is to accomplish.
To prevent model exfiltration then models need to be stored in secure model storage within a secure network using private encryption keys
The secure model storage needs to be versioned and auditable so that unintentional or intentional loss or corruption of models is prevented
The ML Ops pipeline the official path for builds and deploys models to the secure model storage
Explain ML terms and model types
Explain ML Ops pipleines
A good example is NLP and understanding the input data like postal addresses.