FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
SRE[in]con 2019
1. Building Secure and
Scalable Machine Learning
Pipelines: Challenges and
security Patterns
Anamitra Dutta Majumdar
Thomas Goetze
2. Talk
Outline
ML Use cases at
LinkedIn
Phases in
ML pipelines and
Infrastructure
Security Risks
and Security
Solution Patterns
Scalability and
Security Control
challenges.
3. Acknowledgements
SRE, Data APA ,
Foundation and AI teams
for partnering
in developing security
controls for the different
phases of the machine
learning pipelines.
House Security Team for
spearheading some of
the cutting-edge security
initiatives at LinkedIn to
scale the pace of
business innovation in a
secure manner.
Marius Seritan from
Relevance Infra team for
helping us put the slides
together during the initial
phases of preparation
4. Machine Learning
Model
• Trained on sample data
• Used to make a statistical
prediction
• 23% chance user will click on job
• 10% chance non-professional
content
Feature
• Input to a model
• Category (accountant job,
programming job)
• Numeric (years of experience)
• Term Vectors (contains "dev ops")
• Construction
• Reduction
5. Machine Learning Use Cases at LinkedIn
• started 12 years ago
• one of the early growth mechanisms
• Heavy innovation in online serving (Gaia), Venice Compute, Graph
Convolutional Network
PYMK
• Leader in experimentation velocity, hundreds of experiments per
quarter
• Custom deployment, dark canaryFeed
• Multi layered models
• Generalized linear ensemble
• TensorFlow embeddings
Job Search
6. Pro-ML Initiative
Tooling to manage all aspects of machine
learning.
Unified product and frameworks to reduce
routine work.
GOAL: Double productivity of machine learning
engineers.
7. Phases in Machine Learning Pipeline and
ProML
Primary
security
concerns
• Data Exfi
ltration
• Unautho
rized
Data
Access
Feature
Engineering
Mode;
Artifac
t Store
Model
Health
Assurance
Model
Training
Compute
Model
Deployment
Ececutor
Model
scoring and
Selection
Hyperpara
meter
tuning
Model
Explorer
Model
Registry
Model
Inference
Engine
AI
Meta
data
Hub
Proml Work
Space App
Plugins/SDKs/Pip
eline framework
Model
Runtime
Environment
Data
Store
Experimentation:
Model Training and
Evaluation
Model Deployment
Model Serving and Inference
8. Data Storage and Management Infrastructure
Gobblin
Espresso
Data
Sources
3rd Party
Services
through
GAAP
Data
Ingestion
Oracle DB
HDFS
Venice
Data
Storage
Dataset Access
Management
Layer
Espresso
Dali
View
Compute Orchestration Infrastructure
A/B
testing
Cluster
Management
Compute
Engines
Workflow
Orchestration
Use cases
Relevance
Analytics
Reporting
YARN Azkaban
k
K8s master
11. ML
Pipeline Security
Challenges
Experimentation Phase:
Unauthorized data
access, Sensitive data
leakage
Model Training Phase:
Sensitive data
generation and leakage
Deployment
Phase: Unauthorized
model actions Leakage
of sensitive models
Inference Phase: DoS
Security
Misconfiguration,
Member Inference,
Vulnerabilities
12. Security
Controls an
d Patterns
• Access Controls
• Encryption
• Privacy Preserving Libraries
• Feature sensitivity annotation
Experimentation
• Encryption
• Authenticated, authorized and automated flows
• Cleanup Sensitive intermediate dataset
• ML Classifiers
Model Training
• Dual Verification and Multi-factor Authentication
• Model Randomization
• Use of synthetic data
Model Deployment
• Visibility into workload
• Segregation of workloads
• Secure configuration
• Model Health Assurance
Model Inference
13. Examples of security controls in Pro-ML Deployment phase and
runtime environment
Publication requires
read access by
a privileged account,
opt-in policy
Runtime access to
wormhole for
reading model
artifacts
Current: encrypted
keytab
Future: service
principals using
KSudo
14. Dev and EI validation
In theory trained models should work for
inference just like they were trained
In reality, code that uses the models can be
non-trivial and developers need access to
models to test in development environments
How to allow models in
EI and DEV
Validate no PII in the models
Obfuscation
randomization
15. Challenges: Heterogenous authentication controls for offline and online
world
OFFLINE GRID
Name Node
Data Node
Distributed
ML JobsML Job
Scheduler
Kerberos
Delegation
Token
Block Access
TokenBlock Access
Token
Kerberos
ONLNE
Service A Service BMutual TLS
Kerberos
X509
Certificates
16. Security Control Pattern: Heterogenous
Authentication and authorization control
pattern
Translator
Service
Identity Management
System
Secret Store
Distributed
Compute
Distributed
Storage
Service B
Service A
18. Key Take ways for
building ML Pipeline
Security Patterns
Segregation of Infrastructure
Segregation of storage and
computation
Segregation based workload sensitivity
Control plane and data plane
components
AI Metadata system
Model training and inference time
security threats and requirements
Centralized Feature Metadata System
Monitoring
Continuous monitoring
Scanning
Security metrics
Security Infrastructure
Efficient Identity Management platform
wrappers and access layers
Scalable Key Management System
Security Control Scaling
Engineer and operationalize the Automation
of the security controls
Notes: Major use cases of ML to solve business problems
High Level Phases in ML pipelines
Security risks and security solution patterns in each phase
Scalability challenges in applying security controls to various distributed systems involved in the pipeline
High level phases in any machine learning pipeline with each phase acting on data that has been primarily collected by events generated from member activity
Experimentation phase where data sciemtists perform data exploration, problem formulation , feature engineering and model authoring
Model Training phase where model is trained on set of features and best model is selected and pushed to the model artifact store
In model deployment phase the tooling instructs and online service when a new model is available for serving
In the last phase the model is fetched from the model artifact store and loaded to the deployment target typically an online service for inference.
On product initiative to unify all workflows within a single pane of glass with the goal of boosting productivity.
Security is a cross cutting concern in each of the layers and phases
Application of controls are challenging when full service or user attribution is required across phases and infrastructure
Rule based Authorization enforced at Data access layer
Analytics infra deals with data coming from various sources, which is ingested by highly efficient data ingestion platforms stored on our PT-scale HDFS,and abstracted as cluster agnostic logical datasets.
Risks:
Model Experimentation
Unauthorized data access at various points in the pipeline at various phases
Data leakage as part of some of the automated flows into less secure regions in the environment
Combination of two or more non-PII datasets to generate PII Features.
Model Training Phase
Leakage of sensitive data generated in an intermediate training or feature transformation phase.
Models learning PII
Model Deployment
Unauthorized Model actions like Model publication to production environments
Use of vulnerable components as the effect is immediately amplified
Movement of sensitive models into less sensitive areas of the infrastructure
Inference phase:
Model runtime misconfiguration or vulnerable component use leading to resource exhaustion and hence DoS
Authentication of Model Runtime environment to fetch models
Model performance degradation
Access Controls:
In the case of Dataset access controls are implemented based on roles and rules
Roles are enforced when generating an authenticator like tokens that use a single source of identity truth for entitlements determination and rules are enforced based on the attributes of the target .The pattern is to enforce such rules in a data access layer and set attributes on the dataset in a metadata hub. The rules should be based on the attributes.
Compute engines for training and inference phases should authorize and authenticate to the data by using system level identities.
Encryption:
Sensitive data should be encrypted at rest and in motion . There are use cases where encrypted cannot be used ,in case such as those data should be decrypted for the shortest periods of time when in use. It is worth mentioning here that there are some experimental efforts underway for using model training on data that uses homomorphic encryption.
Use of privacy preserving libraries at training time to ensure PII is not leaked through the combination of non-PII fields. The pattern is to integrate these tools and libraries into the model training phase of the pipeline.
Model Deployment Phase
Needs verification for model deployment actions like publish and test. In case of test deployments randomization techniques prevent leakage of PII to less secure environments. Ideally tests should be performed using synthetic data
Model Inference
Visibility into compute workloads of various sensitivity levels to look for abnormal traffic patterns
Segregate workloads based on sensitivity level of the data they handle define key performance metrics based on threat severity.
Model runtime configuration should be secure. One of the tests to ensure this would be by running security bench marks.
An important aspect to monitor would be to look for model accuracy degradation over time.
Online services uses certificates and offline jobs Kerberos for Authentication. Gaps in full user attribution within a context of an operation that spans online and offline entities. This is an issue in cases where an online service needs to contact infrastructure components that rely on token based authentication and authorization. A good example of this in the actual pipeline is the Model Runtime Environment pulling model files from Model Artifact Store.
Introduce a translator service that understands online and offline authentication constructs.
The translator services must preserve full user attribution during translation
The translator service must consult a single source of identity truth trusted by both online and offline
Examples are Ksudo and PKInit
Scalability challenges for applying security controls due to tight coupling of compute and storage in GRID
Machine learning workloads are inherently distributed in nature. In order to apply security controls in a scalable manner important security
Boundaries needs to be determined based on blast radius of compromise, data leakage and exfiltration risks. A high level guideline
Is to divide the distributed systems along the following dimensions
Storage tier and compute tier
control plane and data plane components
Sensitivity level of workloads
ML Domain understanding
Some of the security risks are very specific to the ML domain like Model performance degradation over time,
or Models learning PII during model training So it is important to understand the purpose of the model, the features they are trained on etc.
Another important aspect is to maintain an AI metadata system that supports sensitivity annotation of various type of data used in the pipeline. Such Annotations is extremely useful for decision making around dataflow between different points in the pipeline
Monitoring
At various levels and at various points of the pipeline different phases that captures the full audit of events performed by human and user principles
.
Available and Scalable Security Infrastructure
The identify management system is the core for establishing user attribution based on various authenticators. It is imperative for te indentity management
Systems to help generate identities and authenticators for all the phases and in case for cases where the workflow spans across phases it should help in authenticator translation
Key management system is essential for storing encryption keys, private keys tokens that are required at various levels.
Both the Identity Management and the key management systems should have High Availability to prevent single point of failure and high scalability the meet the demands of calls from massively distributed systems.
Security Control Scaling through automation
Additionally functions like scaling, authenticator generation based on target use understanding needs to be automated, Detection of PII in unwanted areas