This presentation, part talk and part practical demonstration, introduces Privacy-by-Design (PbD) onto a typical software application as part of a Secure Development Lifecycle, with a live demo showcasing how artificial intelligence (AI) can contribute to the process.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
DevSecCon London 2019 - Achieve AI-Powered API Privacy Using Open Source
1. Achieve AI-Powered API Privacy
Using Open Source
Gianluca Brigandi
CEO : Atricore Inc. / Veridax
gianluca@veridax.com
2. Exploring concrete solutions – notYAPAT (Yet-
Another-Privacy-Awareness-Talk)
Introduce grass roots approach for application
privacy
What can we mix-and-match TODAY to start making
progress before regulations hit you
How AI (DNN specifically) can enable new
capabilities in terms of hardening our apps privacy-
wise
What is this talk about?
3. About me: Just a curious guy
Developer, security researcher, entrepreneur and open source contributor
During the past 15 years I’ve architected products at the intersections of
privacy, application and container security, Identity & Access Management
and AI
Introduced first model-driven security solution back in 2011 (security-as-
code but visual) with Fortune 500 and Defense clients
R&D on automating application privacy during the past couple of years
First computer: Commodore 64 !
4. What is Privacy?
State or condition of being free
from being observed or
disturbed by other people.
“
”
5. Privacy-by-design Principles
(by Prof. Ann Cavoukian)
Proactive not reactive; Preventative not remedial
Privacy as the default setting
Privacy embedded into design
Full functionality – positive-sum, not zero-sum
End-to-end security – full lifecycle protection
Visibility and transparency – keep it open
Respect for user privacy – keep it user-centric
6. 1.2B Personal Records Breached in 2017
Cost breakdown
$140B Direct Cost
+ Class Action Liability
+ Lost Business
+ Regulator Penalties
7. Why “fixing” Privacy is challenging
Cuts across Infrastructure, Data, Applications and Processes
Has to address what’s inside and outside the perimeter
Must coordinate with laws and regulations (e.g. GDPR, CCPA)
Fairly new discipline: embryonic body of knowledge operating at 10K feet
Lack of accessible tools to enable faster adoption
Little insight on how to leverage security tools and techniques to introduce automation
Requires strong and effective Governance model with CxO support
8. Agile PbD – The bazaar mindset
(The Cathedral & the Bazaar - Eric Raymond)
Agile adoption with a grass-roots strategy
Leverages current enterprise security practices: threat modeling
Builds on OS security tech that can bring value to the table: static and dynamic code
analysis, behavioral analytics
Plays nice with existent DevSecOps processes and toolchain: automate privacy controls
Proactive vs Reactive – Accommodating regulatory demands instead of reacting “out of
the blue”
9. PbD - Cathedral vs. Bazaar
Policy-driven implementation Engineering-driven
Hierarchical: Owned by compliance, top-
down information flow
Graph: Owned by engineering team and
compliance. Everyone can contribute
Siloed – disconnected from the security
architecture
Built around the existing security
architecture and capabilities
Infrastructure and Data First, Applications
a second thought
Applications as first-class citizens
Mindset that buying COTS software will
translate to solving the problem
Build and Adopt, buy as a last resort
11. Challenges
Missing Dataset
Scale and variety of APIs – No standards!
Manual labeling too laborious and expensive
Consumption-ready PII is not publicly available
Lack of FOSS references for inspiration
13. Synthetic Dataset Generation: Bird’s Eye
REST Request
Generation
OpenAPI stack
API
descriptor
Compiled API
descriptor
PII types and
their regexes
Mock PII fields
generation
OpenAPI
descriptors
Labeled Mock
API Requests
Automatic
Labeling
Unlabeled
Request
Oversampling
Mock REST
Request
Generation
14. Synthetic Dataset Generation: Flow
OpenAPI descriptor gets compiled
PrivAPI takes over request generation
Instead of sending it throughout the wire, it generates a mock request containing
mocked fields based on specified format (e.g. SSN, Dates)
Labels mock request based on trigger words
Oversamples minority class (i.e. PII requests)
Saves it
Note: It’s just a baseline. Augment it with “real world” data
15. Model Training: Bird’s Eye
Vectorize
Mock Request
Vocabulary
Creation
Keras + TensorFlow
Labeled Mock
API Request
Labeled API
Requests
Dataset
Analytics
Model
LSTM Deep
Neural
Network
Training
Embeddings
Produces
16. Model Training: Flow
A. PrivAPI Dataset generated in the previous step gets loaded
B. Vocabulary is created from it
C. Vector embeddings are calculated for every API request
D. LSTM Deep Neural Network is created by learning from API requests
E. Analytics model is saved
17. Classifying: Bird’s Eye
Vectorization
Keras + TensorFlow
API Request
Real world API
Traffic
Analytics Model
LSTM Deep
Neural Network
Prediction
Embeddings
Consumes
Is PII
Classification
18. Classifying: Flow
A. PrivAPI analytics model generated in the previous step is loaded, along with the
vocabulary
B. Analytics model (LSTM) created in the previous step is loaded
C. Target “real” API request is read and vectorized
D. Prediction task is executed for API request
E. Prediction results - whether the submitted API request does or does not contain PII –
are presented
19. Going to Prod? Model quality is key
Synthetic dataset is just a baseline – augment with real world examples!
Introduce smarter (domain specific) labeling through custom ‘fakers’ and Natural
Language Processing techniques (e.g. NER)
Get human feedback
Allow the model to continuously improve based on new data (online learning)