2. • Exploring concrete solutions – not YAPAT (Yet-Another-Privacy-Awareness-Talk)
• Introduce Grass roots approach for application privacy
• What can we mix-and-match TODAY to start making progress before regulations hit
you
• How AI (DNN specifically) can enable new capabilities in terms of hardening our
apps privacy-wise
What is this talk about?
3. What is Privacy?
State or condition of
being free from being
observed or disturbed by
other people.
4. Privacy-by-design Principles (by Prof. Ann Cavoukian)
Proactive not reactive; Preventative not remedial
Privacy as the default setting
Privacy embedded into design
Full functionality – positive-sum, not zero-sum
End-to-end security – full lifecycle protection
Visibility and transparency – keep it open
Respect for user privacy – keep it user-centric
5. 1.2B Personal Records Breached in 2017
Cost breakdown
• $140B Direct Cost
• + Class Action Liability
• + Lost Business
• + Regulator Penalties
6. Why “fixing” Privacy is challenging
• Cuts across Infrastructure, Data, Applications and Processes
Has to address what’s inside and outside the perimeter
• Must coordinate with laws and regulations (e.g. GDPR, CCPA)
• Fairly new discipline: embryonic body of knowledge operating at 10K feet
• Lack of accessible tools to enable faster adoption
• Little insight on how to leverage security tools and techniques to introduce automation
• Requires strong and effective Governance model with CxO support
7. Agile PbD – The bazaar mindset (The Catheral & the Bazaar - Eric Raymond)
• Agile adoption with a grass-roots strategy
• Leverages current enterprise security practices: threat modeling
• Builds on OS security tech that can bring value to the table: static and dynamic
code analysis, behavioral analytics
• Plays nice with existent DevSecOps processes and toolchain: automate privacy
controls
• Proactive vs Reactive – Accommodating regulatory demands instead of reacting
“out of the blue”
8. PbD - Cathedral vs. Bazaar
Policy-driven implementation Engineering-driven
Hierarchical: Owned by compliance,
top-down information flow
Graph: Owned by engineering team
and compliance. Everyone can
contribute
Siloed – disconnected from the security
architecture
Built around the existing security
architecture and capabilities
Infrastructure and Data First,
Applications as a second thought
Applications as first-class citizens
Mindset that buying COTS software will
translate to solving the problem
Build and Adopt, buy as a last resort
10. Challenges
• Missing Dataset
• Scale and variety of APIs – No standards!
• Manual labeling too laborious and expensive
• Consumption-ready PII is not publicly available
• Lack of FOSS references for inspiration
12. Synthetic Dataset Generation: Bird’s Eye
REST
Request
Generation
OpenAPI stack
API
descriptor
Compiled API
descriptor
PII types and
their regexes
Mock PII fields
generation
OpenAPI
descriptors
Labeled Mock
API Requests
Automatic
Labeling
Unlabeled
Request
Oversampling
Mock REST
Request
Generation
13. Synthetic Dataset Generation: Flow
• OpenAPI descriptor gets compiled
• PrivAPI takes over request generation
• Instead of sending it throughout the wire, it generates a mock request containing
mocked fields based on specified format (e.g. SSN, Dates)
• Labels mock request based on trigger words
• Oversamples minority class (i.e. PII requests)
• Saves it
14. Model Training: Bird’s Eye
Vectorize
Mock
Request
Vocabulary
Creation
Keras + TensorFlow
Labeled Mock
API Request
Labeled API
Requests
Dataset
Analytics Model
LSTM Deep
Neural
Network
Training
Embeddings
Produces
15. Model Training: Flow
a) PrivAPI Dataset generated in the previous step gets loaded
b) Vocabulary is created from it
c) Vector embeddings are calculated for every API request
d) LSTM Deep Neural Network is created by learning from API requests
e) Analytics model is saved
16. Classifying: Bird’s Eye
Vectorization
Keras + TensorFlow
API Request
Real world
API Traffic
Analytics Model
LSTM Deep
Neural
Network
Prediction
Embeddings
Consumes
Is PII
Classification
17. Classifying: Flow
a) PrivAPI analytics model generated in the previous step, along with the
vocabulary, are loaded
b) Analytics model (LSTM) created in the previous step is loaded
c) Target “real” API request is read and vectorized
d) Prediction task is executed for API request
e) Prediction results, whether the submitted API request contains or does not
contain, PII – are presented