Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Continuous ML Integration & Delivery
for
Advanced Email Attack Detection
Jeshua Bratman & Justin Young
www.abnormalsecurity.com
The Detection Problem
From: “Josephine Wright” <invoicing@edisonpower.com>
To: “Tim James” <accou...
www.abnormalsecurity.com
The Detection Problem
Advanced Social Engineering
Phishing,
Spear
Phishing,
Malware
Spam
Graymail...
www.abnormalsecurity.com
The Detection Problem
This is a hard machine learning problem
1. Rarity of attacks
1. Adversarial...
www.abnormalsecurity.com
Move Fast!
Lightning speed iteration to get ahead of new attacks
Don’t Break Things!
We don’t wan...
www.abnormalsecurity.com
Part 2:
CI/CD for a
Machine Learning
Detection Engine
How do we develop quickly without breaking
...
www.abnormalsecurity.com
Code
Engineer
Modifies
Land & Deploy
Traditional CI/CD
Tests
Do the Tests
Pass?
www.abnormalsecurity.com
No idea if code change breaks the system
Engineers fixing each others bugs all the time
Pushing b...
www.abnormalsecurity.com
Tests
Machine Learning CI/CD
Rescoring
Analytics
Model Training
Deployment
Do the tests pass?
Is ...
www.abnormalsecurity.com
Cannot safely change system to fix an FN or FP
May degrade system unintentionally when shipping i...
www.abnormalsecurity.com
Adversarial!
From: “Josephine Wright” <invoicing@edisonpower.com>
To: “Tim James” <accounts@north...
www.abnormalsecurity.com
OK, how would we use this
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <...
www.abnormalsecurity.com
Code
ML Engineer
Modifies:
Machine Learning CI/CD Details
Models
Rescoring
Analytics
Model Traini...
www.abnormalsecurity.com
Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (inc...
www.abnormalsecurity.com
Part 3:
Designing the
System
How do we build a CI/CD platform for our ML system
that enables deve...
www.abnormalsecurity.com
So how do we do this?
This is a big data
problem! Data, models,
and code are all part of
the soft...
www.abnormalsecurity.com
A Familiar ML Story
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accoun...
www.abnormalsecurity.com
A Familiar ML Story
A data scientist has a
great new feature… but
how do we safely get it
into pr...
www.abnormalsecurity.com
What does it look like to test this new feature?
In a typical software test,
we can mock out
comp...
www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Br...
www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Br...
www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Br...
www.abnormalsecurity.com
Wait, what about time travel?
50
Hydration of counting
feature up to time t
48
Time
Hydration of ...
www.abnormalsecurity.com
Feature Hydration With Time Travel
Sum over time
Domain
Count
Dataset
Daily Counts
Cumulative
Cou...
www.abnormalsecurity.com
Feature Hydration With Time Travel
Events
Time-bucket
and key
www.abnormalsecurity.com
Feature Hydration With Time Travel
Hydrated
Events
Join By Key +
Time
www.abnormalsecurity.com
Deep Dive: Re-hydrating Behavior Graph
# Index every event by key and day, and take event ID to a...
www.abnormalsecurity.com
Back To Our ML Story
So we can do all of this in Spark.
But no data scientist should ever
have to...
www.abnormalsecurity.com
Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use a...
www.abnormalsecurity.com
Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use a...
www.abnormalsecurity.com
Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (inc...
www.abnormalsecurity.com
Quickly iterate
Know if things break
Train models on old examples
You will have a better & more f...
www.abnormalsecurity.com
We’re Hiring!
abnormalsecurity.com/careers/
www.abnormalsecurity.com
Thank You
www.abnormalsecurity.com
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Machine Learning CI/CD for Email Attack Detection

Download to read offline

Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models.

In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.

  • Be the first to like this

Machine Learning CI/CD for Email Attack Detection

  1. 1. Continuous ML Integration & Delivery for Advanced Email Attack Detection Jeshua Bratman & Justin Young
  2. 2. www.abnormalsecurity.com The Detection Problem From: “Josephine Wright” <invoicing@edisonpower.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, September invoice is ready! Please pay the attached invoice amount of $883,000 for electricity services for Northwest Mercy Hospitals. ABA: 12321001 Routing#: 123456789 -Jo Invoice Payment Fraud!
  3. 3. www.abnormalsecurity.com The Detection Problem Advanced Social Engineering Phishing, Spear Phishing, Malware Spam Graymail Business Email Compromise Extortion Compromised Employee Invoice Fraud Heists Scam Compromised Vendor Legitimate Email More Damaging & Sophisticated & Rare ~25% of emails ~25% of emails ~50% of emails <.1% of emails <.01% of emails < 1 in a 100k emails < 1 in a million emails < 1 in 10 million emails
  4. 4. www.abnormalsecurity.com The Detection Problem This is a hard machine learning problem 1. Rarity of attacks 1. Adversarial Attack Landscape 1. High-dimensional & high data volume 1. Need Extremely high precision and recall simultaneously
  5. 5. www.abnormalsecurity.com Move Fast! Lightning speed iteration to get ahead of new attacks Don’t Break Things! We don’t want to stop catching old attacks Continuous Integration and Delivery (CI/CD) for our ENTIRE ML Detection Engine
  6. 6. www.abnormalsecurity.com Part 2: CI/CD for a Machine Learning Detection Engine How do we develop quickly without breaking things?
  7. 7. www.abnormalsecurity.com Code Engineer Modifies Land & Deploy Traditional CI/CD Tests Do the Tests Pass?
  8. 8. www.abnormalsecurity.com No idea if code change breaks the system Engineers fixing each others bugs all the time Pushing bad code to production In modern software development it would be insane not to have CI/CD What happens if we *do not* have CI/CD?
  9. 9. www.abnormalsecurity.com Tests Machine Learning CI/CD Rescoring Analytics Model Training Deployment Do the tests pass? Is performance good? Can new models train? Code ML Engineer Modifies Models Datasets
  10. 10. www.abnormalsecurity.com Cannot safely change system to fix an FN or FP May degrade system unintentionally when shipping improvements Cannot know overall impact of new model to entire system Most ML products run blind like this! It greatly hampers development speed and product stability. What happens if we *do not* have CI/CD?
  11. 11. www.abnormalsecurity.com Adversarial! From: “Josephine Wright” <invoicing@edisonpower.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, September invoice ready! Please pay the attached invoice amount of $883,000 for electricity services for Northwest Mercy Hospitals. ABA: 12321001 Routing#: 123456789 -Jo From: “Josephine Wright” <invoicing@edisonpovver.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, Just wanted to update you, we recently had to switch banks (long story) but our account number has changed for future invoices. See attached document for updated banking details. -Josephine Attachment: BankDetails.pdf New Attack Strategy Billing Account Update Fraud! Invoice Payment Fraud!
  12. 12. www.abnormalsecurity.com OK, how would we use this From: “Josephine Wright” <invoicing@edisonpovver.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, Just wanted to update you, we recently had to switch banks (long story) but our account number has changed for future invoices. See attached document for updated banking details. -Josephine Attachment: BankDetails.pdf Billing Account Update Fraud! New or improved NLP models to identify language around changing bank accounts New code to parse pdfs and extract bank account numbers from them New counting features for how often a sender uses a particular domain, new code with feature extractor, and a model that uses those features
  13. 13. www.abnormalsecurity.com Code ML Engineer Modifies: Machine Learning CI/CD Details Models Rescoring Analytics Model Training Datasets ML Detection Engine Labeled Samples
  14. 14. www.abnormalsecurity.com Accurate ● Rescoring analytics reflect performance in production ● Training data is unbiased (including time travel to avoid future leakage) ML Engineer Effectiveness ● Easy and fast to run by engineers for retraining and evaluation ● Can add new models, datasets, features easily Requirements of good CI/CD for ML
  15. 15. www.abnormalsecurity.com Part 3: Designing the System How do we build a CI/CD platform for our ML system that enables developers and also scales well?
  16. 16. www.abnormalsecurity.com So how do we do this? This is a big data problem! Data, models, and code are all part of the software system we’re testing So, we’ll use Spark to simulate our online system. But things get complicated fast... Code Models Rescoring Analytics Model Training Datasets ML Detection Engine Labeled Samples
  17. 17. www.abnormalsecurity.com A Familiar ML Story From: “Josephine Wright” <invoicing@edisonpovver.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, Just wanted to update you, we recently had to switch banks (long story) but our account number has changed for future invoices. See attached document for updated banking details. -Josephine Attachment: BankDetails.pdf Billing Account Update Fraud! New counting features for how often a sender uses a particular domain, new code with feature extractor, and a model that uses those features A data scientist has a great new feature… but how do we safely get it into production? Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ...
  18. 18. www.abnormalsecurity.com A Familiar ML Story A data scientist has a great new feature… but how do we safely get it into production? For just the new domain count feature: 1. Domain Count Dataset 2. Feature extraction code 3. New sub-model? Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ...
  19. 19. www.abnormalsecurity.com What does it look like to test this new feature? In a typical software test, we can mock out complex dependencies But for ML, we can’t mock the data! Does every data scientist have to become a data engineer? Domain Count Dataset Code Models Rescoring Analytics Model Training Datasets ML Detection Engine Labeled Samples
  20. 20. www.abnormalsecurity.com Adding Our New Dataset SparkFiles Download dataset to disk on each executor Broadcast Variable Broadcast dataset in memory in each PySpark process What would it look like for our data scientist to add the new dataset?
  21. 21. www.abnormalsecurity.com Adding Our New Dataset SparkFiles Download dataset to disk on each executor Broadcast Variable Broadcast dataset in memory in each PySpark process # Broadcast variable to every executor small_ip_dataset = {“1.2.3.4”: 123, “5.6.7.8”: 567} ip_broadcast = sc.broadcast(dataset1) # hydrate_with_ip_count can use the small_ip_dataset dictionary hydrated_rdd = rdd.map(lambda message: hydrate_with_ip_count(message, ip_broadcast.value)) from pyspark import SparkFiles # Add Spark file so that every executor will download it sc.addFile(remote_dataset_path) # Now the file can be loaded in any Spark operation from local_dataset_path local_dataset_path = SparkFiles.get(os.path.basename(remote_dataset_path )[: -len(".tar.gz")])
  22. 22. www.abnormalsecurity.com Adding Our New Dataset SparkFiles Download dataset to disk on each executor Broadcast Variable Broadcast dataset in memory in each PySpark process Spark Join Join large distributed datasets via Spark operations What would it look like for our data scientist to add the new dataset? Domain Count Dataset
  23. 23. www.abnormalsecurity.com Wait, what about time travel? 50 Hydration of counting feature up to time t 48 Time Hydration of counting feature up to time t-x ...
  24. 24. www.abnormalsecurity.com Feature Hydration With Time Travel Sum over time Domain Count Dataset Daily Counts Cumulative Counts
  25. 25. www.abnormalsecurity.com Feature Hydration With Time Travel Events Time-bucket and key
  26. 26. www.abnormalsecurity.com Feature Hydration With Time Travel Hydrated Events Join By Key + Time
  27. 27. www.abnormalsecurity.com Deep Dive: Re-hydrating Behavior Graph # Index every event by key and day, and take event ID to avoid passing around large objects keyed_event_id_rdd = _expand_events_by_key_day(event_rdd) # Index every count by key and day keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds) # Join date-indexed event ID’s with date-indexed counts, by common key joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd) # In memory, sum up cumulative counts and key by event ID cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap( _extract_cumulative_counts ) # Join actual events back in by event ID joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join( event_rdd.keyBy(_get_id_from_event) ) # Hydrate every event with cumulative counts hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map( _hydrate_event_with_counts )
  28. 28. www.abnormalsecurity.com Back To Our ML Story So we can do all of this in Spark. But no data scientist should ever have to think about this! Data engineers should go to great efforts to provide a simple platform that hides these details Data scientists should spend as much time as possible doing data science Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ... # Index every event by key and day, and take event ID to avoid passing around large objects keyed_event_id_rdd = _expand_events_by_key_day(event_rdd) # Index every count by key and day keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds) # Join date-indexed event ID’s with date-indexed counts, by common key joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd) # In memory, sum up cumulative counts and key by event ID cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap( _extract_cumulative_counts ) # Join actual events back in by event ID joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join( event_rdd.keyBy(_get_id_from_event) ) # Hydrate every event with cumulative counts hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map( _hydrate_event_with_counts )
  29. 29. www.abnormalsecurity.com Re-scoring Is Part of the MLOps Platform Data engineers have to make re- scoring as easy to use as traditional CI/CD This means providing a playbook that’s as easy as adding unit tests Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ...
  30. 30. www.abnormalsecurity.com Re-scoring Is Part of the MLOps Platform Data engineers have to make re- scoring as easy to use as traditional CI/CD This means providing a playbook that’s as easy as adding unit tests Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ... class TimeSlicedStatsEventHydrater(Generic[Stat, Event]): # Class for building set of stats to lookup _lookup_stats_builder: LookupStatsBuilder # How to hydrate the Event with the Stats _hydrate_event: EventHydrater # Takes in an event and returns the date on which it occurred _get_date_from_event: DateExtractor # Takes in an event and returns its ID _get_id_from_event: IdExtractor
  31. 31. www.abnormalsecurity.com Accurate ● Rescoring analytics reflect performance in production ● Training data is unbiased (including time travel to avoid future leakage) ML Engineer Effectiveness ● Easy and fast to run by engineers for retraining and evaluation ● Can add new models, datasets, features easily Data Engineer Jobs-to-be-done ● Provide simple API that just works ● Make the system efficient enough to run on a regular schedule and ad-hoc Requirements of good CI/CD for ML
  32. 32. www.abnormalsecurity.com Quickly iterate Know if things break Train models on old examples You will have a better & more flexible product You will be able to address customer requests quickly You will be able to support a larger team of ML engineers working in parallel What happens if we DO have CI/CD?
  33. 33. www.abnormalsecurity.com We’re Hiring! abnormalsecurity.com/careers/
  34. 34. www.abnormalsecurity.com Thank You
  35. 35. www.abnormalsecurity.com

Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models. In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.

Views

Total views

83

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×