SlideShare a Scribd company logo
1 of 35
Download to read offline
Continuous ML Integration & Delivery
for
Advanced Email Attack Detection
Jeshua Bratman & Justin Young
www.abnormalsecurity.com
The Detection Problem
From: “Josephine Wright” <invoicing@edisonpower.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
September invoice is ready! Please pay the attached invoice amount of
$883,000 for electricity services for Northwest Mercy Hospitals.
ABA: 12321001
Routing#: 123456789
-Jo
Invoice Payment Fraud!
www.abnormalsecurity.com
The Detection Problem
Advanced Social Engineering
Phishing,
Spear
Phishing,
Malware
Spam
Graymail
Business
Email
Compromise
Extortion
Compromised
Employee
Invoice Fraud
Heists
Scam
Compromised
Vendor
Legitimate Email
More Damaging & Sophisticated & Rare
~25% of emails
~25% of emails
~50% of emails
<.1% of emails
<.01% of emails
< 1 in a 100k emails
< 1 in a million emails
< 1 in 10 million emails
www.abnormalsecurity.com
The Detection Problem
This is a hard machine learning problem
1. Rarity of attacks
1. Adversarial Attack Landscape
1. High-dimensional & high data volume
1. Need Extremely high precision and recall simultaneously
www.abnormalsecurity.com
Move Fast!
Lightning speed iteration to get ahead of new attacks
Don’t Break Things!
We don’t want to stop catching old attacks
Continuous Integration and Delivery (CI/CD) for our ENTIRE ML Detection Engine
www.abnormalsecurity.com
Part 2:
CI/CD for a
Machine Learning
Detection Engine
How do we develop quickly without breaking
things?
www.abnormalsecurity.com
Code
Engineer
Modifies
Land & Deploy
Traditional CI/CD
Tests
Do the Tests
Pass?
www.abnormalsecurity.com
No idea if code change breaks the system
Engineers fixing each others bugs all the time
Pushing bad code to production
In modern software development it would be insane not to have CI/CD
What happens if we *do not* have CI/CD?
www.abnormalsecurity.com
Tests
Machine Learning CI/CD
Rescoring
Analytics
Model Training
Deployment
Do the tests pass?
Is performance good?
Can new models train?
Code
ML Engineer
Modifies
Models
Datasets
www.abnormalsecurity.com
Cannot safely change system to fix an FN or FP
May degrade system unintentionally when shipping improvements
Cannot know overall impact of new model to entire system
Most ML products run blind like this! It greatly hampers development speed and product stability.
What happens if we *do not* have CI/CD?
www.abnormalsecurity.com
Adversarial!
From: “Josephine Wright” <invoicing@edisonpower.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
September invoice ready! Please pay the attached invoice
amount of $883,000 for electricity services for Northwest Mercy
Hospitals.
ABA: 12321001
Routing#: 123456789
-Jo
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
New Attack Strategy
Billing Account Update Fraud!
Invoice Payment Fraud!
www.abnormalsecurity.com
OK, how would we use this
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
Billing Account Update Fraud!
New or improved NLP models to identify
language around changing bank
accounts
New code to parse pdfs and extract bank
account numbers from them
New counting features for how often a
sender uses a particular domain, new
code with feature extractor, and a model
that uses those features
www.abnormalsecurity.com
Code
ML Engineer
Modifies:
Machine Learning CI/CD Details
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples
www.abnormalsecurity.com
Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (including time travel to avoid future leakage)
ML Engineer Effectiveness
● Easy and fast to run by engineers for retraining and evaluation
● Can add new models, datasets, features easily
Requirements of good CI/CD for ML
www.abnormalsecurity.com
Part 3:
Designing the
System
How do we build a CI/CD platform for our ML system
that enables developers and also scales well?
www.abnormalsecurity.com
So how do we do this?
This is a big data
problem! Data, models,
and code are all part of
the software system
we’re testing
So, we’ll use Spark to
simulate our online
system. But things get
complicated fast...
Code
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples
www.abnormalsecurity.com
A Familiar ML Story
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
Billing Account Update Fraud!
New counting features for how often a
sender uses a particular domain, new code
with feature extractor, and a model that
uses those features
A data scientist has a great new feature…
but how do we safely get it into
production?
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
www.abnormalsecurity.com
A Familiar ML Story
A data scientist has a
great new feature… but
how do we safely get it
into production?
For just the new domain count
feature:
1. Domain Count Dataset
2. Feature extraction code
3. New sub-model?
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
www.abnormalsecurity.com
What does it look like to test this new feature?
In a typical software test,
we can mock out
complex dependencies
But for ML, we can’t
mock the data!
Does every data
scientist have to become
a data engineer?
Domain
Count
Dataset
Code
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples
www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
What would it look like for our data scientist
to add the new dataset?
www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
# Broadcast variable to every executor
small_ip_dataset = {“1.2.3.4”: 123, “5.6.7.8”: 567}
ip_broadcast = sc.broadcast(dataset1)
# hydrate_with_ip_count can use the
small_ip_dataset dictionary
hydrated_rdd = rdd.map(lambda message:
hydrate_with_ip_count(message, ip_broadcast.value))
from pyspark import SparkFiles
# Add Spark file so that every executor will
download it
sc.addFile(remote_dataset_path)
# Now the file can be loaded in any Spark operation
from local_dataset_path
local_dataset_path =
SparkFiles.get(os.path.basename(remote_dataset_path
)[: -len(".tar.gz")])
www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
Spark Join
Join large distributed datasets via Spark
operations
What would it look like for our data scientist
to add the new dataset?
Domain
Count
Dataset
www.abnormalsecurity.com
Wait, what about time travel?
50
Hydration of counting
feature up to time t
48
Time
Hydration of counting
feature up to time t-x
...
www.abnormalsecurity.com
Feature Hydration With Time Travel
Sum over time
Domain
Count
Dataset
Daily Counts
Cumulative
Counts
www.abnormalsecurity.com
Feature Hydration With Time Travel
Events
Time-bucket
and key
www.abnormalsecurity.com
Feature Hydration With Time Travel
Hydrated
Events
Join By Key +
Time
www.abnormalsecurity.com
Deep Dive: Re-hydrating Behavior Graph
# Index every event by key and day, and take event ID to avoid passing around large objects
keyed_event_id_rdd = _expand_events_by_key_day(event_rdd)
# Index every count by key and day
keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds)
# Join date-indexed event ID’s with date-indexed counts, by common key
joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd)
# In memory, sum up cumulative counts and key by event ID
cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap(
_extract_cumulative_counts
)
# Join actual events back in by event ID
joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join(
event_rdd.keyBy(_get_id_from_event)
)
# Hydrate every event with cumulative counts
hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map(
_hydrate_event_with_counts
)
www.abnormalsecurity.com
Back To Our ML Story
So we can do all of this in Spark.
But no data scientist should ever
have to think about this!
Data engineers should go to
great efforts to provide a simple
platform that hides these details
Data scientists should spend as
much time as possible doing data
science
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
# Index every event by key and day, and take event ID to avoid passing around large objects
keyed_event_id_rdd = _expand_events_by_key_day(event_rdd)
# Index every count by key and day
keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds)
# Join date-indexed event ID’s with date-indexed counts, by common key
joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd)
# In memory, sum up cumulative counts and key by event ID
cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap(
_extract_cumulative_counts
)
# Join actual events back in by event ID
joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join(
event_rdd.keyBy(_get_id_from_event)
)
# Hydrate every event with cumulative counts
hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map(
_hydrate_event_with_counts
)
www.abnormalsecurity.com
Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use as
traditional CI/CD
This means providing a playbook
that’s as easy as adding unit tests
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
www.abnormalsecurity.com
Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use as
traditional CI/CD
This means providing a playbook
that’s as easy as adding unit tests
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
class TimeSlicedStatsEventHydrater(Generic[Stat, Event]):
# Class for building set of stats to lookup
_lookup_stats_builder: LookupStatsBuilder
# How to hydrate the Event with the Stats
_hydrate_event: EventHydrater
# Takes in an event and returns the date on which it occurred
_get_date_from_event: DateExtractor
# Takes in an event and returns its ID
_get_id_from_event: IdExtractor
www.abnormalsecurity.com
Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (including time travel to avoid future leakage)
ML Engineer Effectiveness
● Easy and fast to run by engineers for retraining and evaluation
● Can add new models, datasets, features easily
Data Engineer Jobs-to-be-done
● Provide simple API that just works
● Make the system efficient enough to run on a regular schedule and ad-hoc
Requirements of good CI/CD for ML
www.abnormalsecurity.com
Quickly iterate
Know if things break
Train models on old examples
You will have a better & more flexible product
You will be able to address customer requests quickly
You will be able to support a larger team of ML engineers working in parallel
What happens if we DO have CI/CD?
www.abnormalsecurity.com
We’re Hiring!
abnormalsecurity.com/careers/
www.abnormalsecurity.com
Thank You
www.abnormalsecurity.com

More Related Content

What's hot

Data Pipelines With Streamsets
Data Pipelines With Streamsets Data Pipelines With Streamsets
Data Pipelines With Streamsets Jowanza Joseph
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Automated Testing For Protecting Data Pipelines from Undocumented Assumptions
Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsAutomated Testing For Protecting Data Pipelines from Undocumented Assumptions
Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsDatabricks
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Big Data Spain
 
Redash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data LakesRedash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data LakesDatabricks
 
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Databricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Dealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakeDealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakePat Patterson
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsDatabricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakePat Patterson
 
Improving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT AnalyticsImproving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT AnalyticsDatabricks
 
Effective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a WeekEffective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a WeekDatabricks
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
Analysing data analytics use cases to understand big data platform
Analysing data analytics use cases  to understand big data platformAnalysing data analytics use cases  to understand big data platform
Analysing data analytics use cases to understand big data platformdataeaze systems
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics Franco Ucci
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 

What's hot (20)

Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Data Pipelines With Streamsets
Data Pipelines With Streamsets Data Pipelines With Streamsets
Data Pipelines With Streamsets
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Automated Testing For Protecting Data Pipelines from Undocumented Assumptions
Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsAutomated Testing For Protecting Data Pipelines from Undocumented Assumptions
Automated Testing For Protecting Data Pipelines from Undocumented Assumptions
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
 
Redash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data LakesRedash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data Lakes
 
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Dealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakeDealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data Lake
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization Systems
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data Lake
 
Improving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT AnalyticsImproving Power Grid Reliability Using IoT Analytics
Improving Power Grid Reliability Using IoT Analytics
 
Effective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a WeekEffective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a Week
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Analysing data analytics use cases to understand big data platform
Analysing data analytics use cases  to understand big data platformAnalysing data analytics use cases  to understand big data platform
Analysing data analytics use cases to understand big data platform
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 

Similar to Machine Learning CI/CD for Email Attack Detection

Netpluz Managed SOC - MSS Service
Netpluz Managed SOC - MSS Service Netpluz Managed SOC - MSS Service
Netpluz Managed SOC - MSS Service Netpluz Asia Pte Ltd
 
1st Party
1st Party1st Party
1st Partymdnunez
 
Event Sourcing with Microservices
Event Sourcing with MicroservicesEvent Sourcing with Microservices
Event Sourcing with MicroservicesRalph Winzinger
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionSid Anand
 
Evolution of a big data project
Evolution of a big data projectEvolution of a big data project
Evolution of a big data projectMichael Peacock
 
Amazon Web Services: Building a 'Web-Scale Computing' Architecture
Amazon Web Services: Building a 'Web-Scale Computing' ArchitectureAmazon Web Services: Building a 'Web-Scale Computing' Architecture
Amazon Web Services: Building a 'Web-Scale Computing' Architecturegoodfriday
 
AWS Presentation
AWS PresentationAWS Presentation
AWS Presentationjlechowicz
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteRoger Barga
 
From monolithic to serverless with Amazon Step Functions
From monolithic to serverless with Amazon Step FunctionsFrom monolithic to serverless with Amazon Step Functions
From monolithic to serverless with Amazon Step FunctionsScott Triglia
 
Analyzing Streams: Data Analytics Week at the SF Loft
Analyzing Streams: Data Analytics Week at the SF LoftAnalyzing Streams: Data Analytics Week at the SF Loft
Analyzing Streams: Data Analytics Week at the SF LoftAmazon Web Services
 
Logging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friendsLogging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friendsItamar
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sInside Analysis
 
Inspire 2014 Using eForms and iScripts with Business Applications
Inspire 2014 Using eForms and iScripts with Business ApplicationsInspire 2014 Using eForms and iScripts with Business Applications
Inspire 2014 Using eForms and iScripts with Business ApplicationsMary Fisher
 
FME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-timeFME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-timeSafe Software
 

Similar to Machine Learning CI/CD for Email Attack Detection (20)

Netpluz Managed SOC - MSS Service
Netpluz Managed SOC - MSS Service Netpluz Managed SOC - MSS Service
Netpluz Managed SOC - MSS Service
 
1st Party
1st Party1st Party
1st Party
 
Event Sourcing with Microservices
Event Sourcing with MicroservicesEvent Sourcing with Microservices
Event Sourcing with Microservices
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
 
Evolution of a big data project
Evolution of a big data projectEvolution of a big data project
Evolution of a big data project
 
Amazon Web Services: Building a 'Web-Scale Computing' Architecture
Amazon Web Services: Building a 'Web-Scale Computing' ArchitectureAmazon Web Services: Building a 'Web-Scale Computing' Architecture
Amazon Web Services: Building a 'Web-Scale Computing' Architecture
 
Data Breach Risk Brief - 2015
Data Breach Risk Brief - 2015Data Breach Risk Brief - 2015
Data Breach Risk Brief - 2015
 
AWS Presentation
AWS PresentationAWS Presentation
AWS Presentation
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
 
From monolithic to serverless with Amazon Step Functions
From monolithic to serverless with Amazon Step FunctionsFrom monolithic to serverless with Amazon Step Functions
From monolithic to serverless with Amazon Step Functions
 
Analyzing Streams: Data Analytics Week at the SF Loft
Analyzing Streams: Data Analytics Week at the SF LoftAnalyzing Streams: Data Analytics Week at the SF Loft
Analyzing Streams: Data Analytics Week at the SF Loft
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Logging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friendsLogging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friends
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
 
Inspire 2014 Using eForms and iScripts with Business Applications
Inspire 2014 Using eForms and iScripts with Business ApplicationsInspire 2014 Using eForms and iScripts with Business Applications
Inspire 2014 Using eForms and iScripts with Business Applications
 
FME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-timeFME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-time
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Recently uploaded (20)

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 

Machine Learning CI/CD for Email Attack Detection

  • 1. Continuous ML Integration & Delivery for Advanced Email Attack Detection Jeshua Bratman & Justin Young
  • 2. www.abnormalsecurity.com The Detection Problem From: “Josephine Wright” <invoicing@edisonpower.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, September invoice is ready! Please pay the attached invoice amount of $883,000 for electricity services for Northwest Mercy Hospitals. ABA: 12321001 Routing#: 123456789 -Jo Invoice Payment Fraud!
  • 3. www.abnormalsecurity.com The Detection Problem Advanced Social Engineering Phishing, Spear Phishing, Malware Spam Graymail Business Email Compromise Extortion Compromised Employee Invoice Fraud Heists Scam Compromised Vendor Legitimate Email More Damaging & Sophisticated & Rare ~25% of emails ~25% of emails ~50% of emails <.1% of emails <.01% of emails < 1 in a 100k emails < 1 in a million emails < 1 in 10 million emails
  • 4. www.abnormalsecurity.com The Detection Problem This is a hard machine learning problem 1. Rarity of attacks 1. Adversarial Attack Landscape 1. High-dimensional & high data volume 1. Need Extremely high precision and recall simultaneously
  • 5. www.abnormalsecurity.com Move Fast! Lightning speed iteration to get ahead of new attacks Don’t Break Things! We don’t want to stop catching old attacks Continuous Integration and Delivery (CI/CD) for our ENTIRE ML Detection Engine
  • 6. www.abnormalsecurity.com Part 2: CI/CD for a Machine Learning Detection Engine How do we develop quickly without breaking things?
  • 8. www.abnormalsecurity.com No idea if code change breaks the system Engineers fixing each others bugs all the time Pushing bad code to production In modern software development it would be insane not to have CI/CD What happens if we *do not* have CI/CD?
  • 9. www.abnormalsecurity.com Tests Machine Learning CI/CD Rescoring Analytics Model Training Deployment Do the tests pass? Is performance good? Can new models train? Code ML Engineer Modifies Models Datasets
  • 10. www.abnormalsecurity.com Cannot safely change system to fix an FN or FP May degrade system unintentionally when shipping improvements Cannot know overall impact of new model to entire system Most ML products run blind like this! It greatly hampers development speed and product stability. What happens if we *do not* have CI/CD?
  • 11. www.abnormalsecurity.com Adversarial! From: “Josephine Wright” <invoicing@edisonpower.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, September invoice ready! Please pay the attached invoice amount of $883,000 for electricity services for Northwest Mercy Hospitals. ABA: 12321001 Routing#: 123456789 -Jo From: “Josephine Wright” <invoicing@edisonpovver.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, Just wanted to update you, we recently had to switch banks (long story) but our account number has changed for future invoices. See attached document for updated banking details. -Josephine Attachment: BankDetails.pdf New Attack Strategy Billing Account Update Fraud! Invoice Payment Fraud!
  • 12. www.abnormalsecurity.com OK, how would we use this From: “Josephine Wright” <invoicing@edisonpovver.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, Just wanted to update you, we recently had to switch banks (long story) but our account number has changed for future invoices. See attached document for updated banking details. -Josephine Attachment: BankDetails.pdf Billing Account Update Fraud! New or improved NLP models to identify language around changing bank accounts New code to parse pdfs and extract bank account numbers from them New counting features for how often a sender uses a particular domain, new code with feature extractor, and a model that uses those features
  • 13. www.abnormalsecurity.com Code ML Engineer Modifies: Machine Learning CI/CD Details Models Rescoring Analytics Model Training Datasets ML Detection Engine Labeled Samples
  • 14. www.abnormalsecurity.com Accurate ● Rescoring analytics reflect performance in production ● Training data is unbiased (including time travel to avoid future leakage) ML Engineer Effectiveness ● Easy and fast to run by engineers for retraining and evaluation ● Can add new models, datasets, features easily Requirements of good CI/CD for ML
  • 15. www.abnormalsecurity.com Part 3: Designing the System How do we build a CI/CD platform for our ML system that enables developers and also scales well?
  • 16. www.abnormalsecurity.com So how do we do this? This is a big data problem! Data, models, and code are all part of the software system we’re testing So, we’ll use Spark to simulate our online system. But things get complicated fast... Code Models Rescoring Analytics Model Training Datasets ML Detection Engine Labeled Samples
  • 17. www.abnormalsecurity.com A Familiar ML Story From: “Josephine Wright” <invoicing@edisonpovver.com> To: “Tim James” <accounts@northwestmercyhospitals.com> Subject: “Invoice details for September electricity service” Hi Tim, Just wanted to update you, we recently had to switch banks (long story) but our account number has changed for future invoices. See attached document for updated banking details. -Josephine Attachment: BankDetails.pdf Billing Account Update Fraud! New counting features for how often a sender uses a particular domain, new code with feature extractor, and a model that uses those features A data scientist has a great new feature… but how do we safely get it into production? Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ...
  • 18. www.abnormalsecurity.com A Familiar ML Story A data scientist has a great new feature… but how do we safely get it into production? For just the new domain count feature: 1. Domain Count Dataset 2. Feature extraction code 3. New sub-model? Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ...
  • 19. www.abnormalsecurity.com What does it look like to test this new feature? In a typical software test, we can mock out complex dependencies But for ML, we can’t mock the data! Does every data scientist have to become a data engineer? Domain Count Dataset Code Models Rescoring Analytics Model Training Datasets ML Detection Engine Labeled Samples
  • 20. www.abnormalsecurity.com Adding Our New Dataset SparkFiles Download dataset to disk on each executor Broadcast Variable Broadcast dataset in memory in each PySpark process What would it look like for our data scientist to add the new dataset?
  • 21. www.abnormalsecurity.com Adding Our New Dataset SparkFiles Download dataset to disk on each executor Broadcast Variable Broadcast dataset in memory in each PySpark process # Broadcast variable to every executor small_ip_dataset = {“1.2.3.4”: 123, “5.6.7.8”: 567} ip_broadcast = sc.broadcast(dataset1) # hydrate_with_ip_count can use the small_ip_dataset dictionary hydrated_rdd = rdd.map(lambda message: hydrate_with_ip_count(message, ip_broadcast.value)) from pyspark import SparkFiles # Add Spark file so that every executor will download it sc.addFile(remote_dataset_path) # Now the file can be loaded in any Spark operation from local_dataset_path local_dataset_path = SparkFiles.get(os.path.basename(remote_dataset_path )[: -len(".tar.gz")])
  • 22. www.abnormalsecurity.com Adding Our New Dataset SparkFiles Download dataset to disk on each executor Broadcast Variable Broadcast dataset in memory in each PySpark process Spark Join Join large distributed datasets via Spark operations What would it look like for our data scientist to add the new dataset? Domain Count Dataset
  • 23. www.abnormalsecurity.com Wait, what about time travel? 50 Hydration of counting feature up to time t 48 Time Hydration of counting feature up to time t-x ...
  • 24. www.abnormalsecurity.com Feature Hydration With Time Travel Sum over time Domain Count Dataset Daily Counts Cumulative Counts
  • 25. www.abnormalsecurity.com Feature Hydration With Time Travel Events Time-bucket and key
  • 26. www.abnormalsecurity.com Feature Hydration With Time Travel Hydrated Events Join By Key + Time
  • 27. www.abnormalsecurity.com Deep Dive: Re-hydrating Behavior Graph # Index every event by key and day, and take event ID to avoid passing around large objects keyed_event_id_rdd = _expand_events_by_key_day(event_rdd) # Index every count by key and day keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds) # Join date-indexed event ID’s with date-indexed counts, by common key joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd) # In memory, sum up cumulative counts and key by event ID cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap( _extract_cumulative_counts ) # Join actual events back in by event ID joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join( event_rdd.keyBy(_get_id_from_event) ) # Hydrate every event with cumulative counts hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map( _hydrate_event_with_counts )
  • 28. www.abnormalsecurity.com Back To Our ML Story So we can do all of this in Spark. But no data scientist should ever have to think about this! Data engineers should go to great efforts to provide a simple platform that hides these details Data scientists should spend as much time as possible doing data science Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ... # Index every event by key and day, and take event ID to avoid passing around large objects keyed_event_id_rdd = _expand_events_by_key_day(event_rdd) # Index every count by key and day keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds) # Join date-indexed event ID’s with date-indexed counts, by common key joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd) # In memory, sum up cumulative counts and key by event ID cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap( _extract_cumulative_counts ) # Join actual events back in by event ID joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join( event_rdd.keyBy(_get_id_from_event) ) # Hydrate every event with cumulative counts hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map( _hydrate_event_with_counts )
  • 29. www.abnormalsecurity.com Re-scoring Is Part of the MLOps Platform Data engineers have to make re- scoring as easy to use as traditional CI/CD This means providing a playbook that’s as easy as adding unit tests Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ...
  • 30. www.abnormalsecurity.com Re-scoring Is Part of the MLOps Platform Data engineers have to make re- scoring as easy to use as traditional CI/CD This means providing a playbook that’s as easy as adding unit tests Domain Count Dataset ... (“Josephine Wright”, “edisonpower.com”): 1000, (“Josephine Wright”, “edisonpovver.com”): 0, ... class TimeSlicedStatsEventHydrater(Generic[Stat, Event]): # Class for building set of stats to lookup _lookup_stats_builder: LookupStatsBuilder # How to hydrate the Event with the Stats _hydrate_event: EventHydrater # Takes in an event and returns the date on which it occurred _get_date_from_event: DateExtractor # Takes in an event and returns its ID _get_id_from_event: IdExtractor
  • 31. www.abnormalsecurity.com Accurate ● Rescoring analytics reflect performance in production ● Training data is unbiased (including time travel to avoid future leakage) ML Engineer Effectiveness ● Easy and fast to run by engineers for retraining and evaluation ● Can add new models, datasets, features easily Data Engineer Jobs-to-be-done ● Provide simple API that just works ● Make the system efficient enough to run on a regular schedule and ad-hoc Requirements of good CI/CD for ML
  • 32. www.abnormalsecurity.com Quickly iterate Know if things break Train models on old examples You will have a better & more flexible product You will be able to address customer requests quickly You will be able to support a larger team of ML engineers working in parallel What happens if we DO have CI/CD?