SlideShare a Scribd company logo
1 of 49
Download to read offline
Zachary S. Brown
Cyber Threat
Ranking Through
READ
2
Outline
1
2
3
4
5
3
Who We Are (1)
4
UnitedHealth
Group
UnitedHealthcare Optum
Who We Are (2)
EIS: Cybersecurity group for the enterprise
• Real time monitoring and alerting
• Security operations
• Investigation and incident response
• "We have a cybersecurity team!?"
Data Analytics and Security Innovation (DASI)
• Big data platform and advanced analytics
• Primarily data scientists, data engineers, data analysts
5
Who We Are (3)
6
Johanna
Favole
Oliver
Chan
William
Casey
Zachary
Brown
Security Big Data Lake (1)
Security Big Data Lake
• Primary platform for all enterprise cybersecurity data
• Built upon Hadoop and Elastic
• Streaming ingest of ~10 – 15 TB daily (~80k EPS)
7
Security Big Data Lake (2)
8
• SIEM loggers
• Firewalls
• Email security and
web proxy appliances
• Database activity
monitors
• Endpoint sensors
• Vulnerability scans
• Security ticketing
system
• Incident response
data collectors
Transactional
Enriching
• IP reputation
• Threat feeds
• External
vulnerabilities
• External geolocation
• Contextual
transaction data
• Analyst feedback
• Human capital
management data
• System configuration
management data
• Enterprise technology
management
• Acquired entity (AE)
references
• Application
configuration
management data
• Internal geolocation
Referential
7billioneventsperdayfrom160+sources
Reactive
• Forensic data collection
• Forensic data analysis
• Vulnerability scan data correlation
Security Big Data Lake (3)
9
SIEM
Real-time alerting Low-latency exploration
Flexible, scalable compute
Motivation (1)
Threat feeds provide indicators of compromise (IOC)
• Domains, ips, hashes, etc.
• SIEM provides some threat matching functionality
• Extraction from external feeds for enrichment in the SBDL
We’re drowning in threat matches
• How do we determine which matches are higher priority?
• Matches are rule/signature based
– Supplement with statistical behavioral analysis
10
Motivation (2)
Two step process to better leverage threat feed matches
• Extract threats from feeds, categorize, match against all data
– Produces a large volume of matches
• Utilize anomaly detection methods to implement a ranking system
More efficient analyst workflows
• Going beyond signature-based alerts
• Provide analysts list of top N candidates for investigation
– Provide additional contextual information to aid in investigation
11
Motivation (3)
Borrow approach from literature
• AI^2: Training a big data machine to defend
• Extract portions of outlier detection methodology (matrix decomposition)
• Outlier detection through reconstruction error
Literature describes multi-pronged approach
• Reconstruction error for PCA and auto encoder models
– Additional density-based scored utilized as well
• Human-in-the-loop to introduce feedback through auxiliary model
– Introduce supervised learning model to incorporate feedback
12
Scope
Scope for initial POC
• Use only PCA to compute reconstruction error score
– Only vanilla Python and Spark available at project start
• Initial focus on data captured only by enterprise web proxy
– Very rich, noisy, high volume data
• Initial focus on ip based IOCs from threats
– Less pre-processing of proxy data; no fuzzy matching
Future plans
• Auto-encoder scoring, additional data sources, HITL, additional IOCs
13
14
Threat Extraction
Threat feeds
• Nearly a dozen individual sources
– Source formats vary wildly; CSV, JSON, nesting, etc.
• Internally and externally sourced
• Tens of thousands of individual IOCs each day
• Inconsistent availability for some feeds
ETL pipeline for processing
• Un-nesting, standardization, deduplication
• Each IOC tagged with type, source, etc.
15
Threat Matching
Begin by looking for all individual matches in all data
• Non-trivial engineering problem!
– Multiple matching categories (IP, hash, CIDR, URL/domain)
– Fuzzy matching/whitelisting
• Tens of thousands of individual IOCs each day
– Billions of security events; Main limiting factor
Inconsistency in relevance of IOCs from threat feeds
• High variability in confidence and maliciousness within and across feeds
• IOCs lose relevance due to a myriad of factors
16
17
Feature Engineering (1)
For any ML model, need numerical features as input
• Want to build a statistical model of what is "normal"
• Use this to determine with records associated with IOCs are abnormal
Feature granularity
• Calculate features at the level of the IOC, e.g. domain, external ip
• Pick a time granularity to aggregate features over
– Begin with daily features
• Unique set of feature can be calculated for each data source
– Begin by focusing on web proxy logs; extremely rich data source
18
Many more opportunities for feature generation!
• Windowing, interactions, historical/group statistics
• Other data sources
19
Feature Engineering (2)
Features Example
20
Feature Engineering Implementation (1)
Feature engineering implemented with Apache Spark (SQL)
• Very efficient implementation of aggregations, joins, etc.
• Develop reusable modules that are data source agnostic
• Functions defined to take as input
– keys for features
– column(s) to derive features from
– feature types
Keep track of individual feature sets and join on keys
21
Feature Engineering Implementation (2)
Example function call for feature generation
22
# Calculate feature: stats for in and out fields
# Register the table names
# Add the table name to the list of tables to be
# passed to the join function
keys = ['dst','date']
aggs=['min','max','sum','mean']
in_out_stats = agg_num_columns(keys, columns=['in','out'], aggs=aggs)
in_out_stats.registerTempTable('io_stats')
tables.append('io_stats')
Feature Engineering Pipeline
23
Feature Engineering Takeaways
Key takeaways from feature generation process
• Spark SQL is your best friend
– Python string substitution makes it easy to generalize functionality
• Wrap complex mappings in Python functions -> register in Spark SQL
• Provide Spark as much information as you have available
– E.g. If you're pivoting a column, provide the distinct values to pivot
Feature generation performance
• ~2 hours on 192 executors, processing ~1.5 - 2TB data each day
• Very minimal scaling as time granularity is increased!
24
25
Principal Component Analysis
What is principal component analysis?
• Method of summarizing data
• Constructs new features from old that best summarize data
– New features constructed as linear combinations of old features
• Constructed to simultaneously:
– Maximize variance
– Minimize reconstruction error
• Often used for dimensionality reduction
– Reducing the number of features in a given data set
– Remove feature redundancy
26
Variance Explained
27
Reconstruction Error (1)
Decomposition, transformation, and reconstruction
• Compute principal components of input feature set
• Retain top K principal components, transform to PC space
• Invert the transformation with only the top K components
28
Reconstruction Error (2)
Reconstruction error is calculated by:
• Reconstruction error is defined as:
• Outliers present large deviations in last principal components
• Majority of variance is captured by top K components
– Large deviations in top K components contribute less to reconstruction error
– Large deviations in last components contribute more to reconstruction error
29
Data Transformations for PCA (1)
Should avoid using raw features as input for PCA
• Raw distribution is highly skewed
30
Data Transformations for PCA (2)
Results look great, right?!
• Almost all of our variance is explained by a single component
31
Data Transformations for PCA (3)
Log transformations are always a good start
32
Data Transformations for PCA (4)
Results looking better...
33
Data Transformations for PCA (5)
Scaling the data helps to ensure that individual features don't
dominate
34
Data Transformations for PCA (6)
Finally looking much more balanced
35
Reconstruction Error Revisited
Recall decomposition/reconstruction:
And reconstruction error calculation:
36
Reconstruction Error Distribution (1)
37
Reconstruction Error Distribution (2)
38
39
Process Overview
40
Threat
Extract
Proxy
Features
Matched
features
Population
features
Decompose
and stats
Ranked
Matches
Analyst
Report
Supplement
Ranking, Stats and Enrichment (1)
Reconstruction error provides us with a ranking metric
• Allows us to determine how abnormal an IOC is w.r.t. overall population
• Doesn't provide an investigator with anything concrete starting point
Need to identify the drivers of the abnormal behavior
Also helpful to supplement with contextual information
41
Ranking, Stats and Enrichment (2)
Utilize reconstruction error as a ranking metric
• Calculate PCA for population
– Store mean and std for transformed features
Decompose, reconstruct, score threat match features
• Join the threat matches to features
• Score all matched threats
Determine features driving large reconstruction error
• Calculate z-score for all features w.r.t. stored population mean and std
42
Ranking, Stats and Enrichment (3)
43
Map top N lowest/highest z-scores to message strings
Field in_mean displayed high values
(max z-score: 25)
Field requestMethod_post displayed high values
(max z-score: 18)
Field requestMethod_get displayed abnormally high values
(max z-score: 17)
Ranking, Stats and Enrichment (4)
Additional enrichment with relevant contextual information
• How was traffic to IOC handled in firewall?
• What users were accessing this IOC? What business units?
• Whois lookup information: country, ownership, time since registration
• Available reputation scores, alerting from other security tools
• What specific threat feed the IOC came from
44
Next Steps
45
So, where do we go from here?
• Add in auto-encoder
• Introduce a feedback loop -> supervised learning
• Introduce additional data sources -> more features
• Look at more granular time buckets -> time dependence?
• Additional post-processing for more useful context
• Kibana dashboard
46
47
Principal Component Analysis (2)
Given some raw features:
48
Principal Component Analysis (3)
Given some raw features:
49

More Related Content

What's hot

Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingApache Apex
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...
Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...
Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...Alison Hitchens
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real WorldDataWorks Summit
 
Accelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks RuntimeAccelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks RuntimeDatabricks
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraStream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraParis Carbone
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Data torrent meetup-productioneng
Data torrent meetup-productionengData torrent meetup-productioneng
Data torrent meetup-productionengChris Westin
 
Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Streamlio
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman FarahatSpark Summit
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaSpark Summit
 

What's hot (20)

Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...
Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...
Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real World
 
Accelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks RuntimeAccelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks Runtime
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraStream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Data torrent meetup-productioneng
Data torrent meetup-productionengData torrent meetup-productioneng
Data torrent meetup-productioneng
 
Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
 
NiW: Notebooks into Workflows
NiW: Notebooks into WorkflowsNiW: Notebooks into Workflows
NiW: Notebooks into Workflows
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
 

Similar to PCA-based Ranking of Cyber Threat Matches

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputinginside-BigData.com
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Bernardo Najlis
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondStuart (Pid) Williams
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?Deepak Shankar
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?Deepak Shankar
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?Deepak Shankar
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Network Infrastructure Monitoring @ LinkedIn
Network Infrastructure Monitoring @ LinkedInNetwork Infrastructure Monitoring @ LinkedIn
Network Infrastructure Monitoring @ LinkedInAshish Gite
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsSAIL_QU
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 

Similar to PCA-based Ranking of Cyber Threat Matches (20)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per second
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?
 
DNA: an overview
DNA: an overviewDNA: an overview
DNA: an overview
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Network Infrastructure Monitoring @ LinkedIn
Network Infrastructure Monitoring @ LinkedInNetwork Infrastructure Monitoring @ LinkedIn
Network Infrastructure Monitoring @ LinkedIn
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise Applications
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 

More from Zachary S. Brown

Working in NLP in the Age of Large Language Models
Working in NLP in the Age of Large Language ModelsWorking in NLP in the Age of Large Language Models
Working in NLP in the Age of Large Language ModelsZachary S. Brown
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionZachary S. Brown
 
Building and Deploying Scalable NLP Model Services
Building and Deploying Scalable NLP Model ServicesBuilding and Deploying Scalable NLP Model Services
Building and Deploying Scalable NLP Model ServicesZachary S. Brown
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 
Text Representations for Deep learning
Text Representations for Deep learningText Representations for Deep learning
Text Representations for Deep learningZachary S. Brown
 
Deep Learning and Modern NLP
Deep Learning and Modern NLPDeep Learning and Modern NLP
Deep Learning and Modern NLPZachary S. Brown
 

More from Zachary S. Brown (7)

Working in NLP in the Age of Large Language Models
Working in NLP in the Age of Large Language ModelsWorking in NLP in the Age of Large Language Models
Working in NLP in the Age of Large Language Models
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
 
Building and Deploying Scalable NLP Model Services
Building and Deploying Scalable NLP Model ServicesBuilding and Deploying Scalable NLP Model Services
Building and Deploying Scalable NLP Model Services
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
Text Representations for Deep learning
Text Representations for Deep learningText Representations for Deep learning
Text Representations for Deep learning
 
Deep Learning and Modern NLP
Deep Learning and Modern NLPDeep Learning and Modern NLP
Deep Learning and Modern NLP
 
Deep Domain
Deep DomainDeep Domain
Deep Domain
 

Recently uploaded

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

PCA-based Ranking of Cyber Threat Matches

  • 1. Zachary S. Brown Cyber Threat Ranking Through READ
  • 3. 3
  • 4. Who We Are (1) 4 UnitedHealth Group UnitedHealthcare Optum
  • 5. Who We Are (2) EIS: Cybersecurity group for the enterprise • Real time monitoring and alerting • Security operations • Investigation and incident response • "We have a cybersecurity team!?" Data Analytics and Security Innovation (DASI) • Big data platform and advanced analytics • Primarily data scientists, data engineers, data analysts 5
  • 6. Who We Are (3) 6 Johanna Favole Oliver Chan William Casey Zachary Brown
  • 7. Security Big Data Lake (1) Security Big Data Lake • Primary platform for all enterprise cybersecurity data • Built upon Hadoop and Elastic • Streaming ingest of ~10 – 15 TB daily (~80k EPS) 7
  • 8. Security Big Data Lake (2) 8 • SIEM loggers • Firewalls • Email security and web proxy appliances • Database activity monitors • Endpoint sensors • Vulnerability scans • Security ticketing system • Incident response data collectors Transactional Enriching • IP reputation • Threat feeds • External vulnerabilities • External geolocation • Contextual transaction data • Analyst feedback • Human capital management data • System configuration management data • Enterprise technology management • Acquired entity (AE) references • Application configuration management data • Internal geolocation Referential 7billioneventsperdayfrom160+sources Reactive • Forensic data collection • Forensic data analysis • Vulnerability scan data correlation
  • 9. Security Big Data Lake (3) 9 SIEM Real-time alerting Low-latency exploration Flexible, scalable compute
  • 10. Motivation (1) Threat feeds provide indicators of compromise (IOC) • Domains, ips, hashes, etc. • SIEM provides some threat matching functionality • Extraction from external feeds for enrichment in the SBDL We’re drowning in threat matches • How do we determine which matches are higher priority? • Matches are rule/signature based – Supplement with statistical behavioral analysis 10
  • 11. Motivation (2) Two step process to better leverage threat feed matches • Extract threats from feeds, categorize, match against all data – Produces a large volume of matches • Utilize anomaly detection methods to implement a ranking system More efficient analyst workflows • Going beyond signature-based alerts • Provide analysts list of top N candidates for investigation – Provide additional contextual information to aid in investigation 11
  • 12. Motivation (3) Borrow approach from literature • AI^2: Training a big data machine to defend • Extract portions of outlier detection methodology (matrix decomposition) • Outlier detection through reconstruction error Literature describes multi-pronged approach • Reconstruction error for PCA and auto encoder models – Additional density-based scored utilized as well • Human-in-the-loop to introduce feedback through auxiliary model – Introduce supervised learning model to incorporate feedback 12
  • 13. Scope Scope for initial POC • Use only PCA to compute reconstruction error score – Only vanilla Python and Spark available at project start • Initial focus on data captured only by enterprise web proxy – Very rich, noisy, high volume data • Initial focus on ip based IOCs from threats – Less pre-processing of proxy data; no fuzzy matching Future plans • Auto-encoder scoring, additional data sources, HITL, additional IOCs 13
  • 14. 14
  • 15. Threat Extraction Threat feeds • Nearly a dozen individual sources – Source formats vary wildly; CSV, JSON, nesting, etc. • Internally and externally sourced • Tens of thousands of individual IOCs each day • Inconsistent availability for some feeds ETL pipeline for processing • Un-nesting, standardization, deduplication • Each IOC tagged with type, source, etc. 15
  • 16. Threat Matching Begin by looking for all individual matches in all data • Non-trivial engineering problem! – Multiple matching categories (IP, hash, CIDR, URL/domain) – Fuzzy matching/whitelisting • Tens of thousands of individual IOCs each day – Billions of security events; Main limiting factor Inconsistency in relevance of IOCs from threat feeds • High variability in confidence and maliciousness within and across feeds • IOCs lose relevance due to a myriad of factors 16
  • 17. 17
  • 18. Feature Engineering (1) For any ML model, need numerical features as input • Want to build a statistical model of what is "normal" • Use this to determine with records associated with IOCs are abnormal Feature granularity • Calculate features at the level of the IOC, e.g. domain, external ip • Pick a time granularity to aggregate features over – Begin with daily features • Unique set of feature can be calculated for each data source – Begin by focusing on web proxy logs; extremely rich data source 18
  • 19. Many more opportunities for feature generation! • Windowing, interactions, historical/group statistics • Other data sources 19 Feature Engineering (2)
  • 21. Feature Engineering Implementation (1) Feature engineering implemented with Apache Spark (SQL) • Very efficient implementation of aggregations, joins, etc. • Develop reusable modules that are data source agnostic • Functions defined to take as input – keys for features – column(s) to derive features from – feature types Keep track of individual feature sets and join on keys 21
  • 22. Feature Engineering Implementation (2) Example function call for feature generation 22 # Calculate feature: stats for in and out fields # Register the table names # Add the table name to the list of tables to be # passed to the join function keys = ['dst','date'] aggs=['min','max','sum','mean'] in_out_stats = agg_num_columns(keys, columns=['in','out'], aggs=aggs) in_out_stats.registerTempTable('io_stats') tables.append('io_stats')
  • 24. Feature Engineering Takeaways Key takeaways from feature generation process • Spark SQL is your best friend – Python string substitution makes it easy to generalize functionality • Wrap complex mappings in Python functions -> register in Spark SQL • Provide Spark as much information as you have available – E.g. If you're pivoting a column, provide the distinct values to pivot Feature generation performance • ~2 hours on 192 executors, processing ~1.5 - 2TB data each day • Very minimal scaling as time granularity is increased! 24
  • 25. 25
  • 26. Principal Component Analysis What is principal component analysis? • Method of summarizing data • Constructs new features from old that best summarize data – New features constructed as linear combinations of old features • Constructed to simultaneously: – Maximize variance – Minimize reconstruction error • Often used for dimensionality reduction – Reducing the number of features in a given data set – Remove feature redundancy 26
  • 28. Reconstruction Error (1) Decomposition, transformation, and reconstruction • Compute principal components of input feature set • Retain top K principal components, transform to PC space • Invert the transformation with only the top K components 28
  • 29. Reconstruction Error (2) Reconstruction error is calculated by: • Reconstruction error is defined as: • Outliers present large deviations in last principal components • Majority of variance is captured by top K components – Large deviations in top K components contribute less to reconstruction error – Large deviations in last components contribute more to reconstruction error 29
  • 30. Data Transformations for PCA (1) Should avoid using raw features as input for PCA • Raw distribution is highly skewed 30
  • 31. Data Transformations for PCA (2) Results look great, right?! • Almost all of our variance is explained by a single component 31
  • 32. Data Transformations for PCA (3) Log transformations are always a good start 32
  • 33. Data Transformations for PCA (4) Results looking better... 33
  • 34. Data Transformations for PCA (5) Scaling the data helps to ensure that individual features don't dominate 34
  • 35. Data Transformations for PCA (6) Finally looking much more balanced 35
  • 36. Reconstruction Error Revisited Recall decomposition/reconstruction: And reconstruction error calculation: 36
  • 39. 39
  • 41. Ranking, Stats and Enrichment (1) Reconstruction error provides us with a ranking metric • Allows us to determine how abnormal an IOC is w.r.t. overall population • Doesn't provide an investigator with anything concrete starting point Need to identify the drivers of the abnormal behavior Also helpful to supplement with contextual information 41
  • 42. Ranking, Stats and Enrichment (2) Utilize reconstruction error as a ranking metric • Calculate PCA for population – Store mean and std for transformed features Decompose, reconstruct, score threat match features • Join the threat matches to features • Score all matched threats Determine features driving large reconstruction error • Calculate z-score for all features w.r.t. stored population mean and std 42
  • 43. Ranking, Stats and Enrichment (3) 43 Map top N lowest/highest z-scores to message strings Field in_mean displayed high values (max z-score: 25) Field requestMethod_post displayed high values (max z-score: 18) Field requestMethod_get displayed abnormally high values (max z-score: 17)
  • 44. Ranking, Stats and Enrichment (4) Additional enrichment with relevant contextual information • How was traffic to IOC handled in firewall? • What users were accessing this IOC? What business units? • Whois lookup information: country, ownership, time since registration • Available reputation scores, alerting from other security tools • What specific threat feed the IOC came from 44
  • 45. Next Steps 45 So, where do we go from here? • Add in auto-encoder • Introduce a feedback loop -> supervised learning • Introduce additional data sources -> more features • Look at more granular time buckets -> time dependence? • Additional post-processing for more useful context • Kibana dashboard
  • 46. 46
  • 47. 47
  • 48. Principal Component Analysis (2) Given some raw features: 48
  • 49. Principal Component Analysis (3) Given some raw features: 49