PCA-based Ranking of Cyber Threat Matches

Zachary S. Brown
Cyber Threat
Ranking Through
READ

Who We Are (1)
4
UnitedHealth
Group
UnitedHealthcare Optum

Who We Are (2)
EIS: Cybersecurity group for the enterprise
• Real time monitoring and alerting
• Security operations
• Investigation and incident response
• "We have a cybersecurity team!?"
Data Analytics and Security Innovation (DASI)
• Big data platform and advanced analytics
• Primarily data scientists, data engineers, data analysts
5

Who We Are (3)
6
Johanna
Favole
Oliver
Chan
William
Casey
Zachary
Brown

Security Big Data Lake (1)
Security Big Data Lake
• Primary platform for all enterprise cybersecurity data
• Built upon Hadoop and Elastic
• Streaming ingest of ~10 – 15 TB daily (~80k EPS)
7

8
• SIEM loggers
• Firewalls
• Email security and
web proxy appliances
• Database activity
monitors
• Endpoint sensors
• Vulnerability scans
• Security ticketing
system
• Incident response
data collectors
Transactional
Enriching
• IP reputation
• Threat feeds
• External
vulnerabilities
• External geolocation
• Contextual
transaction data
• Analyst feedback
• Human capital
management data
• System configuration
management data
• Enterprise technology
management
• Acquired entity (AE)
references
• Application
configuration
management data
• Internal geolocation
Referential
7billioneventsperdayfrom160+sources
Reactive
• Forensic data collection
• Forensic data analysis
• Vulnerability scan data correlation

9
SIEM
Real-time alerting Low-latency exploration
Flexible, scalable compute

Motivation (1)
Threat feeds provide indicators of compromise (IOC)
• Domains, ips, hashes, etc.
• SIEM provides some threat matching functionality
• Extraction from external feeds for enrichment in the SBDL
We’re drowning in threat matches
• How do we determine which matches are higher priority?
• Matches are rule/signature based
– Supplement with statistical behavioral analysis
10

Motivation (2)
Two step process to better leverage threat feed matches
• Extract threats from feeds, categorize, match against all data
– Produces a large volume of matches
• Utilize anomaly detection methods to implement a ranking system
More efficient analyst workflows
• Going beyond signature-based alerts
• Provide analysts list of top N candidates for investigation
– Provide additional contextual information to aid in investigation
11

Motivation (3)
Borrow approach from literature
• AI^2: Training a big data machine to defend
• Extract portions of outlier detection methodology (matrix decomposition)
• Outlier detection through reconstruction error
Literature describes multi-pronged approach
• Reconstruction error for PCA and auto encoder models
– Additional density-based scored utilized as well
• Human-in-the-loop to introduce feedback through auxiliary model
– Introduce supervised learning model to incorporate feedback
12

Scope
Scope for initial POC
• Use only PCA to compute reconstruction error score
– Only vanilla Python and Spark available at project start
• Initial focus on data captured only by enterprise web proxy
– Very rich, noisy, high volume data
• Initial focus on ip based IOCs from threats
– Less pre-processing of proxy data; no fuzzy matching
Future plans
• Auto-encoder scoring, additional data sources, HITL, additional IOCs
13

Threat Extraction
Threat feeds
• Nearly a dozen individual sources
– Source formats vary wildly; CSV, JSON, nesting, etc.
• Internally and externally sourced
• Tens of thousands of individual IOCs each day
• Inconsistent availability for some feeds
ETL pipeline for processing
• Un-nesting, standardization, deduplication
• Each IOC tagged with type, source, etc.
15

Threat Matching
Begin by looking for all individual matches in all data
• Non-trivial engineering problem!
– Multiple matching categories (IP, hash, CIDR, URL/domain)
– Fuzzy matching/whitelisting
• Tens of thousands of individual IOCs each day
– Billions of security events; Main limiting factor
Inconsistency in relevance of IOCs from threat feeds
• High variability in confidence and maliciousness within and across feeds
• IOCs lose relevance due to a myriad of factors
16

Feature Engineering (1)
For any ML model, need numerical features as input
• Want to build a statistical model of what is "normal"
• Use this to determine with records associated with IOCs are abnormal
Feature granularity
• Calculate features at the level of the IOC, e.g. domain, external ip
• Pick a time granularity to aggregate features over
– Begin with daily features
• Unique set of feature can be calculated for each data source
– Begin by focusing on web proxy logs; extremely rich data source
18

Many more opportunities for feature generation!
• Windowing, interactions, historical/group statistics
• Other data sources
19
Feature Engineering (2)

Feature Engineering Implementation (1)
Feature engineering implemented with Apache Spark (SQL)
• Very efficient implementation of aggregations, joins, etc.
• Develop reusable modules that are data source agnostic
• Functions defined to take as input
– keys for features
– column(s) to derive features from
– feature types
Keep track of individual feature sets and join on keys
21

Feature Engineering Implementation (2)
Example function call for feature generation
22
# Calculate feature: stats for in and out fields
# Register the table names
# Add the table name to the list of tables to be
# passed to the join function
keys = ['dst','date']
aggs=['min','max','sum','mean']
in_out_stats = agg_num_columns(keys, columns=['in','out'], aggs=aggs)
in_out_stats.registerTempTable('io_stats')
tables.append('io_stats')

Feature Engineering Pipeline
23

Feature Engineering Takeaways
Key takeaways from feature generation process
• Spark SQL is your best friend
– Python string substitution makes it easy to generalize functionality
• Wrap complex mappings in Python functions -> register in Spark SQL
• Provide Spark as much information as you have available
– E.g. If you're pivoting a column, provide the distinct values to pivot
Feature generation performance
• ~2 hours on 192 executors, processing ~1.5 - 2TB data each day
• Very minimal scaling as time granularity is increased!
24

Principal Component Analysis
What is principal component analysis?
• Method of summarizing data
• Constructs new features from old that best summarize data
– New features constructed as linear combinations of old features
• Constructed to simultaneously:
– Maximize variance
– Minimize reconstruction error
• Often used for dimensionality reduction
– Reducing the number of features in a given data set
– Remove feature redundancy
26

Reconstruction Error (1)
Decomposition, transformation, and reconstruction
• Compute principal components of input feature set
• Retain top K principal components, transform to PC space
• Invert the transformation with only the top K components
28

Reconstruction Error (2)
Reconstruction error is calculated by:
• Reconstruction error is defined as:
• Outliers present large deviations in last principal components
• Majority of variance is captured by top K components
– Large deviations in top K components contribute less to reconstruction error
– Large deviations in last components contribute more to reconstruction error
29

Data Transformations for PCA (1)
Should avoid using raw features as input for PCA
• Raw distribution is highly skewed
30

Results look great, right?!
• Almost all of our variance is explained by a single component
31

Log transformations are always a good start
32

Results looking better...
33

Scaling the data helps to ensure that individual features don't
dominate
34

Finally looking much more balanced
35

Reconstruction Error Revisited
Recall decomposition/reconstruction:
And reconstruction error calculation:
36

Reconstruction Error Distribution (1)
37

Reconstruction Error Distribution (2)
38

Process Overview
40
Threat
Extract
Proxy
Features
Matched
features
Population
features
Decompose
and stats
Ranked
Matches
Analyst
Report
Supplement

Ranking, Stats and Enrichment (1)
Reconstruction error provides us with a ranking metric
• Allows us to determine how abnormal an IOC is w.r.t. overall population
• Doesn't provide an investigator with anything concrete starting point
Need to identify the drivers of the abnormal behavior
Also helpful to supplement with contextual information
41

Utilize reconstruction error as a ranking metric
• Calculate PCA for population
– Store mean and std for transformed features
Decompose, reconstruct, score threat match features
• Join the threat matches to features
• Score all matched threats
Determine features driving large reconstruction error
• Calculate z-score for all features w.r.t. stored population mean and std
42

43
Map top N lowest/highest z-scores to message strings
Field in_mean displayed high values
(max z-score: 25)
Field requestMethod_post displayed high values
(max z-score: 18)
Field requestMethod_get displayed abnormally high values
(max z-score: 17)

Additional enrichment with relevant contextual information
• How was traffic to IOC handled in firewall?
• What users were accessing this IOC? What business units?
• Whois lookup information: country, ownership, time since registration
• Available reputation scores, alerting from other security tools
• What specific threat feed the IOC came from
44

Next Steps
45
So, where do we go from here?
• Add in auto-encoder
• Introduce a feedback loop -> supervised learning
• Introduce additional data sources -> more features
• Look at more granular time buckets -> time dependence?
• Additional post-processing for more useful context
• Kibana dashboard

Principal Component Analysis (2)
Given some raw features:
48

Principal Component Analysis (3)
Given some raw features:
49

PCA-based Ranking of Cyber Threat Matches

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PCA-based Ranking of Cyber Threat Matches

Similar to PCA-based Ranking of Cyber Threat Matches (20)

More from Zachary S. Brown

More from Zachary S. Brown (7)

Recently uploaded

Recently uploaded (20)

PCA-based Ranking of Cyber Threat Matches