Zachary S. Brown
Cyber Threat
Ranking Through
READ
2
Outline
1
2
3
4
5
3
Who We Are (1)
4
UnitedHealth
Group
UnitedHealthcare Optum
Who We Are (2)
EIS: Cybersecurity group for the enterprise
• Real time monitoring and alerting
• Security operations
• Investigation and incident response
• "We have a cybersecurity team!?"
Data Analytics and Security Innovation (DASI)
• Big data platform and advanced analytics
• Primarily data scientists, data engineers, data analysts
5
Who We Are (3)
6
Johanna
Favole
Oliver
Chan
William
Casey
Zachary
Brown
Security Big Data Lake (1)
Security Big Data Lake
• Primary platform for all enterprise cybersecurity data
• Built upon Hadoop and Elastic
• Streaming ingest of ~10 – 15 TB daily (~80k EPS)
7
Security Big Data Lake (2)
8
• SIEM loggers
• Firewalls
• Email security and
web proxy appliances
• Database activity
monitors
• Endpoint sensors
• Vulnerability scans
• Security ticketing
system
• Incident response
data collectors
Transactional
Enriching
• IP reputation
• Threat feeds
• External
vulnerabilities
• External geolocation
• Contextual
transaction data
• Analyst feedback
• Human capital
management data
• System configuration
management data
• Enterprise technology
management
• Acquired entity (AE)
references
• Application
configuration
management data
• Internal geolocation
Referential
7billioneventsperdayfrom160+sources
Reactive
• Forensic data collection
• Forensic data analysis
• Vulnerability scan data correlation
Security Big Data Lake (3)
9
SIEM
Real-time alerting Low-latency exploration
Flexible, scalable compute
Motivation (1)
Threat feeds provide indicators of compromise (IOC)
• Domains, ips, hashes, etc.
• SIEM provides some threat matching functionality
• Extraction from external feeds for enrichment in the SBDL
We’re drowning in threat matches
• How do we determine which matches are higher priority?
• Matches are rule/signature based
– Supplement with statistical behavioral analysis
10
Motivation (2)
Two step process to better leverage threat feed matches
• Extract threats from feeds, categorize, match against all data
– Produces a large volume of matches
• Utilize anomaly detection methods to implement a ranking system
More efficient analyst workflows
• Going beyond signature-based alerts
• Provide analysts list of top N candidates for investigation
– Provide additional contextual information to aid in investigation
11
Motivation (3)
Borrow approach from literature
• AI^2: Training a big data machine to defend
• Extract portions of outlier detection methodology (matrix decomposition)
• Outlier detection through reconstruction error
Literature describes multi-pronged approach
• Reconstruction error for PCA and auto encoder models
– Additional density-based scored utilized as well
• Human-in-the-loop to introduce feedback through auxiliary model
– Introduce supervised learning model to incorporate feedback
12
Scope
Scope for initial POC
• Use only PCA to compute reconstruction error score
– Only vanilla Python and Spark available at project start
• Initial focus on data captured only by enterprise web proxy
– Very rich, noisy, high volume data
• Initial focus on ip based IOCs from threats
– Less pre-processing of proxy data; no fuzzy matching
Future plans
• Auto-encoder scoring, additional data sources, HITL, additional IOCs
13
14
Threat Extraction
Threat feeds
• Nearly a dozen individual sources
– Source formats vary wildly; CSV, JSON, nesting, etc.
• Internally and externally sourced
• Tens of thousands of individual IOCs each day
• Inconsistent availability for some feeds
ETL pipeline for processing
• Un-nesting, standardization, deduplication
• Each IOC tagged with type, source, etc.
15
Threat Matching
Begin by looking for all individual matches in all data
• Non-trivial engineering problem!
– Multiple matching categories (IP, hash, CIDR, URL/domain)
– Fuzzy matching/whitelisting
• Tens of thousands of individual IOCs each day
– Billions of security events; Main limiting factor
Inconsistency in relevance of IOCs from threat feeds
• High variability in confidence and maliciousness within and across feeds
• IOCs lose relevance due to a myriad of factors
16
17
Feature Engineering (1)
For any ML model, need numerical features as input
• Want to build a statistical model of what is "normal"
• Use this to determine with records associated with IOCs are abnormal
Feature granularity
• Calculate features at the level of the IOC, e.g. domain, external ip
• Pick a time granularity to aggregate features over
– Begin with daily features
• Unique set of feature can be calculated for each data source
– Begin by focusing on web proxy logs; extremely rich data source
18
Many more opportunities for feature generation!
• Windowing, interactions, historical/group statistics
• Other data sources
19
Feature Engineering (2)
Features Example
20
Feature Engineering Implementation (1)
Feature engineering implemented with Apache Spark (SQL)
• Very efficient implementation of aggregations, joins, etc.
• Develop reusable modules that are data source agnostic
• Functions defined to take as input
– keys for features
– column(s) to derive features from
– feature types
Keep track of individual feature sets and join on keys
21
Feature Engineering Implementation (2)
Example function call for feature generation
22
# Calculate feature: stats for in and out fields
# Register the table names
# Add the table name to the list of tables to be
# passed to the join function
keys = ['dst','date']
aggs=['min','max','sum','mean']
in_out_stats = agg_num_columns(keys, columns=['in','out'], aggs=aggs)
in_out_stats.registerTempTable('io_stats')
tables.append('io_stats')
Feature Engineering Pipeline
23
Feature Engineering Takeaways
Key takeaways from feature generation process
• Spark SQL is your best friend
– Python string substitution makes it easy to generalize functionality
• Wrap complex mappings in Python functions -> register in Spark SQL
• Provide Spark as much information as you have available
– E.g. If you're pivoting a column, provide the distinct values to pivot
Feature generation performance
• ~2 hours on 192 executors, processing ~1.5 - 2TB data each day
• Very minimal scaling as time granularity is increased!
24
25
Principal Component Analysis
What is principal component analysis?
• Method of summarizing data
• Constructs new features from old that best summarize data
– New features constructed as linear combinations of old features
• Constructed to simultaneously:
– Maximize variance
– Minimize reconstruction error
• Often used for dimensionality reduction
– Reducing the number of features in a given data set
– Remove feature redundancy
26
Variance Explained
27
Reconstruction Error (1)
Decomposition, transformation, and reconstruction
• Compute principal components of input feature set
• Retain top K principal components, transform to PC space
• Invert the transformation with only the top K components
28
Reconstruction Error (2)
Reconstruction error is calculated by:
• Reconstruction error is defined as:
• Outliers present large deviations in last principal components
• Majority of variance is captured by top K components
– Large deviations in top K components contribute less to reconstruction error
– Large deviations in last components contribute more to reconstruction error
29
Data Transformations for PCA (1)
Should avoid using raw features as input for PCA
• Raw distribution is highly skewed
30
Data Transformations for PCA (2)
Results look great, right?!
• Almost all of our variance is explained by a single component
31
Data Transformations for PCA (3)
Log transformations are always a good start
32
Data Transformations for PCA (4)
Results looking better...
33
Data Transformations for PCA (5)
Scaling the data helps to ensure that individual features don't
dominate
34
Data Transformations for PCA (6)
Finally looking much more balanced
35
Reconstruction Error Revisited
Recall decomposition/reconstruction:
And reconstruction error calculation:
36
Reconstruction Error Distribution (1)
37
Reconstruction Error Distribution (2)
38
39
Process Overview
40
Threat
Extract
Proxy
Features
Matched
features
Population
features
Decompose
and stats
Ranked
Matches
Analyst
Report
Supplement
Ranking, Stats and Enrichment (1)
Reconstruction error provides us with a ranking metric
• Allows us to determine how abnormal an IOC is w.r.t. overall population
• Doesn't provide an investigator with anything concrete starting point
Need to identify the drivers of the abnormal behavior
Also helpful to supplement with contextual information
41
Ranking, Stats and Enrichment (2)
Utilize reconstruction error as a ranking metric
• Calculate PCA for population
– Store mean and std for transformed features
Decompose, reconstruct, score threat match features
• Join the threat matches to features
• Score all matched threats
Determine features driving large reconstruction error
• Calculate z-score for all features w.r.t. stored population mean and std
42
Ranking, Stats and Enrichment (3)
43
Map top N lowest/highest z-scores to message strings
Field in_mean displayed high values
(max z-score: 25)
Field requestMethod_post displayed high values
(max z-score: 18)
Field requestMethod_get displayed abnormally high values
(max z-score: 17)
Ranking, Stats and Enrichment (4)
Additional enrichment with relevant contextual information
• How was traffic to IOC handled in firewall?
• What users were accessing this IOC? What business units?
• Whois lookup information: country, ownership, time since registration
• Available reputation scores, alerting from other security tools
• What specific threat feed the IOC came from
44
Next Steps
45
So, where do we go from here?
• Add in auto-encoder
• Introduce a feedback loop -> supervised learning
• Introduce additional data sources -> more features
• Look at more granular time buckets -> time dependence?
• Additional post-processing for more useful context
• Kibana dashboard
46
47
Principal Component Analysis (2)
Given some raw features:
48
Principal Component Analysis (3)
Given some raw features:
49

Cyber Threat Ranking using READ

  • 1.
    Zachary S. Brown CyberThreat Ranking Through READ
  • 2.
  • 3.
  • 4.
    Who We Are(1) 4 UnitedHealth Group UnitedHealthcare Optum
  • 5.
    Who We Are(2) EIS: Cybersecurity group for the enterprise • Real time monitoring and alerting • Security operations • Investigation and incident response • "We have a cybersecurity team!?" Data Analytics and Security Innovation (DASI) • Big data platform and advanced analytics • Primarily data scientists, data engineers, data analysts 5
  • 6.
    Who We Are(3) 6 Johanna Favole Oliver Chan William Casey Zachary Brown
  • 7.
    Security Big DataLake (1) Security Big Data Lake • Primary platform for all enterprise cybersecurity data • Built upon Hadoop and Elastic • Streaming ingest of ~10 – 15 TB daily (~80k EPS) 7
  • 8.
    Security Big DataLake (2) 8 • SIEM loggers • Firewalls • Email security and web proxy appliances • Database activity monitors • Endpoint sensors • Vulnerability scans • Security ticketing system • Incident response data collectors Transactional Enriching • IP reputation • Threat feeds • External vulnerabilities • External geolocation • Contextual transaction data • Analyst feedback • Human capital management data • System configuration management data • Enterprise technology management • Acquired entity (AE) references • Application configuration management data • Internal geolocation Referential 7billioneventsperdayfrom160+sources Reactive • Forensic data collection • Forensic data analysis • Vulnerability scan data correlation
  • 9.
    Security Big DataLake (3) 9 SIEM Real-time alerting Low-latency exploration Flexible, scalable compute
  • 10.
    Motivation (1) Threat feedsprovide indicators of compromise (IOC) • Domains, ips, hashes, etc. • SIEM provides some threat matching functionality • Extraction from external feeds for enrichment in the SBDL We’re drowning in threat matches • How do we determine which matches are higher priority? • Matches are rule/signature based – Supplement with statistical behavioral analysis 10
  • 11.
    Motivation (2) Two stepprocess to better leverage threat feed matches • Extract threats from feeds, categorize, match against all data – Produces a large volume of matches • Utilize anomaly detection methods to implement a ranking system More efficient analyst workflows • Going beyond signature-based alerts • Provide analysts list of top N candidates for investigation – Provide additional contextual information to aid in investigation 11
  • 12.
    Motivation (3) Borrow approachfrom literature • AI^2: Training a big data machine to defend • Extract portions of outlier detection methodology (matrix decomposition) • Outlier detection through reconstruction error Literature describes multi-pronged approach • Reconstruction error for PCA and auto encoder models – Additional density-based scored utilized as well • Human-in-the-loop to introduce feedback through auxiliary model – Introduce supervised learning model to incorporate feedback 12
  • 13.
    Scope Scope for initialPOC • Use only PCA to compute reconstruction error score – Only vanilla Python and Spark available at project start • Initial focus on data captured only by enterprise web proxy – Very rich, noisy, high volume data • Initial focus on ip based IOCs from threats – Less pre-processing of proxy data; no fuzzy matching Future plans • Auto-encoder scoring, additional data sources, HITL, additional IOCs 13
  • 14.
  • 15.
    Threat Extraction Threat feeds •Nearly a dozen individual sources – Source formats vary wildly; CSV, JSON, nesting, etc. • Internally and externally sourced • Tens of thousands of individual IOCs each day • Inconsistent availability for some feeds ETL pipeline for processing • Un-nesting, standardization, deduplication • Each IOC tagged with type, source, etc. 15
  • 16.
    Threat Matching Begin bylooking for all individual matches in all data • Non-trivial engineering problem! – Multiple matching categories (IP, hash, CIDR, URL/domain) – Fuzzy matching/whitelisting • Tens of thousands of individual IOCs each day – Billions of security events; Main limiting factor Inconsistency in relevance of IOCs from threat feeds • High variability in confidence and maliciousness within and across feeds • IOCs lose relevance due to a myriad of factors 16
  • 17.
  • 18.
    Feature Engineering (1) Forany ML model, need numerical features as input • Want to build a statistical model of what is "normal" • Use this to determine with records associated with IOCs are abnormal Feature granularity • Calculate features at the level of the IOC, e.g. domain, external ip • Pick a time granularity to aggregate features over – Begin with daily features • Unique set of feature can be calculated for each data source – Begin by focusing on web proxy logs; extremely rich data source 18
  • 19.
    Many more opportunitiesfor feature generation! • Windowing, interactions, historical/group statistics • Other data sources 19 Feature Engineering (2)
  • 20.
  • 21.
    Feature Engineering Implementation(1) Feature engineering implemented with Apache Spark (SQL) • Very efficient implementation of aggregations, joins, etc. • Develop reusable modules that are data source agnostic • Functions defined to take as input – keys for features – column(s) to derive features from – feature types Keep track of individual feature sets and join on keys 21
  • 22.
    Feature Engineering Implementation(2) Example function call for feature generation 22 # Calculate feature: stats for in and out fields # Register the table names # Add the table name to the list of tables to be # passed to the join function keys = ['dst','date'] aggs=['min','max','sum','mean'] in_out_stats = agg_num_columns(keys, columns=['in','out'], aggs=aggs) in_out_stats.registerTempTable('io_stats') tables.append('io_stats')
  • 23.
  • 24.
    Feature Engineering Takeaways Keytakeaways from feature generation process • Spark SQL is your best friend – Python string substitution makes it easy to generalize functionality • Wrap complex mappings in Python functions -> register in Spark SQL • Provide Spark as much information as you have available – E.g. If you're pivoting a column, provide the distinct values to pivot Feature generation performance • ~2 hours on 192 executors, processing ~1.5 - 2TB data each day • Very minimal scaling as time granularity is increased! 24
  • 25.
  • 26.
    Principal Component Analysis Whatis principal component analysis? • Method of summarizing data • Constructs new features from old that best summarize data – New features constructed as linear combinations of old features • Constructed to simultaneously: – Maximize variance – Minimize reconstruction error • Often used for dimensionality reduction – Reducing the number of features in a given data set – Remove feature redundancy 26
  • 27.
  • 28.
    Reconstruction Error (1) Decomposition,transformation, and reconstruction • Compute principal components of input feature set • Retain top K principal components, transform to PC space • Invert the transformation with only the top K components 28
  • 29.
    Reconstruction Error (2) Reconstructionerror is calculated by: • Reconstruction error is defined as: • Outliers present large deviations in last principal components • Majority of variance is captured by top K components – Large deviations in top K components contribute less to reconstruction error – Large deviations in last components contribute more to reconstruction error 29
  • 30.
    Data Transformations forPCA (1) Should avoid using raw features as input for PCA • Raw distribution is highly skewed 30
  • 31.
    Data Transformations forPCA (2) Results look great, right?! • Almost all of our variance is explained by a single component 31
  • 32.
    Data Transformations forPCA (3) Log transformations are always a good start 32
  • 33.
    Data Transformations forPCA (4) Results looking better... 33
  • 34.
    Data Transformations forPCA (5) Scaling the data helps to ensure that individual features don't dominate 34
  • 35.
    Data Transformations forPCA (6) Finally looking much more balanced 35
  • 36.
    Reconstruction Error Revisited Recalldecomposition/reconstruction: And reconstruction error calculation: 36
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    Ranking, Stats andEnrichment (1) Reconstruction error provides us with a ranking metric • Allows us to determine how abnormal an IOC is w.r.t. overall population • Doesn't provide an investigator with anything concrete starting point Need to identify the drivers of the abnormal behavior Also helpful to supplement with contextual information 41
  • 42.
    Ranking, Stats andEnrichment (2) Utilize reconstruction error as a ranking metric • Calculate PCA for population – Store mean and std for transformed features Decompose, reconstruct, score threat match features • Join the threat matches to features • Score all matched threats Determine features driving large reconstruction error • Calculate z-score for all features w.r.t. stored population mean and std 42
  • 43.
    Ranking, Stats andEnrichment (3) 43 Map top N lowest/highest z-scores to message strings Field in_mean displayed high values (max z-score: 25) Field requestMethod_post displayed high values (max z-score: 18) Field requestMethod_get displayed abnormally high values (max z-score: 17)
  • 44.
    Ranking, Stats andEnrichment (4) Additional enrichment with relevant contextual information • How was traffic to IOC handled in firewall? • What users were accessing this IOC? What business units? • Whois lookup information: country, ownership, time since registration • Available reputation scores, alerting from other security tools • What specific threat feed the IOC came from 44
  • 45.
    Next Steps 45 So, wheredo we go from here? • Add in auto-encoder • Introduce a feedback loop -> supervised learning • Introduce additional data sources -> more features • Look at more granular time buckets -> time dependence? • Additional post-processing for more useful context • Kibana dashboard
  • 46.
  • 47.
  • 48.
    Principal Component Analysis(2) Given some raw features: 48
  • 49.
    Principal Component Analysis(3) Given some raw features: 49