4. Who We Are (1)
4
UnitedHealth
Group
UnitedHealthcare Optum
5. Who We Are (2)
EIS: Cybersecurity group for the enterprise
• Real time monitoring and alerting
• Security operations
• Investigation and incident response
• "We have a cybersecurity team!?"
Data Analytics and Security Innovation (DASI)
• Big data platform and advanced analytics
• Primarily data scientists, data engineers, data analysts
5
6. Who We Are (3)
6
Johanna
Favole
Oliver
Chan
William
Casey
Zachary
Brown
7. Security Big Data Lake (1)
Security Big Data Lake
• Primary platform for all enterprise cybersecurity data
• Built upon Hadoop and Elastic
• Streaming ingest of ~10 – 15 TB daily (~80k EPS)
7
8. Security Big Data Lake (2)
8
• SIEM loggers
• Firewalls
• Email security and
web proxy appliances
• Database activity
monitors
• Endpoint sensors
• Vulnerability scans
• Security ticketing
system
• Incident response
data collectors
Transactional
Enriching
• IP reputation
• Threat feeds
• External
vulnerabilities
• External geolocation
• Contextual
transaction data
• Analyst feedback
• Human capital
management data
• System configuration
management data
• Enterprise technology
management
• Acquired entity (AE)
references
• Application
configuration
management data
• Internal geolocation
Referential
7billioneventsperdayfrom160+sources
Reactive
• Forensic data collection
• Forensic data analysis
• Vulnerability scan data correlation
9. Security Big Data Lake (3)
9
SIEM
Real-time alerting Low-latency exploration
Flexible, scalable compute
10. Motivation (1)
Threat feeds provide indicators of compromise (IOC)
• Domains, ips, hashes, etc.
• SIEM provides some threat matching functionality
• Extraction from external feeds for enrichment in the SBDL
We’re drowning in threat matches
• How do we determine which matches are higher priority?
• Matches are rule/signature based
– Supplement with statistical behavioral analysis
10
11. Motivation (2)
Two step process to better leverage threat feed matches
• Extract threats from feeds, categorize, match against all data
– Produces a large volume of matches
• Utilize anomaly detection methods to implement a ranking system
More efficient analyst workflows
• Going beyond signature-based alerts
• Provide analysts list of top N candidates for investigation
– Provide additional contextual information to aid in investigation
11
12. Motivation (3)
Borrow approach from literature
• AI^2: Training a big data machine to defend
• Extract portions of outlier detection methodology (matrix decomposition)
• Outlier detection through reconstruction error
Literature describes multi-pronged approach
• Reconstruction error for PCA and auto encoder models
– Additional density-based scored utilized as well
• Human-in-the-loop to introduce feedback through auxiliary model
– Introduce supervised learning model to incorporate feedback
12
13. Scope
Scope for initial POC
• Use only PCA to compute reconstruction error score
– Only vanilla Python and Spark available at project start
• Initial focus on data captured only by enterprise web proxy
– Very rich, noisy, high volume data
• Initial focus on ip based IOCs from threats
– Less pre-processing of proxy data; no fuzzy matching
Future plans
• Auto-encoder scoring, additional data sources, HITL, additional IOCs
13
15. Threat Extraction
Threat feeds
• Nearly a dozen individual sources
– Source formats vary wildly; CSV, JSON, nesting, etc.
• Internally and externally sourced
• Tens of thousands of individual IOCs each day
• Inconsistent availability for some feeds
ETL pipeline for processing
• Un-nesting, standardization, deduplication
• Each IOC tagged with type, source, etc.
15
16. Threat Matching
Begin by looking for all individual matches in all data
• Non-trivial engineering problem!
– Multiple matching categories (IP, hash, CIDR, URL/domain)
– Fuzzy matching/whitelisting
• Tens of thousands of individual IOCs each day
– Billions of security events; Main limiting factor
Inconsistency in relevance of IOCs from threat feeds
• High variability in confidence and maliciousness within and across feeds
• IOCs lose relevance due to a myriad of factors
16
18. Feature Engineering (1)
For any ML model, need numerical features as input
• Want to build a statistical model of what is "normal"
• Use this to determine with records associated with IOCs are abnormal
Feature granularity
• Calculate features at the level of the IOC, e.g. domain, external ip
• Pick a time granularity to aggregate features over
– Begin with daily features
• Unique set of feature can be calculated for each data source
– Begin by focusing on web proxy logs; extremely rich data source
18
19. Many more opportunities for feature generation!
• Windowing, interactions, historical/group statistics
• Other data sources
19
Feature Engineering (2)
21. Feature Engineering Implementation (1)
Feature engineering implemented with Apache Spark (SQL)
• Very efficient implementation of aggregations, joins, etc.
• Develop reusable modules that are data source agnostic
• Functions defined to take as input
– keys for features
– column(s) to derive features from
– feature types
Keep track of individual feature sets and join on keys
21
22. Feature Engineering Implementation (2)
Example function call for feature generation
22
# Calculate feature: stats for in and out fields
# Register the table names
# Add the table name to the list of tables to be
# passed to the join function
keys = ['dst','date']
aggs=['min','max','sum','mean']
in_out_stats = agg_num_columns(keys, columns=['in','out'], aggs=aggs)
in_out_stats.registerTempTable('io_stats')
tables.append('io_stats')
24. Feature Engineering Takeaways
Key takeaways from feature generation process
• Spark SQL is your best friend
– Python string substitution makes it easy to generalize functionality
• Wrap complex mappings in Python functions -> register in Spark SQL
• Provide Spark as much information as you have available
– E.g. If you're pivoting a column, provide the distinct values to pivot
Feature generation performance
• ~2 hours on 192 executors, processing ~1.5 - 2TB data each day
• Very minimal scaling as time granularity is increased!
24
26. Principal Component Analysis
What is principal component analysis?
• Method of summarizing data
• Constructs new features from old that best summarize data
– New features constructed as linear combinations of old features
• Constructed to simultaneously:
– Maximize variance
– Minimize reconstruction error
• Often used for dimensionality reduction
– Reducing the number of features in a given data set
– Remove feature redundancy
26
28. Reconstruction Error (1)
Decomposition, transformation, and reconstruction
• Compute principal components of input feature set
• Retain top K principal components, transform to PC space
• Invert the transformation with only the top K components
28
29. Reconstruction Error (2)
Reconstruction error is calculated by:
• Reconstruction error is defined as:
• Outliers present large deviations in last principal components
• Majority of variance is captured by top K components
– Large deviations in top K components contribute less to reconstruction error
– Large deviations in last components contribute more to reconstruction error
29
30. Data Transformations for PCA (1)
Should avoid using raw features as input for PCA
• Raw distribution is highly skewed
30
31. Data Transformations for PCA (2)
Results look great, right?!
• Almost all of our variance is explained by a single component
31
41. Ranking, Stats and Enrichment (1)
Reconstruction error provides us with a ranking metric
• Allows us to determine how abnormal an IOC is w.r.t. overall population
• Doesn't provide an investigator with anything concrete starting point
Need to identify the drivers of the abnormal behavior
Also helpful to supplement with contextual information
41
42. Ranking, Stats and Enrichment (2)
Utilize reconstruction error as a ranking metric
• Calculate PCA for population
– Store mean and std for transformed features
Decompose, reconstruct, score threat match features
• Join the threat matches to features
• Score all matched threats
Determine features driving large reconstruction error
• Calculate z-score for all features w.r.t. stored population mean and std
42
43. Ranking, Stats and Enrichment (3)
43
Map top N lowest/highest z-scores to message strings
Field in_mean displayed high values
(max z-score: 25)
Field requestMethod_post displayed high values
(max z-score: 18)
Field requestMethod_get displayed abnormally high values
(max z-score: 17)
44. Ranking, Stats and Enrichment (4)
Additional enrichment with relevant contextual information
• How was traffic to IOC handled in firewall?
• What users were accessing this IOC? What business units?
• Whois lookup information: country, ownership, time since registration
• Available reputation scores, alerting from other security tools
• What specific threat feed the IOC came from
44
45. Next Steps
45
So, where do we go from here?
• Add in auto-encoder
• Introduce a feedback loop -> supervised learning
• Introduce additional data sources -> more features
• Look at more granular time buckets -> time dependence?
• Additional post-processing for more useful context
• Kibana dashboard