David Hurley, Microsoft
Bryan Jeffrey, Microsoft
Naveed Ahmad, Microsoft
Speed Kills. If you cannot detect and remediate the attacker before they have achieved their objective, you lose. The sooner you detect your attacker along the kill chain, the better your chance to disrupt and thwart attacks.
The M365 Vanquish system leverages open source technologies like Spark on Azure HDI clusters and Siphon/Kafka to deliver detections and alerts in minutes at scale for M365. Signals are generated from an on-box agent, HostIDS (developed by ODSP Protect), as well as Windows security events which are sent to Vanquish for processing in the cloud. Detection processing is not tightly coupled to workload service resources. Implementation of new detections is not tied to the deployment of workload services they are monitoring. This removes the risk of changes in security detection processing impacting production services and allows increased agility in detection creation, processing and deployment.
We’ll show examples from the ‘Red October’ pen test of how this agility allows incident specific detections to be created, deployed, and delivering results as an incident is in progress. Additionally, enabling M365 wide IOC (Indicator of Compromise) detections can be added in seconds. We’ll show recent pen test examples of how adding incident specific command and control IPs can provide scope and insight into attacker movement. Dynamic signal filtering capabilities across all of M365 can be added in seconds to allow focus.
Another key factor in enabling agility is being able to validate quickly that changes to the system work as intended. With usage of Attackbot (the M365 Red Team attack generation tool), Vanquish continuously measures detection effectiveness across M365 services with automated real attack signal allowing us to validate that detections continue to work. By measuring rather than attesting to effectiveness, you can to quickly identify regressions when they occasionally happen and have confidence that your detections are up and working.
Agility is also a key factor in our machine learning based detections. M365 Vanquish has model training and promotion fully automated. Training and promotion of new models happens constantly, with measurements of effectiveness detecting AttackBot and historical attacks gatekeeping the promotion of new models. By automating this process end to end, we have confidence that we are always armed with the best possible models. Signals are evaluated and scored in real time against the most recent model. This enables high fidelity, low latency urgent alerting within minutes of attacker activity. We’ll show you examples of pen test activity that produced urgent alerts within minutes!
By designing detection systems that ensure agility with measurable effectiveness, you can improve your security posture with confidence that is not misplaced.
3. Taken from the 2018 Verizon Data Breach Investigation Report
https://www.verizonenterprise.com/resources/reports/rp_DBIR_2018_Report_execsummary_en_xg.pdf 3
4. What we are
going to talk
about
What is Vanquish?
Agility by Design
Agility through Measurability
Agility via Machine Learning
Lessons of Red October Pen Test
4
5. What is
Vanquish?
Near real-time security monitoring & analytics
platform for M365 Data Center infrastructure
• Detections
• Remediation
• Alerting
• Telemetry from Hosts
• Integrated Context
• Incident Management
• Analyst Tools
5
7. Agility by
Design
Move fast, don’t impact customers
Treat detections as code not scripts
Leverage right technologies
Detections at the speed of attackers
Remediation and Investigation at the speed of
attackers
7
8. Vanquish is decoupled from monitored assets
MOVE FAST, BUT DON’T
IMPACT CUSTOMERS
DEPLOY NEW CODE
WITHOUT RISK TO
MONITORED ASSETS
APPLY FILTERS ACROSS ALL
DIMENSIONS IN SECONDS
CREATE IOCS ACROSS ALL
DIMENSIONS IN SECONDS
8
9. Detections are code – not scripts
Broken is not agile
Detections are tested – deployment is gated on passing tests
9
11. Detections at the speed of attackers
Detection
of Badness
Forensic
Analysis
Created
New
Detection
Added the
Detection
to ML
model
More
Badness
found
The Hunt for Red October
• Alerted <4 minutes after intrusion
• IOCs added in minutes
• New detection deployed within hours
11
12. Remediation: Too Slow
1. 9am: Decision to remediate
2. 1pm: Attacker starts pivoting
3. 4pm: Remediation complete
Ample opportunity to remediate
Delayed by tooling
Active attacker kept ahead of us
1
2
3
12
1
2
3
13. Investigation and Remediation at the speed of attackers
On-Host
Telemetry
Cloud Based
Detection
Remediation
Subsystem
Data-Center
Management
Service
On-Host
Remediation
or
Investigation
13
15. System is Up!
Run pipeline as a
service
Monitor for data
latency & completeness
Monitor Spark Jobs
15
16. Endpoints are covered!
We look for heartbeats and configuration correctness for each host
We have monitoring for HostIDS health
Remediation is automated for unhealthy endpoints
16
99th
Percentile
18. Pen Test results
Pen tests in the last year which did not trigger a paging alert:
0
Before we get too overconfident – our Red Team is awesome
• Detecting them does not mean that they did not achieve their objective
M365 still believes in Assume Breach approach
18
19. AttackBot is constantly validating detections
Automated
Attacks Run
Frequently
Process
Signal/Create
Detections
Auto-Label
Detections
Measure
Results
Adjust (if
needed)
19
22. Anomaly Calculation
• A service in a Data Center is largely uniform
• Automate whitelists for normal behavior
• State snapshots: autorun reg keys, group membership
• Challenges:
• Anomalous ≠ Malicious
• Emerging behaviors create noise
22
23. Anomaly Detection is Not Enough
~500 Billion Events Per Day
23
0
100000
200000
300000
400000
8/25 8/27 8/29 8/31 9/2
Anomaly Detections per Day
+ We Catch Attacks
0
100000
200000
300000
400000
8/25 8/27 8/29 8/31 9/2
Anomaly Detections per Day
0
1
2
3
4
5
8/25 8/27 8/29 8/31 9/02
Alerts Per Day
24. Supervised Machine Learning
Maintain an archive of known malicious behavior
• Pen test, attack automation, others, etc.
• Any thing that our security analysts have labelled malicious/bad
New behavior
Is new behavior
similar / not similar
to known attacks?
Limitation - Can’t learn what we haven’t seen
• But there is value in auto-learning what we have seen
• With a world class Pen Test team you can auto-learn a lot
Challenge - M365 evolves quickly → Model becomes stale quickly
24
25. Repeatable Intelligent Automation
• Data processing, model training, evaluation & promotion takes time
when done manually
• AI & Automation is the key to agility & better results
Data Processing
Wrangling
Normalization
Sampling
Bootstrapping
Etc.
Features
Extraction
ML Model
Training
&
Evaluation
Model
Promotion
&
Threshold
Selection
Data Processing
Wrangling
Normalization
Sampling
Bootstrapping
Etc.
Features
Extraction
ML Model
Training
&
Evaluation
Model
Promotion
&
Threshold
Selection
Repeatable Intelligent Automation
Without Needing
Human Intervention
IntelligenceIntelligence
25
26. Model Performance and Automated Learning
Hunt for Red October
New malicious
behavior
identified and
labelled by
humans on a
couple of
machines
24 Machines
compromised
in 4 days
10
Alerted by ML
before
humans
6
Tied between
ML & humans
8
Missed by ML
10
Alerted by ML
before
humans
6
Tied between
ML & humans
ML learned in
and alerted
on the rest
New malicious behavior
learned in by ML
automatically
26
27. Agile Model Experimentation and Update
Adding/updating features to ML model doesn’t require a code change
Features
Extraction
<Features>
<Feature Type ="Numeric" Signal=“Detection1" Operation="Count" Field="ProcessName" /><!-- Number of processes captured-->
<Feature Type ="Numeric" Signal="Detection2" Operation="Max" Field="Score" />
</Features>
Normalized
Detections
Feature
Vectors
<Feature Type ="Numeric" Signal=“Detection3" Operation="MaxSum" Field="Bytes,IP,Port" /><!-- Max bytes transferred to a destination-->
New detection feature
27
31. Takeaways
Design to move fast, without impacting customers
Build confidence through continuous validation
Effectiveness at scale through Intelligent Automated ML
If you are an M365 service – get onboarded with us :-)
31
32. Questions?
Bryan Jeffrey, Naveed Ahmad, David Hurley
O365 Signals - Security Signals Team
Members in Cambridge, Redmond, and Suzhou
Contact us:
O365f-enggsise@microsoft.com
Bryan.Jeffrey@microsoft.com
Navahm@microsoft.com
Davehur@microsoft.com
M365 Service that wants to onboard to Vanquish?
https://aka.ms/getvanquish
32