Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Graph The Planet 2019 - Intrusion Detection with Graphs

1,307 views

Published on

The Office 365 intrusion detection team uses graphs to link alerts and incorporate low-fidelity observations without overwhelming our analysts. In this talk, we describe how we represent alerts in the graph, how we use the structure of the graph to determine which alerts should be reviewed by our analysts, and how we rank subgraphs to ensure that the most important activity is reviewed first. We also discuss approaches we are investigating next to get even more value out of our alert graph.

Published in: Engineering
  • Be the first to comment

Graph The Planet 2019 - Intrusion Detection with Graphs

  1. 1. Intrusion Detection with Graphs Faster, smarter, and with more context
  2. 2. The challenge Windows server intrusion detection in Office 365 Security event logs from hundreds of thousands of servers Contains system activity like deployment, upgrade, engineer troubleshooting Analysis and response performed by security engineering team Graphs help us succeed at scale and in detail Review alerts in context, not in isolation Prioritize investigation according to risk Incorporate low-fidelity signals without overwhelming analysts
  3. 3. Detection pipeline Detection inputs Process, user behavior from built-in Windows audit events Per-process network activity, DNS lookups Windows internal subsystem activity via ETW monitoring Detection results Stored in a flexible-schema columnar database (Azure Data Explorer) Column values are normalized to enforce common semantics across results Classified according to the fidelity of the detection
  4. 4. Building the graph Three steps Extract entities that represent “pivots” between detection results Link each result to the entities it contains and insert these into the graph If an entity already exists from a prior step, use it Forms a hypergraph that links related results together Resulting graph is sparsely-connected and easy to visualize Algorithm is O(n) and trivial to implement in Javascript, C#, etc
  5. 5. Building the graph Anomalous DLL rundll32.exe launched as svc_sql11 on CFE110095 New process uploading rundll32.exe to 40.114.40.133 on CFE110095 Large transfer 50MB to 40.114.40.133 from sqlagent.exe on SQL11006
  6. 6. Building the graph Anomalous DLL rundll32.exe launched as svc_sql11 on CFE110095 New process uploading rundll32.exe to 40.114.40.133 on CFE110095 Large transfer 50MB to 40.114.40.133 from sqlagent.exe on SQL11006 detection type detection type detection type hostname process process process user hostname hostname hostname hostname anomalousdll procupload largetransfer svc_sql11 CFE110095rundll32.exe 40.114.40.133sqlagent.exe SQL11006
  7. 7. Graph clustering Each cluster represents an “incident” Detection results with entities in common that tell a story Analysts view and triage all results in the cluster together View cluster results in tabular form for increased density and detail Identical clusters are merged together Define similarity by the types of detection results each cluster contains Collapses the long tail of small clusters caused by environment-wide changes
  8. 8. Cluster scoring Clusters must meet a criteria to be eligible for triage One result classified alert or atomic Two unique detection types classified behavioral Score based on detection and entity uniqueness Points assigned to each distinct detection type in the cluster Divided by number of distinct machines emitting that detection type Multiplied together to generate an overall cluster score Down-votes systemic behavior and up-votes clusters with many unique detections
  9. 9. Cluster-based actions Alerting for high-scoring clusters In-memory graph ingests new detection results and triage decisions Scores each cluster, persists cluster snapshot as JSON, exposes REST API Emits a high-fidelity alert when cluster score reaches a threshold Automated triage for environment-wide behavior “Time-travel triage” identifies activity that occurs across many servers Adds a rule to suppress future alerts and a detection result to inform analysts
  10. 10. Opportunities Time-series analysis Updated cluster snapshots are written every 5 minutes Can we visualize progression over time or score based on rate of change? Improved cluster scoring Can we use statistics to boost influence of detections that rarely fire? Can we categorize detections by killchain stage and look for in-time-order traversal? Can we use ML to identify detection types that typically fire together?
  11. 11. Bonus Same technique can be applied to customer audit logs Are privileged operations being performed across many resources? Are specific IP addresses responsible for a high number of access attempts? Are sensitive documents being accessed in bulk by a single user? Example using O365 audit logs and PowerBI: aka.ms/auditgraph Graph-based exploratory data analysis on user behavior Great opportunity to help customers get more value out of their audit logs Would love to see someone make this a point-and-click integration with O365
  12. 12. Thank you! mswann@microsoft.com @MSwannMSFT linkedin.com/in/swannman

×