DataWorks 2018: How Big Data and AI Saved the Day

1 | © 2018 Interset Software
How Big Data and AI
Saved the Day:
Critical IP Almost
Walked Out the Door
Roy Wilds, PhD
Field Data Scientist
Interset.AI

2 | © 2018 Interset Software2 | © 2018 Interset Software
Welcome
Partners
About Interset
• 75 employees & growing
• 450% ARR growth
• Data science & analytics focused on cybersecurity
• 100 person-years of Anomaly Detection R&D
• Offices in Ottawa, Canada & Newport Beach,
California
About Me
• Data miner scientist since 2006
• 4+ years building machine
learning systems for threat
hunting
• 8 years experience using
Hadoop for large scale
advanced analytics
Field Data Scientist
• Identify valuable data feeds
• Optimize system for use cases
We uncover the threats that matter!

What is AI-Based Security Analytics About?
Advanced analytics to help you catch the bad guys

zz
Increasing Threat Hunting Efficiency
Low Success Rate SOC Cycle Generate Highly Anomalous Threat Leads

Augment Architecture to Increase Visibility & Efficiency
SECURITY ANALYTICS
SIEM
IAMENDPOINT
BUSINESS
APPLICATIONS
CUSTOM
DATANETWORK DLP
SIEM
IAMENDPOINT NETWORK DLP

Platform based on Unsupervised Machine Learning & AI
ACQUIRE
DATA
HIGHQUALITY
THREATLEADS INTERNAL RECON
INFECTED HOST
DATA STAGING
& THEFT
COMPROMISED
ACCOUNT
LATERAL
MOVEMENT
ACCOUNT MISUSE
CUSTOM
FRAUD
DLP
ENDPOINT
BizApps
CUSTOM
DATA
NETWORK
IAM Kibana
DETECT,
MEASUREAND
SCORE
ANOMALIES
CREATEUNIQUE
BASELINES
Contextual views.
Drill-downand
cyber-hunting.
Broaddata
collection
Determinewhat
isnormal
Gather the
rawmaterials
Findthebehavior
that matters
Workflowengine
for incident
response.
SIEM

z
Mathematically Measure Cybersecurity Risk

Baseline “Unique Normal” for Every Entity
CREATE UNIQUE
BASELINES
Determine
what is
normal
• Rules & Thresholds Don’t Work --> Assumes same rules work for every
entity, causing many false positives. Requires system to already be coded
to know what it’s looking for which is not scalable, also makes it easy for
adversaries to game the system.
• Must scale horizontally to accommodate measurement of “unique
normal” for thousands of entities à Requires big data architecture for
storage and compute.
• Need unsupervised machine learning to mathematically discover
patterns that create unique baselines:
• for a single entity (user, machine, printer, server, website ….)
• for a group of entities (peer group)
• for all entities (population)

Multiple ML Algorithms to Assess Enterprise Risk
Authentication
Logs
Endpoint
Logs
Operating
System Logs
Proxy Logs
VPN Logs Printer Logs
Network LogsFile/Network
Share Logs
Volumetric Models
Neural Networks
Probability Distribution
Estimation
Other
Detection of Threats like:
● CompromisedAccount
● DataBreach
● Fraud
● InfectedHost
Based on Anomalies like:
● Multiplefailedlogins
● Unusual locations
● Unusual successful attempt
From Individually Measured
Statistics for Every Entity Like:
● Annmoves asignificant volumeof data
● Annaccesses andtakes fromfilefolders
● Printer hadmultiplefailedlogins
● Server accesses unusual locations
● Server shows unusual successful login
● Ann’s peer has different expensereport for thesameevent
● Annsends email topersonal account
Entities:
● Account
● Machine
● File
● IPAddresses
● Servers
● Websites
● Printers● Projects
M any Data Sources Detect Anom alies Produce Risk Score
96

Insider Threat Detection Requires Measuring “Unique Normal”
Current tools scalability shortcomings must
assume common rules for entire population
Comparing everyone to the same
pattern means many false positives
Measuring “Unique Normal” for
each user/ machine/ filesystem
/printer /.. results in accuracy
Only large scale machine learning can measure
what is normal for every user for every category

“Unique Normal”, Or Not Requires Big Data & Unsupervised #ML
Supervised approaches, such as deep learning, is good for
cybersecurity data with lots of labels, i.e. malware. The
malware use case has decades’ worth of example
binaries, both malicious and innocent.
Unsupervised approaches are best for cybersecurity data
with limited data, typically without labels, such as
detecting anomalies indicative of unique insider threats
where there is not enough data for supervised ML.
Supervised learning is learning by example
and requires “labeled” data.
Unsupervised learning is self-discovery of
patterns and doesn’t need labels/examples.

Because Every SOC Has LOTS of Data
5,210,465,083
Billions of events
analyzed with
machine learning
Anomalies
discovered by
data science
High quality
“most wanted”
list
Users, machines, files, projects, servers, sharing behavior, resources,
websites, IP Addresses and more

To Find Threats Such As:
•At-Risk em ployee
•High-Risk Em ployees
•Account M isuse
•Privilege Account M isuse
•Term inated Em ployee
Activity
•Data Staging
•Data Exfiltration
•Em ail Exfiltration
•Print Exfiltration
•USB Exfiltration
•Unusual data access
•Unusual uploads
•Com prom ised Account
•C2 Activity Detection
•Im possible Journeys
•Internal Recon
•Dorm ant Account Usage
•Unusual Login Patterns
•Audit Log Tam pering
•Unusual Traffic
•Password M anipulation
•Abnorm al Processes
•Unusual Applications
•Infected Host
•M alicious Tunneling
•Bot Detection
•M ooching
•Snooping
•Interactions with dorm ant
resources/files
•High Risk IP/Data Access
•Lateral M ovem ent
•Transaction Abuse
•Expense Fraud
Insider Threat Advanced Threat IP TheftData Breach Fraud

Case Study #1: $20B Manufacturer
X
2 Engineers
stole data
1 Year
$1 Million Spent
Large security
vendor failed to
find anything
2 Weeks
Easily
identified the 2
Engineers
Found 3
additional users
stealing data in
North America
Found 8
additional users
stealing data in
China

Case Study #2: High Profile Media Leak
IT’S ABOUT VISIBILITY

Case Study #3: Healthcare Records & Payments
§ Profile: 6.5 billion transactions annually, 750+ customers, 500+
employees
§ Team of 7: CISO, 1 security architect, 3 security analysts, 2
network security
§ Analytics surfaced (for example) an employee who attempted to
move “sensitive data” from endpoint to personal Dropbox
§ Employee was arrested and prosecuted using incident data
Focus and prioritized incident responses
Incident alert accuracy increased from 28% to 92%
Incident mitigation coverage doubled from 70 per week to 140

Case Study #4: Defense Contractor
zz
High Probability Anomalous Behavior Models
§ Detected large copies to the portable hard drive,
at an unusual time of day
§ Bayesian models to measure and detect highly
improbable events
High Risk File Models
§ Detected high risk files, including PowerPoints
collecting large amounts of inappropriate content
§ Risk aggregation based on suspicious behaviors
and unusual derivative movement

z
Lesson: AI is the buzzword, but The Math Matters – Test It
Recommendations
• Agree on the use cases in advance
• Use a proof-of-concept with historical/existing data to test the SA’s math
• Engage red team or pen testing if available
• Evaluate the results: Do they support the use cases you care about?

QUESTIONS?
Roy Wilds – Field Data Scientist
@roywilds
Learn more at Interset.AI

About Interset.AI
SECURITY ANALYTICS LEADER PARTNERSABOUT US
Data science & analytics
focused on cybersecurity
100 person-years of security
analytics and anomaly
detection R&D
Offices in Ottawa, Canada;
Newport Beach, CA
Interset.AI

DataWorks 2018: How Big Data and AI Saved the Day

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to DataWorks 2018: How Big Data and AI Saved the Day

Similar to DataWorks 2018: How Big Data and AI Saved the Day (20)

Recently uploaded

Recently uploaded (20)

DataWorks 2018: How Big Data and AI Saved the Day