Enhancing Threat Detection
with Big Data and AI
Michael Armbrust & Burak Yavuz
@michaelarmbrust
Bay AreaCyber Security Meetup – April 2018
2
About Us
Engineers on the StreamTeam @
Committers / PMC Members on
Created:
• Spark SQL – High-level, declarative queries on big data
• Structured Streaming – Low latency, incremental processing
• Databricks Delta - Massive Scale, Transactional Cloud Storage
3
Fast and General Cluster Computing
Scala
SQL
EC2
Kubernetes
Automatic Parallelism and Fault-tolerance
GCP
YARN
4
Apache Spark Philosophy
Unified engine for complete
data applications
High-level user-friendly APIs
SQLStreaming ML Graph
…
5
Write Less Code: Compute an Average
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
}
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
5
6
Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
Using DataFrames
sqlCtx.table("people") 
.groupBy("name") 
.agg("name", avg("age")) 
.collect()
6
Using SQL
SELECT name,	avg(age)
FROM people
GROUP	BY	name
7
Why should I care?
As a business analyst, how does help me?
data scientist
APT HUNTER
8
Lets look at what has been
happing in AI…
9
Big Data was the Missing Link for AI
BIG DATA
Customer Data
Emails/Web pages
Click Streams
Sensor data (IoT)
Video/Speech
…
GREAT RESULTS
10
Hardest part of AI isn’t AI
“Hidden Technical Debt in Machine LearningSystems", Google NIPS2015
The hardest part of AI is Big Data
ML	
Code
11
Does the same apply to
Security?
12
• Only a few weeks of data
• Very expensive to scale
• Proprietary formats
• No predictions (ML)
Messy data not ready
for analytics
DATA LAKE
Complex ETL
EDW
EDW
EDW Incidence
Response
Alerting
Reports
SIEM
Security Data @ Fortune 100 Company
SecurityInfrastructure
IDS/IPS, DLP, antivirus, load
balancers, proxy servers
Cloud Infrastructure& Apps
AWS, Azure, Google Cloud, Audit Logs
Servers Infrastructure
Linux, Unix, Windows
Network Infrastructure
Routers, switches, WAPs,
databases, LDAP
Threat Intelligence Feeds
TrillionsofRecords
13
DELTA
DATA LAKE
Reporting
Streaming
Analytics
The
LOW-LATENCY
of streaming
The
RELIABILITY&
PERFORMANCE
of data warehouse
The
SCALE
of data lake
The Delta Architecture
14
An Example Hunt
Raise your hand when you know what we are searching for…
spark.read.table("dns")
.where("len(query) > 50")
.groupBy(window("ts", "5 minutes"), "src_ip")
.count()
.where("count > 20")
Answer: DNS Exfiltration Attack
15
What about Future Threats?
Tune for a specific SLAs using the same code:
Batch
high latency
execute
on-demand
high throughput
Micro-batch
medium latency
efficient resource
allocation
high throughput
Continuous
millisecond latency
static resource
allocation
16
What about Future Threats?
Tune for a specific SLAs using the same code:
Batch
historical
hunting
Micro-batch
human-in-the-loop
alerts
Continuous
automatic
remediation
17
DNS Exfiltration Search
Rewritten as a streaming alert
spark.readStream.table("dns")
.where("len(query) > 50")
.groupBy(window("ts", "5 minutes"), "src_ip")
.count()
.where("count > 20")
.writeStream
.foreach(new PagerDutySink)
18
Demo: Hunting in
• Hosted Apache Spark in the Cloud
• Integrated Collaboration
• Monitoring / Alerting
19
15% Discount Code: BASMU

Enancing Threat Detection with Big Data and AI

  • 1.
    Enhancing Threat Detection withBig Data and AI Michael Armbrust & Burak Yavuz @michaelarmbrust Bay AreaCyber Security Meetup – April 2018
  • 2.
    2 About Us Engineers onthe StreamTeam @ Committers / PMC Members on Created: • Spark SQL – High-level, declarative queries on big data • Structured Streaming – Low latency, incremental processing • Databricks Delta - Massive Scale, Transactional Cloud Storage
  • 3.
    3 Fast and GeneralCluster Computing Scala SQL EC2 Kubernetes Automatic Parallelism and Fault-tolerance GCP YARN
  • 4.
    4 Apache Spark Philosophy Unifiedengine for complete data applications High-level user-friendly APIs SQLStreaming ML Graph …
  • 5.
    5 Write Less Code:Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) } data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() 5
  • 6.
    6 Write Less Code:Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() 6 Using SQL SELECT name, avg(age) FROM people GROUP BY name
  • 7.
    7 Why should Icare? As a business analyst, how does help me? data scientist APT HUNTER
  • 8.
    8 Lets look atwhat has been happing in AI…
  • 9.
    9 Big Data wasthe Missing Link for AI BIG DATA Customer Data Emails/Web pages Click Streams Sensor data (IoT) Video/Speech … GREAT RESULTS
  • 10.
    10 Hardest part ofAI isn’t AI “Hidden Technical Debt in Machine LearningSystems", Google NIPS2015 The hardest part of AI is Big Data ML Code
  • 11.
    11 Does the sameapply to Security?
  • 12.
    12 • Only afew weeks of data • Very expensive to scale • Proprietary formats • No predictions (ML) Messy data not ready for analytics DATA LAKE Complex ETL EDW EDW EDW Incidence Response Alerting Reports SIEM Security Data @ Fortune 100 Company SecurityInfrastructure IDS/IPS, DLP, antivirus, load balancers, proxy servers Cloud Infrastructure& Apps AWS, Azure, Google Cloud, Audit Logs Servers Infrastructure Linux, Unix, Windows Network Infrastructure Routers, switches, WAPs, databases, LDAP Threat Intelligence Feeds TrillionsofRecords
  • 13.
  • 14.
    14 An Example Hunt Raiseyour hand when you know what we are searching for… spark.read.table("dns") .where("len(query) > 50") .groupBy(window("ts", "5 minutes"), "src_ip") .count() .where("count > 20") Answer: DNS Exfiltration Attack
  • 15.
    15 What about FutureThreats? Tune for a specific SLAs using the same code: Batch high latency execute on-demand high throughput Micro-batch medium latency efficient resource allocation high throughput Continuous millisecond latency static resource allocation
  • 16.
    16 What about FutureThreats? Tune for a specific SLAs using the same code: Batch historical hunting Micro-batch human-in-the-loop alerts Continuous automatic remediation
  • 17.
    17 DNS Exfiltration Search Rewrittenas a streaming alert spark.readStream.table("dns") .where("len(query) > 50") .groupBy(window("ts", "5 minutes"), "src_ip") .count() .where("count > 20") .writeStream .foreach(new PagerDutySink)
  • 18.
    18 Demo: Hunting in •Hosted Apache Spark in the Cloud • Integrated Collaboration • Monitoring / Alerting
  • 19.