Data Science Driven Malware Detection

VMware Tanzu
VMware TanzuVMware Tanzu
1© Copyright 2015 Pivotal. All rights reserved. 1
Data Science Driven
Malware Detection
Malicious Domain Association
Anirudh Kondaveeti, PhD
Principal Data Scientist
2© Copyright 2015 Pivotal. All rights reserved.
Project Goal
 Goal: Find domains that have time and user based co-occurrence
relationships to aid the detection of coordinated network attacks.
 Example: Domain A is a watering hole. It redirects users to an exploit kit at
Domain B within a short time window.
– B is relatively unknown: Visiting B is a low
frequency (support) event.
– B is almost always redirected from A: The
conditional probability (confidence) of an
initial visit to A is high given B is visited later on.
User visits
watering hole
domain A
Domain B
hosts exploit
kit
Watering hole
domain A
redirects to
domain B
User machine
compromised
3© Copyright 2015 Pivotal. All rights reserved.
Data Sources & Preprocessing
 Historical Proxy Logs
– Information about “who is accessing which website at what time”
– Approx. 3 months of data with billions of connection records
 Local Domain White List
– List of non-malicious websites
 Preprocessing
Host Name
Normalization
(anirudh.facebook.com ->
facebook.com)
Filter Invalid Host
Names
( www.facebook,ca)
Identify “unpopular”
domains
( www.francelegal.com)
User Specific
Sessionization
4© Copyright 2015 Pivotal. All rights reserved.
User-Specific Sessionization
 Each user’s proxy logs are sessionized so that two consecutive connections
in the same session occur within a user-specified time window (e.g. 60s).
 Sequential patterns are derived from sessionized data.
Connection Time Domain
Session
ID
2015-07-03 12:41:08 googlevideo.com 1
2015-07-03 12:41:09 twitter.com 1
2015-07-03 12:41:12 youtube.com 1
2015-07-03 12:41:14 doubleclick.net 1
2015-07-03 12:41:15 google.com 1
2015-07-03 12:41:15 googleanalytics.com 1
2015-07-03 12:41:28 youtube.com 1
2015-07-03 12:59:23 facebook.com 2
2015-07-03 12:59:24 yahoo.com 2
>60s apart, start
a new session
5© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
 Sequential Pattern Mining
– Find time-ordered co-occurrence relationships between multiple domains.
– Output low frequency, high confidence sequences of domains:
[{Domain1},{Domain2, Domain3},…] => [DomainN].
 Graph Mining
– Build a “social network” graph between domains by creating edges
between pairs of domains that are associated with high confidence
– Use graph based algorithms to find fully and partially connected
subgraphs
 Two approaches can be used in conjunction to compliment
each other.
6© Copyright 2015 Pivotal. All rights reserved.
Modeling Framework Design Considerations
 Operational feasibility
– Incremental data processing and modeling on incoming new data, e.g. on a weekly
basis, to distribute workload over time.
– Results are updated to incorporate new model outputs.
 Computational tractability
– Implement most of the modeling frameworks in plain SQL, and design efficient
Window functions to achieve better runtime performance.
– Explicit PL/R routine parallelization to leverage the Massively Parallel Processing
architecture of the Greenplum database.
7© Copyright 2015 Pivotal. All rights reserved.
An Incremental Modeling Framework
Initial Proxy Logs &
Domain Whitelist
Preprocessed Proxy
Logs
• Host normalization & validation
• Data filtering
• Sessionization
Model-Specific
Results
Model Execution:
• Sequential Pattern Mining
• Graph Mining
New Proxy Logs &
(Possibly) Updated
Domain Whitelist
Preprocessed New
Proxy Logs
• Host normalization & validation
• Data filtering
• Sessionization
Updated Model-
Specific Results
Initial Run
Update
Model Update:
• Sequential Pattern Mining
• Graph Mining
8© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
Sequential Pattern Mining
9© Copyright 2015 Pivotal. All rights reserved.
Model Execution: Sequential Pattern Mining
Create time-ordered
domain sequences from
sessionized data
Given a list of targeted
domains (e.g. rare
domains), select subset of
sequences containing
those domains
Find high confidence, low
support sequential patterns
of targeted domains in
parallel
10© Copyright 2015 Pivotal. All rights reserved.
Sequence Creation
 Each sequence contains domains in a session
by the same user.
 Domains are ordered by connection time.
 Sequence for example on the right
– Sequence 1 : [ {googlevideo.com}, {twitter.com},
{youtube.com}, {doubleclick.net}, {google.com},
{googleanalytics.com} ]
– Sequence 2: [{facebook.com}, {yahoo.com}]
Connection Time Domain
Session
ID
2015-01-06 14:41:08 googlevideo.com 1
2015-01-06 14:41:09 twitter.com 1
2015-01-06 14:41:12 youtube.com 1
2015-01-06 14:41:14 doubleclick.net 1
2015-01-06 14:41:15 google.com 1
2015-01-06 14:41:15 googleanalytics.com 1
2015-01-06 14:59:23 facebook.com 2
2015-01-06 14:59:24 yahoo.com 2
11© Copyright 2015 Pivotal. All rights reserved.
Sequence Statistics
 sup: Support of a pattern P is the ratio of sequences in which a
pattern occurs
– sup({a,e}) = 2/10
 conf: Confidence of a rule X => Y is proportion of transactions
containing X that also contain Y
– conf({a => e}) = sup({a,e})/sup({a}) = 2/5
 #users: Number of distinct users for which a pattern P occurs
– #users({a}) = 1
 sup and #users follow monotone property
i.e.
– {a,e} {a}
– sup({a,e}) ≤ sup({a})
– #users({a,e}) ≤ #users({a})
10 sequences from a single user
12© Copyright 2015 Pivotal. All rights reserved.
Sequential Pattern Mining (SPM) in Parallel
 Developed a scalable algorithm in Greenplum database (GPDB) to identify patterns with
low support and high confidence patterns occurring in a minimum number of user
sequences.
 High confidence patterns relating to a given set of domains are obtained in parallel:
i.e., SPM runs independently on different subsets of sequences for different domains.
SELECT a_targeted_domain,
sequential_pattern_mining(min_support, min_confidence, min_num_users)
FROM input_table
Pseudo code:
Find domain A with
small support (or
known bad domain)
Subset sequences from
data containing A
Find sequential patterns
of A with high confidence
Repeat for all A in parallel on separate GPDB node
13© Copyright 2015 Pivotal. All rights reserved.
Relative Confidence to Adjust Ranking of Patterns
 For each domain of interest, SPM is run only on the subset of sequences containing that domain. This
may cause some sequential patterns to have artificially high confidence.
 Recall: confidence(X=>Y):=support(<X,Y>)/support(X)=|<X,Y>|/|X|. |X|, the number of sequences
in the subset that contain the left hand side pattern, may not reflect the popularity of X in the full dataset.
 We define relative confidence as: relative_confidence(X=>Y):=|<X,Y>|/|Xi|fullset
where|Xi|fullset is the number of sequences in the full dataset that contain the left hand pattern.
 Relative confidence favors the pattern whose left hand side contains less popular domains (see the
highlighted example below).
Relative confidence
favors unpopular left
hand side pattern
Domain Pattern Supp Conf Rel Conf
revenueindia.
net
<{google.com},{facebook.com}> =>
<{revenueindia.net}> 0.079 0.75 0.0001
revenueindia.
net
<{google.com}, {fileshare.com}> =>
<{revenueindia.net}> 0.071 0.75 0.067
revenueindia.
net
<{fileshare.com},{redworm.com}> =>
<{revenueindia.net}> 0.030 1.00 0.51
14© Copyright 2015 Pivotal. All rights reserved.
Model Update: Sequential Pattern Mining
 The model update module for sequential pattern mining follows a similar workflow as
its model execution module.
 One additional step is simply to merge the new results obtained from the incoming
new data with the existing set of patterns, including updating rule quality metrics:
support, confidence, etc.
Create time-ordered
domain sequences from
new sessionized data
Given a list of targeted
domains (e.g. rare
domains), select subset
of sequences containing
those domains
Find high confidence, low
support sequential
patterns of targeted
domains in parallel
Merge new results with
the existing set of
patterns.
15© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
Graph Mining
16© Copyright 2015 Pivotal. All rights reserved.
Model Execution: Graph Mining
Construct “baskets” of
domains (co-
occurrence domains)
by running a sliding
window of certain time
interval through data
Find high confidence,
low support pairwise
association rules of the
form
Domain 1 => Domain 2
Create social network
of domains
Find partially and fully
connected sub-graphs
17© Copyright 2015 Pivotal. All rights reserved.
Construction of “Baskets”
 Domains visited by a user in a certain
time window form a “basket”, analogous
to items purchased in a single
transaction as in market basket analysis.
 The time interval for the sliding window
(60s window used in the implementation)
can be tuned.
 A basket contains distinct domains in a
sliding window:
Example on right:
Basket 1 = {googlevideo.com, twitter.com, youtube.com,
doubleclick.net, google.com}
Connection Time Domain
2015-01-06 14:41:00 googlevideo.com
2015-01-06 14:41:09 twitter.com
2015-01-06 14:41:12 youtube.com
2015-01-06 14:41:14 doubleclick.net
2015-01-06 14:42:00 google.com
2015-01-06 14:42:05 googleanalytics.com
2015-01-06 14:42:08 pivotal.io
2015-01-06 14:59:23 facebook.com
2015-01-06 14:59:24 yahoo.com
1
2
18© Copyright 2015 Pivotal. All rights reserved.
Pairwise Association Rule Mining
 Given domain-to-basket assignments, pairwise association rule mining mainly
involves evaluation of:
– Co-occurrence frequency: the number of times two domains fall in a common basket.
– Conditional probability: probability of seeing domain 2 given domain 1 is present.
 Pairwise rule mining is implemented in plain SQL in a scalable fashion.
Domain A Domain B
#
{A,B}
# A # B P(A|B) P(B|A)
# A
to B
# B
to A
# AB
Same
Time
Max(#
User
Names/
M)
#
Date
Min
Date
Max
Date
pivotal.io montecarlo.com 10 560 10 1.000000 0.017857 9 0 1 1 1
2015-02-
26
2015-
02-26
pivotal.io bigbangtheory.com 25 560 26 0.961538 0.044643 21 4 0 2 1
2015-02-
23
2015-
02-23
pivotal.io sciencefiction.com 78 560 97 0.804124 0.139286 61 15 2 4 8
2015-01-
23
2015-
02-17
High confidence (>0.5) associations involving
multiple users over several days (e.g. highlighted
rules) are generally more interesting.
19© Copyright 2015 Pivotal. All rights reserved.
Exploring Interactions between Domains
 To explore the interactions between domains, we build an
undirected correlation graph using the discovered pairwise
domain association rules.
 Each node in the graph is a domain. An edge connects two
domains if their co-occurrence confidence is higher than a
threshold (e.g. 0.2).
 The example on the right shows the tightly connected “social
network” of a particular domain.
 Partially and fully connected networks indicate possible
waterhole or bot-net attacks.
 Question: How to quantify the connectivity of a network?
0.25
0.37
0.71
0.52
0.1
0.6
0.1
Weight of Edge denotes
the confidence
Node denotes the
domain
abc.com
xyz.com
hga.com
hebf.com
20© Copyright 2015 Pivotal. All rights reserved.
OddBall Metrics for Graph Anomaly Detection
 We take the OddBall approach* to quantify the connectivity of each domain’s network:
– Identify each domain’s one-step neighborhood (also called ego-net).
– Extract two graph features from the ego-net:
▪ N: Number of neighbors
▪ E: Number of edges in the ego-net
 The number of neighbors and the number of edges follow a power law: E ∝ Nα , 1≤ α ≤ 2
* OddBall: Spotting Anomalies in Weighted Graphs, Leman Akoglu et al., PAKDD, Hyderabad, India, June 2010.
Picture Source: ICDM’12 tutorial
on graph anomaly detection
• Use log(E)/log(N) to approximate the slope. log(E)/log(N) > 1
indicates some degree of connectivity among neighbors.
• The higher the ratio the higher degree of connectivity (given
same number of neighbors). Generally OddBall ratio of >1.5 is
more interesting.
• One can additionally compute clique percentage: the ratio
between E and the number of edges needed to form a clique:
E/[(N2+N)/2], to measure network connectivity.
21© Copyright 2015 Pivotal. All rights reserved.
Sample Domains with Highly Connected Networks
Highlighted domain has a
fully connected network, a
clique!
Domain
#
Neighb
ors
Neighbours
#
Edg
e
log(
E)/lo
g(N)
Clique
Percen
t
# User
Names
a.com 4 {b.com, c.com,d.com, e.com} 10 1.66 100% 6
s.com 7 {a.com, b.com, c.com, d.com, e.com, f.com} 27 1.69 96% 9
r.com 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 43 1.71 96% 7
abc.ru 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 42 1.70 93% 11
d.com
e.com
b.com
c.com
a.com
22© Copyright 2015 Pivotal. All rights reserved.
Detecting Isolated Clusters
 Given the domain correlation graph, one can also identify isolated groups of domains that
only interact with domains in the same group, but not others (a bot-net like structure).
 This can be formulated as the task of finding connected components (CCs) in a graph.
 The example below show that malicious sites tend to exist in small CCs.
Sample Connected Component
qre.com
jekc.com
fbc.com
abc.com
ghk.com
bcd.com
Known malicious site
23© Copyright 2015 Pivotal. All rights reserved.
Operationalization and
Outlook
24© Copyright 2015 Pivotal. All rights reserved.
Operationalization Vision
Run Algorithms
Inspect Anomalies
Evaluate Model
Outputs
Refine Algorithms
Load New Data
• Owned by Data Engineer/Data Scientist
• Incrementally (e.g. weekly) update models
using new batches of data, e.g. as a Cron job
• Owned by security
team
• Ideally model outputs
provided via
interactive web
dashboards
• Feedback on model
performance from security
team.
• Opportunities for refinement
and ideas for new models
• Owned by Data Scientist
• Refine algorithms
• Owned by Data Engineer
• Load new data
BUILT FOR THE SPEED OF BUSINESS
1 of 25

More Related Content

What's hot(20)

Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
Gabriel Moreira5.6K views
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit1.7K views
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
Dr. Shikha Mehta214 views

Viewers also liked(20)

What Is the Future of Data Sharing?What Is the Future of Data Sharing?
What Is the Future of Data Sharing?
Center on Global Brand Leadership28K views
State of Application Security Vol. 4State of Application Security Vol. 4
State of Application Security Vol. 4
IBM Security5.3K views
Senzations’15: Secure Internet of ThingsSenzations’15: Secure Internet of Things
Senzations’15: Secure Internet of Things
SenZations Summer School 1K views
IoT and BD IntroductionIoT and BD Introduction
IoT and BD Introduction
Wayne Sun1.3K views

Similar to Data Science Driven Malware Detection(20)

Understanding apache-druidUnderstanding apache-druid
Understanding apache-druid
Suman Banerjee346 views
Web engineeringWeb engineering
Web engineering
•sreejith •sree1.9K views
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz676 views
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit2K views
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz1.3K views
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptx
TarekHamdi822 views
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
Kostas Tzoumas1.8K views
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz1.7K views
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz2.2K views
RTP Bluemix Meetup April 20th 2016RTP Bluemix Meetup April 20th 2016
RTP Bluemix Meetup April 20th 2016
Tom Boucher379 views

Data Science Driven Malware Detection

  • 1. 1© Copyright 2015 Pivotal. All rights reserved. 1 Data Science Driven Malware Detection Malicious Domain Association Anirudh Kondaveeti, PhD Principal Data Scientist
  • 2. 2© Copyright 2015 Pivotal. All rights reserved. Project Goal  Goal: Find domains that have time and user based co-occurrence relationships to aid the detection of coordinated network attacks.  Example: Domain A is a watering hole. It redirects users to an exploit kit at Domain B within a short time window. – B is relatively unknown: Visiting B is a low frequency (support) event. – B is almost always redirected from A: The conditional probability (confidence) of an initial visit to A is high given B is visited later on. User visits watering hole domain A Domain B hosts exploit kit Watering hole domain A redirects to domain B User machine compromised
  • 3. 3© Copyright 2015 Pivotal. All rights reserved. Data Sources & Preprocessing  Historical Proxy Logs – Information about “who is accessing which website at what time” – Approx. 3 months of data with billions of connection records  Local Domain White List – List of non-malicious websites  Preprocessing Host Name Normalization (anirudh.facebook.com -> facebook.com) Filter Invalid Host Names ( www.facebook,ca) Identify “unpopular” domains ( www.francelegal.com) User Specific Sessionization
  • 4. 4© Copyright 2015 Pivotal. All rights reserved. User-Specific Sessionization  Each user’s proxy logs are sessionized so that two consecutive connections in the same session occur within a user-specified time window (e.g. 60s).  Sequential patterns are derived from sessionized data. Connection Time Domain Session ID 2015-07-03 12:41:08 googlevideo.com 1 2015-07-03 12:41:09 twitter.com 1 2015-07-03 12:41:12 youtube.com 1 2015-07-03 12:41:14 doubleclick.net 1 2015-07-03 12:41:15 google.com 1 2015-07-03 12:41:15 googleanalytics.com 1 2015-07-03 12:41:28 youtube.com 1 2015-07-03 12:59:23 facebook.com 2 2015-07-03 12:59:24 yahoo.com 2 >60s apart, start a new session
  • 5. 5© Copyright 2015 Pivotal. All rights reserved. Modeling Approaches  Sequential Pattern Mining – Find time-ordered co-occurrence relationships between multiple domains. – Output low frequency, high confidence sequences of domains: [{Domain1},{Domain2, Domain3},…] => [DomainN].  Graph Mining – Build a “social network” graph between domains by creating edges between pairs of domains that are associated with high confidence – Use graph based algorithms to find fully and partially connected subgraphs  Two approaches can be used in conjunction to compliment each other.
  • 6. 6© Copyright 2015 Pivotal. All rights reserved. Modeling Framework Design Considerations  Operational feasibility – Incremental data processing and modeling on incoming new data, e.g. on a weekly basis, to distribute workload over time. – Results are updated to incorporate new model outputs.  Computational tractability – Implement most of the modeling frameworks in plain SQL, and design efficient Window functions to achieve better runtime performance. – Explicit PL/R routine parallelization to leverage the Massively Parallel Processing architecture of the Greenplum database.
  • 7. 7© Copyright 2015 Pivotal. All rights reserved. An Incremental Modeling Framework Initial Proxy Logs & Domain Whitelist Preprocessed Proxy Logs • Host normalization & validation • Data filtering • Sessionization Model-Specific Results Model Execution: • Sequential Pattern Mining • Graph Mining New Proxy Logs & (Possibly) Updated Domain Whitelist Preprocessed New Proxy Logs • Host normalization & validation • Data filtering • Sessionization Updated Model- Specific Results Initial Run Update Model Update: • Sequential Pattern Mining • Graph Mining
  • 8. 8© Copyright 2015 Pivotal. All rights reserved. Modeling Approaches Sequential Pattern Mining
  • 9. 9© Copyright 2015 Pivotal. All rights reserved. Model Execution: Sequential Pattern Mining Create time-ordered domain sequences from sessionized data Given a list of targeted domains (e.g. rare domains), select subset of sequences containing those domains Find high confidence, low support sequential patterns of targeted domains in parallel
  • 10. 10© Copyright 2015 Pivotal. All rights reserved. Sequence Creation  Each sequence contains domains in a session by the same user.  Domains are ordered by connection time.  Sequence for example on the right – Sequence 1 : [ {googlevideo.com}, {twitter.com}, {youtube.com}, {doubleclick.net}, {google.com}, {googleanalytics.com} ] – Sequence 2: [{facebook.com}, {yahoo.com}] Connection Time Domain Session ID 2015-01-06 14:41:08 googlevideo.com 1 2015-01-06 14:41:09 twitter.com 1 2015-01-06 14:41:12 youtube.com 1 2015-01-06 14:41:14 doubleclick.net 1 2015-01-06 14:41:15 google.com 1 2015-01-06 14:41:15 googleanalytics.com 1 2015-01-06 14:59:23 facebook.com 2 2015-01-06 14:59:24 yahoo.com 2
  • 11. 11© Copyright 2015 Pivotal. All rights reserved. Sequence Statistics  sup: Support of a pattern P is the ratio of sequences in which a pattern occurs – sup({a,e}) = 2/10  conf: Confidence of a rule X => Y is proportion of transactions containing X that also contain Y – conf({a => e}) = sup({a,e})/sup({a}) = 2/5  #users: Number of distinct users for which a pattern P occurs – #users({a}) = 1  sup and #users follow monotone property i.e. – {a,e} {a} – sup({a,e}) ≤ sup({a}) – #users({a,e}) ≤ #users({a}) 10 sequences from a single user
  • 12. 12© Copyright 2015 Pivotal. All rights reserved. Sequential Pattern Mining (SPM) in Parallel  Developed a scalable algorithm in Greenplum database (GPDB) to identify patterns with low support and high confidence patterns occurring in a minimum number of user sequences.  High confidence patterns relating to a given set of domains are obtained in parallel: i.e., SPM runs independently on different subsets of sequences for different domains. SELECT a_targeted_domain, sequential_pattern_mining(min_support, min_confidence, min_num_users) FROM input_table Pseudo code: Find domain A with small support (or known bad domain) Subset sequences from data containing A Find sequential patterns of A with high confidence Repeat for all A in parallel on separate GPDB node
  • 13. 13© Copyright 2015 Pivotal. All rights reserved. Relative Confidence to Adjust Ranking of Patterns  For each domain of interest, SPM is run only on the subset of sequences containing that domain. This may cause some sequential patterns to have artificially high confidence.  Recall: confidence(X=>Y):=support(<X,Y>)/support(X)=|<X,Y>|/|X|. |X|, the number of sequences in the subset that contain the left hand side pattern, may not reflect the popularity of X in the full dataset.  We define relative confidence as: relative_confidence(X=>Y):=|<X,Y>|/|Xi|fullset where|Xi|fullset is the number of sequences in the full dataset that contain the left hand pattern.  Relative confidence favors the pattern whose left hand side contains less popular domains (see the highlighted example below). Relative confidence favors unpopular left hand side pattern Domain Pattern Supp Conf Rel Conf revenueindia. net <{google.com},{facebook.com}> => <{revenueindia.net}> 0.079 0.75 0.0001 revenueindia. net <{google.com}, {fileshare.com}> => <{revenueindia.net}> 0.071 0.75 0.067 revenueindia. net <{fileshare.com},{redworm.com}> => <{revenueindia.net}> 0.030 1.00 0.51
  • 14. 14© Copyright 2015 Pivotal. All rights reserved. Model Update: Sequential Pattern Mining  The model update module for sequential pattern mining follows a similar workflow as its model execution module.  One additional step is simply to merge the new results obtained from the incoming new data with the existing set of patterns, including updating rule quality metrics: support, confidence, etc. Create time-ordered domain sequences from new sessionized data Given a list of targeted domains (e.g. rare domains), select subset of sequences containing those domains Find high confidence, low support sequential patterns of targeted domains in parallel Merge new results with the existing set of patterns.
  • 15. 15© Copyright 2015 Pivotal. All rights reserved. Modeling Approaches Graph Mining
  • 16. 16© Copyright 2015 Pivotal. All rights reserved. Model Execution: Graph Mining Construct “baskets” of domains (co- occurrence domains) by running a sliding window of certain time interval through data Find high confidence, low support pairwise association rules of the form Domain 1 => Domain 2 Create social network of domains Find partially and fully connected sub-graphs
  • 17. 17© Copyright 2015 Pivotal. All rights reserved. Construction of “Baskets”  Domains visited by a user in a certain time window form a “basket”, analogous to items purchased in a single transaction as in market basket analysis.  The time interval for the sliding window (60s window used in the implementation) can be tuned.  A basket contains distinct domains in a sliding window: Example on right: Basket 1 = {googlevideo.com, twitter.com, youtube.com, doubleclick.net, google.com} Connection Time Domain 2015-01-06 14:41:00 googlevideo.com 2015-01-06 14:41:09 twitter.com 2015-01-06 14:41:12 youtube.com 2015-01-06 14:41:14 doubleclick.net 2015-01-06 14:42:00 google.com 2015-01-06 14:42:05 googleanalytics.com 2015-01-06 14:42:08 pivotal.io 2015-01-06 14:59:23 facebook.com 2015-01-06 14:59:24 yahoo.com 1 2
  • 18. 18© Copyright 2015 Pivotal. All rights reserved. Pairwise Association Rule Mining  Given domain-to-basket assignments, pairwise association rule mining mainly involves evaluation of: – Co-occurrence frequency: the number of times two domains fall in a common basket. – Conditional probability: probability of seeing domain 2 given domain 1 is present.  Pairwise rule mining is implemented in plain SQL in a scalable fashion. Domain A Domain B # {A,B} # A # B P(A|B) P(B|A) # A to B # B to A # AB Same Time Max(# User Names/ M) # Date Min Date Max Date pivotal.io montecarlo.com 10 560 10 1.000000 0.017857 9 0 1 1 1 2015-02- 26 2015- 02-26 pivotal.io bigbangtheory.com 25 560 26 0.961538 0.044643 21 4 0 2 1 2015-02- 23 2015- 02-23 pivotal.io sciencefiction.com 78 560 97 0.804124 0.139286 61 15 2 4 8 2015-01- 23 2015- 02-17 High confidence (>0.5) associations involving multiple users over several days (e.g. highlighted rules) are generally more interesting.
  • 19. 19© Copyright 2015 Pivotal. All rights reserved. Exploring Interactions between Domains  To explore the interactions between domains, we build an undirected correlation graph using the discovered pairwise domain association rules.  Each node in the graph is a domain. An edge connects two domains if their co-occurrence confidence is higher than a threshold (e.g. 0.2).  The example on the right shows the tightly connected “social network” of a particular domain.  Partially and fully connected networks indicate possible waterhole or bot-net attacks.  Question: How to quantify the connectivity of a network? 0.25 0.37 0.71 0.52 0.1 0.6 0.1 Weight of Edge denotes the confidence Node denotes the domain abc.com xyz.com hga.com hebf.com
  • 20. 20© Copyright 2015 Pivotal. All rights reserved. OddBall Metrics for Graph Anomaly Detection  We take the OddBall approach* to quantify the connectivity of each domain’s network: – Identify each domain’s one-step neighborhood (also called ego-net). – Extract two graph features from the ego-net: ▪ N: Number of neighbors ▪ E: Number of edges in the ego-net  The number of neighbors and the number of edges follow a power law: E ∝ Nα , 1≤ α ≤ 2 * OddBall: Spotting Anomalies in Weighted Graphs, Leman Akoglu et al., PAKDD, Hyderabad, India, June 2010. Picture Source: ICDM’12 tutorial on graph anomaly detection • Use log(E)/log(N) to approximate the slope. log(E)/log(N) > 1 indicates some degree of connectivity among neighbors. • The higher the ratio the higher degree of connectivity (given same number of neighbors). Generally OddBall ratio of >1.5 is more interesting. • One can additionally compute clique percentage: the ratio between E and the number of edges needed to form a clique: E/[(N2+N)/2], to measure network connectivity.
  • 21. 21© Copyright 2015 Pivotal. All rights reserved. Sample Domains with Highly Connected Networks Highlighted domain has a fully connected network, a clique! Domain # Neighb ors Neighbours # Edg e log( E)/lo g(N) Clique Percen t # User Names a.com 4 {b.com, c.com,d.com, e.com} 10 1.66 100% 6 s.com 7 {a.com, b.com, c.com, d.com, e.com, f.com} 27 1.69 96% 9 r.com 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 43 1.71 96% 7 abc.ru 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 42 1.70 93% 11 d.com e.com b.com c.com a.com
  • 22. 22© Copyright 2015 Pivotal. All rights reserved. Detecting Isolated Clusters  Given the domain correlation graph, one can also identify isolated groups of domains that only interact with domains in the same group, but not others (a bot-net like structure).  This can be formulated as the task of finding connected components (CCs) in a graph.  The example below show that malicious sites tend to exist in small CCs. Sample Connected Component qre.com jekc.com fbc.com abc.com ghk.com bcd.com Known malicious site
  • 23. 23© Copyright 2015 Pivotal. All rights reserved. Operationalization and Outlook
  • 24. 24© Copyright 2015 Pivotal. All rights reserved. Operationalization Vision Run Algorithms Inspect Anomalies Evaluate Model Outputs Refine Algorithms Load New Data • Owned by Data Engineer/Data Scientist • Incrementally (e.g. weekly) update models using new batches of data, e.g. as a Cron job • Owned by security team • Ideally model outputs provided via interactive web dashboards • Feedback on model performance from security team. • Opportunities for refinement and ideas for new models • Owned by Data Scientist • Refine algorithms • Owned by Data Engineer • Load new data
  • 25. BUILT FOR THE SPEED OF BUSINESS