SlideShare a Scribd company logo
1 of 73
Download to read offline
The Power of Partnership – from Vision to Reality
L-3 Data Tactics:
Data Science Brown Bag
Welcome!
Hard and Soft Clusters and Cyber Data
April 22, 2014
!
R2 = 500; p<.05
asymptotically approaching perfect
!
•Why a (our 3rd) Data Science Brown Bag (Rich H.)?
!
•About US & About YOU (Rich H.)!!

!
•Case Studies in Cyber:
•What is Clustering, Honeypots and Density Based Clustering (Max W.)?
•What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection. (David P.)
•What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection. (Nathan D.)
!
•On the horizon...(Rich H.)
DT Data Science Brown Bag: Outline
L-3
DT Data Science Brown Bag: Outline
Learning [close] at a pace similar to the pace at which we learn.
!
Learning and Educating from/to DS to PMs, SWE, and OPs.
!
DS2PM: Provide insights for FRIs/RFPs.
PM2DS: Atmospherics from our costumers.
!
DS2SWE: Integrating algorithms.
SWE2DS: Accessing data spaces.
!
DS2OP: How do you consume the outputs of models?
OP2DS: What models are best to present to OPs?
DS: Data Scientist, PM: Program Managers, SWE: Software Engineers, OP: Operators
L-3
The Team: 

(Geoffrey B., Nathan D., Rich H., David P., Ted P., Shrayes R., Jonathan T., Adam VE., Max W.)
!
Graduates from top universities…
	 …many of whom are EMC Data Science Certified.
!
Advanced degrees include:

mathematics, computer science, astrophysics, electrical
engineering, mechanical engineering, statistics, social sciences. 

!
Base competencies (horizontals): clustering, association rules,
regression, naive bayesian classifier, decision trees, time-series, text
analysis. 

!
Going beyond the base (verticals)...
About Us: DT Data Science Team
L-3
About Us: DT Data Science Team
L-3
Clustering || Regression || Decision Trees || Text Analysis
Association Rules || Naive Bayesian Classifier || Time Series Analysis
econom
etrics
spatialeconom
etrics
graph
theory
algorithm
s
astrophysicaltim
e-series
analysis
path
planning
algorithm
s
bayesian
statistics
constrained
optim
izations
num
ericalintegration
techniques
PCA
bagging/boosting
hierarchicalm
odels
IRT
space-tim
e
latentclass
analysis
structuralequation
m
odeling
m
ixture
m
odels
SVM
m
axent
CART
autoregressive
m
odels
ICA
factoranalysis
random
forest
dim
ensionalreduction
topic
m
odels
sentim
entanalysis
frequency
dom
ain
patterns
unsupervised
by
supervised
change-pointm
odels
LUBAP
DLISA
DBAC
optics
clustering
Hierarchy of Data Scientists
About Us: DT Data Science Team
L-3
!
!
No Free Lunch (NFL) theorems: no algorithm performs
better than any other when their performance is averaged
uniformly over all possible problems of a particular type.
Algorithms must be designed for a particular domain or style
of problem, and that there is no such thing as a general
purpose algorithm.

!
!
!
About Us: DT Data Science Team
L-3
ABOUT YOU:
35 confirmed, 15 webex, 21 Data Tactics employees, 13 L-3 NSS
employees; Sam Posten was the first to sign-up (webex) and Aaron Glahe was
the first to sign-up for in-person!
!
# define Twitter account names
start <- getUser(“L3_NSS”)
finish <- getUser(“DataTactics”)
!
# find all connections independently of each account
dt.friends.object <- lookupUsers(start$getFriendsIDs())
l3.friends.object <- lookupUsers(finish$getFriendsIDs())
!
#create one large table that relates followers from each account
relations <- merge(data.frame(User=“DataTactics”, follower=dt.friends),
data.frame(User=l3.friends, Followers=“L3_NSS”), all=TRUE)
!
#create network layout showing each account’s community and overlap
g.followers <- graph.data.frame(relations.followers, directed = T)
!
#finally plot the graph
tkplot(g) L-3
ABOUT YOU:
@DataTactics
@L3_NSS
L-3
ABOUT YOU:
@L3_NSS
@DataTactics
L-3
Why Clustering?
L-3
Six Pillars of Data Mining:
 
Clustering has become a workhorse in Big Data and fits into the Six Pillars of Data Mining and our own
DS4PM & DS4G framework.
 
• Anomaly detection: the identification of unusual data records, that might be interesting or data errors that require
further investigation.
• Association rule learning: searches for relationships between variables.
• Clustering: is the task of discovering groups and structures in the data that are in some way or another "similar",
without using known structures in the data.
• Classification: is the task of generalizing known structure to apply to new data.
• Regression: finds a function which models the data with the least error.
• Summarization: providing a more compact representation of the data set.
 
!
Taxonomy of Questions (ref: DS4PM):
!
• Causal Effects: is an approach to the statistical analysis of cause and effect based on the framework of potential
outcomes
• Classification/Clustering: identifying to which of a set of observations belong, on the basis of a training set of data
or without labels in the clustering approach.
• Outlier Detection: is the identification of events which do not conform to an expected pattern or other items in a
dataset.
• Big Data and Analytics: discovering interesting relations between variables in large databases
• Measurement Models: statistical models to measure the relationships between the observable variables and the
unobserved (or “latent”) quantity	

• Text Analysis: refers to the process of deriving high-quality information from text.
Max Watson: Max’s background is in physics and applied mathematics. Max
completed his undergraduate degree at University of California, Berkeley and
completed his PhD at University of California, Santa Barbara in 2012. Max
specializes in large-scale simulations, signal analysis and statistical physics - he
joined the Data Tactics team in January 2014 and has supported DHS. Max is an
EMC Certified Data Scientist.
David Pekarek: David’s background is in Mechanical Engineering and specializes
in mechanical control systems, optimization, and spatio-temporal statistics. David
finished his PhD in 2010 from California Institute of Technology and joined Data
Tactics in the fall of 2012 and currently supports DARPA.
Nathan Danneman: Nathan’s background is in political science, with
specializations in applied statistics and international conflict. He finished his PhD
in June of 2013, and joined Data Tactics in May of that same year. He recently co-
authored Social Media Mining with R, is active in the local Data Science
community and currently supports DARPA. Nathan is an EMC Certified Data
Scientist.
!
Today’s presenters:
L-3
L-3
Cluster Analysis of
Honeypot Data
By Max Watson
Outline
14
• What are Honeypots? 

!
• Cluster Analysis

! -General Principles

-Density Based Clustering

!
• Cluster Analysis Applied to Honeypot Data

!
• Conclusions

!
L-3
Honeypots
15
8 websites: (USA, 4), (Singapore, 2), (Brazil, 2) [brought to you by Ted Procita]

!
Collection Period: October 15, 2013 - November 18, 2013

!
2 Sources of Data: requests at firewall and requests at webserver
!
number of webserver requests: ~4000
Honeypots are traps set to detect, deflect, or counteract 

	 attempts at unauthorized use of information systems
some information from a typical ‘hit’ on the webserver:

!
IP address	 Country	 	 Request		 Timestamp

101.227.4.25	 CN	 	 /robots.txt	 10/17/13 17:58:21
L-3
Goals of Honeypot Analysis
16
• Categorize IP addresses in terms of similar requests

!
• Determine how requests vary by country

!
• Detect outliers
L-3
Cluster Analysis
17
Grouping similar objects:
Requirements:

!
1) Distance metric

!
!
2) Method for grouping nearby objects
L-3
Distance I: Combine Requests
18
1) Gather all unique requests invoked by each IP address:

!
!
	 IP1 ⇒ { /, /robots.txt, …}

!
	 IP2 ⇒ { /HNAP1/, /manager/html, …}

	 .	 	 	 	 .
	 .	 	 	 	 .
	 .	 	 	 	 .
L-3
Distance II: Jaccard Similarity
19
	 	 Requests from IP address A: {♣,♦} 

!
	 	 Requests from IP address B: {,♦}
Jaccard Similarity:
intersection(A, B) = {♦}	 union(A, B) = {♣, ♦, }
J(A, B) = 1/3
Effective Distance: D = 1 - J
0 1
D = 0 : A and B issue the same requests

D = 1 : A and B issue completely different requests
L-3
Distance III: All Pairs
20
Calculate effective distance between all pairs of IP addresses
!
...leaving us with:
But usually in a high number of dimensions!
L-3
Identifying Clusters
21
How many clusters are there?

Are there outliers?
Density Based Clustering:

	 ● connectivity 

	 ● density
L-3
Connectivity
22
Distance

Threshold
Number 

of Clusters
Distance Threshold
3

2

1
Cluster 1
Cluster 2
L-3
Density Based Clustering
23
2 parameters: distance threshold and minimum number of neighbors (DBSCAN)
example:

minimum number = 2
Outliers
Clusters
L-3
Shiny App for Analysis
24
L-3
China
25
	 Dominant Requests of Each Cluster
!
❶ /robots.txt ❷ /
❸ /manager/html ❹ www.baidu.com/
! L-3
China
26
Other
!
/manager/html
Time (~34 Days)
Hits Over Time
NumberofHits
L-3
United States
27
10 Clusters (Malicious & Benign)
L-3
United States
28
Clusters
!
Outliers
from same 

IP address
NumberofHits
Time (~34 Days)
Hits Over Time
L-3
Accomplishments
29
✓	 Categorized behavior of IP addresses based on requests

!
✓
 Detection of outliers

!
✓	 Determined how requests vary by country (China vs. USA)
L-3
What Clustering Can Do for You
30
Objects + Attributes
Cluster the Objects 	 	 	 	 Cluster the Attributes
Applications:

!
• IP addresses & their requests 		 • patients & their symptoms 

• devices & their malfunctions 	 	 • people & their associates
L-3
Port Based Clustering
	 	 of Firewall Activity
By David Pekarek
L-3
Firewall Activity Clustering Workflow
Data Preprocessing!
and Vectorization
OPTICS !
Clustering
Follow-on
Investigations
Honey Pot!
Firewall Activity
Aggregated counts of
IP’s dest. port hits
Reachability Distance
plot identifying user
clusters and outliers
Characteristics of outlying
IPs and IP clusters
#!
#!
#!
#!
#!
#
Abc
Abc
Abc
Abc
Abc
Abc
#!
#!
#!
#!
#!
#
#!
#!
#!
#!
#!
#
Abc
Abc
Abc
Abc
Abc
Abc
~32K!
logs
time, host, src IP,
location, ports, protocol
#!
#!
#!
#!
#!
#
#!
#!
#!
#!
#!
#
#!
#!
#!
#!
#!
#
~19K
IPs
128 ports
#!
#!
#!
#!
#!
#
#!
#!
#!
#!
#!
#
outliers
clusters
Distinct activity
levels on port
53
• The majority of source IPs make use of only one destination port!
• 94% of source IPs fall into some cluster with similar port usage and
traffic volume
L-3
OPTICS: Hierarchical Density Based Clustering
• Clustering algorithms provide a means to sort data without pre-existing labels!
• Density-based clustering methods are robust in identifying clusters with non-
uniform shapes
• The OPTICS algorithm is a density-based approach that simultaneously
evaluates cluster results at different scales
k-Means!
results
Density-based!
clustering !
results
Is this one cluster or two?!
The answer depends on scale!
L-3
OPTICS: How does it work?
• The OPTICS algorithm performs two major operations on the data:!
• determining an ordering of all data points, based on the likelihood of points being
clustered together
• assigning each point a Reachability Distance (R.D.): a quantification of the length
scale at which the given point will belong to any cluster
• Plotting R.D. vs the ordered data points, clusters appear as troughs
Whole face
Eye EyeSmile
Outliers
L-3
OPTICS: How was it applied?
Data Preprocessing!
and Vectorization
OPTICS !
Clustering
Follow-on
Investigations
• Source IPs used as the identifier for entities with traffic hitting the honey pot
firewall.
• Destination ports used to define the dimensions of feature space. Each of
the 127 most common ports (those with at least 60 hits from the total
population) got its own dimension. The remaining ‘rare’ ports bundled as a
single dimension.
• OPTICS algorithm identified clusters of IPs in 128 dimensional space, with
clustering results summarized in a 2-D reachability plot !
• Follow-on investigations performed to identify anomalous properties of
outlying IPs and commonalities among clustered IPs
L-3
Firewall Port Usage Clustering Results
L-3
Firewall Port Usage Clustering Results
Clusters with some!
distinctive activity
Outlying IPs!
(Their activity falls into clusters only at
extremely generous length scales)
L-3
Interactive Plotting Demo
Interactive Plotting
Demo Here
L-3
Port Usage Cluster Characterization
IPs with minimal activity
on highly travelled ports
(22, 53, Other)
Outlying IPs:!
Activity on multiple ports or very
seldom used ports
1-15 hits on port 80
(HTTP)
1-10 hits on port 3389
(RDP)
1-14 hits on port 1433
(MSSQL)
1-6 hits on port 445
(Active Directory)
Small clusters with activity !
on less used ports!
(3306, 5060, 4899, 135, 25, 23,
45091, 48879, 1234)
L-3
Port 53 Traffic Clustering Validation
OPTICS identifies the
multimodal distribution of
traffic to port 53 (DNS)
L-3
Conclusions
• Destination ports show little correlation in the firewall logs. Source
IPs tend to cluster by the one port to which they sent traffic.
• OPTICS clustering efficiently sorts source IPs as outliers or
belonging to a cluster of common port usage.
• Interactive plotting tools allow for the rapid characterization of
clusters.
L-3
Latent Dirichlet Allocation: Characterizing
normal behavior and identifying deviations
from normality
By Nathan Danneman
L-3
Outline
• What is Latent Dirichlet Allocation (LDA)?
• How does it compare to other clustering tools?
• LDA by example: analyzing log files
L-3
LDA is a Mixture Model
• Mixture Models:
– Identify sets of variables that co-occur (behavioral patterns)
– Determine what behavioral patterns each individual exhibits
• Example: The Sports Equipment Analogy
Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat
John 12 4
Susan 14 1 6 3
Chris 2 3
Jane 1 11 1
L-3
• Mixture Models:
– Identify sets of variables that co-occur (behavioral patterns)
– Determine what behavioral patterns each individual exhibits
• Example: The Sports Equipment Analogy
Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat
John 12 4
Susan 14 1 6 3
Chris 2 3
Jane 1 11 1
LDA is a Mixture Model
L-3
LDA Utilizes Soft Clustering
• Hard Clustering: every point is
assigned to one group
• Hard Clustering with Outliers: every
point is assigned to one or no
groups
• Soft Clustering: every point is
assigned to zero, one, or several
groups. x1
x2
L-3
x1
x2
Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat
John 12 4
Susan 14 1 6 3
Chris 2 3
Jane 1 11 1
• Hard Clustering: every point is
assigned to one group
• Hard Clustering with Outliers: every
point is assigned to one or no
groups
• Soft Clustering: every point is
assigned to zero, one, or several
groups.
LDA Utilizes Soft Clustering
L-3
Input Data for LDA: Cyber Data
• LDA takes a matrix of counts
• Data: log files from a large network; 8700 users, 85 log types
• Each row represents a user
• Each column represents a log type
Connection!
Success
Termination!
Success
Invalid !
Login
...
User 1 0 3 2
User 2 12 3 0
User 3 3 0 18
User 4 2 22 1
User 5 7 5 9
... ...
L-3
LDA Estimates Two Mixtures
• Output 1: logs that co-occur, forming behavioral patterns
Log Type 1
!
Log Type 2
!
Log Type 3
!
Log Type 4
...
Behavioral
Pattern 1
!
!
Behavioral
Pattern 2
Each log relates to zero,
one, or many behavioral
patterns
L-3
LDA Estimates Two Mixtures
• Output 1: logs that co-occur, forming behavioral patterns
• Output 2: which behavioral pattern(s) characterize each
user
Log Type 1
!
Log Type 2
!
Log Type 3
!
Log Type 4
...
Behavioral
Pattern 1
!
!
Behavioral
Pattern 2
User 1
!
User 2
!
User 3
!
User 4
...
Users exhibit zero, one, or
many behaviors
L-3
LDA Workflow
• Build the N (observation) by P (log types) matrix of counts
• Use an empirical method to determine the optimal number
of behavioral patterns to estimate
• Estimate the model
Connection
(Successful)
Connection
(Failure)
Termination
(Successful)
Connection
(Time-Out)
User1 15 15 0 3
User2 8 12 2 0
L-3
Output 1: Mapping Log Types to Behavioral Patterns:

Behavioral Pattern #3
Firewall.Connections.Successful
Firewall.Connections.Terminations
Firewall.Connections.Successful
Firewall.Connections.Terminations
L-3
Output 1: Mapping Log Types to Behavioral Patterns:

Behavioral Pattern #4
Firewall.Connections.Successful
Firewall.Connections.Terminations
Windows.Hosts.User.Logins
Windows.Hosts.User.Logoffs
Windows.Hosts.User.Privileged.
Use.Successful
L-3
Behavioral Pattern Characterization
• Behavioral Pattern 1:
• Windows Hosts: Failed Logins
• Behavioral Pattern 2:
• Firewall: Connections
• Windows Hosts: Logins, Logoffs
• Behavioral Pattern 3:
• Firewall: Connections, Terminations
• Behavioral Pattern 4:
• Windows Hosts: Logins
• Behavioral Pattern 5:
• Firewall: System Normal, Connections, Terminations
• Behavioral Pattern 6:
• Web Logs: System Normal
• Behavioral Pattern 7:
• Firewall: System Errors
Normal Activity
Abnormal Activity
Abnormal Activity
L-3
Behavioral Pattern Characterization
W
indowsHosts:Failed
Logins
Firewall:Connections,
Term
inations
W
indowsHosts:LoginsFirewall:System
Norm
al,
Connections,Term
inations
W
eb
Logs:System
Norm
al
Firewall:System
Errors
Firewall:Connections
W
indowsHosts:Logins,Logoffs
L-3
LDA Estimates Two Mixtures
• Output 1: logs that co-occur, forming behavioral patterns
• Output 2: which behavioral pattern(s) characterize each
user
Log Type 1
!
Log Type 2
!
Log Type 3
!
Log Type 4
...
Behavioral
Pattern 1
!
!
Behavioral
Pattern 2
User 1
!
User 2
!
User 3
!
User 4
...
Users exhibit zero, one, or
many behaviors
L-3
Characterizing Users with Behavioral Patterns
User # 2
Essentially, entirely firewall
connections and terminations.
L-3
Characterizing Users with Behavioral Patterns
User # 43
Lots of failed logins!
Normal activity: connections,
terminations, logins, logoffs
L-3
Visualizing Two-Level Mixtures
Behavioral Pattern 3
Behavioral Pattern 4
A User Characterized by:
45% Behavior 3 and
55% Behavior 4
L-3
Outlier Detection with LDA
• Mixture models make predictions about the
proportion of each log type a user will have

• We can compare the predicted proportions to each
user’s actual proportions to see how well the
model captures each user’s actions

• Typical users should be well-characterized by
mixtures of common behavioral patterns – these
are “normal” users

• Users whose actions are not mixtures of common
behavioral patterns are doing things that are
uncommon – these are outliers
L-3
Measuring User-Level Discrepancy
Cosine Similarity = 0.99
Proportions of All Log Types for a Single User
L-3
Measuring User-Level Discrepancy
Cosine Similarity = 0.02
Proportions of All Log Types for a Single User
L-3
Cosine Similarity between Predicted and
Observed Data (all users)
~99% of users are well-explained
L-3
Cosine Similarity between Predicted and
Observed Data (poorly fit users)
L-3
LDA Detects Univariate Outliers
One user had 77% Windows Hosts Failed Logins; mean for data is 0.002%
Proportion of Windows Hosts: Failed Login Logs
User # 12
L-3
LDA Detects Conditional Outliers
User # 53 has a typical proportion
of Firewall Termination logs...
!
!
!
However, User 53 has more than twice
as many Firewall Terminations as users
with his/her same proportion of Firewall
Connections.
!
!
Percentage of Logs that are
Firewall Terminations
NumberofUsers
User 53
Firewall Terminations comprise
about 50% of many users’ logs
Percentage of Logs that are
Firewall Terminations
NumberofUsers
Firewall Terminations among
users with 53’s proportion of
Firewall Connections
User 53
L-3
Conclusions
• LDA allows an analyst to:
– Succinctly characterize common behavioral patterns
– Capture nuance through soft clustering
– Identify both simple and conditional outliers
!
• Next Steps:
– Radically improve parallelized versions of LDA
– Build enhanced visualizations that allow analysts to interact with data
!
• Previous Steps:

– Cyber IR&D II - Honeypots & Topic Graphs
– https://portal.data-tactics-corp.com/sites/analytics/Shared
%20Documents/honeypots.pdf
L-3
•	 Query based analytics are tenuous for data with large feature
spaces and population sizes. For complete answers, we must
analyze with comprehensive algorithms.
	•	 Cyber systems regularly lack reliable (or stationary) models and
priors. Hence we have been focused on questions of pattern
detection (hard) and outlier detection (harder) for big cyber
data, primarily obtaining results via clustering analyses.
	•	 There are many, many clustering algorithms, each with distinct
features and requirements (No Free Lunch for Theorems).
Choosing the most appropriate tool requires a deep
understanding of the available data, the questions at hand, and
the pros and cons of applicable methods.
Final Thoughts…
L-3
•	 L-3 Data Tactics has several minimally viable products (MVP)
working of very hard elements of the cyber analytics problem
set. 

•These MVPs can be used in a support function to existing
security protocol and signature based systems - or provide
those systems already in place with pattern and anomaly
detection.

•Previous and future honeypot collection will further define L-3’s
cyber competencies in proactive cyber analytics.
Final Thoughts…
L-3
...on the Horizon:
!
Honeypots and Twitter Collection Platforms
!
Summer Data Science Internship Program (Robert R. & USMA cadets):
	 Honeypots analytical application development
	 USA Civil Affairs & CERDEC Analytics
	 	 http://glimmer.rstudio.com/gosystems01/Stability/
	 Next Data Science Brown Bag late July.
!
DS4G & DS4PM both making appearances this year.
!
Data Science on display at the L-3 Technology Exchange 2014… more
to come.
… on the horizon.
L-3
The Data Science Team
http://datatactics.blogspot.com
L-3
The Data Science Team
https://github.com/DataTacticsCorp
L-3
Homepage: http://www.data-tactics.com
Blog: http://datatactics.blogspot.com
Twitter: https://twitter.com/rheimann
Or, me (Rich Heimann) at rheimann@data-tactics-corp.com
Questions?
L-3
Twitter: https://twitter.com/DataTactics
Twitter: https://twitter.com/mwatson
Twitter: https://twitter.com/ndanneman

More Related Content

What's hot

Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Sherlock a deep learning approach to semantic data type dete
Sherlock a deep learning approach to semantic data type deteSherlock a deep learning approach to semantic data type dete
Sherlock a deep learning approach to semantic data type detemayank272369
 
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)Rich Heimann
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRoku
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterSudarsun Santhiappan
 
Towards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsTowards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsCITE
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationBlerina Spahiu
 

What's hot (20)

Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Sherlock a deep learning approach to semantic data type dete
Sherlock a deep learning approach to semantic data type deteSherlock a deep learning approach to semantic data type dete
Sherlock a deep learning approach to semantic data type dete
 
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data era
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Towards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsTowards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong Students
 
Question answering
Question answeringQuestion answering
Question answering
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
Text categorization
Text categorizationText categorization
Text categorization
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
 

Viewers also liked

Data Tactics Semantic and Interoperability Summit Feb 12, 2013
Data Tactics Semantic and Interoperability Summit Feb 12, 2013Data Tactics Semantic and Interoperability Summit Feb 12, 2013
Data Tactics Semantic and Interoperability Summit Feb 12, 2013DataTactics
 
Data Tactics dhs introduction to cloud technologies wtc
Data Tactics dhs introduction to cloud technologies wtcData Tactics dhs introduction to cloud technologies wtc
Data Tactics dhs introduction to cloud technologies wtcDataTactics
 
Data Tactics Open Source Brief
Data Tactics Open Source BriefData Tactics Open Source Brief
Data Tactics Open Source BriefDataTactics
 
Multi Discipline Intelligence Production Teams 1
Multi Discipline Intelligence Production Teams 1Multi Discipline Intelligence Production Teams 1
Multi Discipline Intelligence Production Teams 1DataTactics
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data ConferenceDataTactics
 
Ontology and Reports
Ontology and ReportsOntology and Reports
Ontology and ReportsDataTactics
 
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATANETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATADataTactics
 
Data Tactics and Nervve Integrated Big Data v3
Data Tactics and Nervve Integrated Big Data v3Data Tactics and Nervve Integrated Big Data v3
Data Tactics and Nervve Integrated Big Data v3DataTactics
 
ODSC_Cherven_20160518
ODSC_Cherven_20160518ODSC_Cherven_20160518
ODSC_Cherven_20160518Ken Cherven
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Rich Heimann
 
Horizontal Integration of Big Intelligence Data
Horizontal Integration of Big Intelligence DataHorizontal Integration of Big Intelligence Data
Horizontal Integration of Big Intelligence DataDataTactics
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureRajesh Piryani
 

Viewers also liked (15)

Data Tactics Semantic and Interoperability Summit Feb 12, 2013
Data Tactics Semantic and Interoperability Summit Feb 12, 2013Data Tactics Semantic and Interoperability Summit Feb 12, 2013
Data Tactics Semantic and Interoperability Summit Feb 12, 2013
 
Data Tactics dhs introduction to cloud technologies wtc
Data Tactics dhs introduction to cloud technologies wtcData Tactics dhs introduction to cloud technologies wtc
Data Tactics dhs introduction to cloud technologies wtc
 
Data Tactics Open Source Brief
Data Tactics Open Source BriefData Tactics Open Source Brief
Data Tactics Open Source Brief
 
Οι Λάπωνες
Οι ΛάπωνεςΟι Λάπωνες
Οι Λάπωνες
 
Multi Discipline Intelligence Production Teams 1
Multi Discipline Intelligence Production Teams 1Multi Discipline Intelligence Production Teams 1
Multi Discipline Intelligence Production Teams 1
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data Conference
 
Ontology and Reports
Ontology and ReportsOntology and Reports
Ontology and Reports
 
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATANETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
NETWORK CENTRALITY IN SUB-NATIONAL AREAS OF INTEREST USING GDELT DATA
 
Data Tactics and Nervve Integrated Big Data v3
Data Tactics and Nervve Integrated Big Data v3Data Tactics and Nervve Integrated Big Data v3
Data Tactics and Nervve Integrated Big Data v3
 
ODSC_Cherven_20160518
ODSC_Cherven_20160518ODSC_Cherven_20160518
ODSC_Cherven_20160518
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?
 
Horizontal Integration of Big Intelligence Data
Horizontal Integration of Big Intelligence DataHorizontal Integration of Big Intelligence Data
Horizontal Integration of Big Intelligence Data
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 

Similar to Data Science and Analytics Brown Bag

PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Templatebutest
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collectiondnac
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.Giuseppe Ricci
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining AppliedDMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining AppliedJohannes Hoppe
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Claudia Wagner
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detectionguest0edcaf
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to RAnshik Bansal
 
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Krishnaram Kenthapadi
 

Similar to Data Science and Analytics Brown Bag (20)

Quantifying the bias in data links
Quantifying the bias in data linksQuantifying the bias in data links
Quantifying the bias in data links
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
 
02 Network Data Collection (2016)
02 Network Data Collection (2016)02 Network Data Collection (2016)
02 Network Data Collection (2016)
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining AppliedDMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
03 presentation-bothiesson
03 presentation-bothiesson03 presentation-bothiesson
03 presentation-bothiesson
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
 
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
 

More from DataTactics

C Star Analytic Presentation
C Star Analytic PresentationC Star Analytic Presentation
C Star Analytic PresentationDataTactics
 
Text Analysis Using Twitter: A Case Study in Dhaka
Text Analysis Using Twitter: A Case Study in Dhaka Text Analysis Using Twitter: A Case Study in Dhaka
Text Analysis Using Twitter: A Case Study in Dhaka DataTactics
 
Data Tactics Analytics Practice
Data Tactics Analytics PracticeData Tactics Analytics Practice
Data Tactics Analytics PracticeDataTactics
 
Discontinuities Demo
Discontinuities DemoDiscontinuities Demo
Discontinuities DemoDataTactics
 
Analytics Brownbag
Analytics Brownbag Analytics Brownbag
Analytics Brownbag DataTactics
 
Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013DataTactics
 
Data Tactics Unified Dataspace Architecture and Description
Data Tactics Unified Dataspace Architecture and DescriptionData Tactics Unified Dataspace Architecture and Description
Data Tactics Unified Dataspace Architecture and DescriptionDataTactics
 
Bill Ontology Summit (08 feb 1400hrs) v2
Bill Ontology Summit (08 feb 1400hrs) v2Bill Ontology Summit (08 feb 1400hrs) v2
Bill Ontology Summit (08 feb 1400hrs) v2DataTactics
 
DT Company Overview January 2013
DT Company Overview January 2013DT Company Overview January 2013
DT Company Overview January 2013DataTactics
 
Capabilities Brief Analytics
Capabilities Brief AnalyticsCapabilities Brief Analytics
Capabilities Brief AnalyticsDataTactics
 

More from DataTactics (11)

C Star Analytic Presentation
C Star Analytic PresentationC Star Analytic Presentation
C Star Analytic Presentation
 
Text Analysis Using Twitter: A Case Study in Dhaka
Text Analysis Using Twitter: A Case Study in Dhaka Text Analysis Using Twitter: A Case Study in Dhaka
Text Analysis Using Twitter: A Case Study in Dhaka
 
Data Tactics Analytics Practice
Data Tactics Analytics PracticeData Tactics Analytics Practice
Data Tactics Analytics Practice
 
Discontinuities Demo
Discontinuities DemoDiscontinuities Demo
Discontinuities Demo
 
DLISA
DLISADLISA
DLISA
 
Analytics Brownbag
Analytics Brownbag Analytics Brownbag
Analytics Brownbag
 
Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013
 
Data Tactics Unified Dataspace Architecture and Description
Data Tactics Unified Dataspace Architecture and DescriptionData Tactics Unified Dataspace Architecture and Description
Data Tactics Unified Dataspace Architecture and Description
 
Bill Ontology Summit (08 feb 1400hrs) v2
Bill Ontology Summit (08 feb 1400hrs) v2Bill Ontology Summit (08 feb 1400hrs) v2
Bill Ontology Summit (08 feb 1400hrs) v2
 
DT Company Overview January 2013
DT Company Overview January 2013DT Company Overview January 2013
DT Company Overview January 2013
 
Capabilities Brief Analytics
Capabilities Brief AnalyticsCapabilities Brief Analytics
Capabilities Brief Analytics
 

Recently uploaded

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Data Science and Analytics Brown Bag

  • 1. The Power of Partnership – from Vision to Reality L-3 Data Tactics: Data Science Brown Bag Welcome! Hard and Soft Clusters and Cyber Data April 22, 2014 ! R2 = 500; p<.05 asymptotically approaching perfect
  • 2. ! •Why a (our 3rd) Data Science Brown Bag (Rich H.)? ! •About US & About YOU (Rich H.)!! ! •Case Studies in Cyber: •What is Clustering, Honeypots and Density Based Clustering (Max W.)? •What is Optics Clustering and how is it different than DB Clustering? …and how can it be used for outlier detection. (David P.) •What is so-called soft clustering and how is it different than clustering? …and how can it be used for outlier detection. (Nathan D.) ! •On the horizon...(Rich H.) DT Data Science Brown Bag: Outline L-3
  • 3. DT Data Science Brown Bag: Outline Learning [close] at a pace similar to the pace at which we learn. ! Learning and Educating from/to DS to PMs, SWE, and OPs. ! DS2PM: Provide insights for FRIs/RFPs. PM2DS: Atmospherics from our costumers. ! DS2SWE: Integrating algorithms. SWE2DS: Accessing data spaces. ! DS2OP: How do you consume the outputs of models? OP2DS: What models are best to present to OPs? DS: Data Scientist, PM: Program Managers, SWE: Software Engineers, OP: Operators L-3
  • 4. The Team: (Geoffrey B., Nathan D., Rich H., David P., Ted P., Shrayes R., Jonathan T., Adam VE., Max W.) ! Graduates from top universities… …many of whom are EMC Data Science Certified. ! Advanced degrees include: mathematics, computer science, astrophysics, electrical engineering, mechanical engineering, statistics, social sciences. ! Base competencies (horizontals): clustering, association rules, regression, naive bayesian classifier, decision trees, time-series, text analysis. ! Going beyond the base (verticals)... About Us: DT Data Science Team L-3
  • 5. About Us: DT Data Science Team L-3 Clustering || Regression || Decision Trees || Text Analysis Association Rules || Naive Bayesian Classifier || Time Series Analysis econom etrics spatialeconom etrics graph theory algorithm s astrophysicaltim e-series analysis path planning algorithm s bayesian statistics constrained optim izations num ericalintegration techniques PCA bagging/boosting hierarchicalm odels IRT space-tim e latentclass analysis structuralequation m odeling m ixture m odels SVM m axent CART autoregressive m odels ICA factoranalysis random forest dim ensionalreduction topic m odels sentim entanalysis frequency dom ain patterns unsupervised by supervised change-pointm odels LUBAP DLISA DBAC optics clustering
  • 6. Hierarchy of Data Scientists About Us: DT Data Science Team L-3
  • 7. ! ! No Free Lunch (NFL) theorems: no algorithm performs better than any other when their performance is averaged uniformly over all possible problems of a particular type. Algorithms must be designed for a particular domain or style of problem, and that there is no such thing as a general purpose algorithm. ! ! ! About Us: DT Data Science Team L-3
  • 8. ABOUT YOU: 35 confirmed, 15 webex, 21 Data Tactics employees, 13 L-3 NSS employees; Sam Posten was the first to sign-up (webex) and Aaron Glahe was the first to sign-up for in-person! ! # define Twitter account names start <- getUser(“L3_NSS”) finish <- getUser(“DataTactics”) ! # find all connections independently of each account dt.friends.object <- lookupUsers(start$getFriendsIDs()) l3.friends.object <- lookupUsers(finish$getFriendsIDs()) ! #create one large table that relates followers from each account relations <- merge(data.frame(User=“DataTactics”, follower=dt.friends), data.frame(User=l3.friends, Followers=“L3_NSS”), all=TRUE) ! #create network layout showing each account’s community and overlap g.followers <- graph.data.frame(relations.followers, directed = T) ! #finally plot the graph tkplot(g) L-3
  • 11. Why Clustering? L-3 Six Pillars of Data Mining:   Clustering has become a workhorse in Big Data and fits into the Six Pillars of Data Mining and our own DS4PM & DS4G framework.   • Anomaly detection: the identification of unusual data records, that might be interesting or data errors that require further investigation. • Association rule learning: searches for relationships between variables. • Clustering: is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. • Classification: is the task of generalizing known structure to apply to new data. • Regression: finds a function which models the data with the least error. • Summarization: providing a more compact representation of the data set.   ! Taxonomy of Questions (ref: DS4PM): ! • Causal Effects: is an approach to the statistical analysis of cause and effect based on the framework of potential outcomes • Classification/Clustering: identifying to which of a set of observations belong, on the basis of a training set of data or without labels in the clustering approach. • Outlier Detection: is the identification of events which do not conform to an expected pattern or other items in a dataset. • Big Data and Analytics: discovering interesting relations between variables in large databases • Measurement Models: statistical models to measure the relationships between the observable variables and the unobserved (or “latent”) quantity • Text Analysis: refers to the process of deriving high-quality information from text.
  • 12. Max Watson: Max’s background is in physics and applied mathematics. Max completed his undergraduate degree at University of California, Berkeley and completed his PhD at University of California, Santa Barbara in 2012. Max specializes in large-scale simulations, signal analysis and statistical physics - he joined the Data Tactics team in January 2014 and has supported DHS. Max is an EMC Certified Data Scientist. David Pekarek: David’s background is in Mechanical Engineering and specializes in mechanical control systems, optimization, and spatio-temporal statistics. David finished his PhD in 2010 from California Institute of Technology and joined Data Tactics in the fall of 2012 and currently supports DARPA. Nathan Danneman: Nathan’s background is in political science, with specializations in applied statistics and international conflict. He finished his PhD in June of 2013, and joined Data Tactics in May of that same year. He recently co- authored Social Media Mining with R, is active in the local Data Science community and currently supports DARPA. Nathan is an EMC Certified Data Scientist. ! Today’s presenters: L-3
  • 13. L-3 Cluster Analysis of Honeypot Data By Max Watson
  • 14. Outline 14 • What are Honeypots? ! • Cluster Analysis ! -General Principles -Density Based Clustering ! • Cluster Analysis Applied to Honeypot Data ! • Conclusions ! L-3
  • 15. Honeypots 15 8 websites: (USA, 4), (Singapore, 2), (Brazil, 2) [brought to you by Ted Procita] ! Collection Period: October 15, 2013 - November 18, 2013 ! 2 Sources of Data: requests at firewall and requests at webserver ! number of webserver requests: ~4000 Honeypots are traps set to detect, deflect, or counteract attempts at unauthorized use of information systems some information from a typical ‘hit’ on the webserver: ! IP address Country Request Timestamp 101.227.4.25 CN /robots.txt 10/17/13 17:58:21 L-3
  • 16. Goals of Honeypot Analysis 16 • Categorize IP addresses in terms of similar requests ! • Determine how requests vary by country ! • Detect outliers L-3
  • 17. Cluster Analysis 17 Grouping similar objects: Requirements: ! 1) Distance metric ! ! 2) Method for grouping nearby objects L-3
  • 18. Distance I: Combine Requests 18 1) Gather all unique requests invoked by each IP address: ! ! IP1 ⇒ { /, /robots.txt, …} ! IP2 ⇒ { /HNAP1/, /manager/html, …} . . . . . . L-3
  • 19. Distance II: Jaccard Similarity 19 Requests from IP address A: {♣,♦} ! Requests from IP address B: {,♦} Jaccard Similarity: intersection(A, B) = {♦} union(A, B) = {♣, ♦, } J(A, B) = 1/3 Effective Distance: D = 1 - J 0 1 D = 0 : A and B issue the same requests D = 1 : A and B issue completely different requests L-3
  • 20. Distance III: All Pairs 20 Calculate effective distance between all pairs of IP addresses ! ...leaving us with: But usually in a high number of dimensions! L-3
  • 21. Identifying Clusters 21 How many clusters are there? Are there outliers? Density Based Clustering: ● connectivity ● density L-3
  • 22. Connectivity 22 Distance Threshold Number of Clusters Distance Threshold 3 2 1 Cluster 1 Cluster 2 L-3
  • 23. Density Based Clustering 23 2 parameters: distance threshold and minimum number of neighbors (DBSCAN) example: minimum number = 2 Outliers Clusters L-3
  • 24. Shiny App for Analysis 24 L-3
  • 25. China 25 Dominant Requests of Each Cluster ! ❶ /robots.txt ❷ / ❸ /manager/html ❹ www.baidu.com/ ! L-3
  • 27. United States 27 10 Clusters (Malicious & Benign) L-3
  • 28. United States 28 Clusters ! Outliers from same IP address NumberofHits Time (~34 Days) Hits Over Time L-3
  • 29. Accomplishments 29 ✓ Categorized behavior of IP addresses based on requests ! ✓ Detection of outliers ! ✓ Determined how requests vary by country (China vs. USA) L-3
  • 30. What Clustering Can Do for You 30 Objects + Attributes Cluster the Objects Cluster the Attributes Applications: ! • IP addresses & their requests • patients & their symptoms • devices & their malfunctions • people & their associates L-3
  • 31. Port Based Clustering of Firewall Activity By David Pekarek L-3
  • 32. Firewall Activity Clustering Workflow Data Preprocessing! and Vectorization OPTICS ! Clustering Follow-on Investigations Honey Pot! Firewall Activity Aggregated counts of IP’s dest. port hits Reachability Distance plot identifying user clusters and outliers Characteristics of outlying IPs and IP clusters #! #! #! #! #! # Abc Abc Abc Abc Abc Abc #! #! #! #! #! # #! #! #! #! #! # Abc Abc Abc Abc Abc Abc ~32K! logs time, host, src IP, location, ports, protocol #! #! #! #! #! # #! #! #! #! #! # #! #! #! #! #! # ~19K IPs 128 ports #! #! #! #! #! # #! #! #! #! #! # outliers clusters Distinct activity levels on port 53 • The majority of source IPs make use of only one destination port! • 94% of source IPs fall into some cluster with similar port usage and traffic volume L-3
  • 33. OPTICS: Hierarchical Density Based Clustering • Clustering algorithms provide a means to sort data without pre-existing labels! • Density-based clustering methods are robust in identifying clusters with non- uniform shapes • The OPTICS algorithm is a density-based approach that simultaneously evaluates cluster results at different scales k-Means! results Density-based! clustering ! results Is this one cluster or two?! The answer depends on scale! L-3
  • 34. OPTICS: How does it work? • The OPTICS algorithm performs two major operations on the data:! • determining an ordering of all data points, based on the likelihood of points being clustered together • assigning each point a Reachability Distance (R.D.): a quantification of the length scale at which the given point will belong to any cluster • Plotting R.D. vs the ordered data points, clusters appear as troughs Whole face Eye EyeSmile Outliers L-3
  • 35. OPTICS: How was it applied? Data Preprocessing! and Vectorization OPTICS ! Clustering Follow-on Investigations • Source IPs used as the identifier for entities with traffic hitting the honey pot firewall. • Destination ports used to define the dimensions of feature space. Each of the 127 most common ports (those with at least 60 hits from the total population) got its own dimension. The remaining ‘rare’ ports bundled as a single dimension. • OPTICS algorithm identified clusters of IPs in 128 dimensional space, with clustering results summarized in a 2-D reachability plot ! • Follow-on investigations performed to identify anomalous properties of outlying IPs and commonalities among clustered IPs L-3
  • 36. Firewall Port Usage Clustering Results L-3
  • 37. Firewall Port Usage Clustering Results Clusters with some! distinctive activity Outlying IPs! (Their activity falls into clusters only at extremely generous length scales) L-3
  • 38. Interactive Plotting Demo Interactive Plotting Demo Here L-3
  • 39. Port Usage Cluster Characterization IPs with minimal activity on highly travelled ports (22, 53, Other) Outlying IPs:! Activity on multiple ports or very seldom used ports 1-15 hits on port 80 (HTTP) 1-10 hits on port 3389 (RDP) 1-14 hits on port 1433 (MSSQL) 1-6 hits on port 445 (Active Directory) Small clusters with activity ! on less used ports! (3306, 5060, 4899, 135, 25, 23, 45091, 48879, 1234) L-3
  • 40. Port 53 Traffic Clustering Validation OPTICS identifies the multimodal distribution of traffic to port 53 (DNS) L-3
  • 41. Conclusions • Destination ports show little correlation in the firewall logs. Source IPs tend to cluster by the one port to which they sent traffic. • OPTICS clustering efficiently sorts source IPs as outliers or belonging to a cluster of common port usage. • Interactive plotting tools allow for the rapid characterization of clusters. L-3
  • 42. Latent Dirichlet Allocation: Characterizing normal behavior and identifying deviations from normality By Nathan Danneman L-3
  • 43. Outline • What is Latent Dirichlet Allocation (LDA)? • How does it compare to other clustering tools? • LDA by example: analyzing log files L-3
  • 44. LDA is a Mixture Model • Mixture Models: – Identify sets of variables that co-occur (behavioral patterns) – Determine what behavioral patterns each individual exhibits • Example: The Sports Equipment Analogy Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat John 12 4 Susan 14 1 6 3 Chris 2 3 Jane 1 11 1 L-3
  • 45. • Mixture Models: – Identify sets of variables that co-occur (behavioral patterns) – Determine what behavioral patterns each individual exhibits • Example: The Sports Equipment Analogy Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat John 12 4 Susan 14 1 6 3 Chris 2 3 Jane 1 11 1 LDA is a Mixture Model L-3
  • 46. LDA Utilizes Soft Clustering • Hard Clustering: every point is assigned to one group • Hard Clustering with Outliers: every point is assigned to one or no groups • Soft Clustering: every point is assigned to zero, one, or several groups. x1 x2 L-3
  • 47. x1 x2 Golf Clubs Tennis Racket Golf Balls Tennis Balls Baseball Bat John 12 4 Susan 14 1 6 3 Chris 2 3 Jane 1 11 1 • Hard Clustering: every point is assigned to one group • Hard Clustering with Outliers: every point is assigned to one or no groups • Soft Clustering: every point is assigned to zero, one, or several groups. LDA Utilizes Soft Clustering L-3
  • 48. Input Data for LDA: Cyber Data • LDA takes a matrix of counts • Data: log files from a large network; 8700 users, 85 log types • Each row represents a user • Each column represents a log type Connection! Success Termination! Success Invalid ! Login ... User 1 0 3 2 User 2 12 3 0 User 3 3 0 18 User 4 2 22 1 User 5 7 5 9 ... ... L-3
  • 49. LDA Estimates Two Mixtures • Output 1: logs that co-occur, forming behavioral patterns Log Type 1 ! Log Type 2 ! Log Type 3 ! Log Type 4 ... Behavioral Pattern 1 ! ! Behavioral Pattern 2 Each log relates to zero, one, or many behavioral patterns L-3
  • 50. LDA Estimates Two Mixtures • Output 1: logs that co-occur, forming behavioral patterns • Output 2: which behavioral pattern(s) characterize each user Log Type 1 ! Log Type 2 ! Log Type 3 ! Log Type 4 ... Behavioral Pattern 1 ! ! Behavioral Pattern 2 User 1 ! User 2 ! User 3 ! User 4 ... Users exhibit zero, one, or many behaviors L-3
  • 51. LDA Workflow • Build the N (observation) by P (log types) matrix of counts • Use an empirical method to determine the optimal number of behavioral patterns to estimate • Estimate the model Connection (Successful) Connection (Failure) Termination (Successful) Connection (Time-Out) User1 15 15 0 3 User2 8 12 2 0 L-3
  • 52. Output 1: Mapping Log Types to Behavioral Patterns:
 Behavioral Pattern #3 Firewall.Connections.Successful Firewall.Connections.Terminations Firewall.Connections.Successful Firewall.Connections.Terminations L-3
  • 53. Output 1: Mapping Log Types to Behavioral Patterns:
 Behavioral Pattern #4 Firewall.Connections.Successful Firewall.Connections.Terminations Windows.Hosts.User.Logins Windows.Hosts.User.Logoffs Windows.Hosts.User.Privileged. Use.Successful L-3
  • 54. Behavioral Pattern Characterization • Behavioral Pattern 1: • Windows Hosts: Failed Logins • Behavioral Pattern 2: • Firewall: Connections • Windows Hosts: Logins, Logoffs • Behavioral Pattern 3: • Firewall: Connections, Terminations • Behavioral Pattern 4: • Windows Hosts: Logins • Behavioral Pattern 5: • Firewall: System Normal, Connections, Terminations • Behavioral Pattern 6: • Web Logs: System Normal • Behavioral Pattern 7: • Firewall: System Errors Normal Activity Abnormal Activity Abnormal Activity L-3
  • 56. LDA Estimates Two Mixtures • Output 1: logs that co-occur, forming behavioral patterns • Output 2: which behavioral pattern(s) characterize each user Log Type 1 ! Log Type 2 ! Log Type 3 ! Log Type 4 ... Behavioral Pattern 1 ! ! Behavioral Pattern 2 User 1 ! User 2 ! User 3 ! User 4 ... Users exhibit zero, one, or many behaviors L-3
  • 57. Characterizing Users with Behavioral Patterns User # 2 Essentially, entirely firewall connections and terminations. L-3
  • 58. Characterizing Users with Behavioral Patterns User # 43 Lots of failed logins! Normal activity: connections, terminations, logins, logoffs L-3
  • 59. Visualizing Two-Level Mixtures Behavioral Pattern 3 Behavioral Pattern 4 A User Characterized by: 45% Behavior 3 and 55% Behavior 4 L-3
  • 60. Outlier Detection with LDA • Mixture models make predictions about the proportion of each log type a user will have • We can compare the predicted proportions to each user’s actual proportions to see how well the model captures each user’s actions • Typical users should be well-characterized by mixtures of common behavioral patterns – these are “normal” users • Users whose actions are not mixtures of common behavioral patterns are doing things that are uncommon – these are outliers L-3
  • 61. Measuring User-Level Discrepancy Cosine Similarity = 0.99 Proportions of All Log Types for a Single User L-3
  • 62. Measuring User-Level Discrepancy Cosine Similarity = 0.02 Proportions of All Log Types for a Single User L-3
  • 63. Cosine Similarity between Predicted and Observed Data (all users) ~99% of users are well-explained L-3
  • 64. Cosine Similarity between Predicted and Observed Data (poorly fit users) L-3
  • 65. LDA Detects Univariate Outliers One user had 77% Windows Hosts Failed Logins; mean for data is 0.002% Proportion of Windows Hosts: Failed Login Logs User # 12 L-3
  • 66. LDA Detects Conditional Outliers User # 53 has a typical proportion of Firewall Termination logs... ! ! ! However, User 53 has more than twice as many Firewall Terminations as users with his/her same proportion of Firewall Connections. ! ! Percentage of Logs that are Firewall Terminations NumberofUsers User 53 Firewall Terminations comprise about 50% of many users’ logs Percentage of Logs that are Firewall Terminations NumberofUsers Firewall Terminations among users with 53’s proportion of Firewall Connections User 53 L-3
  • 67. Conclusions • LDA allows an analyst to: – Succinctly characterize common behavioral patterns – Capture nuance through soft clustering – Identify both simple and conditional outliers ! • Next Steps: – Radically improve parallelized versions of LDA – Build enhanced visualizations that allow analysts to interact with data ! • Previous Steps: – Cyber IR&D II - Honeypots & Topic Graphs – https://portal.data-tactics-corp.com/sites/analytics/Shared %20Documents/honeypots.pdf L-3
  • 68. • Query based analytics are tenuous for data with large feature spaces and population sizes. For complete answers, we must analyze with comprehensive algorithms. • Cyber systems regularly lack reliable (or stationary) models and priors. Hence we have been focused on questions of pattern detection (hard) and outlier detection (harder) for big cyber data, primarily obtaining results via clustering analyses. • There are many, many clustering algorithms, each with distinct features and requirements (No Free Lunch for Theorems). Choosing the most appropriate tool requires a deep understanding of the available data, the questions at hand, and the pros and cons of applicable methods. Final Thoughts… L-3
  • 69. • L-3 Data Tactics has several minimally viable products (MVP) working of very hard elements of the cyber analytics problem set. •These MVPs can be used in a support function to existing security protocol and signature based systems - or provide those systems already in place with pattern and anomaly detection. •Previous and future honeypot collection will further define L-3’s cyber competencies in proactive cyber analytics. Final Thoughts… L-3
  • 70. ...on the Horizon: ! Honeypots and Twitter Collection Platforms ! Summer Data Science Internship Program (Robert R. & USMA cadets): Honeypots analytical application development USA Civil Affairs & CERDEC Analytics http://glimmer.rstudio.com/gosystems01/Stability/ Next Data Science Brown Bag late July. ! DS4G & DS4PM both making appearances this year. ! Data Science on display at the L-3 Technology Exchange 2014… more to come. … on the horizon. L-3
  • 71. The Data Science Team http://datatactics.blogspot.com L-3
  • 72. The Data Science Team https://github.com/DataTacticsCorp L-3
  • 73. Homepage: http://www.data-tactics.com Blog: http://datatactics.blogspot.com Twitter: https://twitter.com/rheimann Or, me (Rich Heimann) at rheimann@data-tactics-corp.com Questions? L-3 Twitter: https://twitter.com/DataTactics Twitter: https://twitter.com/mwatson Twitter: https://twitter.com/ndanneman