I presented this talk a while back, at S4 Fall 2012.
S4 is a San Francisco/Bay Area local meetup event for security professionals. Check out the past events here.
http://s4con.blogspot.com/
11. DNS
transacDons
The
less
visited
by
good
clients,
the
higher
chance
a
domain
is
bad
Two
types
of
node
Node
is
either
visiDng,
or
being
visited,
but
never
both
There
are
super
nodes
that
link
to
millions
of
other
nodes
Domains
are
classified
as
benign,
malicious,
unknown
Page
rank
The
more
linked
by
good
pages,
the
higher
a
page
is
ranked
One
type
of
node
One
node
can
have
both
inlinks
and
outlinks
Most
nodes
link
to
a
limited
amount
of
other
nodes
Pages
are
not
classified
12. DNS
transacDons
The
domains
visited
by
more
good
visitors
are
ranked
high
(inlink)
-‐
Assign
a
“posiDve”
iniDal
value
Visitors
visiDng
more
good
domains
are
ranked
high
(outlink)
-‐
Assign
a
“posiDve”
iniDal
value
Linkage
matrix
NxM
(N
being
total
number
of
domains,
M
being
total
number
of
IPs)
PotenDally,
we
can
consider
query
count
as
linkage
weight
Page
rank
Damping
factor
(user
get
bored)
Random
sinks
and
cycles
Page
rank
are
numbers
between
0
and
1
and
sum
up
to
one
in
total
Linkage
matrix
NxN
(N
being
the
total
number
of
pages
13. Recursive
defini-on
r(dn)t+1 = (r(ip)t / L(ip))
for
all
ips
visiDng
domain
dn
-‐-‐-‐
r(ip)t
rank
for
ip
at
Dme
t
-‐-‐-‐
L(ip)
the
total
number
of
domains
ip
connects
to
(in
a
certain
Dme
window)
r(ip) =
(r(dn)t / L(dn))
t+1
for
all
domain
dn
visited
by
ip
-‐-‐-‐
r(ip)
t
the
rank
for
d
at
Dme
t
-‐-‐-‐
L(dn)
the
total
number
of
ips
visiDng
domain
d
(not
variant
by
Dme)
The
denominator
gives
the
marginal
(the
sum
of
the
counts
of
the
condiDoning
variable
co-‐occurring
with
anything
else)
∑
∑
15. Link
analysis
–
build
sparse
linkage
matrix
(row-‐wise)
input
query
log
(each
entry:
client
ip
to
hostnames)
output
dn
-‐>
ip
ip
ip
ip
ip
-‐>
dn
dn
dn
dn
//STRIPE
DESIGN
//map
job:
parsing
query
entry,
filter
bad
hostname,
convert
hostname
to
domain
emit
[key(domain),
value(ip)]
emit
[key(ip),value(domain)]
//reduce
job:
emit
[key(domain),
value(ip
ip
ip)]
emit
[key(ip),value(domain
domain
domain)]
16. Itera-ng
–
MapReduce
iteraDon
#n
map
–
input
Key
(domain),
value
(pagerank
ip
ip
ip)
Or
Key(ip),
value
(pagerank
dn
dn
dn
dn)
–
output
key(ip/domain),
value(x=pagerank/linklist.size())
reduce
–
input
Key(domain/ip),
values
(x)
//x
as
defined
above
key
(domain/ip),
value
(x
ip
ip
...
ip
)
–
output
Key
(domain/ip),
value
(Σx
ip
ip
ip)
18. Hadoop
Job
2
–
linkage
creaDon,
domain
(or
ip)
mappings
Reducer
Mapper
Input
Querylog
key
Output
Output
value
Domain
IP
IP
Domain
key
Value
(rank,
previous
rank,
links)
IP
1.0
1.0
d
d
d
d
Domain
1.0
1.0
ip
ip
ip
ip
Slide
18
19. Hadoop
Job
2
–
Security
Ranking
(SR)
Mapper
Output
Input
Value
key
value
IP1
2.3,
1.0,
d1,
d2,
d3
d1
“rank”
2.3/(num_of_links=3)
IP2
-‐9.5,1.0,
d1,
d3
d1
“rank”
-‐9.5/(num_of_links=2)
d1
24,
1.0,
IP1,
IP2
d2
“rank”
2.3/(num_of_links=3)
d3
“rank”
2.3/(num_of_links=3)
d3
“rank”
-‐9.5/(num_of_links=2)
IP1
“links”
2.3,
1.0,
d1,
d2,
d3
IP2
“links”
-‐9.5,1.0,
d1,
d3
UpdaDng
security
rank
SR
=
Σ
SRi/K,
for
each
outlink,
K
being
the
number
of
outlinks
of
enDty
i
Reducer
Key
Output
key
value
d1
2.3/3
+
-‐9.5/2,
24,
IP1,
IP2
Slide
19
20. Risks/Issues
• Behavior
changes.
A
machine
can
be
infected
at
any
minute.
Is
a
day
or
an
hour
good
window
to
measure
the
“cleanness”
of
a
client?
• Noises
• Each
individual
source
is
one
client
IP
or
a
user
or
machine
(e.g.,
school
WIFI,
where
no
consistent
client
visiDng
behavior
can
be
obtained).
Are
these
IPs
introducing
noises
or
they
are
the
ones
bringing
in
the
most
likely
malicious
connec8ons?
• Massive
detecDon,
is
it
massive
FP?
21. Take-‐away
• Graph-‐based
discovery
• Take
a
different
view
at
your
data
• Machine
Learning
at
a
different
scale