Security data deluge

Security
Data
Deluge-‐
Zions
Bank's

Hadoop
Based
Security
Data
Warehouse

Claiming
the
Intersec>on
@

Informa>on
Security
and
Fraud

Brian
Chris>an,
CTO
and
Co-‐Founder,
ZeEaset

Michael
Fowkes,
SVP
and
Director
of
Fraud

Preven>on
and
Security
Analy>cs
,
Zions
Bancorp

Security
Data
Warehouse

•  A
security
data
warehouse
is
a
massive

database
intended
to
aggregate
event
data

across
your
en>re
enterprise;
for
long
term

large-‐scale
security/fraud
related
analy>cs

•  The
u>lity
of
this
system
is
realized
once
the

data
is
normalized
into
a
common
format,
and

mined
by
experts
with
in>mate
understanding

of
the
data
itself

•  It’s
also
aﬀordable
to
the
common
company

Why
SDW
Today

•  “More
data
is
generated
in
3
days
than
in
the

history
of
the
world
to
2003”
–
Eric
Schmidt

•  Fraudsters
con>nue
to
innovate
and
leverage

explosive
growth
of
portable
compu>ng

•  Fraudsters
con-nue
to
study
“us”

•  Through
massive
data
sets
and

comprehensive
analy>c
modeling,
you/we
can

begin
to
study
them

SDW
isn’t
a
product

•  Security
is
never
a
product
it’s
a
process

•  There
are
past
“processes”
that
help
build
the

system

–  Key
example
is
SIEM:
SIEM
creates
a
“Big
Data”

problem
for
InfoSec.
Instead
of
dumping
that
data

a]er
60
days,
store
ALL
the
data
in
the
SDW
–

even
the
events
you’re
currently
not
logging

•  When
fraud
teams
work
with
security,
the

common
pla`orm
will
accelerate
the
program

SDW
Data
Collec>on

•  The
SDW
is
intended
to
collect
EVERYTHING

–  Everything
in
terms
of
event
data/not
just
security

•  SDW
business
analysts
live
by
the
expression

“the
more
data
I
receive,
the
beEer
I
feel”

•  All
data
is
created
equal
–
but
data
mined
in

certain
combina>ons
is
more
interes>ng
than

others

– 
Trust
but
verify:
This
goes
for
both
automated

controls
as
well
as
human
behavior

SDW
System
Availability

•  The
system
should
be
easy
to
use

–  Average
skilled
labor
to
maintain
the
pla`orm/
cluster

•  The
system
must
fault
tolerant

–  At
700TB
–
2PB
of
data,
when
a
hard
drives
fail

the
system
should
maintain
its
process

•  The
SDW
should
grow
as
needed
without

performance
degrada>on

–  Aﬀordable
to
meet
tomorrow’s
demand

SDW
is
used
for
Mining

•  SDW
is
where
Informa>on
Security
and
Fraud

teams
meet
to
solve
problems

–  Most
InfoSec
and
Fraud
don’t
communicate

–  Silos
of
data
are
collapsed
into
a
single
view

•  The
SDW
is
a
laboratory
–
Not
SIEM

–  Are
there
indicators
that
other
users/accounts
are

suscep>ble
to
fraud/aEack

–  Run
the
model
through
the
en>re
database
to

account
for
similar
aEributes

What
is
a
Security
Data
Warehouse

•  A
Security
Data
Warehouse
is
a
massive

mineable
database

•  The
system
is
horizontally
scalable
to

Petabytes
of
data

•  The
amount
of
data
available
for
analysis
is

historical
and
many
years
old

•  Its
aﬀordable
to
the
common
person!

–  Risk
Management
are
the
common
people
in
IT

Why
did
we
build
a
SDW?

SIEM

DATA
DATA

DATA
DATA
DATA

DATA
DATA

SIEM
Issues

•  Rigid
data
models

•  Did
not
deal
well
with
unstructured
data

•  RDMS
performance
with
large
data
sets

•  Limited
ways
to
interact
with
data

Why
Hadoop
&
Hive

•  Scalability
/
performance

•  Manage
resources

–  Fair
Scheduler

•  Fault
tolerance

•  SQL
like
language
(HiveQL)

–  Most
of
the
staﬀ
had
SQL
skills

•  Easy
applica>on
/
tool
integra>on

–  ODBC
/
JDBC
driver

•  Fast
data
inges>on

•  Can
handle
unstructured
data

•  Flexibility
&
extensibility

–  UDF’s

–  Streaming
jobs

ETL
Philosophy

•  “Pre-‐mine”
Intelligence
during
the
ETL
process

–  Add
value
at
>me
of
capture
(enrichment)

–  Quickly
analyze
important
data

–  Automate
>me-‐sensi>ve
ac>vi>es

•  Load
all
data…no
filtering
of
data
that
will
be

loaded
into
the
warehouse

–  You
don’t
know
what
you
will
want
tomorrow

–  Leverage
file
compression,
rcfiles,
and
table

par>>oning
to
address
storage
/
performance
issues

•  Store
2
years
worth
of
historical
data

The
Team

•  Data
Scien>st

•  Data
Analyst

•  LOB
User

•  Data
Engineer

•  Data
Pla`orm
Administrator

What
was
the
outcome?

Hadoop
jobtracker
stats

•  1709
daily
ETL
and
model
jobs

•  350
daily
employee
jobs

Data
Examples

•  Web
server
logs
•  Customer
database(s)

•  OS
logs

•  Fraud
model
alerts

•  DB
logs
•  Mainframe
ac>vity
logs

•  Proxy
server
logs
•  HTTP
(customer
Internet
ac>vity)

•  SPAM
ﬁlter
logs
logs

•  A/V
events
•  ATM/POS
transac>ons

•  DLP
events
•  Credit
card
transac>ons

•  VPN
logs
•  G/L
logs

•  DNS
logs
•  ACH,
Wire,
and
Deposit

•  Firewall
logs
transac>ons

•  On-‐line
banking
applica>on
logs

•  E-‐mail
logs

•  Deposits
/
savings
/
>me
account

•  Router
/
switch
logs
daily
balances

•  IP
blacklists

•  Vulnerability
scan
results

Over
120
data
sets

How
users
interact
with
the
data

•  Data
Scien>st

–  KNIME

–  R

–  Tableau

•  Data
Analyst

–  SQuirreL
SQL

–  Hive
command
line

•  LOB
User

–  Datameer

–  Custom
web
app
for
common
queries

•  Parameterized
queries

•  Output
to
HTML
table
or
tab
delimited
ﬁle

Sample
ﬁrewall
log
query

SELECT

collect_set(src_ip)
as
src_ips,

dst_ip,

protocol,

ac>on,

rule_uid,

collect_set(rule)
as
rules,

count
(*)
as
log_entry_count

FROM
ﬁrewall_logs

WHERE
day
=
‘2012-‐05-‐26’

AND
dst
=
‘1.1.1.1’

GROUP
BY
ac>on,
dst,
proto,
service,
rule_uid

ORDER
BY
dst_ip,
protocol,
rule_uid

Dealing
with
unstructured
data

Via
Perl:

while
(<INFILE>)
{

if
(
$_
=~
/s+w+s+Transac>onsInquirys+/)
{

chomp
$_;

my
($ts,
$ip,
$port,
$payload)
=
split(/|/,
$_);

if
($payload
=~
/
s+Account:s+d+s+Appl:s+(w+)s+/)

{
$product_code
=
$1;
}

if
($payload
=~
/
s+Account:s+(d+)s+Appl:s+w+s+/)

{
$account
=
$1;
}

if
($payload
=~
/
s+w+s+Transac>ons+Inquirys+(w+)s+/)

{
$bank
=
$1;
}

if
($payload
=~
/
s+(w+)s+Transac>ons+Inquirys+/)

{
$agent
=
$1;
}

print
OUTFILE
“$ts|$ip|$port|$product_code|$account|$bank|$agentn”;

}

}

Dealing
with
unstructured
data

Via
Hive:

SELECT

ts,

ip
,

port
,

regexp_extract(payload,
's+Account:s+d+s+Appl:s+(w+)s+',
1)
as
product_code,

's+Account:s+(d+)s+Appl:s+w+s+',
1)
as
account,

's+w+s+Transac>ons+Inquirys+(w+)s+',
1)
as
bank,

's+(w+)s+Transac>ons+Inquirys+',
1)
as
agent

FROM
mainframe_logs

WHERE
day
=
'2012-‐05-‐26'

AND
payload
rlike
's+w+s+Transac>onsInquirys+’

Predic>ve
analy>cs
examples

•  Spear
phishing
detec>on

•  Phishing
website
detec>on

•  Fraud
detec>on:

–  Online
banking
anomaly
detec>on

–  ACH
/
Wire
fraud

–  Monitoring
of
high-‐risk
employee
ac>vi>es

Visualiza>on
example

Fraud
Events
Per
Day

JAN
FEB
MAR
APR
MAY
JUN
JUL
AUG
SEP
OCT
NOV
DEC

Sun

Mon

Tue

2010

Wed

Thu

Fri

Sat

Sun

Mon

Tue

2011

Wed

Thu

Fri

Sat

Sun

Mon

Tue

2012

Wed

Thu

Fri

Sat

Sessions will resume at 2:25pm

Page 35

Security data deluge

More Related Content

What's hot

Similar to Security data deluge

More from DataWorks Summit

Recently uploaded

Security data deluge