How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved.
October 2015
Download, Forums, Docs, Events http://Kamanja.org
Meet 80% of the Needs of a Pilot Project
With a CC Fraud Detection Example
By Greg Makowski
ACM Data Science Camp, Saturday 10/24/2015
http://www.sfbayacm.org/event/silicon-valley-data-science-camp-2015
http://kamanja.org/white-papers/

2
ligaDATA
Summary
Ques%on)
How
to
help
a
OSS
pilot
evalua0on
go
faster?

Answer)

Develop
“design
pa:erns”
for
applica0ons

Pick
a
speciﬁc
app

(Credit
Card
Fraud
Detec0on)

Get
data

(end
up
genera0ng
it)

Need
to
vary
arch
conﬁg
(like
performance
tes0ng)

Given
requirements,
generate
a
mul0-‐node
example
pilot

system,
involving
many
OSS
components

PMML
can
abstract
the
produc0on
step
from
model

building

3
ligaDATA
Problem
When
evalua0ng
any
new
data
mining
or
big
data

soPware,
companies
want
to
“try
it
out”
and
see
how
it

meets
their
requirements.

A
common
step
is
a
pilot

project.

A
pilot
would
commonly
involve
integra0on
with
related

soPware
systems.

Open
Source
SoPware
(OSS)
may
come
with
examples.

Need
an
example
“produc%on
system”

Q)
What
can
be
done
to
shorten
the
0me
to
ﬁnish
a
Pilot?

4
ligaDATA
Problem:
Questions to be answered from Pilot
How
fast
is
it?

It
depends

(yes,
that
is
an
annoying
answer)

How
to
conﬁgure
the
system
with
other
OSS
soBware?

It
depends

(yes,
that
is
an
annoying
answer)

5
ligaDATA
Problem:
Questions to be answered from Pilot
How
fast
is
it?

It
depends

(yes,
that
is
an
annoying
answer)

Show
example
configs
with
performance
results

How
to
configure
the
system
with
other
OSS
soBware?

It
depends

(yes,
that
is
an
annoying
answer)

Consider
different
applica%on
“design
paCerns”

How
will
the
system
grow
as
complexity
grows?

The
answer
is
specific
per
design
pa:ern

How
should
DevOps
monitor
and
manage?

6
ligaDATA
Kamanja Platform
Storage

Ouput

Queues

Input

Queues

Decisioning
Ac0ons

CDC,
Logs,
Apps
Next Best
Action
Batch
Stores
Application
Updates
Decision
Engine

Admin
Management

kamanja
Databases
ESBs
Alerts &
Notifications
Social
3rd Party
Data
Sources

Data Store

7
ligaDATA
See Kamanja.org, and github
Kamanja
is
used
as
an
example,

The
process
is
in
this
talk
is
general
and
can
be
broadly

applied
to
other
OSS.

Kamanja
is
a
big
data
con0nuous
decisioning
system

Apache
license,
available
on
github

8
ligaDATA
Application Design Pattern
Departmental Model Scoring Application
Scaling
challenges

transac0on
growth
and
type

(quan0ty
&
speed)

model
complexity
(hybrid
systems)

quan0ty
of
models:
10’s
to
10k’s

for
most
models,
most
ﬁelds,

need
to
access
the
data
store
for
preprocessing

Input
queue
Model
Scoring
Real time
Output
Queue
Cache +
Data Store
Managementand
ControlSystem
Financial
Log
Consumer
Business
Preprocessing
& Scores
Reporting
Analysis
Lambda
Architecture
Combines
Real time
And Batch
PMML

9
ligaDATA
Social Network Analysis
Scaling
challenges

transac0on
growth
and
source

(quan0ty
&
speed)

model:
sen0ment,
graph

quan0ty
of
models:
a
few

data
store
lookup
for
base
user
info

Input
queue
Model
Scoring
Real Time
Charting, Alerting
Cache +
Data Store
Managementand
ControlSystem
Twitter
Facebook
:
User baseline
Network
Trend Analysis
Deep Dive
Java, Scala

10
ligaDATA
Text Mining, Search
Scaling
challenges

transac0on
growth

some
projects:
very
heavy
compu0ng
for
NLP
parsing

quickly
score
on
tagged
results

Input
queue
Model
Scoring
Output
Queue
Cache +
Data Store
Managementand
ControlSystem
Pages
Documents
Posts
Tweets
Java, Stanford NLP
Parse trees
Inverted indexes
Trending topics
Update Thesaurus
Docs ßà Topics

11
ligaDATA
Details on Departmental Scoring:
Credit Card Fraud Detection System
How
to
develop
an
example
system?

There
is
no
public
data.

Private
won’t
be
shared

Generate
the
data

(then
can
also
test
BIG
DATA)

Focus
on
5
use
cases
of
“normal”
and
5
“fraud”

Conﬁguring
architecture
can
be
used
for

1)
Performance
tes0ng
for
diﬀerent
requirements

2)
Pilot
system,
example
included
w/
Kamanja

Train
models,
generate
PMML
for
scoring

12
ligaDATA
FRAUD Use Cases
Fraudster
extrac%ng
value
out
of
hacked
card

Likely
a
first
“test”
of
CC
info.

iTunes
or
unmanned
gas
pump
w/o
camera

Drain
account
up
to
CC
limit
in
15
min,
up
to
2-‐3
days

Purchase
things
“easy
to
cash
out
or
resell”
–
launder
money

giP
cards,
gems,
jewelry,
small
electronics
easy
to
sell,
burner
phones

F1)
Elder
abuse
–
either
PII
or
CC
info
gets
copied

Fraudster
opens
first
web
or
mobile
account
(surprising
for
grandmother)

Higher
credit
limit,
long
0me
with
no
web/mobile

Long
0me
CC
holder
(high
tenure),
li:le
spend
varia0on

F2)
Hacker
bought
PII
(Personally
Iden0fiable
Informa0on)

Fraudster
used
PII
to
apply
for
a
new
account

new
account
likely
has
a
lower
credit
limit

Over
1st
month,
slowly
changes
PII
to
fraudsters
to
not
alert
vic0m

use
in
“card
not
present”
situa0ons

13
ligaDATA
FRAUD Use Cases
F3)
Physical
clone

Fraudster
may
have
bought
CC
info
online
($1/account)
or
copied
mag
strip

from
the
vic0m
in
the
store.

Fraudster
card
use
can
be
concurrent
with
normal
consumer
use
–
or
very

diﬀerent
place
and
0me
zone

F4)
Rare
Behavior
(may
be
part
of
other
use
cases)

Unusual
0me
of
day,
geography,
spending
by
type
of
goods
/
services

F5)
Risky
Behavior
–
fraudster
may
visit
blacklisted
web
page

Fraudster
is
engaging
with

Geography
changes
are
not
plausible
(noon
in
San
Jose,
1pm
in
Hong
Kong)

Relate
to
past
labeled
cases
of
CC
fraud.

14
ligaDATA
NORMAL Use Cases
1)  Steady
State
use
–
the
CC
use
by
these
people
is
fairly
consistent
and

stable.

Can
have
a
die
vei

2)  New
Card,
1st
month
–
this
example
is
setup
to
make
it
diﬃcult
to

compare
with
fraudulently
opened
new
cards.

Spending
may
max
out

3)
Young
and
star%ng
singles
or
newly
married.

These
people
don’t
have
much
of
a
credit
ra0ng

More
likely
to
use
web
and
mobile
channels.

More
likely
to
wander
to
dangerous
areas
of
the
web.

Likely
to
spend
in
a
bigger
array
of
categories

Possibly
many
geographic
loca0ons

4)
Normal
Case,
Family
–

Medium
to
higher
income
limit,
many
don’t
hit
limit

Low
to
moderate
showing
up
in
new
geographies,
or
spending
on
new
catagor.

5)
Work
Travel
–
Work
in
sales
or
consul0ng.

New
loca0ons
are
no
surprise.

Higher
spending
limit
and
amounts,
many
ﬂight,
hotel,
car
rental,
high
mobile

15
ligaDATA
Pilot Project & Performance Testing
Credit Card Fraud Detection
Input
queue
Model
Scoring
Real time
Output
Input
queue
Real time
OutputInput
queue
Input
queue
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Real time
Output
Real time
Output
Input
queue
Model
Scoring
Real time
Output
Cache +
Data Store
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
1 Kafka
1 Kamanja
1 Kafka
~3 Kafka
16 Kamanja
~3 Kafka
Add
Preprocessing
Logic and
HBase table lookup

16
ligaDATA
Performance Testing – Model Node
Credit Card Fraud Detection
Fields
per
record:

tes0ng
network
speed
between
nodes

30,
120,
480
ﬁelds

(yes,
could
go
10k,
100k)

Single
model
complexity:
tes0ng
compute
load

Small,
Medium
&
Large

(100,
2k,
32.5k
elements)

Preprocessing
lookup
tables:
tes0ng
cache
to
HB
&
netwrk

none,
some

Ensemble
Models
per
score:
tes0ng
compute
&
network

1,
5,
20

Number
of
Models
in
department:

1,
10,
100

17
ligaDATA
Solution to Developer Questions
(How Fast, How to Configure?)
How
many
ﬁelds
per
record?

30,
120,
480

(SML)

What
model
complexity?

100,
2k,
32.5k

(SML)

Is
data
already
preprocessed?

Yes,

No

(YN)

Average
models
/
ensemble?

1,
5,
20

(SML)

How
many
models
in
the
department?

1,
10,
100

(SML)

What
language?

PMML,
Java,
Scala

(I
want
to
create
a
table
like…)

Requirements
à
Then
need
conﬁgure

For
speed
rec/s

S,S,Y,M,S

1
Kaf,
1
Kam,
1
Kaf

1.1mm

M,L,Y,M,S

1
Kaf,
1
Kam,
1
Kaf

200K

L,L,N,L,L

3
Kaf,
16
Kam,
1
Kaf,
3HB
1.6mm

Generate
Architecture
and
run
an
80%
relevant
Pilot

Text or
Twitter
API
Java 1
and GUI Kafka
Java 3 for
analysis
Data
Store
Java calls API, and
Kafka producer
Tweets returned in
JSON
JSON tweets sent to Kafka Kafka JSON
to Kamanja
JSON with features saved in DB
JAVA: Every “time window”, queries the DB to aggregate (i.e. count (tags) by (tags) by..)
JSON returns the aggregate query results to JAVA
JSON query results to Kafka
JSON results of rule scoring, alert text
13 Tomcat web
service displays data
and charts
Matched_tags_
per_text
table
results to Java 3 for scoring,
with thresholds
Alerts table
Save results to DB
JAVA 1: check for updates to the alerts table
Kamanja
1
2
3 4
5
6
7
8 9
11
12
10
Java 2 for
Features
Sentiment or
Stanford NLP
Social Netowork Analysis:
Example System Conﬁguration
ligaDATA

19
ligaDATA
Scoring
Engine
(Kamanja)
PMML Diagram
Predictive Modeling Markup Language
Training & test data
(batch)
Data
Mining
Tool File, Save As
PMML
PMML
File
PMML
Producer
(18 available)
PMML
FileScoring data
(real time streaming)
Output data has
new score field
Training Project Phase
Production Scoring Project Phase
Full model
specification
PMML Consumer

20
ligaDATA
Given industry fragmentation,
PMML is a solution for Data Mining scoring
PMML Producers (18 data mining packages)
•  R (Rattle, PMML)*
•  RapidMiner
•  KNIME*
PMML Consumers (12 co)
•  Zementis
•  IBM SPSS
•  KNIME
•  Microstrategy
•  SAS
•  Kamanja* (Open Source)
•  Spark (MLib)* * = Open Source
•  Weka*
•  SAS Enterprise Miner
PREDICTIVE
Naïve Bayes
Neural Net
Regression
Rules
Scorecard
Sequence
SVM
Time Series
Trees
DESCRIPTIVE / OTH
Association Rules
Cluster, K-Nearest Nb
Text Models
model ensembles &
composition
(i.e. Gradient Boosting)

21
ligaDATA
Summary
Ques%on)
How
to
help
a
OSS
pilot
evalua0on
go
faster?

Answer)

Develop
“design
pa:erns”
for
applica0ons

Pick
a
speciﬁc
app

(Credit
Card
Fraud
Detec0on)

Get
data

(end
up
genera0ng
it)

Need
to
vary
arch
conﬁg
(like
performance
tes0ng)

Given
requirements,
generate
a
mul0-‐node
example
pilot

system,
involving
many
OSS
components

PMML
can
abstract
the
produc0on
step
from
model

building

Try out 
Kamanja
CONFIDENTIAL
Download, Forums, Docs, Events http://Kamanja.org
ligaDATA
http://kamanja.org/white-papers/

Kamanja: 220k to 230k messages / second
CONFIGURATION:
•  16 core box, using Solid State Disc
•  Sample Tool to generate messages of size 1k (not being reduced)
•  Data Mining uses 100’s to 100k fields – not 100 byte message
•  Kafka Queue
•  3 input queues, each queue has 8 partitions
•  Kamanja Engine
•  Using the remaining 12-13 cores
•  Not saving score results per record in this test
SO WHAT? COMPARISON:
•  Storm is currently the lowest latency Apache big data system
•  Storm integration, got up to 90k to 100k for same data
•  Kamanja is 2.4 times faster than Storm = (225k/95k) in this test
•  Spark streaming is with mini-batches, with higher latency than Storm or Kamanja
Why is Kamanja faster
than Storm?
Storm reads the data from
the input queue (sprout)
and passes that to Bolts.
Each pass between sprout
to bolt they serialize &
deserialize the data. There
is other overhead.
Kamanja:
One Speed Analysis
ligaDATA

How to Create 80% of a Big Data Pilot Project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How to Create 80% of a Big Data Pilot Project

Similar to How to Create 80% of a Big Data Pilot Project (20)

More from Greg Makowski

More from Greg Makowski (6)

Recently uploaded

Recently uploaded (20)

How to Create 80% of a Big Data Pilot Project