Deep Learning for Fraud Detection

© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & others
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning

Goals for Today
• Explore the state of the art for deep-learning and fraud detection
• Separate at least some of the wheat from the chaff
• Provide some realistic guidance for getting results

Goals for Today
• Explore the state of the art for deep-learning and fraud detection
• Separate at least some of the wheat from the chaff
• Provide some realistic guidance for getting results
• Play with cool stuff !

Agenda
• Motivation
• What are neural networks and deep learning?
• It can be simpler than you think
• But, no free lunch / you get what you pay / other clever aphorism
• Some experiments
• Where to go from here

Motivation For Advanced Modeling in Fraud
• Neural networks have completely dominated credit card fraud
detection since late 80’s
– Random forest, tree ensembles often used in other kinds of fraud and
churn models
• The reason is rule-based systems simply don’t work
– Well, they do work at first
– Fraudsters change tactics, you add rules, interaction mayhem ensues
• And learning algorithms really do work
– Fraudsters change tactics, you add features and retrain

So learning is good

So learning is good
But good learning is hard

So learning is good
But good learning is hard
And finding good features is
really hard

Some Sample Features
• Charge size relative to previous averages for card
• Charge size relative to previous average for merchant
• Known merchant or not
• Doubled transaction
• AVS or CVV2 mismatch

• Address Verification System or CVV2 mismatch

• Address Verification System or Card Verification Value mismatch

• Address Verification System or Card Verification Value mismatch
• Unusual region for card
• Unusual time-of-day relative to history
• Magstripe use if chip available
• (hundreds more)

Sequence Based Features
• Plausible pattern matching (rent a car, pay for gas at airport)
• Probe transactions (gas in wrong place, pizza, big charge)
• Previous transaction at compromised merchant
• Card velocity

Key Problems
• Good guys need data … that means that fraudsters get first
chance at bat
• Good guys are careful and test systems before releasing
• Bad guys have many low-risk transactions and can change
methods quickly
• In some areas, fraudster adapt techniques in hours

Making up features is easy
Finding features that add
real lift is very hard

What are neural networks and deep learning?
• Start simple … imagine we have 20 features, 0 or 1
– Let’s yell “Fraud” if any of the features is a 1
– Houston, we have a model
• But this model isn’t any better than a rule
• Also doesn’t have any interesting Greek letters

Real-world Intrudes
• We assumed all features are equally good
– What if some are kind of poor or weak?
• Can we weight different features more or less?
– Can we learn these weights from data?

Learning Works
• Yes. We can learn these models
• How we measure error is important
• We must have good features
• Even good features may need transformation
– Take logs of times and monetary values
– Subtract means, scale, bin values

Not Good Enough
• We need combinations of models
• Simple linear combinations aren’t subtle enough
• Enter multi-level models
– Can we learn a model that uses combinations of inputs
– Where each of those combinations is a model that we learn?

Yes, Virginia, There IS a Santa Claus
Each circle is a sum
and a (soft) threshold
Arrows are multiplication
by a learned weight

Errors on Output Can Propagate
Each circle is sends
error to each arrow
Arrows weight back-
propagating errors
Inputs
Hidden layer

Success!
Triumph!
World domination!

World domination!
With some reservations
because features are hard

Turtles All the Way Down – We Wish
• This learning works well for just a few layers
• This is still a big deal …
– with cool features, we can build real systems
• With many layers, the learning no longer converges
• Well … until recently

Model Learning in an Ideal World
• If we could just learn the features
– Maybe unsupervised, maybe supervised
– And at the same time learn the model
• Presumably we could build models quicker
• And more easily
• And we wouldn’t have to dirty our minds with
pedestrian domain knowledge

Example 1 – (not very) Deep Auto-encoder
• Let’s take an example where we can learn features
• Data is EKG traces
• We want to find anomalies
– No supervised training

Spot the Anomaly
Anomaly?

Maybe not!

Where’s Waldo?
This is the real
anomaly

Normal Isn’t Just Normal
• What we want is a model of what is normal
• What doesn’t fit the model is the anomaly
• For simple signals, the model can be simple …
• The real world is rarely so accommodating
x ~ m(t)+ N(0,e)

We Do Windows

Windows on the World
• The set of windowed signals is a nice model of our original signal
• Clustering can find the prototypes
– Fancier techniques available using sparse coding
• The result is a dictionary of shapes
• New signals can be encoded by shifting, scaling and adding
shapes from the dictionary

Most Common Shapes (for EKG)

Reconstructed signal
Original
signal
Reconstructed
signal
Reconstruction
error
< 1 bit / sample

An Anomaly
Original technique for finding
1-d anomaly works against
reconstruction error

Close-up of anomaly
Not what you want your
heart to do.
And not what the model
expects it to do.

A Different Kind of Anomaly

Some k-means Caveats
• But Eamonn Keogh says that k-means can’t work on time-series
• That is silly … and kind of correct, k-means does have limits
– Other kinds of auto-encoders are much more powerful
• More fun and code demos at
– https://github.com/tdunning/k-means-auto-encoder
http://www.cs.ucr.edu/~eamonn/meaningless.pdf

The Limits of Clustering as Auto-encoder
• Clustering is like trying to tile your sample distribution
• Can be used to approximate a signal
• Filling d dimensional region with k clusters should give
• If d is large, this is no good
e » 1/ kd

0 500 1000 1500 2000
−2−1012
Time series training data (first 2000 samples)
Time
Test data
Reconstruction
Error

0 500 1000 1500 2000
0.000.050.100.15
Reconstruction error for time−series data
Centroids
MAVError
Training data
Held−out data

Moral For Auto-encoders
• The simplest auto-encoders can be good models
• For more complex spaces/signals, more elaborate models may
be required
– Winner take (absolutely) all may be problematic
– In particular, models that allow sparse linear combination may be better
• Consider deep learning, recurrent networks, denoising

How Does Clustering Do Reconstruction?
x1 x2
...
xn-1 xn
Input
For normalized cluster centroids,
dot-product and distance are equivalent

x1 x2
...
xn-1 xn
Input
Winner takes all with k-means

x1 x2
...
xn-1 xn
x'1 x'2
...
x'n-1 x'n
Input
Hidden layer
(clusters)
Reconstruction
Dot-product scales
centroid to reconstruct

AKA - Neural Network
x1 x2
...
xn-1 xn
x'1 x'2
...
x'n-1 x'n
Input
Hidden layer
(clusters)
Reconstruction

What If … We Had More Layers?
...
...
...
...
... ... ... ... ...
... ... ... ... ...
A
B
A'

Other Thoughts
• What if we allow more than one cluster to be active?
– k-sparse learning!

Other Thoughts

Other Thoughts
• Well, almost

The Point of Deep Learning
• It isn’t just many hidden layers in a neural network
• The goal is to eliminate feature engineering by learning features
as well as the classifier

Experiment 3 – Card Velocity
• Most features so far are inherent in the data
• Few are true sequence features
• Card velocity is a pure combination
– Starting point can be anywhere
– The issue is where the next point is relative to starting point

Card Velocity
Non-fraud steps are
reasonable in terms
of distance and time
Fraudulent use of card
by multiple attackers
results in big, fast jumps

Synthetic Data Example
• Generate random point
• Take four small steps
• If fraud, second step can be large
• Result is five positions, each in 3-d on surface of a sphere
– Data shape is N x (5 x 3)
• Add secondary features containing step size … N x 4

The Truth is Out There
• With the right feature (step-size),
it is trivial to spot the fraud
• Here we show the step size
between positions
• Fraud cases take a big jump that
others don’t
• But they can be anywhere

But Dimensionality Bites Hard
• With the step-size feature, learning succeeds instantly with the
simplest models and gets nearly perfect accuracy
• Without the step-size feature, learning with TensorFlow gets
modest accuracy after substantial learning cost (work in
progress, could do better with lots more tuning)
• The problem is that there are two many combinations of 15
variables, we need a very specific combination of three pair-wise
diffs combined non-linearly into a distance

104
105
106
1
0
0.2
0.4
0.6
0.8
Data Size
AUCorPrecision
AUC
Precision

We have a
bona fide revolution
But old tricks still pay

Greenfield Problem Landscape

Mature Problem Landscape

Summary
• There is too much to say in 40 minutes, let’s talk some more at
the MapR booth
• Deep learning, especially with systems like TensorFlow have
huge promise
• Deep learning trades learning architecture engineering for
feature engineering
• There are powerful middle grounds

Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 - 2016
• For sale from Amazon or O’Reilly
• Free e-books currently available courtesy of MapR
http://bit.ly/ebook-real-
world-hadoop
http://bit.ly/mapr-tsdb-
ebook
http://bit.ly/ebook-
anomaly
http://bit.ly/recommend
ation-ebook

Streaming Architecture
by Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly)
Free copies at book
signing today
http://bit.ly/mapr-ebook-streams

Thank You!

Q&A
@mapr maprtech
tdunning@maprtech.com
Engage with us!
MapR
maprtech
mapr-technologies

Deep Learning for Fraud Detection

More Related Content

What's hot

Viewers also liked

Similar to Deep Learning for Fraud Detection

More from DataWorks Summit/Hadoop Summit

Recently uploaded

Deep Learning for Fraud Detection