Datasalt - BBVA case study - extracting value from credit card transactions

Case
Study

Value extraction from BBVA credit
card transactions

104,000
employees

47
million
customers

The
idea

Extract
value

from

anonymized

credit
card

transac5ons

data
&
share
it

Always:

ü  Impersonal

ü  Aggregated

ü  Dissociated

ü  Irreversible

Helping

Consumers

Informed
decision

ü  Shop
recommenda5ons
(by
loca5on
and
by
category)

ü  Best
5me
to
buy

ü  Ac5vity
&
ﬁdelity
of
shop’s
customers

Sellers

Learning
clients
pa:erns

ü  Ac5vity
&
ﬁdelity
of
shop’s
customers

ü  Sex
&
Age
&
Loca5on

ü  Buying
paIerns

Shop
stats
For
diﬀerent
periods

ü  All,
year,
quarter,
month,
week,
day

…
and
much
more

The
applica5ons

Internal
use

Sellers

Customers

The
challenges

Company
silos
The
costs

The
amount
of
data
Security

Development
ﬂexibility/agility

Human
failures

The
plaOorm

Data
storage
S3

Data
processing
Elas5c
Map
Reduce

Data
serving
EC2

Hadoop

Distributed
Filesystem

ü  Files
as
big
as
you
want

ü  Horizontal
scalability

ü  Failover

Distributed
Compu5ng

ü  MapReduce

ü  Batch
oriented

•  Input
ﬁles
processed
and
converted
in
output
ﬁles

ü  Horizontal
scalability

Easier
Hadoop
Java
API

ü  But
keeping
similar
eﬃciency

Common
design
paIerns
covered

ü  Compound
records

ü  Secondary
sor5ng

ü  Joins

Other
improvements

ü  Instance
based
conﬁgura5on

ü  First
class
mul5ple
input/output

Tuple
MapReduce
implementaDon
for
Hadoop

Tuple
MapReduce

Our
evoluDon
to
Google’s
MapReduce

Pere
Ferrera,
Iván
de
Prado,
Eric
Palacios,
Jose
Luis
Fernandez-‐
Marquez,
Giovanna
Di
Marzo
Serugendo:

Tuple
MapReduce:
Beyond
classic
MapReduce.

In
ICDM
2012:
Proceedings
of
the
IEEE
Interna6onal
Conference

on
Data
Mining

Brussels,
Belgium
|
December
10
–
13,
2012

Sales
diﬀerence
between
the
most
selling

Tuple
MapReduce
oﬃces
per
each
loca6on

Tuple
MapReduce

Main
constraint

ü  Group
by
clause
must
be
a
subset
of
sort
by
clause

Indeed,
Tuple
MapReduce
can
be
implemented
on
top
of

any
MapReduce
implementaDon

•  Pangool
-‐>
Tuple
MapReduce
over
Hadoop

Eﬃciency

Similar
eﬃciency
to
Hadoop

hIp://pangool.net/benchmark.html

Voldemort

Distributed
key/value
store

Voldemort
&
Hadoop

Benefits

ü  Scalability
&
failover

ü  Upda5ng
the
database
does
not
affect
serving
queries

ü  All
data
is
replaced
at
each
execu5on

•  Providing
agility/flexibility

§  Big
development
changes
are
not
a
pain

•  Easier
survival
to
human
errors

§  Fix
code
and
run
again

•  Easy
to
set
up
new
clusters
with
different
topologies

Basic
sta5s5cs

Easy
to
implement
with
Pangool/Hadoop

ü  One
job,
grouping
by
the
dimension
over
which
you
want
to

calculate
the
sta5s5cs.

Count
Average
Min
Max
Stdev

CompuDng
several
Dme
periods
in
the

same
job

ü  Use
the
mapper
for
replica5ng
each
datum
for
each
period

ü  Add
a
period
iden5ﬁer
ﬁeld
in
the
tuple
and
include
it
in
the

group
by
clause

Dis5nct
count

Possible
to
compute
in
a
single
job

ü  Using
secondary
sor5ng
by
the
ﬁeld
you
want
to
dis5nct
count

on

ü  Detec5ng
changes
on
that
ﬁeld

Example

ü  Group
by
shop,
sort
by
shop
and
card

Shop
Card

Shop
1
1234

Shop
1
1234

Shop
1
1234
Change

+1

Shop
1
5678
2
dis5nct

buyers
for

Shop
1
5678
Change

+1
shop
1

Histograms

Typically
two-‐pass
algorithm

ü  First
pass
for
detec5ng
the
minimum
and
the

maximum
and
determine
the
bins
ranges

ü  Second
pass
to
count
the
number
of
occurrences

on
each
bin

AdaptaDve
histogram

ü  One
pass

ü  Fixed
number
of
bins

ü  Bins
adapt

Op5mal
histogram

Calculate
the
be:er
histogram
that
represents
the
original
one

using
a
limited
number
of
ﬂexible
width
bins

ü  Reduce
storage
needs

ü  More
representa5ve
than
ﬁxed
width
ones
-‐>
beIer

visualiza5on

Op5mal
histogram

Exact
Algorithm

Petri
Kontkanen,
Petri
Myllym
aki

̈

MDL
Histogram
Density
EsDmaDon

hIp://eprints.pascal-‐network.org/archive/00002983/

Too
slow
for
producDon
use

Op5mal
histogram

Alterna5ve:
Approximated
algorithm

Random-‐restart
hill
climbing

ü  A
solu5on
is
just
a
way
of
grouping
exis5ng
bins

ü  From
a
solu5on,
you
can
move
to
some
close

solu5ons

ü  Some
are
beIer:
reduce
the
representa5on
error

Algorithm

1.  Iterate
N
5mes,
keeping
best

solu5on

1.  Generate
a
random
solu5on

2.  Iterate
un5l
no
improvement

1.  Move
to
next
beIer

possible
movement

Op5mal
histogram

Alterna5ve:
Approximated
algorithm

Random-‐restart
hill
climbing

ü  One
order
of
magnitude
faster

ü  99%
accuracy

Everything
in
one
job

Basic
staDsDcs
-‐>
1
job

DisDnct
count
staDsDcs
-‐>
1
job

One
pass
histograms
-‐>
1
job

Several
periods
&
shops
-‐>
1
job

We
can
put
all
together
so
that

compu5ng
all
sta5s5cs
for
all
shops

ﬁts
into
exactly
one
job

Shop
recommenda5ons

Based
on
co-‐occurrences

ü  If
somebody
bought
in
shop
A
and
in
shop
B,
then
a
co-‐occurrence

between
A
and
B
exists

ü  Only
one
co-‐occurrence
is
considered
although
a
buyer
bought

several
5mes
in
A
and
B

ü  Top
co-‐occurrences
per
each
shop
are
the
recommenda5ons

Improvements

ü  Most
popular
shops
are
ﬁltered
out
because
almost
everybody
buys

in
them.

ü  Recommenda5ons
by
category,
by
locaDon
and
by
both

ü  Diﬀerent
calcula5on
periods

Shop
recommenda5ons

Implemented
in
Pangool

ü  Using
its
coun5ng
and
joining
capabili5es

ü  Several
jobs

Challenges

ü  If
somebody
bought

in
many
shops,
the
list
of
co-‐occurrences
can

explode:

•  Co-‐occurrences
=
N
*
(N
–
1),
where
N
=
#
of
dis5nct
shops

where
the
person
bought

ü  Alleviated
by
limi5ng
the
total
number
of
dis5nct
shops
to
consider

ü  Only
uses
the
top
M
shops
where
the
client
bought
the
most

Future

ü  Time
aware
co-‐occurrences.
The
client
bought
in
A
and
B
and
he

did
it
in
a
close
period
of
5me.

Some
numbers

EsDmated
resources
needed
with
1
year

data

270
GB
of
stats
to
serve

24
large
instances
~
11
hours
of
execu5on

$3500
month

ü  Op5miza5ons
s5ll
possible

ü  Cost
without
the
use
of
reserved
instances

ü  Probably
cheaper
with
an
in-‐house
Hadoop
cluster

Conclusion

It
was
possible
to
develop
a
Big
Data

soluDon
for
a
Bank

ü  With
low
use
of
resources

ü  Quickly

ü  Thanks
to
the
use
of
technologies
like
Hadoop,
Amazon
Web

Services
and
NoSQL
databases

The
soluDon
is

ü  Scalable

ü  Flexible/agile.
Improvements
easy
to
implement

ü  Prepared
to
stand
human
failures

ü  At
a
reasonable
cost

Main
advantage:
doing
always
everything

Future:
Splout

Key/value
datastores
have
limitaDons

ü  Only
accept
querying
by
the
key

ü  Aggrega5ons
no
possible

ü  In
other
words,
we
are
forced
to
pre-‐compute
everything

ü  Not
always
possible
-‐>
data
explode

ü  For
this
par5cular
case,
5me
ranges
are
ﬁxed

Splout:
like
Voldemort
but
SQL!

ü  The
idea:
to
replace
Voldemort
by
Splout
SQL

ü  Much
richer
queries:
real-‐5me
aggrega5ons,
ﬂexible
5me
ranges

ü  It
would
allow
to
create
some
kind
of
Google
Analy5cs
for
the

sta5s5cs
discussed
in
this
presenta5on

ü  Open
Sourced!!!

hIps://github.com/datasalt/splout-‐db

Datasalt - BBVA case study - extracting value from credit card transactions

More Related Content

Viewers also liked

Similar to Datasalt - BBVA case study - extracting value from credit card transactions

Datasalt - BBVA case study - extracting value from credit card transactions