Case	
  Study	
  




    Value extraction from BBVA credit
            card transactions
104,000	
  employees	
  
47	
  million	
  customers	
  
The	
  idea	
  
 Extract	
  value	
  
       from	
  
 anonymized	
  
  credit	
  card	
  
 transac5ons	
  
data	
  &	
  share	
  it	
  	
  	
  
   Always:	
  	
  
   ü  Impersonal	
  
   ü  Aggregated	
  
   ü  Dissociated	
  
   ü  Irreversible	
  
Helping	
  

   Consumers	
  
      Informed	
  decision	
  
      ü  Shop	
  recommenda5ons	
  (by	
  loca5on	
  and	
  by	
  category)	
  
      ü  Best	
  5me	
  to	
  buy	
  
      ü  Ac5vity	
  &	
  fidelity	
  of	
  shop’s	
  customers	
  


      Sellers	
  
      Learning	
  clients	
  pa:erns	
  
      ü  Ac5vity	
  &	
  fidelity	
  of	
  shop’s	
  customers	
  
      ü  Sex	
  &	
  Age	
  &	
  Loca5on	
  
      ü  Buying	
  paIerns	
  
Shop	
  stats	
            For	
  different	
  periods	
  
                           ü  All,	
  year,	
  quarter,	
  month,	
  week,	
  day	
  




                …	
  and	
  much	
  more	
  
The	
  applica5ons	
  

Internal	
  use	
  


Sellers	
  


Customers	
  
The	
  challenges	
  

Company	
  silos	
                The	
  costs	
  

The	
  amount	
  of	
  data	
         Security	
  

 Development	
  flexibility/agility	
  

            Human	
  failures	
  
The	
  plaOorm	
  




  Data	
  storage	
                 S3	
  
Data	
  processing	
     Elas5c	
  Map	
  Reduce	
  
  Data	
  serving	
                EC2	
  
The	
  architecture	
  
Hadoop	
  

Distributed	
  Filesystem	
  
 ü     Files	
  as	
  big	
  as	
  you	
  want	
  
 ü     Horizontal	
  scalability	
  
 ü     Failover	
  
 	
  

Distributed	
  Compu5ng	
  
 ü     MapReduce	
  
 ü     Batch	
  oriented	
  
      •     Input	
  files	
  processed	
  and	
  converted	
  in	
  output	
  files	
  
 ü  Horizontal	
  scalability	
  
 	
  
Easier	
  Hadoop	
  Java	
  API	
  
    ü    But	
  keeping	
  similar	
  efficiency	
  

Common	
  design	
  paIerns	
  covered	
  
    ü    Compound	
  records	
  
    ü    Secondary	
  sor5ng	
  
    ü    Joins	
  

Other	
  improvements	
  
    ü    Instance	
  based	
  configura5on	
  
    ü    First	
  class	
  mul5ple	
  input/output	
  

Tuple	
  MapReduce	
  implementaDon	
  for	
  Hadoop	
  
Tuple	
  MapReduce	
  
Our	
  evoluDon	
  to	
  Google’s	
  MapReduce	
  
Pere	
  Ferrera,	
  Iván	
  de	
  Prado,	
  Eric	
  Palacios,	
  Jose	
  Luis	
  Fernandez-­‐
Marquez,	
  Giovanna	
  Di	
  Marzo	
  Serugendo:	
  	
  
	
  
Tuple	
  MapReduce:	
  Beyond	
  classic	
  MapReduce.	
  	
  
	
  
In	
  ICDM	
  2012:	
  Proceedings	
  of	
  the	
  IEEE	
  Interna6onal	
  Conference	
  
on	
  Data	
  Mining	
  
	
  
Brussels,	
  Belgium	
  |	
  December	
  10	
  –	
  13,	
  2012	
  
Sales	
  difference	
  between	
  the	
  most	
  selling	
  
Tuple	
  MapReduce	
     offices	
  per	
  each	
  loca6on	
  
Tuple	
  MapReduce	
  

         Main	
  constraint	
  


         ü  Group	
  by	
  clause	
  must	
  be	
  a	
  subset	
  of	
  sort	
  by	
  clause	
  

Indeed,	
  Tuple	
  MapReduce	
  can	
  be	
  implemented	
  on	
  top	
  of	
  
any	
  MapReduce	
  implementaDon	
  
   •  Pangool	
  -­‐>	
  Tuple	
  MapReduce	
  over	
  Hadoop	
  
Efficiency	
  
Similar	
  efficiency	
  to	
  Hadoop	
  




    hIp://pangool.net/benchmark.html	
  
Voldemort	
  
Distributed	
  key/value	
  store	
  
Voldemort	
  &	
  Hadoop	
  

        Benefits	
  
     ü  Scalability	
  &	
  failover	
  
     ü  Upda5ng	
  the	
  database	
  does	
  not	
  affect	
  serving	
  queries	
  
     ü  All	
  data	
  is	
  replaced	
  at	
  each	
  execu5on	
  
           •  Providing	
  agility/flexibility	
  	
  
                   §  Big	
  development	
  changes	
  are	
  not	
  a	
  pain	
  
           •  Easier	
  survival	
  to	
  human	
  errors	
  
                   §  Fix	
  code	
  and	
  run	
  again	
  
           •  Easy	
  to	
  set	
  up	
  new	
  clusters	
  with	
  different	
  topologies	
  	
  
Basic	
  sta5s5cs	
  
Easy	
  to	
  implement	
  with	
  Pangool/Hadoop	
  
   ü  One	
  job,	
  grouping	
  by	
  the	
  dimension	
  over	
  which	
  you	
  want	
  to	
  
       calculate	
  the	
  sta5s5cs.	
  


Count	
               Average	
                          Min	
                Max	
                 Stdev	
  
CompuDng	
  several	
  Dme	
  periods	
  in	
  the	
  
same	
  job	
  
     ü  Use	
  the	
  mapper	
  for	
  replica5ng	
  each	
  datum	
  for	
  each	
  period	
  
     ü  Add	
  a	
  period	
  iden5fier	
  field	
  in	
  the	
  tuple	
  and	
  include	
  it	
  in	
  the	
  
         group	
  by	
  clause	
  	
  
Dis5nct	
  count	
  
Possible	
  to	
  compute	
  in	
  a	
  single	
  job	
  
    ü  Using	
  secondary	
  sor5ng	
  by	
  the	
  field	
  you	
  want	
  to	
  dis5nct	
  count	
  
        on	
  
    ü  Detec5ng	
  changes	
  on	
  that	
  field	
  	
  

Example	
  
    ü  Group	
  by	
  shop,	
  sort	
  by	
  shop	
  and	
  card	
  

    Shop	
                         Card	
  
    Shop	
  1	
                    1234	
  
    Shop	
  1	
                    1234	
  
    Shop	
  1	
                    1234	
                               Change	
  
                                                                                     +1	
  
    Shop	
  1	
                    5678	
                                                           2	
  dis5nct	
  
                                                                                                    buyers	
  for	
  
    Shop	
  1	
                    5678	
                               Change	
  
                                                                                     +1	
           shop	
  1	
  
Histograms	
  
Typically	
  two-­‐pass	
  algorithm	
  
  ü  First	
  pass	
  for	
  detec5ng	
  the	
  minimum	
  and	
  the	
  
      maximum	
  and	
  determine	
  the	
  bins	
  ranges	
  
  ü  Second	
  pass	
  to	
  count	
  the	
  number	
  of	
  occurrences	
  
      on	
  each	
  bin	
  
AdaptaDve	
  histogram	
  	
  
                                                                   ü  One	
  pass	
  
                                                                   ü  Fixed	
  number	
  of	
  bins	
  
                                                                   ü  Bins	
  adapt	
  	
  
Op5mal	
  histogram	
  
Calculate	
  the	
  be:er	
  histogram	
  that	
  represents	
  the	
  original	
  one	
  
using	
  a	
  limited	
  number	
  of	
  flexible	
  width	
  bins	
  
      ü  Reduce	
  storage	
  needs	
  
      ü  More	
  representa5ve	
  than	
  fixed	
  width	
  ones	
  -­‐>	
  beIer	
  
          visualiza5on	
  
Op5mal	
  histogram	
  
   Exact	
  Algorithm	
  
   Petri	
  Kontkanen,	
  Petri	
  Myllym	
  aki	
  
                                             ̈
   	
  
   MDL	
  Histogram	
  Density	
  EsDmaDon	
  
   	
  
   hIp://eprints.pascal-­‐network.org/archive/00002983/	
  



Too	
  slow	
  for	
  producDon	
  use	
  
Op5mal	
  histogram	
  
 Alterna5ve:	
  Approximated	
  algorithm	
  
Random-­‐restart	
  hill	
  climbing	
  	
  
    ü  A	
  solu5on	
  is	
  just	
  a	
  way	
  of	
  grouping	
  exis5ng	
  bins	
  
    ü  From	
  a	
  solu5on,	
  you	
  can	
  move	
  to	
  some	
  close	
  
        solu5ons	
  
    ü  Some	
  are	
  beIer:	
  reduce	
  the	
  representa5on	
  error	
  	
  

Algorithm	
  
    1.  Iterate	
  N	
  5mes,	
  keeping	
  best	
  
        solu5on	
  
        1.  Generate	
  a	
  random	
  solu5on	
  
        2.  Iterate	
  un5l	
  no	
  improvement	
  
             1.  Move	
  to	
  next	
  beIer	
  
                    possible	
  movement	
  
Op5mal	
  histogram	
  
 Alterna5ve:	
  Approximated	
  algorithm	
  
Random-­‐restart	
  hill	
  climbing	
  	
  
    ü  One	
  order	
  of	
  magnitude	
  faster	
  
    ü  99%	
  accuracy	
  	
  
Everything	
  in	
  one	
  job	
  
 Basic	
  staDsDcs	
  -­‐>	
  1	
  job	
  
 DisDnct	
  count	
  staDsDcs	
  -­‐>	
  1	
  job	
  
 One	
  pass	
  histograms	
  -­‐>	
  1	
  job	
  
 Several	
  periods	
  &	
  shops	
  -­‐>	
  1	
  job	
  

     We	
  can	
  put	
  all	
  together	
  so	
  that	
  
   compu5ng	
  all	
  sta5s5cs	
  for	
  all	
  shops	
  
          fits	
  into	
  exactly	
  one	
  job	
  	
  	
  
Shop	
  recommenda5ons	
  
Based	
  on	
  co-­‐occurrences	
  
   ü  If	
  somebody	
  bought	
  in	
  shop	
  A	
  and	
  in	
  shop	
  B,	
  then	
  a	
  co-­‐occurrence	
  
       between	
  A	
  and	
  B	
  exists	
  
   ü  Only	
  one	
  co-­‐occurrence	
  is	
  considered	
  although	
  a	
  buyer	
  bought	
  
       several	
  5mes	
  in	
  A	
  and	
  B	
  
   ü  Top	
  co-­‐occurrences	
  per	
  each	
  shop	
  are	
  the	
  recommenda5ons	
  

Improvements	
  
   ü  Most	
  popular	
  shops	
  are	
  filtered	
  out	
  because	
  almost	
  everybody	
  buys	
  
       in	
  them.	
  
   ü  Recommenda5ons	
  by	
  category,	
  by	
  locaDon	
  and	
  by	
  both	
  
   ü  Different	
  calcula5on	
  periods	
  
Shop	
  recommenda5ons	
  
Implemented	
  in	
  Pangool	
  
    ü  Using	
  its	
  coun5ng	
  and	
  joining	
  capabili5es	
  
    ü  Several	
  jobs	
  

Challenges	
  
    ü  If	
  somebody	
  bought	
  	
  in	
  many	
  shops,	
  the	
  list	
  of	
  co-­‐occurrences	
  can	
  
        explode:	
  
            •  Co-­‐occurrences	
  =	
  N	
  *	
  (N	
  –	
  1),	
  where	
  N	
  =	
  #	
  of	
  dis5nct	
  shops	
  
                where	
  the	
  person	
  bought	
  
    ü  Alleviated	
  by	
  limi5ng	
  the	
  total	
  number	
  of	
  dis5nct	
  shops	
  to	
  consider	
  
            ü  Only	
  uses	
  the	
  top	
  M	
  shops	
  where	
  the	
  client	
  bought	
  the	
  most	
  	
  
Future	
  
    ü  Time	
  aware	
  co-­‐occurrences.	
  The	
  client	
  bought	
  in	
  A	
  and	
  B	
  and	
  he	
  
        did	
  it	
  in	
  a	
  close	
  period	
  of	
  5me.	
  
Some	
  numbers	
  
EsDmated	
  resources	
  needed	
  with	
  1	
  year	
  
data	
  
                  270	
  GB	
  of	
  stats	
  to	
  serve	
  

24	
  large	
  instances	
  ~	
  11	
  hours	
  of	
  execu5on	
  

                               $3500	
  month	
  
       ü  Op5miza5ons	
  s5ll	
  possible	
  
       ü  Cost	
  without	
  the	
  use	
  of	
  reserved	
  instances	
  
       ü  Probably	
  cheaper	
  with	
  an	
  in-­‐house	
  Hadoop	
  cluster	
  
Conclusion	
  
It	
  was	
  possible	
  to	
  develop	
  a	
  Big	
  Data	
  
soluDon	
  for	
  a	
  Bank	
  
  ü  With	
  low	
  use	
  of	
  resources	
  
  ü  Quickly	
  
  ü  Thanks	
  to	
  the	
  use	
  of	
  technologies	
  like	
  Hadoop,	
  Amazon	
  Web	
  
      Services	
  and	
  NoSQL	
  databases	
  

The	
  soluDon	
  is	
  
    ü  Scalable	
  
    ü  Flexible/agile.	
  Improvements	
  easy	
  to	
  implement	
  
    ü  Prepared	
  to	
  stand	
  human	
  failures	
  
    ü  At	
  a	
  reasonable	
  cost	
  

Main	
  advantage:	
  doing	
  always	
  everything	
  
Future:	
  Splout	
  
Key/value	
  datastores	
  have	
  limitaDons	
  
  ü  Only	
  accept	
  querying	
  by	
  the	
  key	
  
  ü  Aggrega5ons	
  no	
  possible	
  
  ü  In	
  other	
  words,	
  we	
  are	
  forced	
  to	
  pre-­‐compute	
  everything	
  
       ü  Not	
  always	
  possible	
  -­‐>	
  data	
  explode	
  
       ü  For	
  this	
  par5cular	
  case,	
  5me	
  ranges	
  are	
  fixed	
  

Splout:	
  like	
  Voldemort	
  but	
  SQL!	
  
  ü  The	
  idea:	
  to	
  replace	
  Voldemort	
  by	
  Splout	
  SQL	
  
  ü  Much	
  richer	
  queries:	
  real-­‐5me	
  aggrega5ons,	
  flexible	
  5me	
  ranges	
  
  ü  It	
  would	
  allow	
  to	
  create	
  some	
  kind	
  of	
  Google	
  Analy5cs	
  for	
  the	
  
      sta5s5cs	
  discussed	
  in	
  this	
  presenta5on	
  
  ü  Open	
  Sourced!!!	
  
       hIps://github.com/datasalt/splout-­‐db	
  	
  

Datasalt - BBVA case study - extracting value from credit card transactions

  • 1.
    Case  Study   Value extraction from BBVA credit card transactions
  • 2.
    104,000  employees   47  million  customers  
  • 3.
    The  idea   Extract  value   from   anonymized   credit  card   transac5ons   data  &  share  it       Always:     ü  Impersonal   ü  Aggregated   ü  Dissociated   ü  Irreversible  
  • 4.
    Helping   Consumers   Informed  decision   ü  Shop  recommenda5ons  (by  loca5on  and  by  category)   ü  Best  5me  to  buy   ü  Ac5vity  &  fidelity  of  shop’s  customers   Sellers   Learning  clients  pa:erns   ü  Ac5vity  &  fidelity  of  shop’s  customers   ü  Sex  &  Age  &  Loca5on   ü  Buying  paIerns  
  • 5.
    Shop  stats   For  different  periods   ü  All,  year,  quarter,  month,  week,  day   …  and  much  more  
  • 6.
    The  applica5ons   Internal  use   Sellers   Customers  
  • 7.
    The  challenges   Company  silos   The  costs   The  amount  of  data   Security   Development  flexibility/agility   Human  failures  
  • 8.
    The  plaOorm   Data  storage   S3   Data  processing   Elas5c  Map  Reduce   Data  serving   EC2  
  • 9.
  • 10.
    Hadoop   Distributed  Filesystem   ü  Files  as  big  as  you  want   ü  Horizontal  scalability   ü  Failover     Distributed  Compu5ng   ü  MapReduce   ü  Batch  oriented   •  Input  files  processed  and  converted  in  output  files   ü  Horizontal  scalability    
  • 11.
    Easier  Hadoop  Java  API   ü  But  keeping  similar  efficiency   Common  design  paIerns  covered   ü  Compound  records   ü  Secondary  sor5ng   ü  Joins   Other  improvements   ü  Instance  based  configura5on   ü  First  class  mul5ple  input/output   Tuple  MapReduce  implementaDon  for  Hadoop  
  • 12.
    Tuple  MapReduce   Our  evoluDon  to  Google’s  MapReduce   Pere  Ferrera,  Iván  de  Prado,  Eric  Palacios,  Jose  Luis  Fernandez-­‐ Marquez,  Giovanna  Di  Marzo  Serugendo:       Tuple  MapReduce:  Beyond  classic  MapReduce.       In  ICDM  2012:  Proceedings  of  the  IEEE  Interna6onal  Conference   on  Data  Mining     Brussels,  Belgium  |  December  10  –  13,  2012  
  • 13.
    Sales  difference  between  the  most  selling   Tuple  MapReduce   offices  per  each  loca6on  
  • 14.
    Tuple  MapReduce   Main  constraint   ü  Group  by  clause  must  be  a  subset  of  sort  by  clause   Indeed,  Tuple  MapReduce  can  be  implemented  on  top  of   any  MapReduce  implementaDon   •  Pangool  -­‐>  Tuple  MapReduce  over  Hadoop  
  • 15.
    Efficiency   Similar  efficiency  to  Hadoop   hIp://pangool.net/benchmark.html  
  • 16.
  • 17.
    Voldemort  &  Hadoop   Benefits   ü  Scalability  &  failover   ü  Upda5ng  the  database  does  not  affect  serving  queries   ü  All  data  is  replaced  at  each  execu5on   •  Providing  agility/flexibility     §  Big  development  changes  are  not  a  pain   •  Easier  survival  to  human  errors   §  Fix  code  and  run  again   •  Easy  to  set  up  new  clusters  with  different  topologies    
  • 18.
    Basic  sta5s5cs   Easy  to  implement  with  Pangool/Hadoop   ü  One  job,  grouping  by  the  dimension  over  which  you  want  to   calculate  the  sta5s5cs.   Count   Average   Min   Max   Stdev   CompuDng  several  Dme  periods  in  the   same  job   ü  Use  the  mapper  for  replica5ng  each  datum  for  each  period   ü  Add  a  period  iden5fier  field  in  the  tuple  and  include  it  in  the   group  by  clause    
  • 19.
    Dis5nct  count   Possible  to  compute  in  a  single  job   ü  Using  secondary  sor5ng  by  the  field  you  want  to  dis5nct  count   on   ü  Detec5ng  changes  on  that  field     Example   ü  Group  by  shop,  sort  by  shop  and  card   Shop   Card   Shop  1   1234   Shop  1   1234   Shop  1   1234   Change   +1   Shop  1   5678   2  dis5nct   buyers  for   Shop  1   5678   Change   +1   shop  1  
  • 20.
    Histograms   Typically  two-­‐pass  algorithm   ü  First  pass  for  detec5ng  the  minimum  and  the   maximum  and  determine  the  bins  ranges   ü  Second  pass  to  count  the  number  of  occurrences   on  each  bin   AdaptaDve  histogram     ü  One  pass   ü  Fixed  number  of  bins   ü  Bins  adapt    
  • 21.
    Op5mal  histogram   Calculate  the  be:er  histogram  that  represents  the  original  one   using  a  limited  number  of  flexible  width  bins   ü  Reduce  storage  needs   ü  More  representa5ve  than  fixed  width  ones  -­‐>  beIer   visualiza5on  
  • 22.
    Op5mal  histogram   Exact  Algorithm   Petri  Kontkanen,  Petri  Myllym  aki   ̈   MDL  Histogram  Density  EsDmaDon     hIp://eprints.pascal-­‐network.org/archive/00002983/   Too  slow  for  producDon  use  
  • 23.
    Op5mal  histogram   Alterna5ve:  Approximated  algorithm   Random-­‐restart  hill  climbing     ü  A  solu5on  is  just  a  way  of  grouping  exis5ng  bins   ü  From  a  solu5on,  you  can  move  to  some  close   solu5ons   ü  Some  are  beIer:  reduce  the  representa5on  error     Algorithm   1.  Iterate  N  5mes,  keeping  best   solu5on   1.  Generate  a  random  solu5on   2.  Iterate  un5l  no  improvement   1.  Move  to  next  beIer   possible  movement  
  • 24.
    Op5mal  histogram   Alterna5ve:  Approximated  algorithm   Random-­‐restart  hill  climbing     ü  One  order  of  magnitude  faster   ü  99%  accuracy    
  • 25.
    Everything  in  one  job   Basic  staDsDcs  -­‐>  1  job   DisDnct  count  staDsDcs  -­‐>  1  job   One  pass  histograms  -­‐>  1  job   Several  periods  &  shops  -­‐>  1  job   We  can  put  all  together  so  that   compu5ng  all  sta5s5cs  for  all  shops   fits  into  exactly  one  job      
  • 26.
    Shop  recommenda5ons   Based  on  co-­‐occurrences   ü  If  somebody  bought  in  shop  A  and  in  shop  B,  then  a  co-­‐occurrence   between  A  and  B  exists   ü  Only  one  co-­‐occurrence  is  considered  although  a  buyer  bought   several  5mes  in  A  and  B   ü  Top  co-­‐occurrences  per  each  shop  are  the  recommenda5ons   Improvements   ü  Most  popular  shops  are  filtered  out  because  almost  everybody  buys   in  them.   ü  Recommenda5ons  by  category,  by  locaDon  and  by  both   ü  Different  calcula5on  periods  
  • 27.
    Shop  recommenda5ons   Implemented  in  Pangool   ü  Using  its  coun5ng  and  joining  capabili5es   ü  Several  jobs   Challenges   ü  If  somebody  bought    in  many  shops,  the  list  of  co-­‐occurrences  can   explode:   •  Co-­‐occurrences  =  N  *  (N  –  1),  where  N  =  #  of  dis5nct  shops   where  the  person  bought   ü  Alleviated  by  limi5ng  the  total  number  of  dis5nct  shops  to  consider   ü  Only  uses  the  top  M  shops  where  the  client  bought  the  most     Future   ü  Time  aware  co-­‐occurrences.  The  client  bought  in  A  and  B  and  he   did  it  in  a  close  period  of  5me.  
  • 28.
    Some  numbers   EsDmated  resources  needed  with  1  year   data   270  GB  of  stats  to  serve   24  large  instances  ~  11  hours  of  execu5on   $3500  month   ü  Op5miza5ons  s5ll  possible   ü  Cost  without  the  use  of  reserved  instances   ü  Probably  cheaper  with  an  in-­‐house  Hadoop  cluster  
  • 29.
    Conclusion   It  was  possible  to  develop  a  Big  Data   soluDon  for  a  Bank   ü  With  low  use  of  resources   ü  Quickly   ü  Thanks  to  the  use  of  technologies  like  Hadoop,  Amazon  Web   Services  and  NoSQL  databases   The  soluDon  is   ü  Scalable   ü  Flexible/agile.  Improvements  easy  to  implement   ü  Prepared  to  stand  human  failures   ü  At  a  reasonable  cost   Main  advantage:  doing  always  everything  
  • 30.
    Future:  Splout   Key/value  datastores  have  limitaDons   ü  Only  accept  querying  by  the  key   ü  Aggrega5ons  no  possible   ü  In  other  words,  we  are  forced  to  pre-­‐compute  everything   ü  Not  always  possible  -­‐>  data  explode   ü  For  this  par5cular  case,  5me  ranges  are  fixed   Splout:  like  Voldemort  but  SQL!   ü  The  idea:  to  replace  Voldemort  by  Splout  SQL   ü  Much  richer  queries:  real-­‐5me  aggrega5ons,  flexible  5me  ranges   ü  It  would  allow  to  create  some  kind  of  Google  Analy5cs  for  the   sta5s5cs  discussed  in  this  presenta5on   ü  Open  Sourced!!!   hIps://github.com/datasalt/splout-­‐db