• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Datasalt - BBVA case study - extracting value from credit card transactions

Datasalt - BBVA case study - extracting value from credit card transactions






Total Views
Views on SlideShare
Embed Views



18 Embeds 1,277

http://www.datasalt.com 533
http://www.datasalt.es 335
http://jugnu-life.blogspot.in 241
http://jugnu-life.blogspot.com 63
http://jugnu-life.blogspot.ca 36
http://jugnu-life.blogspot.com.au 33
http://jugnu-life.blogspot.de 9
http://jugnu-life.blogspot.com.br 6
http://jugnu-life.blogspot.sg 5
http://jugnu-life.blogspot.kr 4
http://jugnu-life.blogspot.jp 3
http://jugnu-life.blogspot.gr 2
http://jugnu-life.blogspot.tw 2
http://jugnu-life.blogspot.co.uk 1
http://jugnu-life.blogspot.mx 1
http://jugnu-life.blogspot.nl 1
http://jugnu-life.blogspot.co.il 1
http://jugnu-life.blogspot.ae 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Datasalt - BBVA case study - extracting value from credit card transactions Datasalt - BBVA case study - extracting value from credit card transactions Presentation Transcript

    • Case  Study   Value extraction from BBVA credit card transactions
    • 104,000  employees  47  million  customers  
    • The  idea   Extract  value   from   anonymized   credit  card   transac5ons  data  &  share  it       Always:     ü  Impersonal   ü  Aggregated   ü  Dissociated   ü  Irreversible  
    • Helping   Consumers   Informed  decision   ü  Shop  recommenda5ons  (by  loca5on  and  by  category)   ü  Best  5me  to  buy   ü  Ac5vity  &  fidelity  of  shop’s  customers   Sellers   Learning  clients  pa:erns   ü  Ac5vity  &  fidelity  of  shop’s  customers   ü  Sex  &  Age  &  Loca5on   ü  Buying  paIerns  
    • Shop  stats   For  different  periods   ü  All,  year,  quarter,  month,  week,  day   …  and  much  more  
    • The  applica5ons  Internal  use  Sellers  Customers  
    • The  challenges  Company  silos   The  costs  The  amount  of  data   Security   Development  flexibility/agility   Human  failures  
    • The  plaOorm   Data  storage   S3  Data  processing   Elas5c  Map  Reduce   Data  serving   EC2  
    • The  architecture  
    • Hadoop  Distributed  Filesystem   ü  Files  as  big  as  you  want   ü  Horizontal  scalability   ü  Failover    Distributed  Compu5ng   ü  MapReduce   ü  Batch  oriented   •  Input  files  processed  and  converted  in  output  files   ü  Horizontal  scalability    
    • Easier  Hadoop  Java  API   ü  But  keeping  similar  efficiency  Common  design  paIerns  covered   ü  Compound  records   ü  Secondary  sor5ng   ü  Joins  Other  improvements   ü  Instance  based  configura5on   ü  First  class  mul5ple  input/output  Tuple  MapReduce  implementaDon  for  Hadoop  
    • Tuple  MapReduce  Our  evoluDon  to  Google’s  MapReduce  Pere  Ferrera,  Iván  de  Prado,  Eric  Palacios,  Jose  Luis  Fernandez-­‐Marquez,  Giovanna  Di  Marzo  Serugendo:      Tuple  MapReduce:  Beyond  classic  MapReduce.      In  ICDM  2012:  Proceedings  of  the  IEEE  Interna6onal  Conference  on  Data  Mining    Brussels,  Belgium  |  December  10  –  13,  2012  
    • Sales  difference  between  the  most  selling  Tuple  MapReduce   offices  per  each  loca6on  
    • Tuple  MapReduce   Main  constraint   ü  Group  by  clause  must  be  a  subset  of  sort  by  clause  Indeed,  Tuple  MapReduce  can  be  implemented  on  top  of  any  MapReduce  implementaDon   •  Pangool  -­‐>  Tuple  MapReduce  over  Hadoop  
    • Efficiency  Similar  efficiency  to  Hadoop   hIp://pangool.net/benchmark.html  
    • Voldemort  Distributed  key/value  store  
    • Voldemort  &  Hadoop   Benefits   ü  Scalability  &  failover   ü  Upda5ng  the  database  does  not  affect  serving  queries   ü  All  data  is  replaced  at  each  execu5on   •  Providing  agility/flexibility     §  Big  development  changes  are  not  a  pain   •  Easier  survival  to  human  errors   §  Fix  code  and  run  again   •  Easy  to  set  up  new  clusters  with  different  topologies    
    • Basic  sta5s5cs  Easy  to  implement  with  Pangool/Hadoop   ü  One  job,  grouping  by  the  dimension  over  which  you  want  to   calculate  the  sta5s5cs.  Count   Average   Min   Max   Stdev  CompuDng  several  Dme  periods  in  the  same  job   ü  Use  the  mapper  for  replica5ng  each  datum  for  each  period   ü  Add  a  period  iden5fier  field  in  the  tuple  and  include  it  in  the   group  by  clause    
    • Dis5nct  count  Possible  to  compute  in  a  single  job   ü  Using  secondary  sor5ng  by  the  field  you  want  to  dis5nct  count   on   ü  Detec5ng  changes  on  that  field    Example   ü  Group  by  shop,  sort  by  shop  and  card   Shop   Card   Shop  1   1234   Shop  1   1234   Shop  1   1234   Change   +1   Shop  1   5678   2  dis5nct   buyers  for   Shop  1   5678   Change   +1   shop  1  
    • Histograms  Typically  two-­‐pass  algorithm   ü  First  pass  for  detec5ng  the  minimum  and  the   maximum  and  determine  the  bins  ranges   ü  Second  pass  to  count  the  number  of  occurrences   on  each  bin  AdaptaDve  histogram     ü  One  pass   ü  Fixed  number  of  bins   ü  Bins  adapt    
    • Op5mal  histogram  Calculate  the  be:er  histogram  that  represents  the  original  one  using  a  limited  number  of  flexible  width  bins   ü  Reduce  storage  needs   ü  More  representa5ve  than  fixed  width  ones  -­‐>  beIer   visualiza5on  
    • Op5mal  histogram   Exact  Algorithm   Petri  Kontkanen,  Petri  Myllym  aki   ̈   MDL  Histogram  Density  EsDmaDon     hIp://eprints.pascal-­‐network.org/archive/00002983/  Too  slow  for  producDon  use  
    • Op5mal  histogram   Alterna5ve:  Approximated  algorithm  Random-­‐restart  hill  climbing     ü  A  solu5on  is  just  a  way  of  grouping  exis5ng  bins   ü  From  a  solu5on,  you  can  move  to  some  close   solu5ons   ü  Some  are  beIer:  reduce  the  representa5on  error    Algorithm   1.  Iterate  N  5mes,  keeping  best   solu5on   1.  Generate  a  random  solu5on   2.  Iterate  un5l  no  improvement   1.  Move  to  next  beIer   possible  movement  
    • Op5mal  histogram   Alterna5ve:  Approximated  algorithm  Random-­‐restart  hill  climbing     ü  One  order  of  magnitude  faster   ü  99%  accuracy    
    • Everything  in  one  job   Basic  staDsDcs  -­‐>  1  job   DisDnct  count  staDsDcs  -­‐>  1  job   One  pass  histograms  -­‐>  1  job   Several  periods  &  shops  -­‐>  1  job   We  can  put  all  together  so  that   compu5ng  all  sta5s5cs  for  all  shops   fits  into  exactly  one  job      
    • Shop  recommenda5ons  Based  on  co-­‐occurrences   ü  If  somebody  bought  in  shop  A  and  in  shop  B,  then  a  co-­‐occurrence   between  A  and  B  exists   ü  Only  one  co-­‐occurrence  is  considered  although  a  buyer  bought   several  5mes  in  A  and  B   ü  Top  co-­‐occurrences  per  each  shop  are  the  recommenda5ons  Improvements   ü  Most  popular  shops  are  filtered  out  because  almost  everybody  buys   in  them.   ü  Recommenda5ons  by  category,  by  locaDon  and  by  both   ü  Different  calcula5on  periods  
    • Shop  recommenda5ons  Implemented  in  Pangool   ü  Using  its  coun5ng  and  joining  capabili5es   ü  Several  jobs  Challenges   ü  If  somebody  bought    in  many  shops,  the  list  of  co-­‐occurrences  can   explode:   •  Co-­‐occurrences  =  N  *  (N  –  1),  where  N  =  #  of  dis5nct  shops   where  the  person  bought   ü  Alleviated  by  limi5ng  the  total  number  of  dis5nct  shops  to  consider   ü  Only  uses  the  top  M  shops  where  the  client  bought  the  most    Future   ü  Time  aware  co-­‐occurrences.  The  client  bought  in  A  and  B  and  he   did  it  in  a  close  period  of  5me.  
    • Some  numbers  EsDmated  resources  needed  with  1  year  data   270  GB  of  stats  to  serve  24  large  instances  ~  11  hours  of  execu5on   $3500  month   ü  Op5miza5ons  s5ll  possible   ü  Cost  without  the  use  of  reserved  instances   ü  Probably  cheaper  with  an  in-­‐house  Hadoop  cluster  
    • Conclusion  It  was  possible  to  develop  a  Big  Data  soluDon  for  a  Bank   ü  With  low  use  of  resources   ü  Quickly   ü  Thanks  to  the  use  of  technologies  like  Hadoop,  Amazon  Web   Services  and  NoSQL  databases  The  soluDon  is   ü  Scalable   ü  Flexible/agile.  Improvements  easy  to  implement   ü  Prepared  to  stand  human  failures   ü  At  a  reasonable  cost  Main  advantage:  doing  always  everything  
    • Future:  Splout  Key/value  datastores  have  limitaDons   ü  Only  accept  querying  by  the  key   ü  Aggrega5ons  no  possible   ü  In  other  words,  we  are  forced  to  pre-­‐compute  everything   ü  Not  always  possible  -­‐>  data  explode   ü  For  this  par5cular  case,  5me  ranges  are  fixed  Splout:  like  Voldemort  but  SQL!   ü  The  idea:  to  replace  Voldemort  by  Splout  SQL   ü  Much  richer  queries:  real-­‐5me  aggrega5ons,  flexible  5me  ranges   ü  It  would  allow  to  create  some  kind  of  Google  Analy5cs  for  the   sta5s5cs  discussed  in  this  presenta5on   ü  Open  Sourced!!!   hIps://github.com/datasalt/splout-­‐db