0
Case	  Study	      Value extraction from BBVA credit            card transactions
104,000	  employees	  47	  million	  customers	  
The	  idea	   Extract	  value	         from	   anonymized	    credit	  card	   transac5ons	  data	  &	  share	  it	  	  	 ...
Helping	     Consumers	        Informed	  decision	        ü  Shop	  recommenda5ons	  (by	  loca5on	  and	  by	  category...
Shop	  stats	            For	  different	  periods	                             ü  All,	  year,	  quarter,	  month,	  week...
The	  applica5ons	  Internal	  use	  Sellers	  Customers	  
The	  challenges	  Company	  silos	                The	  costs	  The	  amount	  of	  data	         Security	   Development...
The	  plaOorm	    Data	  storage	                 S3	  Data	  processing	     Elas5c	  Map	  Reduce	    Data	  serving	   ...
The	  architecture	  
Hadoop	  Distributed	  Filesystem	   ü     Files	  as	  big	  as	  you	  want	   ü     Horizontal	  scalability	   ü   ...
Easier	  Hadoop	  Java	  API	      ü    But	  keeping	  similar	  efficiency	  Common	  design	  paIerns	  covered	      ü...
Tuple	  MapReduce	  Our	  evoluDon	  to	  Google’s	  MapReduce	  Pere	  Ferrera,	  Iván	  de	  Prado,	  Eric	  Palacios,	 ...
Sales	  difference	  between	  the	  most	  selling	  Tuple	  MapReduce	     offices	  per	  each	  loca6on	  
Tuple	  MapReduce	           Main	  constraint	           ü  Group	  by	  clause	  must	  be	  a	  subset	  of	  sort	  b...
Efficiency	  Similar	  efficiency	  to	  Hadoop	      hIp://pangool.net/benchmark.html	  
Voldemort	  Distributed	  key/value	  store	  
Voldemort	  &	  Hadoop	          Benefits	       ü  Scalability	  &	  failover	       ü  Upda5ng	  the	  database	  does	...
Basic	  sta5s5cs	  Easy	  to	  implement	  with	  Pangool/Hadoop	     ü  One	  job,	  grouping	  by	  the	  dimension	  o...
Dis5nct	  count	  Possible	  to	  compute	  in	  a	  single	  job	      ü  Using	  secondary	  sor5ng	  by	  the	  field	 ...
Histograms	  Typically	  two-­‐pass	  algorithm	    ü  First	  pass	  for	  detec5ng	  the	  minimum	  and	  the	        ...
Op5mal	  histogram	  Calculate	  the	  be:er	  histogram	  that	  represents	  the	  original	  one	  using	  a	  limited	...
Op5mal	  histogram	     Exact	  Algorithm	     Petri	  Kontkanen,	  Petri	  Myllym	  aki	                                 ...
Op5mal	  histogram	   Alterna5ve:	  Approximated	  algorithm	  Random-­‐restart	  hill	  climbing	  	      ü  A	  solu5on...
Op5mal	  histogram	   Alterna5ve:	  Approximated	  algorithm	  Random-­‐restart	  hill	  climbing	  	      ü  One	  order...
Everything	  in	  one	  job	   Basic	  staDsDcs	  -­‐>	  1	  job	   DisDnct	  count	  staDsDcs	  -­‐>	  1	  job	   One	  p...
Shop	  recommenda5ons	  Based	  on	  co-­‐occurrences	     ü  If	  somebody	  bought	  in	  shop	  A	  and	  in	  shop	  ...
Shop	  recommenda5ons	  Implemented	  in	  Pangool	      ü  Using	  its	  coun5ng	  and	  joining	  capabili5es	      ü ...
Some	  numbers	  EsDmated	  resources	  needed	  with	  1	  year	  data	                    270	  GB	  of	  stats	  to	  s...
Conclusion	  It	  was	  possible	  to	  develop	  a	  Big	  Data	  soluDon	  for	  a	  Bank	    ü  With	  low	  use	  of	...
Future:	  Splout	  Key/value	  datastores	  have	  limitaDons	    ü  Only	  accept	  querying	  by	  the	  key	    ü  Ag...
Upcoming SlideShare
Loading in...5
×

Datasalt - BBVA case study - extracting value from credit card transactions

2,699

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,699
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
42
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Datasalt - BBVA case study - extracting value from credit card transactions"

  1. 1. Case  Study   Value extraction from BBVA credit card transactions
  2. 2. 104,000  employees  47  million  customers  
  3. 3. The  idea   Extract  value   from   anonymized   credit  card   transac5ons  data  &  share  it       Always:     ü  Impersonal   ü  Aggregated   ü  Dissociated   ü  Irreversible  
  4. 4. Helping   Consumers   Informed  decision   ü  Shop  recommenda5ons  (by  loca5on  and  by  category)   ü  Best  5me  to  buy   ü  Ac5vity  &  fidelity  of  shop’s  customers   Sellers   Learning  clients  pa:erns   ü  Ac5vity  &  fidelity  of  shop’s  customers   ü  Sex  &  Age  &  Loca5on   ü  Buying  paIerns  
  5. 5. Shop  stats   For  different  periods   ü  All,  year,  quarter,  month,  week,  day   …  and  much  more  
  6. 6. The  applica5ons  Internal  use  Sellers  Customers  
  7. 7. The  challenges  Company  silos   The  costs  The  amount  of  data   Security   Development  flexibility/agility   Human  failures  
  8. 8. The  plaOorm   Data  storage   S3  Data  processing   Elas5c  Map  Reduce   Data  serving   EC2  
  9. 9. The  architecture  
  10. 10. Hadoop  Distributed  Filesystem   ü  Files  as  big  as  you  want   ü  Horizontal  scalability   ü  Failover    Distributed  Compu5ng   ü  MapReduce   ü  Batch  oriented   •  Input  files  processed  and  converted  in  output  files   ü  Horizontal  scalability    
  11. 11. Easier  Hadoop  Java  API   ü  But  keeping  similar  efficiency  Common  design  paIerns  covered   ü  Compound  records   ü  Secondary  sor5ng   ü  Joins  Other  improvements   ü  Instance  based  configura5on   ü  First  class  mul5ple  input/output  Tuple  MapReduce  implementaDon  for  Hadoop  
  12. 12. Tuple  MapReduce  Our  evoluDon  to  Google’s  MapReduce  Pere  Ferrera,  Iván  de  Prado,  Eric  Palacios,  Jose  Luis  Fernandez-­‐Marquez,  Giovanna  Di  Marzo  Serugendo:      Tuple  MapReduce:  Beyond  classic  MapReduce.      In  ICDM  2012:  Proceedings  of  the  IEEE  Interna6onal  Conference  on  Data  Mining    Brussels,  Belgium  |  December  10  –  13,  2012  
  13. 13. Sales  difference  between  the  most  selling  Tuple  MapReduce   offices  per  each  loca6on  
  14. 14. Tuple  MapReduce   Main  constraint   ü  Group  by  clause  must  be  a  subset  of  sort  by  clause  Indeed,  Tuple  MapReduce  can  be  implemented  on  top  of  any  MapReduce  implementaDon   •  Pangool  -­‐>  Tuple  MapReduce  over  Hadoop  
  15. 15. Efficiency  Similar  efficiency  to  Hadoop   hIp://pangool.net/benchmark.html  
  16. 16. Voldemort  Distributed  key/value  store  
  17. 17. Voldemort  &  Hadoop   Benefits   ü  Scalability  &  failover   ü  Upda5ng  the  database  does  not  affect  serving  queries   ü  All  data  is  replaced  at  each  execu5on   •  Providing  agility/flexibility     §  Big  development  changes  are  not  a  pain   •  Easier  survival  to  human  errors   §  Fix  code  and  run  again   •  Easy  to  set  up  new  clusters  with  different  topologies    
  18. 18. Basic  sta5s5cs  Easy  to  implement  with  Pangool/Hadoop   ü  One  job,  grouping  by  the  dimension  over  which  you  want  to   calculate  the  sta5s5cs.  Count   Average   Min   Max   Stdev  CompuDng  several  Dme  periods  in  the  same  job   ü  Use  the  mapper  for  replica5ng  each  datum  for  each  period   ü  Add  a  period  iden5fier  field  in  the  tuple  and  include  it  in  the   group  by  clause    
  19. 19. Dis5nct  count  Possible  to  compute  in  a  single  job   ü  Using  secondary  sor5ng  by  the  field  you  want  to  dis5nct  count   on   ü  Detec5ng  changes  on  that  field    Example   ü  Group  by  shop,  sort  by  shop  and  card   Shop   Card   Shop  1   1234   Shop  1   1234   Shop  1   1234   Change   +1   Shop  1   5678   2  dis5nct   buyers  for   Shop  1   5678   Change   +1   shop  1  
  20. 20. Histograms  Typically  two-­‐pass  algorithm   ü  First  pass  for  detec5ng  the  minimum  and  the   maximum  and  determine  the  bins  ranges   ü  Second  pass  to  count  the  number  of  occurrences   on  each  bin  AdaptaDve  histogram     ü  One  pass   ü  Fixed  number  of  bins   ü  Bins  adapt    
  21. 21. Op5mal  histogram  Calculate  the  be:er  histogram  that  represents  the  original  one  using  a  limited  number  of  flexible  width  bins   ü  Reduce  storage  needs   ü  More  representa5ve  than  fixed  width  ones  -­‐>  beIer   visualiza5on  
  22. 22. Op5mal  histogram   Exact  Algorithm   Petri  Kontkanen,  Petri  Myllym  aki   ̈   MDL  Histogram  Density  EsDmaDon     hIp://eprints.pascal-­‐network.org/archive/00002983/  Too  slow  for  producDon  use  
  23. 23. Op5mal  histogram   Alterna5ve:  Approximated  algorithm  Random-­‐restart  hill  climbing     ü  A  solu5on  is  just  a  way  of  grouping  exis5ng  bins   ü  From  a  solu5on,  you  can  move  to  some  close   solu5ons   ü  Some  are  beIer:  reduce  the  representa5on  error    Algorithm   1.  Iterate  N  5mes,  keeping  best   solu5on   1.  Generate  a  random  solu5on   2.  Iterate  un5l  no  improvement   1.  Move  to  next  beIer   possible  movement  
  24. 24. Op5mal  histogram   Alterna5ve:  Approximated  algorithm  Random-­‐restart  hill  climbing     ü  One  order  of  magnitude  faster   ü  99%  accuracy    
  25. 25. Everything  in  one  job   Basic  staDsDcs  -­‐>  1  job   DisDnct  count  staDsDcs  -­‐>  1  job   One  pass  histograms  -­‐>  1  job   Several  periods  &  shops  -­‐>  1  job   We  can  put  all  together  so  that   compu5ng  all  sta5s5cs  for  all  shops   fits  into  exactly  one  job      
  26. 26. Shop  recommenda5ons  Based  on  co-­‐occurrences   ü  If  somebody  bought  in  shop  A  and  in  shop  B,  then  a  co-­‐occurrence   between  A  and  B  exists   ü  Only  one  co-­‐occurrence  is  considered  although  a  buyer  bought   several  5mes  in  A  and  B   ü  Top  co-­‐occurrences  per  each  shop  are  the  recommenda5ons  Improvements   ü  Most  popular  shops  are  filtered  out  because  almost  everybody  buys   in  them.   ü  Recommenda5ons  by  category,  by  locaDon  and  by  both   ü  Different  calcula5on  periods  
  27. 27. Shop  recommenda5ons  Implemented  in  Pangool   ü  Using  its  coun5ng  and  joining  capabili5es   ü  Several  jobs  Challenges   ü  If  somebody  bought    in  many  shops,  the  list  of  co-­‐occurrences  can   explode:   •  Co-­‐occurrences  =  N  *  (N  –  1),  where  N  =  #  of  dis5nct  shops   where  the  person  bought   ü  Alleviated  by  limi5ng  the  total  number  of  dis5nct  shops  to  consider   ü  Only  uses  the  top  M  shops  where  the  client  bought  the  most    Future   ü  Time  aware  co-­‐occurrences.  The  client  bought  in  A  and  B  and  he   did  it  in  a  close  period  of  5me.  
  28. 28. Some  numbers  EsDmated  resources  needed  with  1  year  data   270  GB  of  stats  to  serve  24  large  instances  ~  11  hours  of  execu5on   $3500  month   ü  Op5miza5ons  s5ll  possible   ü  Cost  without  the  use  of  reserved  instances   ü  Probably  cheaper  with  an  in-­‐house  Hadoop  cluster  
  29. 29. Conclusion  It  was  possible  to  develop  a  Big  Data  soluDon  for  a  Bank   ü  With  low  use  of  resources   ü  Quickly   ü  Thanks  to  the  use  of  technologies  like  Hadoop,  Amazon  Web   Services  and  NoSQL  databases  The  soluDon  is   ü  Scalable   ü  Flexible/agile.  Improvements  easy  to  implement   ü  Prepared  to  stand  human  failures   ü  At  a  reasonable  cost  Main  advantage:  doing  always  everything  
  30. 30. Future:  Splout  Key/value  datastores  have  limitaDons   ü  Only  accept  querying  by  the  key   ü  Aggrega5ons  no  possible   ü  In  other  words,  we  are  forced  to  pre-­‐compute  everything   ü  Not  always  possible  -­‐>  data  explode   ü  For  this  par5cular  case,  5me  ranges  are  fixed  Splout:  like  Voldemort  but  SQL!   ü  The  idea:  to  replace  Voldemort  by  Splout  SQL   ü  Much  richer  queries:  real-­‐5me  aggrega5ons,  flexible  5me  ranges   ü  It  would  allow  to  create  some  kind  of  Google  Analy5cs  for  the   sta5s5cs  discussed  in  this  presenta5on   ü  Open  Sourced!!!   hIps://github.com/datasalt/splout-­‐db    
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×