Merchant Lookup Service Intuit


Published on

The Merchant Lookup Service at Intuit enables users and products to look up business details by:

Business name (including partial name & misspellings)
Business location (street address, latitude and longitude)
Business type (category, SIC)
User location (IP,GPS-enabled device location)
This powerful service enables auto-suggest, auto-complete and auto-correct within product. The project aims at providing a more complete, canonical business profile by bringing together data and metadata from across the various information providers as well as merchants from Intuit's small business customer base. The Business Directory Service is available as a web-service that can be integrated into desktop, web and mobile applications. It is available through a REST API whose response times are minimized because the data is indexed in Solr and distributed. The backend is powered by HBase, which stores this comprehensive,deduplicated, canonical merchant information. Hundreds of millions of records that have duplicates that exist due to sparse, manually entered information by Intuit's small business customers as well as records from different information providers are de-duplicated through a series of Hadoop jobs resulting in a canonical set of merchants. The deduping pipeline has various components like Reader, Index Generator, various Matchers, Score Combiner and Merchant Splicer.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Merchant Lookup Service Intuit

  1. 1. Merchant  Mastering  &  De-­‐duping  with  Hadoop  and  Lucene  Hadoop in Action @ Hadoop Summit, June 13th, 2012Michael J. Radwin, Intuit
  2. 2. Merchant  contact  informa9on  
  3. 3. Fuzzy  matching  &  de-­‐duplica9ng  merchants   Company ABC Company PQRname: The Windsor Press, Inc. name: The Windsor Pressaddress: PO Box 465 6 North Third Street address: P.O. Box 465 6 North 3rd Hamburg city: Hamburgstate: PA state: PAzip: 19526 zip: 19526-0465phone: (610) 562-2267 phone: (610) 562-2267 Both of the above vendor records map to external reference data: DUNSnum: 002114902 Name: The Windsor-Press Inc Street: 6 N 3rd St City: Hamburg State: PA Dun & Zip: 19526-1502 Bradstreet Phone: (610)-562-2267
  4. 4. Automa9c  transac9on  categoriza9on   09/20/2010 ORCHARD SUPPLY #690 MOUNTAIN VI026460773 415-691-2000 320102640145034981 $20.09
  5. 5. De-­‐duping  system  architecture   1 Input Import 2 Data Address Merchant Standardizer reference data 3 name phone address Matchers 7 4 Matcher 5 Applications scores Score Combiner Auto-complete 6 Merchant Transaction categorization Splicer5
  6. 6. HBase  schema  example:  Merchant  table   Row key Info (column family) Mapping (column family) 25204939 name:Crepevine sourcename:10000048, street:367 University Avenue 10000075 city:Palo Alto state:CA zip:94031 county:Santa Clara County country: United States of America phoneNumber:16503233900 latitude:37.430211 longitude:-122.098221 source:internet mint_category:Food & Dining qbo_category:Restaurants NAICS:722110 SIC:51826
  7. 7. MapReduce  algorithm  for  matching   Mapper Reducer Input Merchant Merchant A1 Compare attribute A values via custom matching Merchant A2 Output score Generate between 0 to 1 potential Merchant matches A3 subset A: A1 0.6 A: A2 0.9 Lookup Merchant A: A3 0.4 A4 A: A4 0.667 Matched from lucene7
  8. 8. Fuzzy-­‐matching  implementa9on  details   • Normaliza)on  &  string  pre-­‐processing   – Case,  punctua)on  &  special  characters   – Phone  numbers:  le;er-­‐to-­‐digit  conversion,  remove  extensions   – Biz  names:  special  handling  for  common  suffixes  like  Inc,  Corp,  LLC   – USA  addresses:  123  North  Main  Ave  becomes  123  N.  Main   • Jaccard  and  Jaro  Winkler  string  similarity  approaches   • Final  Score  =  (0.4  *  phone  confidence)  +  (0.25  *  name   confidence)  +  (0.35  *  address  confidence)   – Two  businesses  with  same  phone  are  likely  to  be  the  same  business   – Same  with  email  address   – Similar  business  name  less  important   – And  some)mes  two  businesses  share  the  same  address  8
  9. 9. 10x  speedup  via  op9miza9ons!   • De-­‐duping  1  million  sample  merchants  takes  about  1  hour   (previously  took  10  hours)   • Wri)ng  back  a  sample  set  of  31  million  records  into  the  HBase   cluster  takes  about  30  mins  (previously  took  4  hours  37  mins)   • These  metrics  calculated  on  a  20-­‐node  Hadoop  cluster  (HBase   installed  on  5  nodes)  9
  10. 10. Op9miza9ons  –  overall  system  design   Idea:  par))on  address  match  by  US  state  to  allow  parallelism   1.  Select  subset  of  input  table  from  a  par)cular  state  (e.g.  NY)   2.  Apply  matching  to  a  Lucene  index  that  contains  only  reference   data  from  that  state   – Each  single-­‐state  Lucene  index  is  small,  fits  en)rely  in  memory   – Standardize  the  addresses,  normalize  the  strings   – Compare  using  string  distance  metrics   3.  Run  all  50  states  (+  Washington  DC,  Puerto  Rico,  etc)   – Let  Oozie  run  these  in  parallel  10
  11. 11. Op9miza9ons  –  hbase  config   Set  caching  parameters  to  make  our  full  table  scans  faster   scan.setCaching(500);   – transfers  500  rows  at  a  )me  to  the  client  to  be  processed   – Scanner  )meout  Excep)ons  possible  if  you  set  it  too  high   scan.setCacheBlocks(false);   – avoid  the  block  cache  churning  =  10  minutes     – Clients  must  report  in  within  this  period  else  they  are  considered  dead  11
  12. 12. Op9miza9ons  –  code  level   Cache  frequently  used  column  family  and  column  names  as   immutable  byte  arrays  in  a  public  interface     public  static  final  byte[]  COLUMN_NAME  =   Bytes.toBytes("name");   public  static  final  byte[]  COLUMN_FAMILY_INFO  =   Bytes.toBytes("info");     •  Improves  readability   •  Minor  run)me  performance  improvement  12
  13. 13. Best  prac9ces  –  hadoop  interfacing   • For  Hadoop  jobs  interfacing  with  HBase,  used   TableMapReduceUtil   – On  the  input  side  (source)  as  well  as  the  output  side  (sink)   – Instead  of  doing  a  regular  input  split   • When  wri)ng  to  HBase  table,  emi;ed  a  ‘put’  from  Mapper  or   Reducer  instead  of  a  regular  HTable  put   – Use  context.write(rowKey,put)   – Much  faster  than  doing  an  HTable.put(),  even  for  a  bulk  put  13
  14. 14. Best  prac9ces  –  readability,  maintainability   Client  gets  values  out  of  Result  via  convenience  methods:     String  val  =  HBaseUtils.getColumnValue(result,   COLUMN_FAMILY_INFO,  COLUMN_NAME));     Double  lat  =  HBaseUtils.getDoubleColumnValue(result,   COLUMN_FAMILY_INFO,  COLUMN_LATITUDE);     Long  sicCode  =  HBaseUtils.getLongColumnValue(result,   COLUMN_FAMILY_INFO,  COLUMN_SIC)  14
  15. 15. Best  prac9ces  –  HBaseU)ls  implementa)on   public  class  HBaseUtils  {      public  static  String  getColumnValue(Result  result,  byte[]  type,   byte[]  columnName)  {          return  Bytes.toString(result.getValue(type,  columnName));      }      public  static  Double  getDoubleColumnValue(Result  result,  byte[]   type,  byte[]  columnName)  {          try  {              return  Double.parseDouble(getColumnValue(result,  type,   columnName));          }  catch  (Exception  e)  {              return  null;          }      }   }    15
  16. 16. Thank  You!   Michael  J.  Radwin   Twi;er:  @michael_radwin  16
  17. 17. MR  Workflow  (oozie)       Start Name Matcher OK OK OK OK Phone Score Data Splicer matcher combiner Import Address Matcher (Fork-join) Address Standardizer (Fork-join) Failed End17
  18. 18. Backups  via  HBase  Export   • Backups  done  before  new  dataset  is  added  or  updates  of  exis)ng   data  set  are  to  be  added   • Master  dataset  on  HBase     – Backed  up  before  merge   – Uses  Live  Cluster  Backup  done  using  HBase  Export   – Data  can  be  reimported  using  HBase  Import  18