Successfully reported this slideshow.

HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

1

Share

1 of 13
1 of 13

HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

1

Share

Download to read offline

Intuit uses HBase for storing comprehensive, de-duplicated, canonical merchant information that powers the backend for a Merchant Lookup Service at Intuit. This service enables users and products to look up business details by various parameters like merchant name, location, and business type. It aims at providing a more complete, canonical business profile by bringing together data from across the various information providers including Intuit’s small business customer base. In this talk, we will describe the Hadoop deduping pipeline, our HBase data model, the challenges faced along the way and our plans to have upcoming projects leverage this data in HBase.

Intuit uses HBase for storing comprehensive, de-duplicated, canonical merchant information that powers the backend for a Merchant Lookup Service at Intuit. This service enables users and products to look up business details by various parameters like merchant name, location, and business type. It aims at providing a more complete, canonical business profile by bringing together data from across the various information providers including Intuit’s small business customer base. In this talk, we will describe the Hadoop deduping pipeline, our HBase data model, the challenges faced along the way and our plans to have upcoming projects leverage this data in HBase.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

  1. 1. HBase powered Merchant Lookup Service at Intuit Vrushali Channapattan, Intuit Lightning Talk @ HBaseCon2012 (May 22nd, 2012)
  2. 2. About Intuit Intuit is a leader in this trend because we are entrusted with the collective data of our 50 million customers. 2
  3. 3. Problem: Duplicate Merchants Company ABC Company PQR name: The Windsor Press, Inc. name: The Windsor Press address: PO Box 465 6 North Third Street address: P.O. Box 465 6 North 3rd St. city: Hamburg city: Hamburg state: PA state: PA zip: 19526 zip: 19526-0465 phone: (610) 562-2267 phone: (610) 562-2267 Both of the above vendor records map to the D&B business: ID: 002114902 Name: The Windsor-Press Inc Street: 6 N 3rd St City: Hamburg Dun & State: PA Bradstreet Zip: 19526-1502 Phone: (610)-562-2267
  4. 4. Applications of Merchant Lookup
  5. 5. Applications of Merchant Lookup
  6. 6. Backend Architecture Input Applications Loader Data Internal Research Projects Update Merchant Splicer Full table Final Scan Match Score Name Phone Address Individual Score Matcher Scores Combiner Various Matchers 6
  7. 7. Data Model -Tables in HBase Merchants  Master dataset of merchants Sangria_id Unique id generation coordination across mapper processes Duplicates Noting duplicate merchants after deduplication SnapshotMerchants Merging into master dataset NewMerchants The new merchant set that is to be added to the master data set of merchants 7
  8. 8. Schema Merchants Row key Info (column family) Mapping (column family) 25204939 name:Crepevine sourcename:10000048, street:367 University Avenue 10000075 city:Palo Alto state:CA zip:94031 county:Santa Clara County country: United States of America website:www.crepevine.com phoneNumber:16503233900 latitude:37.430211 longitude:-122.098221 source:internet mint_category:Food & Dining qbo_category:Restaurants NAICS:722110 SIC:5182 8
  9. 9. Schema  Sangria_id Row key Info (column family) default seed:30000 comment:initial seed by vc of 1000 qbo seed:20550000 comment:initial seed by kf of 20000000 Duplicates Row key Info (column family) 10000043 25204921:0.998 10000048 25204939:0.78 10000075 25204939:0.95 9
  10. 10. Optimizations (job level) • For Hadoop jobs interfacing with HBase, used TableMapReduceUtil • Emitted a ‘put’ from Mapper or Reducer instead of a regular htable put – Use context.write(rowKey,put) • To make the full table scan faster (hbase read only hadoop jobs – deduping matchers , Solr index generator)  scan.setCaching(500);  scan.setCacheBlocks(false); • Used Customized TableInputFormat while scanning (custom number of splits for map tasks) job.setInputFormatClass(CustomizedTableInputFormat.class); extends TableInputFormat class and overriding getSplits method 10
  11. 11. Optimizations (code level) • Storing frequently used column family and column names as byte arrays in a public interface public static final byte[] COLUMN_NAME = Bytes.toBytes("name"); public static final byte[] COLUMN_FAMILY_INFO = Bytes.toBytes("info"); • Utility class for getting values from hbase.client.Result HBaseUtils.getColumnValue(result, COLUMN_FAMILY_INFO, COLUMN_NAME)); public static String getColumnValue(Result result, byte[] type, byte[] columnName) { return Bytes.toString(result.getValue(type, columnName)); } • Writing a sample set of 31 million records into the HBase cluster changed from 4 hours 37 mins 47 secs to 32 mins, 18 seconds 11
  12. 12. Thank You! Vrushali Channapattan, Intuit Data Group (BIO) vrushali_channapattan@intuit.com 12
  13. 13. Schema  SnapshotMerchants Row key Info (column family) merge first:1336813613 start:1337029113 end:1337120100 comments:merging qbo against dandb merchants initiated on May 14th 2012 outcome:started (or) merge run successful NewMerchants- same as Merchants 13

Editor's Notes

  • 9,223,372,036,854,775,80720550000
  • ×