Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit


Published on

Intuit uses HBase for storing comprehensive, de-duplicated, canonical merchant information that powers the backend for a Merchant Lookup Service at Intuit. This service enables users and products to look up business details by various parameters like merchant name, location, and business type. It aims at providing a more complete, canonical business profile by bringing together data from across the various information providers including Intuit’s small business customer base. In this talk, we will describe the Hadoop deduping pipeline, our HBase data model, the challenges faced along the way and our plans to have upcoming projects leverage this data in HBase.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

  1. 1. HBase powered Merchant LookupService at IntuitVrushali Channapattan, IntuitLightning Talk @ HBaseCon2012 (May 22nd, 2012)
  2. 2. About Intuit Intuit is a leader in this trend because we are entrusted with the collective data of our 50 million customers.2
  3. 3. Problem: Duplicate Merchants Company ABC Company PQRname: The Windsor Press, Inc. name: The Windsor Pressaddress: PO Box 465 6 North Third Street address: P.O. Box 465 6 North 3rd Hamburg city: Hamburgstate: PA state: PAzip: 19526 zip: 19526-0465phone: (610) 562-2267 phone: (610) 562-2267 Both of the above vendor records map to the D&B business: ID: 002114902 Name: The Windsor-Press Inc Street: 6 N 3rd St City: Hamburg Dun & State: PA Bradstreet Zip: 19526-1502 Phone: (610)-562-2267
  4. 4. Applications of Merchant Lookup
  5. 5. Applications of Merchant Lookup
  6. 6. Backend Architecture Input Applications Loader Data Internal Research Projects Update Merchant Splicer Full table Final Scan Match Score Name Phone Address Individual Score Matcher Scores Combiner Various Matchers6
  7. 7. Data Model -Tables in HBase Merchants  Master dataset of merchants Sangria_id Unique id generation coordination across mapper processes Duplicates Noting duplicate merchants after deduplication SnapshotMerchants Merging into master dataset NewMerchants The new merchant set that is to be added to the master data set of merchants7
  8. 8. Schema Merchants Row key Info (column family) Mapping (column family) 25204939 name:Crepevine sourcename:10000048, street:367 University Avenue 10000075 city:Palo Alto state:CA zip:94031 county:Santa Clara County country: United States of America phoneNumber:16503233900 latitude:37.430211 longitude:-122.098221 source:internet mint_category:Food & Dining qbo_category:Restaurants NAICS:722110 SIC:51828
  9. 9. Schema  Sangria_id Row key Info (column family) default seed:30000 comment:initial seed by vc of 1000 qbo seed:20550000 comment:initial seed by kf of 20000000 Duplicates Row key Info (column family) 10000043 25204921:0.998 10000048 25204939:0.78 10000075 25204939:0.959
  10. 10. Optimizations (job level) • For Hadoop jobs interfacing with HBase, used TableMapReduceUtil • Emitted a ‘put’ from Mapper or Reducer instead of a regular htable put – Use context.write(rowKey,put) • To make the full table scan faster (hbase read only hadoop jobs – deduping matchers , Solr index generator)  scan.setCaching(500);  scan.setCacheBlocks(false); • Used Customized TableInputFormat while scanning (custom number of splits for map tasks) job.setInputFormatClass(CustomizedTableInputFormat.class); extends TableInputFormat class and overriding getSplits method10
  11. 11. Optimizations (code level) • Storing frequently used column family and column names as byte arrays in a public interface public static final byte[] COLUMN_NAME = Bytes.toBytes("name"); public static final byte[] COLUMN_FAMILY_INFO = Bytes.toBytes("info"); • Utility class for getting values from hbase.client.Result HBaseUtils.getColumnValue(result, COLUMN_FAMILY_INFO, COLUMN_NAME)); public static String getColumnValue(Result result, byte[] type, byte[] columnName) { return Bytes.toString(result.getValue(type, columnName)); } • Writing a sample set of 31 million records into the HBase cluster changed from 4 hours 37 mins 47 secs to 32 mins, 18 seconds11
  12. 12. Thank You! Vrushali Channapattan, Intuit Data Group (BIO) vrushali_channapattan@intuit.com12
  13. 13. Schema  SnapshotMerchants Row key Info (column family) merge first:1336813613 start:1337029113 end:1337120100 comments:merging qbo against dandb merchants initiated on May 14th 2012 outcome:started (or) merge run successful NewMerchants- same as Merchants13