Intuit uses HBase for storing comprehensive, de-duplicated, canonical merchant information that powers the backend for a Merchant Lookup Service at Intuit. This service enables users and products to look up business details by various parameters like merchant name, location, and business type. It aims at providing a more complete, canonical business profile by bringing together data from across the various information providers including Intuit’s small business customer base. In this talk, we will describe the Hadoop deduping pipeline, our HBase data model, the challenges faced along the way and our plans to have upcoming projects leverage this data in HBase.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit
1. HBase powered Merchant Lookup
Service at Intuit
Vrushali Channapattan, Intuit
Lightning Talk @ HBaseCon2012 (May 22nd, 2012)
2. About Intuit
Intuit is a leader in this trend
because we are entrusted with the
collective data of our 50 million
customers.
2
3. Problem: Duplicate Merchants
Company ABC Company PQR
name: The Windsor Press, Inc. name: The Windsor Press
address: PO Box 465 6 North Third Street address: P.O. Box 465 6 North 3rd St.
city: Hamburg city: Hamburg
state: PA state: PA
zip: 19526 zip: 19526-0465
phone: (610) 562-2267 phone: (610) 562-2267
Both of the above vendor records map to the D&B business:
ID: 002114902
Name: The Windsor-Press Inc
Street: 6 N 3rd St
City: Hamburg
Dun & State: PA
Bradstreet Zip: 19526-1502
Phone: (610)-562-2267
6. Backend Architecture
Input Applications
Loader
Data
Internal
Research
Projects
Update
Merchant
Splicer
Full table Final
Scan Match Score
Name Phone Address
Individual Score
Matcher
Scores Combiner
Various
Matchers
6
7. Data Model -Tables in HBase
Merchants
Master dataset of merchants
Sangria_id
Unique id generation coordination across mapper processes
Duplicates
Noting duplicate merchants after deduplication
SnapshotMerchants
Merging into master dataset
NewMerchants
The new merchant set that is to be added to the master data set of
merchants
7
8. Schema
Merchants
Row key Info (column family) Mapping (column
family)
25204939 name:Crepevine sourcename:10000048,
street:367 University Avenue 10000075
city:Palo Alto
state:CA
zip:94031
county:Santa Clara County
country: United States of America
website:www.crepevine.com
phoneNumber:16503233900
latitude:37.430211
longitude:-122.098221
source:internet
mint_category:Food & Dining
qbo_category:Restaurants
NAICS:722110
SIC:5182
8
9. Schema
Sangria_id
Row key Info (column family)
default seed:30000
comment:initial seed by vc of 1000
qbo seed:20550000
comment:initial seed by kf of 20000000
Duplicates
Row key Info (column family)
10000043 25204921:0.998
10000048 25204939:0.78
10000075 25204939:0.95
9
10. Optimizations (job level)
• For Hadoop jobs interfacing with HBase, used TableMapReduceUtil
• Emitted a ‘put’ from Mapper or Reducer instead of a regular htable put
– Use context.write(rowKey,put)
• To make the full table scan faster (hbase read only hadoop jobs – deduping
matchers , Solr index generator)
scan.setCaching(500);
scan.setCacheBlocks(false);
• Used Customized TableInputFormat while scanning (custom number of
splits for map tasks)
job.setInputFormatClass(CustomizedTableInputFormat.class);
extends TableInputFormat class and overriding getSplits
method
10
11. Optimizations (code level)
• Storing frequently used column family and column names as byte arrays in a
public interface
public static final byte[] COLUMN_NAME =
Bytes.toBytes("name");
public static final byte[] COLUMN_FAMILY_INFO =
Bytes.toBytes("info");
• Utility class for getting values from hbase.client.Result
HBaseUtils.getColumnValue(result, COLUMN_FAMILY_INFO,
COLUMN_NAME));
public static String getColumnValue(Result result, byte[]
type, byte[] columnName) {
return Bytes.toString(result.getValue(type,
columnName));
}
• Writing a sample set of 31 million records into the HBase cluster changed
from 4 hours 37 mins 47 secs to 32 mins, 18 seconds
11
12. Thank You!
Vrushali Channapattan, Intuit Data Group (BIO)
vrushali_channapattan@intuit.com
12
13. Schema
SnapshotMerchants
Row key Info (column family)
merge first:1336813613
start:1337029113
end:1337120100
comments:merging qbo against dandb
merchants initiated on May 14th 2012
outcome:started (or) merge run successful
NewMerchants- same as Merchants
13