Standardizing +113 million Merchant Names in Financial Services with Greenplum Hadoop

Applied Analytics with Greenplum Hadoop:

Standardizing +113 million Merchant Names
with RegEx and Fuzzy Matching

Ian Andrews
Mike Goddard
© Copyright 2012 EMC Corporation. All rights reserved. 1

Greenplum, A Division of EMC
• 10 years of experience building and supporting enterprise-class massively
parallel data processing software based on open source technology
• Silicon-valley based core engineering talent from
Yahoo!, Teradata, Oracle, Amazon, Microsoft, IBM, etc
• 1,000 (and growing) personnel focused on Greenplum’s Big Data Platform
– Greenplum Database
– Greenplum HD (Hadoop)
– Chorus
– Data Computing Appliances
– Data Scientists
– Pivotal Labs
• Fully integrated with EMC’s award-winning global support infrastructure.
• 500+ customers in production globally across all industry segments.
• Established relationships with ecosystems partners:
Informatica, SAS, Talend, Pentaho, Microstrategy, etc.
• Strategic development relationship with VMware around virtual big data
platforms


Greenplum Unified Analytic Platform


Transaction Data - Merchant Name
Standardization System


Overview of Findings
• Transaction data is difficult to analyze as merchants
names found in credit and debit data are unstructured
and non-standardized across single business entities
• We created a system for cleaning and standardizing
merchant names
– Stage 1: feature extraction
– Stage 2: automated cleanup using regular expressions
– Stage 3: fuzzy matching algorithm
– Stage 4: application of manual rules
• This is an open system, easy to use, extend and modify
• We used the results to do some preliminary analysis on
the transaction data


Background Information -
Credit and Debit Data Overview
% # transactions
Credit Transactions1 Debit Transactions
• 1,396,344 distinct merchant 14.62% • 2,598,462 distinct merchant
names names
• 16,554,889 credit transactions • 96,658,020 debit transactions
($1,979,801,143.50) ($3,471,084,518.72)
85.38%
• 161,931 households with • 435,615 households with debit
credit transaction transaction
• Min: -$32,585 Debit Credit • Min: $0.01
• Max: $99,000 • Max: $39,404
• Average: $120 % sum transactions • Average: $36
• Std. Deviation: $496 • Std. Deviation: $89

36.32%

63.68%

Debit Credit

1 Excludes 13 Sic Codes in depository institution activity group


Why standardize merchant names?
• Due to multiple names of same businesses across
locations a single business entity appears as many
in the database
• Examples

WAL-MART PAYPAL STARBUCKS
WALMART PORTRAITS 23093 PAYPAL *SACCAR.COM STARBUCKSSTORE.COM-USD
WAL-MART #2366 SE2 PAYPAL *BRICKSUPPLY STARBUCKS CORP00034488
WAL-MART STORE#1041 PAYPAL *BRETT2010FL SS-STARBUCKS
WAL-MART SUPERCENTER 20 PAYPAL *UNITED T1 STARBUCKS J10431542
WAL MART LINCOLN PAYPAL *TL5354 STARBUCKS C #112201505
WALMART.COM RELOAD PAYPAL *CAR-KIT.COM STARBUCKS WEST30081525


Examples of name passing thru
merchant name standardization system
Original: Original:

GIANT FOOD #089 PETSMART INC 1963
Features: Stage 1 Features:
Length: 14 Length: 17
1st White Space: 6 1st White Space: 9
1st Special Characters: 12 Business Suffix: 10
1st Digit: 13 1st Digit: 14
Stage 2 Regex:
Regex:
[^(?-i)a-z] [^(?-i)a-z]|( INC )$
Remove all numbers (0-9), Remove all numbers (0-9),
white space, white space, special
& special characters characters, & remove
Stage 3 business suffix
Fuzzy Matching: Fuzzy Matching:
1016 (count of <170 PETSMART FOUND
GIANTFOOD matches) (Not run)
Stage 4
Manual Override: Manual Override:
None None
Final Results: Final Results:
GIANTFOOD PETSMART


Example Results - STARBUCKS
Pre-Standardization Post-Standardization
STARBUCKS DELI20371514 STARBUCKS
STARBUCKS-ARIFJAN CAMP2 STARBUCKS
STARBUCKS C #112201505 STARBUCKS
STARBUCKS USA 00115832 STARBUCKS
STARBUCK'S CAFE CROWNE STARBUCKS
STARBUCKS CORP00134759 STARBUCKS
ATL MED CTR STARBUCKS STARBUCKS
T3 N STARBUCKS30031512 STARBUCKS
STARBUCKS COFEE STARBUCKS
STARBUCKS LA ISLA STARBUCKS
OMNI FT WORTH - STARBUCKS STARBUCKS
ST. RITA'S STARBUCKS STARBUCKS
MGM GRND STARBUCKS-CASINO STARBUCKS
006 STARBUCKS AMR STARBUCKS


90% of all transactions occur at 7% of the
merchants
Company Total
Name Transactions
MCDONALDS 4,309,728
SPEEDWAY 2,032,474
WALMART 1,606,446
KROGER 1,564,819
SHELLOIL 1,546,056
SHEETZ 1,358,977
SUBWAY 1,280,037
REDBOX 1,236,148
EXXONMOBIL 1,205,451
WAWA 1,197,711
SUNO 1,180,799
WENDYS 1,066,628 Gini Coefficient = 0.9447
MARATHONOIL 1,050,593
• 0 represents equality
MEIJER 1,017,998
• 1 represents all transactions at 1 merchant
STARBUCKS 1,002,805


90% of the total spend in 2011 occurred
at top 8.3% of merchants
Company Total spent
Name
WALMART $87,454,235.66
KROGER $63,850,902.99
SPEEDWAY $54,270,752.65
TARGET $48,086,797.70
MEIJER $46,716,327.56
WMSUPERCENTER $46,650,761.15
SHELLOIL $45,115,993.12
GIANTEAGLE $44,668,211.07
ATT $44,497,819.88
VERIZONWRLS $41,971,943.31
LOWES $34,952,686.13
SUNO $34,498,328.42
EXXONMOBIL $33,695,575.95
Gini Coefficient = 0.9408
MCDONALDS $30,869,463.74
• 0 represents equality
SHEETZ $30,273,183.81
• 1 represents all money spent at 1 merchant


‘Sic Codes’ alone are problematic; they
can differ greatly across like businesses
• On average the top 1,000 frequently occurring
merchants have ~6 sic codes associated with their
cleaned merchant name

WALMART TARGET SAFEWAY KROGER AT&T VERIZON T-MOBILE
4814 5310 5411 12 1711 4812 12
4816 5411 5499 5411 2741 4814 4812
5300 5732 5921 5499 3640 4899 5732
5411 8043 5541 4112 5999 5999
6300 8099 5542 5971 7311 7299
… … … 7399 …
Total 31 Total 8 Total 71 Total 10

6 total matches 2 total matches 4 total matches


Relative Value Add segments created by
splitting population into deciles based on
RVA RVA

• Relative Value Added (RVA) provides an estimated ordinal
ranking of customers using balance and transaction data (a
rough precursor of EVA)
• The RVA was created to put a context around the merchant
name discovery, the distribution of PNC’s products and how
they interact


Segment Profiles
Index: % segment / % population
Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5 Cohort marketing to 8 Cohort 9 Cohort 10
Target’s 6 Cohort 7 Cohort
Cellular telephone providers higher income
ATT 1.00 0.86 1.18 1.24 1.14 1.04 0.97
households seems to 0.91 0.86 0.79
SPRINT 1.75 0.55 1.93 1.72 1.15 0.81 0.67 0.56 0.50 0.36
have worked
TMOBILE 1.35 0.95 1.38 1.36 1.06 0.86 0.92 0.81 0.71 0.60
VERIZONWRLS 0.95 0.52 1.18 1.32 1.28 1.11 1.01 0.95 0.90 0.78
Retail stores
SEARSROEBUCK 0.64 1.60 0.60 0.63 0.79 0.90 1.03 1.12 1.25 1.45
TJMAXX 0.68 1.46 0.71 0.66 0.83 0.96 1.02 1.12 1.22 1.32
TARGET 0.72 1.51 0.63 0.69 0.87 1.02 1.11 1.16 1.18 1.12
WALMART 0.82 1.77 0.82 0.82 0.88 0.89 0.92 0.97 1.00 1.11
STAPLES 0.69 1.72 0.71 0.55 0.68 0.88 0.97 1.06 1.19 1.54
STARBUCKS 0.82 0.47 0.81 0.88 1.04 1.21 1.23 1.23 1.19 1.14
PAYPAL 1.13 1.51 1.03 0.86 0.82 0.91 1.00 0.90 0.92 0.93
Groceries
PUBLIX 0.84 3.16 0.35 0.45 0.56 0.72 0.83 0.86 0.94 1.27
MENARDS 0.75 3.66 0.42 0.38 0.55 0.71 0.77 0.93 0.85 0.98
KROGER 0.79 1.13 0.79 0.87 1.00 1.01 1.03 1.10 1.09 1.20
Gas and convenience stores
EXXONMOBIL 1.07 0.93 1.04 1.03 1.01 0.99 1.00 0.96 0.96 1.01
SHEETZ 0.87 0.36 0.91 1.01 0.96 0.96 1.04 1.21 1.37 1.31
SHELLOIL 1.12 1.04 1.03 1.04 1.01 1.01 0.98 0.93 0.93 0.91
SPEEDWAY 1.17 0.90 1.25 1.24 1.16 1.04 0.97 0.87 0.77 0.63
Hotels
HILTON 0.69 1.70 0.49 0.53 0.76 1.02 1.15 1.14 1.16 1.36
RAMADAINN 0.75 2.29 0.40 0.64 0.90 0.88 1.00 1.10 0.90 1.13
RESIDENCEINN 0.92 1.94 0.56 0.73 0.68 0.84 1.00 0.82 0.97 1.55
ROYALINN 0.23 0.87 1.07 0.81 0.99 0.85 0.78 0.49 1.04 2.87


Segment Profiles
Index: % segment / % population
Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5 Cohort 6 Cohort 7 Cohort 8 Cohort 9 Cohort 10
Cellular telephone providers
ATT 1.00 0.86 1.18 1.24 1.14 1.04 0.97 0.91 0.86 0.79
SPRINT 1.75 0.55 1.93 1.72 1.15 0.81 0.67 0.56 0.50 0.36
TMOBILE 1.35 0.95 1.38 1.36 1.06 0.86 0.92 0.81 0.71 0.60
VERIZONWRLS 0.95 0.52 1.18 1.32 1.28 1.11 1.01 0.95 0.90 0.78
Retail stores
SEARSROEBUCK 0.64 1.60 0.60 0.63 0.79 0.90 1.03 1.12 1.25 1.45
TJMAXX 0.68 1.46 0.71 0.66 0.83 0.96 1.02 1.12 1.22 1.32
TARGET 0.72 1.51 0.63 0.69 0.87 1.02 1.11 1.16 1.18 1.12
WALMART 0.82 1.77 0.82 0.82 0.88 0.89 0.92 0.97 1.00 1.11
STAPLES 0.69 1.72 0.71 0.55 0.68 0.88 0.97 1.06 1.19 1.54
STARBUCKS 0.82 0.47 0.81 0.88 1.04 1.21 1.23 1.23 1.19 1.14
PAYPAL 1.13 1.51 1.03 0.86 0.82 0.91 and1.00
AT&T 0.90
Verizon 0.92 0.93
Groceries
PUBLIX 0.84 3.16 0.35 0.45 0.56
appear to be gaining
0.72 0.83 0.86 0.94 1.27
MENARDS 0.75 3.66 0.42 0.38 0.55 more high value0.93
0.71 0.77 0.85 0.98
KROGER 0.79 1.13 0.79 0.87 1.00 customers 1.10
1.01 1.03 1.09 1.20
Gas and convenience stores
EXXONMOBIL 1.07 0.93 1.04 1.03 1.01 0.99 1.00 0.96 0.96 1.01
SHEETZ 0.87 0.36 0.91 1.01 0.96 0.96 1.04 1.21 1.37 1.31
SHELLOIL 1.12 1.04 1.03 1.04 1.01 1.01 0.98 0.93 0.93 0.91
SPEEDWAY 1.17 0.90 1.25 1.24 1.16 1.04 0.97 0.87 0.77 0.63
Hotels
HILTON 0.69 1.70 0.49 0.53 0.76 1.02 1.15 1.14 1.16 1.36
RAMADAINN 0.75 2.29 0.40 0.64 0.90 0.88 1.00 1.10 0.90 1.13
RESIDENCEINN 0.92 1.94 0.56 0.73 0.68 0.84 1.00 0.82 0.97 1.55
ROYALINN 0.23 0.87 1.07 0.81 0.99 0.85 0.78 0.49 1.04 2.87


Summary of Findings
• We cleaned and standardized merchant names and
– Found 1.1 million distinct merchants from the original 113+ million
– Discovered 90% of transactions and 90% of the money spent
happened at less than 10% of the merchants
– Identified that ‘Sic Codes’ significantly differ across like businesses
– Identified differences in credit and debit purchase behavior
– In reaction to the announcement that Square made August 8th we
used cleaned merchant names to evaluate the potential impact of
the trend towards alternative payment methods using the clean
merchant names
• Segmentation augmented by a value added metric
– We found that segmenting customers based on a rough measure of
value added and combining that with transaction data can provide
interesting insights
– Prediction of migration from low to high value segments seems
possible


Standardizing +113 million Merchant Names in Financial Services with Greenplum Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Standardizing +113 million Merchant Names in Financial Services with Greenplum Hadoop

Similar to Standardizing +113 million Merchant Names in Financial Services with Greenplum Hadoop (20)

More from Data Science London

More from Data Science London (20)

Recently uploaded

Recently uploaded (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplum Hadoop

Editor's Notes