• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Standardizing +113 million Merchant Names in Financial Services with Greenplum Hadoop
 

Standardizing +113 million Merchant Names in Financial Services with Greenplum Hadoop

on

  • 1,120 views

Talk by Ian Andrews & Mike Goddard @Greenplum at Data Science London 28/11/2012. A financial services case on how to standardize merchant names with RegEx & fuzzy matching

Talk by Ian Andrews & Mike Goddard @Greenplum at Data Science London 28/11/2012. A financial services case on how to standardize merchant names with RegEx & fuzzy matching

Statistics

Views

Total Views
1,120
Views on SlideShare
1,120
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • SCRIPT:This diagram depicts the Greenplum Unified Analytics Platform. Let’s take a high level look of what it looks like from a stack diagram. The foundations of UAP lie in Greenplum Database for analyzing your structured data, co-processing unstructured data with Greenplum Hadoop. These two components are fused together by Greenplum gNet, which allows for parallel data exchange and parallel query access. These are overlaid with a unified data access and query layer that combines the languages of choice for your analysts (SQL, MapReduce, Etc.). Over the access layer comes our powerful partner tool and services layer. We are not about locking customers into a single tool or stack. Instead we work with the tool vendor of your choice, be it SAS or R, Microstrategy or informatica. And what truly enables productivity and ensures you are getting maximum value out of your data scientist team is Greenplum Chorus. What sets this diagram apart from a typically vendor example is the inclusion of people – Data Stakeholders. UAP is designed to enable an emerging group of talent, the new practitioners, that we refer to as the Data Science team. This team can include the data platform administrator, data scientist, analysts, engineers, BI teams, and most importantly the line of business user and how they participate on this data science team.We develop, package, and support this as a unified software platform available over your favorite commodity hardware, cloud infrastructure, or from our modular Data Computing Appliance. NOTES:

Standardizing +113 million Merchant Names in Financial Services with Greenplum Hadoop Standardizing +113 million Merchant Names in Financial Services with Greenplum Hadoop Presentation Transcript

  • Applied Analytics with Greenplum Hadoop: Standardizing +113 million Merchant Names with RegEx and Fuzzy Matching Ian Andrews Mike Goddard© Copyright 2012 EMC Corporation. All rights reserved. 1
  • Greenplum, A Division of EMC• 10 years of experience building and supporting enterprise-class massively parallel data processing software based on open source technology• Silicon-valley based core engineering talent from Yahoo!, Teradata, Oracle, Amazon, Microsoft, IBM, etc• 1,000 (and growing) personnel focused on Greenplum’s Big Data Platform – Greenplum Database – Greenplum HD (Hadoop) – Chorus – Data Computing Appliances – Data Scientists – Pivotal Labs• Fully integrated with EMC’s award-winning global support infrastructure.• 500+ customers in production globally across all industry segments.• Established relationships with ecosystems partners: Informatica, SAS, Talend, Pentaho, Microstrategy, etc.• Strategic development relationship with VMware around virtual big data platforms© Copyright 2012 EMC Corporation. All rights reserved. 2
  • Greenplum Unified Analytic Platform© Copyright 2012 EMC Corporation. All rights reserved. 3
  • Transaction Data - Merchant Name Standardization System© Copyright 2012 EMC Corporation. All rights reserved. 4
  • Overview of Findings• Transaction data is difficult to analyze as merchants names found in credit and debit data are unstructured and non-standardized across single business entities• We created a system for cleaning and standardizing merchant names – Stage 1: feature extraction – Stage 2: automated cleanup using regular expressions – Stage 3: fuzzy matching algorithm – Stage 4: application of manual rules• This is an open system, easy to use, extend and modify• We used the results to do some preliminary analysis on the transaction data© Copyright 2012 EMC Corporation. All rights reserved. 5
  • Background Information - Credit and Debit Data Overview % # transactionsCredit Transactions1 Debit Transactions• 1,396,344 distinct merchant 14.62% • 2,598,462 distinct merchant names names• 16,554,889 credit transactions • 96,658,020 debit transactions ($1,979,801,143.50) ($3,471,084,518.72) 85.38%• 161,931 households with • 435,615 households with debit credit transaction transaction• Min: -$32,585 Debit Credit • Min: $0.01• Max: $99,000 • Max: $39,404• Average: $120 % sum transactions • Average: $36• Std. Deviation: $496 • Std. Deviation: $89 36.32% 63.68% Debit Credit 1 Excludes 13 Sic Codes in depository institution activity group © Copyright 2012 EMC Corporation. All rights reserved. 6
  • Why standardize merchant names?• Due to multiple names of same businesses across locations a single business entity appears as many in the database• Examples WAL-MART PAYPAL STARBUCKSWALMART PORTRAITS 23093 PAYPAL *SACCAR.COM STARBUCKSSTORE.COM-USDWAL-MART #2366 SE2 PAYPAL *BRICKSUPPLY STARBUCKS CORP00034488WAL-MART STORE#1041 PAYPAL *BRETT2010FL SS-STARBUCKSWAL-MART SUPERCENTER 20 PAYPAL *UNITED T1 STARBUCKS J10431542WAL MART LINCOLN PAYPAL *TL5354 STARBUCKS C #112201505WALMART.COM RELOAD PAYPAL *CAR-KIT.COM STARBUCKS WEST30081525© Copyright 2012 EMC Corporation. All rights reserved. 7
  • Examples of name passing thrumerchant name standardization systemOriginal: Original: GIANT FOOD #089 PETSMART INC 1963Features: Stage 1 Features: Length: 14 Length: 17 1st White Space: 6 1st White Space: 9 1st Special Characters: 12 Business Suffix: 10 1st Digit: 13 1st Digit: 14 Stage 2 Regex:Regex: [^(?-i)a-z] [^(?-i)a-z]|( INC )$ Remove all numbers (0-9), Remove all numbers (0-9), white space, white space, special & special characters characters, & remove Stage 3 business suffixFuzzy Matching: Fuzzy Matching: 1016 (count of <170 PETSMART FOUND GIANTFOOD matches) (Not run) Stage 4Manual Override: Manual Override: None NoneFinal Results: Final Results: GIANTFOOD PETSMART© Copyright 2012 EMC Corporation. All rights reserved. 8
  • Example Results - STARBUCKS Pre-Standardization Post-StandardizationSTARBUCKS DELI20371514 STARBUCKSSTARBUCKS-ARIFJAN CAMP2 STARBUCKSSTARBUCKS C #112201505 STARBUCKSSTARBUCKS USA 00115832 STARBUCKSSTARBUCKS CAFE CROWNE STARBUCKSSTARBUCKS CORP00134759 STARBUCKSATL MED CTR STARBUCKS STARBUCKST3 N STARBUCKS30031512 STARBUCKSSTARBUCKS COFEE STARBUCKSSTARBUCKS LA ISLA STARBUCKSOMNI FT WORTH - STARBUCKS STARBUCKSST. RITAS STARBUCKS STARBUCKSMGM GRND STARBUCKS-CASINO STARBUCKS006 STARBUCKS AMR STARBUCKS© Copyright 2012 EMC Corporation. All rights reserved. 9
  • 90% of all transactions occur at 7% of the merchantsCompany TotalName TransactionsMCDONALDS 4,309,728SPEEDWAY 2,032,474WALMART 1,606,446KROGER 1,564,819SHELLOIL 1,546,056SHEETZ 1,358,977SUBWAY 1,280,037REDBOX 1,236,148EXXONMOBIL 1,205,451WAWA 1,197,711SUNO 1,180,799WENDYS 1,066,628 Gini Coefficient = 0.9447MARATHONOIL 1,050,593 • 0 represents equalityMEIJER 1,017,998 • 1 represents all transactions at 1 merchantSTARBUCKS 1,002,805 © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 90% of the total spend in 2011 occurred at top 8.3% of merchantsCompany Total spentNameWALMART $87,454,235.66KROGER $63,850,902.99SPEEDWAY $54,270,752.65TARGET $48,086,797.70MEIJER $46,716,327.56WMSUPERCENTER $46,650,761.15SHELLOIL $45,115,993.12GIANTEAGLE $44,668,211.07ATT $44,497,819.88VERIZONWRLS $41,971,943.31LOWES $34,952,686.13SUNO $34,498,328.42EXXONMOBIL $33,695,575.95 Gini Coefficient = 0.9408MCDONALDS $30,869,463.74 • 0 represents equalitySHEETZ $30,273,183.81 • 1 represents all money spent at 1 merchant © Copyright 2012 EMC Corporation. All rights reserved. 11
  • ‘Sic Codes’ alone are problematic; they can differ greatly across like businesses • On average the top 1,000 frequently occurring merchants have ~6 sic codes associated with their cleaned merchant nameWALMART TARGET SAFEWAY KROGER AT&T VERIZON T-MOBILE4814 5310 5411 12 1711 4812 124816 5411 5499 5411 2741 4814 48125300 5732 5921 5499 3640 4899 57325411 8043 5541 4112 5999 59996300 8099 5542 5971 7311 7299… … … 7399 …Total 31 Total 8 Total 71 Total 10 6 total matches 2 total matches 4 total matches © Copyright 2012 EMC Corporation. All rights reserved. 12
  • Relative Value Add segments created bysplitting population into deciles based onRVA RVA• Relative Value Added (RVA) provides an estimated ordinal ranking of customers using balance and transaction data (a rough precursor of EVA)• The RVA was created to put a context around the merchant name discovery, the distribution of PNC’s products and how they interact© Copyright 2012 EMC Corporation. All rights reserved. 13
  • Segment ProfilesIndex: % segment / % population Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5 Cohort marketing to 8 Cohort 9 Cohort 10 Target’s 6 Cohort 7 Cohort Cellular telephone providers higher income ATT 1.00 0.86 1.18 1.24 1.14 1.04 0.97 households seems to 0.91 0.86 0.79 SPRINT 1.75 0.55 1.93 1.72 1.15 0.81 0.67 0.56 0.50 0.36 have worked TMOBILE 1.35 0.95 1.38 1.36 1.06 0.86 0.92 0.81 0.71 0.60 VERIZONWRLS 0.95 0.52 1.18 1.32 1.28 1.11 1.01 0.95 0.90 0.78 Retail stores SEARSROEBUCK 0.64 1.60 0.60 0.63 0.79 0.90 1.03 1.12 1.25 1.45 TJMAXX 0.68 1.46 0.71 0.66 0.83 0.96 1.02 1.12 1.22 1.32 TARGET 0.72 1.51 0.63 0.69 0.87 1.02 1.11 1.16 1.18 1.12 WALMART 0.82 1.77 0.82 0.82 0.88 0.89 0.92 0.97 1.00 1.11 STAPLES 0.69 1.72 0.71 0.55 0.68 0.88 0.97 1.06 1.19 1.54 STARBUCKS 0.82 0.47 0.81 0.88 1.04 1.21 1.23 1.23 1.19 1.14 PAYPAL 1.13 1.51 1.03 0.86 0.82 0.91 1.00 0.90 0.92 0.93 Groceries PUBLIX 0.84 3.16 0.35 0.45 0.56 0.72 0.83 0.86 0.94 1.27 MENARDS 0.75 3.66 0.42 0.38 0.55 0.71 0.77 0.93 0.85 0.98 KROGER 0.79 1.13 0.79 0.87 1.00 1.01 1.03 1.10 1.09 1.20 Gas and convenience stores EXXONMOBIL 1.07 0.93 1.04 1.03 1.01 0.99 1.00 0.96 0.96 1.01 SHEETZ 0.87 0.36 0.91 1.01 0.96 0.96 1.04 1.21 1.37 1.31 SHELLOIL 1.12 1.04 1.03 1.04 1.01 1.01 0.98 0.93 0.93 0.91 SPEEDWAY 1.17 0.90 1.25 1.24 1.16 1.04 0.97 0.87 0.77 0.63 Hotels HILTON 0.69 1.70 0.49 0.53 0.76 1.02 1.15 1.14 1.16 1.36 RAMADAINN 0.75 2.29 0.40 0.64 0.90 0.88 1.00 1.10 0.90 1.13 RESIDENCEINN 0.92 1.94 0.56 0.73 0.68 0.84 1.00 0.82 0.97 1.55 ROYALINN 0.23 0.87 1.07 0.81 0.99 0.85 0.78 0.49 1.04 2.87© Copyright 2012 EMC Corporation. All rights reserved. 14
  • Segment ProfilesIndex: % segment / % population Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5 Cohort 6 Cohort 7 Cohort 8 Cohort 9 Cohort 10 Cellular telephone providers ATT 1.00 0.86 1.18 1.24 1.14 1.04 0.97 0.91 0.86 0.79 SPRINT 1.75 0.55 1.93 1.72 1.15 0.81 0.67 0.56 0.50 0.36 TMOBILE 1.35 0.95 1.38 1.36 1.06 0.86 0.92 0.81 0.71 0.60 VERIZONWRLS 0.95 0.52 1.18 1.32 1.28 1.11 1.01 0.95 0.90 0.78 Retail stores SEARSROEBUCK 0.64 1.60 0.60 0.63 0.79 0.90 1.03 1.12 1.25 1.45 TJMAXX 0.68 1.46 0.71 0.66 0.83 0.96 1.02 1.12 1.22 1.32 TARGET 0.72 1.51 0.63 0.69 0.87 1.02 1.11 1.16 1.18 1.12 WALMART 0.82 1.77 0.82 0.82 0.88 0.89 0.92 0.97 1.00 1.11 STAPLES 0.69 1.72 0.71 0.55 0.68 0.88 0.97 1.06 1.19 1.54 STARBUCKS 0.82 0.47 0.81 0.88 1.04 1.21 1.23 1.23 1.19 1.14 PAYPAL 1.13 1.51 1.03 0.86 0.82 0.91 and1.00 AT&T 0.90 Verizon 0.92 0.93 Groceries PUBLIX 0.84 3.16 0.35 0.45 0.56 appear to be gaining 0.72 0.83 0.86 0.94 1.27 MENARDS 0.75 3.66 0.42 0.38 0.55 more high value0.93 0.71 0.77 0.85 0.98 KROGER 0.79 1.13 0.79 0.87 1.00 customers 1.10 1.01 1.03 1.09 1.20 Gas and convenience stores EXXONMOBIL 1.07 0.93 1.04 1.03 1.01 0.99 1.00 0.96 0.96 1.01 SHEETZ 0.87 0.36 0.91 1.01 0.96 0.96 1.04 1.21 1.37 1.31 SHELLOIL 1.12 1.04 1.03 1.04 1.01 1.01 0.98 0.93 0.93 0.91 SPEEDWAY 1.17 0.90 1.25 1.24 1.16 1.04 0.97 0.87 0.77 0.63 Hotels HILTON 0.69 1.70 0.49 0.53 0.76 1.02 1.15 1.14 1.16 1.36 RAMADAINN 0.75 2.29 0.40 0.64 0.90 0.88 1.00 1.10 0.90 1.13 RESIDENCEINN 0.92 1.94 0.56 0.73 0.68 0.84 1.00 0.82 0.97 1.55 ROYALINN 0.23 0.87 1.07 0.81 0.99 0.85 0.78 0.49 1.04 2.87© Copyright 2012 EMC Corporation. All rights reserved. 15
  • Summary of Findings• We cleaned and standardized merchant names and – Found 1.1 million distinct merchants from the original 113+ million – Discovered 90% of transactions and 90% of the money spent happened at less than 10% of the merchants – Identified that ‘Sic Codes’ significantly differ across like businesses – Identified differences in credit and debit purchase behavior – In reaction to the announcement that Square made August 8th we used cleaned merchant names to evaluate the potential impact of the trend towards alternative payment methods using the clean merchant names• Segmentation augmented by a value added metric – We found that segmenting customers based on a rough measure of value added and combining that with transaction data can provide interesting insights – Prediction of migration from low to high value segments seems possible© Copyright 2012 EMC Corporation. All rights reserved. 16