Your SlideShare is downloading. ×
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

3,200
views

Published on

Jaime G Fitzgerald, Fitzgerald Analytics …

Jaime G Fitzgerald, Fitzgerald Analytics
Alex Hasha, Bundle.com (a joint venture between Citi, Microsoft Money, and Morningstar)

Published in: Economy & Finance, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,200
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Jaime intro:Alex Intro: Thanks Jaime. Since Jaime has already introduced me, I’ll introduce Bundle. Bundle is a company that uses data to help consumers make better decisions with their money. We do this on the one hand by providing free tools for managing personal financial data. But more to the point of today’s talk, we are also mining mountains of credit card transaction data to extract actionable insights for consumers based on the spending behavior of their peers.
  • First to provide local merchant profiles for consumers that is deeply data-drivenLocal Search Business (Yelp, CitiSearch, FourSquare, Google, Bing)% of local searches on mobile devices is growing very fastFast-growing sector in data-driven startupsExample: Ted’s montana grillBundle addresses issues with other sites:Selection Bias (strong opinions over-represented)System Gaming (just like SEO. interesting story “reputation mgt” companies!)Explicit rankings (rank by the actual metrics!)
  • Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. It’s primary purpose is for interacting with card holders, generating statements, and not suprisingly it’s formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. It’s semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are “acquiring banks”, which deals with merchants and processes their credit card transaction over various payment networks, and “issuing banks” which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an “issuing” bank, so they don’t have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
  • Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, I’m sure you’re reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because it’s generated directly from the credit card transactions of over 20 million US households.
  • Alex: (Review features left to right.)I just wanted to return to this screen shot to highlight the features that are made possible by transforming credit card data in this way. (Loyalty score) Unlike other sites, our star ratings are data driven: we assign each merchant what we call the “Bundle Loyalty Score”, which is calculated from the share of wallet a merchant’s customers devote to the business and how frequently they return. (Coverage) Because we capture transactions from a broad-cross section of the population, we have data on many small local merchants, not just the popular ones that attract a lot of reviews. (Segments and Silent majority) We can break merchants customers down into demographic and behavioral segments, to show how well it serves different groups, and which groups it is most popular with. We’re capturing information about the silent majority of shoppers, who shop without writing about it online, and also avoid the common bias on review sites towards extremely positive or extremely negative reviews.(Real price levels) We have rich data about the real range of prices visitors to this merchant are paying, based on real transactions.(Web of merchants) Another unique feature on Bundle is that we can show you what other merchants are popular with customers of this merchant. We’re all familiar with “People who bought this also bought” on Amazon and other online market places, but I believe we’re the first to take this to the offline market place on a massive scale.
  • Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, I’m sure you’re reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because it’s generated directly from the credit card transactions of over 20 million US households.
  • Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. It’s primary purpose is for interacting with card holders, generating statements, and not suprisingly it’s formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. It’s semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are “acquiring banks”, which deals with merchants and processes their credit card transaction over various payment networks, and “issuing banks” which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an “issuing” bank, so they don’t have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
  • Top 10 Possible Matches, Like Google Search)
  • Jaime: Take it back to audience. A common theme in converting data to dollars is to to extract new value from old data by MATCHING with other preexisting data. No need to dwell on particulars of Bundle data on this slide, except as an instance of a more general pattern.
  • JF Provides Framing: This is a universal problem for companies seeking to convert Data to Dollars, repurposing old data sets often requires matching with other data sets without a common key. AH: It should be clear now how a robust, accurate algorithm for matching text descriptions to merchant listings is a prerequisite for our entire user experience.There are two aspects of this problem that created significant challenges for us. First, there’s the basic issue that accurate fuzzy string matching is hard. Our inputs highly variable transaction descriptions, sometimes dozens or hundreds per merchant, inconsistent coding, error prone geographic indicators, and noisy merchant category indicators. These give us a lot to go on, but to treat any of them as a source of truth gets you in trouble. We’re at a Text Analytics conference, so I don’t have to tell you that accurate fuzzy string matching can be hard, especially if supporting data like merchant category and geo information are not 100% reliable. But before we could even begin to attack that problem we had to do something about the sheer size of our data set.We receive about 1 billion credit card transactions per year, each of which must be associated with one of 10s of millions of merchants in a comprehensive listing. Not that anyone would try this, but a brute force attempt to take each transaction description and scan through the merchant listing item by item looking for a match would require on the order of 10^16 fuzzy string comparisons. To put that in perspective, if each comparison took about a millisecond, the match would take over 300,000 years to run.Clearly something needs to be done to reduce the scale of the input AND the matching search space. Broadly speaking, we accomplished this by breaking the matching process into two phases, using text clustering in the first phase to dramatically decrease the size of the data set, and then proceeding to a fuzzy match.
  • This isn’t rocket science, there are a handful of obvious places to start simplifying the problem. One key lever is location: if you have a transaction that occurred in New Mexico it doesn’t make sense to include merchants in New York in your search.There are tens of millions of merchants nationally, but only hundreds of thousands in each city, and maybe a thousand max in each neighborhood. If you can identify the neighborhood of a transaction, and only search the merchants in that neighborhood, the efficiency payoff is hugeThis wasn’t a completely obvious step for us, though, because as I mentioned before the geographic fields in our transaction data were not 100% reliable. We could identify the city with no problem, but at the neighborhood level there is a significant error rate. But we eventually realized we had to ignore all the little complications and, at all costs, reduce the size of our data so we could work with it efficiently. It’s worth creating an intermediate data set that’s still pretty messy, if you can now load it into R on your laptop and try out a few fuzzy matching experiments in an afternoon.
  • This slide gives a high level overview of how we achieved a cascade of scale reductions by batching transactions by neighborhood. Considering each neighborhood in isolation, we dedupe and then cluster transaction strings which are highly likely to be generated by the same merchant. Each of these clusters is assigned a preliminary merchant ID. At this point we have a preliminary merchant listing which still suffers from some of the quality issues of the original data set but Can provide aggregated transaction data views which to inform subsequent matching and is on a much more manageable scale.The output of the clustering algorithm feeds into a more resource intensive fuzzy matching algorithm, which becomes feasible at this scale.Taking this approach on a single machine, we were able to get our processing time down to about a week. However, in startup time a week is not much better than 300K years. Thanks to the revolution in open source parallel computing, we were able to quickly set up a small Hadoop cluster which parallelizes the text clustering operations so all the neighborhoods run at the same time. This brought our processing down to about 20 minutes. While this isn’t a complete solution to the initial problem, it vastly increases our capability to experiment with new methods and tweaks to the existing process.So that’s a quick and dirty introduction to a part of our technology stack, and now I”ll turn it over to Jaime to convert my case study into some high level takeaways.
  • Robin custbehavior PayComplainPay....then....ST vs LT RecAdvLoyalty
  • Top 10 Possible Matches, Like Google Search)
  • Comments:Consider trade-offs between false positive and false negativesRelated Hot/Emerging Best Practices we can mention to frame this:Metrics-Driven DevelopmentBeginning with the End in Mind / Causal Clarity 
  • Transcript

    • 1. From Big Legacy Data to Insight: Lessons Learned CreatingNew Value from a Billion Low Quality RecordsJaime Fitzgerald, President, Fitzgerald Analytics, Inc.Alex Hasha, Chief Data Scientist, Bundle.comMay 1, 2012 Architects of Fact-Based Decisions™
    • 2. Agenda for Today’s Talk 1. The Business Model 2. The Text Analytics Challenge 3. How We Overcame the Challenge 4. Key Takeaways 5. Q&AFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 2
    • 3. Introduction Jaime Fitzgerald, Alex Hasha Founder @ Data Scientist @ Fitzgerald Analytics Bundle Corp @JaimeFitzgerald @AlexHasha  Leading development of data products  Transforming data into value for clients Responsible  Designing statistical methods / algorithm For… that transform data into insights for  Creating meaningful careers for employees consumers  Helps clients convert Data to Dollars™  Uses data to help consumers make better At a decisions with their money  Brings a strategic perspective to improve  Bends valuable legacy data to new Company ROI on investments in technology, data, purposes That people, and processes  Is growing and hiring! Also  Working to Democratize Analytics by  Learning about and implementing best Working Reducing the “Barrier to Benefit” for non- practices for managing complex data On profits, social entrepreneurs, and gov’t pipelinesFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 3
    • 4. The Local Search BusinessFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 4
    • 5. Gaps in Local Search Offerings Paid Advertisement Not Trusted User-Reviews Can be Biased Not Selection Can be Personalized Bias Gamed (to you)From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 5
    • 6. Bundle’s Unique Contribution Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households Example: Credit Card Statement DataFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 6
    • 7. A Screen Shot From our SiteFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 7
    • 8. A Screen Shot From our SiteFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 8
    • 9. A Screen Shot From our SiteFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 9
    • 10. We Do This with Billions of Real Spending Records Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households Key Issues with this Data: Example: Credit Card Statement Data 1. Credit card data lacks merchant identifier 2. So we rely on text analytics to associate transactions with merchantsFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 10
    • 11. Building our “Version of the Truth” from 3 sources Our Localeze Factual Transaction Data  Proprietary  Crowd Sourced  High Quality Pros  Differentiated  Up to the  Clean / Verified  Special Sauce Minute  Incomplete  More variability Cons  Semi-Structured  Lag / Recency in qualityFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 11
    • 12. Data: Not Useful Until Refined.From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 12
    • 13. Key Steps in “Refinement” (Transformation) Transformed To Create New Old Data in New Ways Features Such As… Card Transaction Normalization People Who Shop Data Here Also Like… Clustering Merchant Listings The Bundle Loyalty (e.g., Address, Phone Score Number, Business Type) Linking Data-Driven Other Data: Reviews From an Census, Bureau of Labor Aggregation Array of Customer Statistics, User Feedback SegmentsFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 13
    • 14. Before the Fun Stuff Happens… Before we can generate insights about merchants for our users, we must associate each transaction in our database with a specific merchant from a master list…. Two main problems: Credit Card Transactions 1. Accurate Fuzzy Matching is Difficult (Billions – 109) 2. Scale of Data is Enormous • Highly variable text descriptions • Noisy geographic info Comprehensive Listing Text • Noisy merchant Matching of US Merchants category info (Tens of Millions – 107) Naïve item by item search takes O(1016) expensive string comparisons: Too Slow!From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 14
    • 15. A “Brute Force” Approach Would Never Work… 1 1. Matching w/in Hundreds of Millions of Merchants would Processing Time / Workload require massive processing… Nation ….Fortunately we don’t need to match at this level 2. Batching at local area, process orders of magnitude faster. City Neighborhood 0 Hundreds Hundreds of Tens of Millions Thousands # of Merchants in Comparison SetFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 15
    • 16. Solution to Scaling Problem This is a “Cascade of Scale Reductions”, Parallelizing by Location Credit Card Transactions (Billions – 109) Keys to solving the scaling problem: Batch Transactions by Geographic Neighborhood 1. Scale Reduction / Parallelized Text Clustering 2. Free Open Source Software 1 2 10000 Dedupe Description Strings Secondary Fuzzy Matching Process Reconciles Preliminary Listings with Merchant Text Clustering “Source of Truth” (Not Matching) Consolidate Strings Belonging to Same Merchant Computational Efficiency Increased by a Factor of 108! Preliminary Merchant Final Merged Listing Generated Directly Transaction Eons -> Days -> Minutes from Transactions Data Set (Tens of Millions–107)From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 16
    • 17. Data Preparation: Phase 1 Machine DAMA Lens Learning Lens Example: • Unsupervised Anthonys Restaurant Deduping Learning #123 Brkly NY • Matching X 10, • Text Clustering (Strings) Cleansing • Pattern Anthony’s Restaurant DiscoveryFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 17
    • 18. Data Preparation: Phase 2 Machine DAMA Lens Learning Lens Search Retrieves Top 10 Possible Matches • Deduping • Record • Information Classifier applied to + 30% Linkage Retrieval each, returns • More • Data Quality Cleansing confidence score • Supervised Enhancement • Data Classifier If Confidence = High, Enrichment Records are linkedFrom Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 18
    • 19. Takeaways 1. Tame your data before perfecting your methods. efficiency enables experimentation, iteration, improvement. 2. Design your process to minimize unnecessary complexity (e.g. Parallel Processing at Scale, Normalization, Pre-Filtering) 3. Tools: Take advantage of powerful (and inexpensive) open- source tools that enable your process...From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 19