Your SlideShare is downloading. ×
0
New Insights from ‘Big Legacy Data’:The Role of Text Analytics at Bundle.comJaime Fitzgerald, President, Fitzgerald Analyt...
Agenda for Today’s Talk                          1.       Introduction to the Business Model                          2.  ...
Introduction                                                      Jaime Fitzgerald,                        Alex Hasha     ...
For Example, We Help You Decide Where to Spend…New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.c...
We Do This with Billions of Real Spending Records        Unlike other merchant listing sites, our content is based on real...
A Business Model “Built Out Of Data”                                                                          Transformed ...
The Benefit is to Provide More Accurate, Less Biased ContentNew Insights from ‘Big Legacy Data’: The Role of Text Analytic...
Before the Fun Stuff Happens…        Before we can generate insights about merchants for our users, we must associate     ...
A “Brute Force” Approach Would Never Work…                                       1                                        ...
Solution to Scaling Problem        This is a “Cascade of Scale Reductions”, Parallelizing by Location                 Cred...
Takeaways           1. Tame your data before perfecting your methods.           efficiency enables experimentation, iterat...
Upcoming SlideShare
Loading in...5
×

New insights from big legacy data at bundle (Presented at Text Analytics World 2011)

2,412

Published on

Along with co-presenter Alex Hasha, Jaime spoke on “New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com.”

For decades, credit card transactions have generated mountains of data about consumer spending habits, but the data formats were designed for archiving and reporting rather than for data mining and pattern discovery. In particular, the merchant's name is embedded in a text field, which also contains other information, without any standard format.

Bundle.com is a new startup that is building a business on the extraction of value from this legacy data source. Text analytics are being used to robustly identify merchants in the dataset as a first crucial step in the extraction of powerful insights about consumer spending behavior.

4 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,412
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
26
Comments
4
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "New insights from big legacy data at bundle (Presented at Text Analytics World 2011)"

  1. 1. New Insights from ‘Big Legacy Data’:The Role of Text Analytics at Bundle.comJaime Fitzgerald, President, Fitzgerald Analytics, Inc.Alex Hasha, Chief Data Scientist, Bundle.comOctober 2011 Architects of Fact-Based Decisions™
  2. 2. Agenda for Today’s Talk 1. Introduction to the Business Model 2. The Role of Text Analytics 3. A Key Challenge and How we Overcame It 4. Takeaways 5. Q&ANew Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 2
  3. 3. Introduction Jaime Fitzgerald, Alex Hasha Founder @ Data Scientist @ Fitzgerald Analytics Bundle Corp @JaimeFitzgerald @AlexHasha Leading development of data products Transforming data into value for clients Responsible Designing statistical methods / algorithm For… that transform data into insights for Creating meaningful careers for employees consumers Helps clients convert Data to Dollars™ Uses data to help consumers make better At a decisions with their money Brings a strategic perspective to improve Bends valuable legacy data to new Company ROI on investments in technology, data, purposes That people, and processes Is growing and hiring! Working on a movement to Democratize Also Learning about and implementing best Analytics by Reducing the “Barrier to Working practices for managing complex data Benefit” for non-profits, social On pipelines entrepreneurs, and gov’tNew Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 3
  4. 4. For Example, We Help You Decide Where to Spend…New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 4
  5. 5. We Do This with Billions of Real Spending Records Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households Key Issues with this Data: Example: Credit Card Statement Data 1. Credit card data lacks merchant identifier 2. So we rely on text analytics to associate transactions with merchantsNew Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 5
  6. 6. A Business Model “Built Out Of Data” Transformed To Create New Old Data in New Ways Features Such As… Card Transaction Normalization People Who Shop Data Here Also Like… Clustering Merchant Listings The Bundle Loyalty (e.g., Address, Phone Score Number, Business Type) Linking Data-Driven Other Data: Reviews From an Census, Bureau of Labor Aggregation Array of Customer Statistics, User Feedback SegmentsNew Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 6
  7. 7. The Benefit is to Provide More Accurate, Less Biased ContentNew Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 7
  8. 8. Before the Fun Stuff Happens… Before we can generate insights about merchants for our users, we must associate each transaction in our database with a specific merchant from a master list…. Two main problems: Credit Card 1. Accurate Fuzzy Matching is Difficult Transactions 2. Scale of Data is Enormous (Billions – 109) This case focuses on the second problem • Highly variable text descriptions • Noisy geographic info Comprehensive Listing Text • Noisy merchant Matching of US Merchants category info (Tens of Millions – 107) Naïve item by item search takes O(1016) expensive string comparisons: Too Slow!New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 8
  9. 9. A “Brute Force” Approach Would Never Work… 1 1. Matching w/in Hundreds of Millions of Merchants would Processing Time / Workload require massive processing… Nation ….Fortunately we don’t need to match at this level 2. Batching at local area, process orders of magnitude faster. City Neighborhood 0 Hundreds Hundreds of Tens of Millions Thousands # of Merchants in Comparison SetNew Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 9
  10. 10. Solution to Scaling Problem This is a “Cascade of Scale Reductions”, Parallelizing by Location Credit Card Transactions (Billions – 109) Keys to solving the scaling problem: Batch Transactions by Geographic Neighborhood 1. Scale Reduction / Parallelized Text Clustering 2. Free Open Source Software 1 2 10000 Dedupe Description Secondary Fuzzy Matching Strings Process Reconciles Preliminary Listings with Merchant “Source of Truth” Text Clustering (Not Matching) Consolidate Strings Belonging to Same Merchant Computational Efficiency Increased by a Factor of 108! Preliminary Merchant Final Merged Listing Generated Directly Transaction Eons -> Days -> Minutes from Transactions Data Set (Tens of Millions–107)New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 10
  11. 11. Takeaways 1. Tame your data before perfecting your methods. efficiency enables experimentation, iteration, improvement. 2. Design your process to minimize unnecessary complexity (e.g. Parallel Processing at Scale, Normalization, Pre-Filtering) 3. Tools: Take advantage of powerful (and inexpensive) open- source tools that enable your process...New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 11
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×