Data Refinement: The missing link between data collection and decisions
Upcoming SlideShare
Loading in...5

Data Refinement: The missing link between data collection and decisions






Total Views
Views on SlideShare
Embed Views



1 Embed 8 8



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Data Refinement: The missing link between data collection and decisions Data Refinement: The missing link between data collection and decisions Presentation Transcript

  • Data Refinement: The missing link between data collection and decisions Stephen H. Yu Data Strategy & Analytics Consultant
  • What we will cover • • • • • • Database Marketing Landscape Analytics and Models “Model-Ready” Environment Data Summarization & Categorization Delivery: Scoring & QC Closing the Loop 1
  • Big Data, Small Data, Neat Data, Messy Data How is the "Big Data” working out for you? 2.5 quintillion bytes collected per “day” 1 quintillion (exabytes) = 1 billion gigabytes • Did all this data improve your decision making process? • Do you have the results to show for? • Information Overload? You bet!  Harness insights, drop the noise 2 View slide
  • Database Marketing Landscape • No guessing game – You MUST know your target • Vast amount of online & offline data collected  But are they being used properly? • Analytics play a huge roles in prospecting & CRM • Short paced marketing cycle getting shorter • Huge difference between advanced marketers and those who are falling behind Winners are the ones who know how to wield the power of all available data faster. 3 View slide
  • Insights, not Raw Data • Database marketers must excel in: Collection Refinement Delivery Size & speed matters Get to answers, not just ingredients For consumption by end-users • Must provide “marketing answers” via advanced analytics, not just bits and pieces of data • Insight does not come from data, it is derived from data 4
  • Refined Answers Raw Data Marketing Answers • • • • • • • • • • Likely to buy a luxury car • Likely to take a foreign vacation • Likely to response to free shipping offer • Likely to be a high value customer • Likely to be qualified for credit • Likely to upgrade • Likely to leave • Likely to come back Demographic / Firmographic RFM Products & Services Used Promotion / Response History Lifestyle / Survey Responses Delinquent history Call / Communication Log Movement Data Sentiments 5
  • Different Types of Analytics “Analytics” means different things… • BI (Business Intelligence) Reporting: Display of success metrics, dashboard reporting • Descriptive Analytics: Profiling, segmentation, clustering • Predictive Modeling: Response models, cloning Models, lifetime value, revenue models, etc. • Optimization: Channel optimization, marketing spending analysis, econometrics models Predictive Modeling for 1-to-1 Marketing 6
  • Why model? • Increase Targeting Accuracy • Reduce costs by contacting less/smart • Stay relevant • Consistent results • Reveal hidden patterns in data • Repeatable – key for automation • Expandable • “Supposedly” save time and effort Models summarize complex data into simple-to-use “scores” 7
  • Why NOT model? • Universe is too small • Predictable data not available • 1-to-1 marketing channels not in plan • Tight budget • Lack of resources 8
  • Custom Models for every stage of Marketing Lifecycle 9
  • Any pain implementing models? • Not easy to find “Best” customers • Modelers are fixing data all the time • Rely on a few popular variables • Always need more variables • Takes too long to build models and score • Inconsistencies shown when scored • Disappointing results 10
  • What does your database support? If you have a database… • Order Fulfillment • Contact Management • Standard Reports • Ad hoc Reports and Queries • Name Selections • Response Analysis • Trend Analysis But does it support predictive modeling and scoring? 11
  • For modeling, clean the data first “Garbage-in, garbage-out” • Most data sets are messy & “unstructured” • Over 70-80% of model development time goes to data prep work • Most databases are NOT model-ready • Modeling & Scoring o Extension of database work o Consistency is “the” key 12
  • Predictive Modeling is all about “Ranking” 2 1 3 Ultimately, Models must properly “Rank” Determine the level of data accordingly • Households • Individuals • Products • Relational or unstructured databases won’t cut it • Must create “Descriptors” that fit the level that needs to be ranked 13
  • Unstructured to Structured • Most modern databases optimized for massive storage and rapid retrieval, not necessarily for predictive analytics o Relational databases o NoSQL databases • Need “Analytical Sandbox” (or Database/Data-mart) o Structured & de-normalized o Variables as descriptors of model targets o Common analytical language (SAS, R, SPSS) o Must support “in-database” scoring 14
  • Analytical Sandbox – “Model-Ready” Environment From “all” types of data collection to decisions. Then repeat. 15
  • Why Front-end DP Important? Inexperienced analysts spend majority of time doing DP work  Modeling work at the last minute! Creative variables enhance models Inconsistent data creates a chain reaction to melt-downs Data append/match becomes ineffective 16
  • First, Define Analytical Goals • Rank & select prospect names • Cross-sell/Up-sell • Segment the universe for messaging strategy • Pinpoint attrition point • Assign lifetime values • Optimize media/channel spending • Create product packages • Project customer value • Detect fraud 17
  • 3 Major Types of Data for Marketers Descriptive Data • Demographic Data • Firmographic Data • Geo-demographic Data Transaction Data / Behavioral Data • Transaction data Attitudinal Data • Surveys • Sentiments • Compiled / Co-op data • Lifestyle data • Online behavioral data 3-Dimensions in Predictive Analytics 18
  • Data Inventory • “Modeling is making the best of what we know” • Beyond obvious RFM data • Get Deeper o Product/Service Level Data o Historical Data o Channel Data – Inbound & Outbound o Online activities, sentiments, unstructured data • External Data 19
  • Create Data Menu • Base it on Companywide Need-Analysis • Ask the Analysts first: What type of models are in the plan? o Affinity/Look-alike Models o Promotion/Response Models o Time-series Models o Attrition Models • Consider non-analytical departments • Maintain the ones that fit the objective  Don’t be afraid to throw out “noises” 20
  • Data Menu (continued) Check the ingredients Cost - Can you afford to maintain it? • What do you have today? • What can be bought? • What can be created? • Storage/Platform – Consider the scoring part, too • Programming/Processing Time • Software • Update • External Data 21
  • Check Your Data Inventory You may have more than you thought, so let’s start with what you have: • • Name & Address: Key to Geo/Demographic Data Order Transaction Data: “RFM”, Payment Methods • Item/SKU Level Data: Products, Price, Units • Promotion/Response History: Source, Channel, Offer • Life-to-Date/Past “x” Months Summary Data • Customer Level Status Flags: Active, Dormant, Delinquent • Surveys/Product Registration Forms : Attitudinal/Lifestyle • Customer Communication History Data: Call-center, Web • Social Media, Click-through, Page views : Sentiment/Intentions Need Conversion, Categorization, & Summarization 22
  • Maximize the Power of Transaction Data Most databases describe shopping baskets  Start describing your targets • RFM Data must be Summarized (or Denormalized) • Turn RFM data into individual / household level “Descriptors” • Combine with essential categorical variables (e.g., product, offer, channel, etc.) 23
  • Data Summarization – Matching the level of Data Order Table Cust ID Order # Order Summary Table Order Date $ Amount 000123 100011 2009-05-06 $199.99 000123 100128 2010-08-30 $50.49 000123 103082 2011-12-21 100036 2010-06-06 $43.99 003859 101658 2011-01-20 102189 2011-04-15 $119.45 003859 106458 2012-02-18 104535 2012-07-30 $354.72 016899 107296 2011-07-14 102982 2010-09-07 $128.60 019872 103826 2011-04-30 $499.99 019872 109056 2012-03-12 Last Order Date $199.99 019872 First Order Date $43.99 004593 $ Total $43.99 003859 # Orders $128.60 003859 Cust ID $59.99 000123 3 $379.08 2009-05-06 2011-12-21 003859 4 $251.42 2010-06-06 2012-02-18 004593 1 $354.72 2012-07-30 2012-07-30 016899 1 $199.99 2011-07-14 2011-07-14 019872 3 $688.58 2010-09-07 2012-03-12 24
  • Sample Variables after Summarization Before Recency Frequency Monetary After Summarization • Weeks since last online purchase • Years since member sign up • Days since last delinquent date • Months since last response date • Orders by offer type • Orders by product/service type • Payments by pay method • Average days between transactions • Total $ past 24 months • Life-to-date spending • Average dollars by channel • Average dollars by product type 25
  • RFM Data Summary – Timeline Life-to-date Summary provides the historical view May create bias towards tenured customers Put time limit on variables (e.g. 12-month, 24-month, etc.) May require higher number of variables and complicate the process For Lifetime Value & Time Series Models Must create historical arrays (daily, weekly, monthly counts of events) 26
  • Who does the summary work? Answer: Not the statistician! Key Takeaway The data variables must be consistent everywhere in the “Analytical Sandbox” • Main analytical database • Model development sample • Pool of records to be scored Pre-built summary variables in the database 27
  • Data Categorization Free-form data comes to life through categorization Don’t Give Up! • Hidden data in: o Product, service, offer, channel, source, status, titles, surveys, etc. • Have categorization guideline? • Who will do it? o Consider text mining techniques • What to throw out? o Keep data that matters in predictive modeling 28
  • Categorical Data Any Non-numeric Data • Product • Service • Offer • Channel • Source • Market • Region • Business Title • Member Status • Payment Status • etc… Offer Code Example: • Flat Dollar Discount • % Discount • Buy 1, Get 1 Free • Free Shipping • No Payment Until… • Free Gift • etc… Categorize as much as possible at the data collection stage 29
  • Categorization Guidelines Be consistent throughout • • • • • • Survey Form Data Entry Inventory Database Data Collection & Compilation Summarization Modeling, and Scoring Create “Code” structure • NEVER allow free-form answers 30
  • Categorization Guidelines (continued) Create Rules and DON’T Deviate from them Training & Automation More specific the better Combine them later But, don’t allow too many variations (over 20) in one code Break into multiple codes if necessary Don’t forget the end goals and don’t over-do it Must be “relevant” 31
  • Data Hygiene & Data Append Do you know how many customers you have in your database? • Data conversion o o o o Create consistency Standardization Edit Purge • Cover all bases – PII & RFM Data • Create rules and be consistent 32
  • PII – Gateway to External Data What is hidden behind simple name & address? • Standardize Name & Address first o Maintain PII (Personally Identifiable Information) o Hygiene via periodic NCOA and standardization • First & Last Name – Ethnic, Gender • Name, Address, Email – Demographic Data • Address – Geo-demographic, Census Data • Zip – County, Market Region, DMA 33
  • External Data • Always consider buying data before collecting and building • Compiled Demographic / Firmographic Data • Behavioral / Transaction / Co-op Data • Lifestyle / Attitudinal / Survey • Census / Geo-Demographic Data 34
  • External Data Check List • Test multiple data sources 1. 2. 3. 4. 5. 6. Depth of information / Uniqueness Coverage / Match Rate Consistency / Update Cycle Price User-friendliness Delivery Options – Real-time? • Learn about the data sources o What’s real and what’s imputed? o Don’t stop at Demographic: always consider “Behavioral” data 35
  • Missing Values “Missing Data can be meaningful.” • For Numeric Data (e.g., $, Counters, Dates, etc.) o Incalculable vs. Data-append Non-matches o Missing is missing: DO NOT fill in with 0’s $ • For Categorical Data (e.g., Codes, Text, etc.) o Leave room for “N/A” (e.g., blank, “N/A”, “0”, “.”, etc.) o Code “Non-matches” to external files differently And yet, unstructured database only store what’s available. 36
  • More on missing data • Agree on Imputation Rules o Do it upfront o Must be part of scoring codes Data • Educate non-analysts o Hard to undo when combined with other values o Train vendors • Always check % missing o Development Sample vs. Life Databases 37
  • Scoring – Sample vs. Database • Development Sample vs. Live Database o o o o Database Structure Variable List/Name Variable Value Imputation Assumption Lead to disasters if “anything” is different • Do NOT play with model groups that are set in the development sample 38
  • Scoring QC Most troubles happen after the models are built… Check: • • • • • • Model Group Distributions Variable distributions (values and indices) Missing Values Match rate for appended data Scoring codes, including score breaks Compare to previous runs – Check Deterioration Set parameters for acceptable differences and Enforce 39
  • Score installment in the database • Plan ahead to store model scores in the database and datamarts  Sync with the main database • Store raw scores, not just model groups • Match the levels of scores o o o o Household Individual Email Product 40
  • Back-end Analysis Close the Loop Properly! • 1-to-1 MKT 101: “Learn from past campaigns” • Must Plan ahead o No excuse for not doing it o Schedule ahead & budget properly o Keep historical promo/response data (even without marketing databases) • Set Metrics Upfront o List/Data Source o By Offer, Creative, Season, Product, etc. 41
  • Response Reports & BI Analytics • Start with “Canned” Reports and BI Dashboards from vendors • Don’t insist real-time for every report • Get ready to create “Custom” reports and dashboard views o Prioritize what you want • Format and Delivery • Timing and Interval • Timeline to be covered (YTD, 12-mo, etc.) 42
  • Key ROI Metrics Set ROI Metrics, such as: • Open, Click-through, Conversion Rates o “Denominator” in each? • Revenue o Per 1,000 mailed/calls o Per Order o Per Display, Email, Click-through, Conversion • By o Source, Campaign, Time Period, Model Group, Offer, Creative, Targeting Criteria, Channel (in & outbound), Ad server, Publisher, Key word, Script, Daypart, etc. • Key variables must reside in the database o Keep them in “ready-to-use” format 43
  • Where to Begin with Analytical Sandbox Spec it out: • Project Goal • Data Source List (as detailed as possible) • Final Variable List • Project Flow: o Collection o Scoring o Conversion & Categorization o Storage o Summarization o Backend Analysis o Matching / Data Append o Development Sample 44
  • Who will build the sandbox? • In-house vs. Out-sourcing; Must consider o Platform o Software o Programming o Staffing • Cost it out o Don’t forget the update cost • Back to the analysts for variable list review • Don’t be shy and ask for help from consultants 45
  • 10 essential items to consider when outsourcing analytics It is not just about modeling, but all surrounding services as well 10 essential items to consider when outsourcing analytical projects 1. Consulting capability: Translate marketing goals into mathematics 2. Data processing: Conversion, edit, summarization, data-append, etc. 3. Pricing structure: Model development is only one part; hidden fees? 4. Track record in the industry: Not in rocket science, but in marketing 5. Types of models supported: Watch out for one-trick ponies 6. Speed of execution: Turnaround time measured in days, not weeks 7. Documentation: Full disclosure of algorithms, charts and reports 8. Scoring validation: Job not done until fully scored and validated 9. Back-end analysis: For true “Closed-loop” marketing 10. Ongoing Support: Periodic review and update 46
  • Scope it out • Know what you need, but don’t over do it “Modeling is making the best of what’s available” • Take a phased approach o If budget is tight, start with low hanging fruits o Proof of concepts without full database commitment in the beginning o Maintain consistency o Keep Historical Data 47
  • Key Takeaways • • • • • • • • Invest in analytics: It is about the insights, not just the data Set business & analytical goals first Advanced analytics requires its own sandbox Ensure consistent data every step of the way: From sampling to scoring Check every data source, but don’t wait for a perfect data set Match the levels of data (Data Summary) Don’t over-do it – employ phased approach Ask for help
  • Stephen H. Yu · Data Strategy & Analytics Consultant·· 201.218.2068