BAQMaR 2008
Upcoming SlideShare
Loading in...5
×
 

BAQMaR 2008

on

  • 8,044 views

BAQMaR Conference 2008

BAQMaR Conference 2008

Statistics

Views

Total Views
8,044
Views on SlideShare
8,042
Embed Views
2

Actions

Likes
1
Downloads
42
Comments
0

1 Embed 2

http://www.baqmar.be 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    BAQMaR 2008 BAQMaR 2008 Presentation Transcript

    • 600+ analysts
    • Our dreams
    • Inspiration
    • Online & offline Marketing research domains Friend organisations Neighbour countries
    • website!
    • Join the conversation !
    • Marketing Researchers TODAY ?!
    • C L Well known Popular Unique
    • An explosion of data !
    • Fatigue among respondents !
    • Me, MySpace & I Visual (n) ethnography among 13-17 year olds Joeri Van den Bergh (InSites Consulting) Veerle Colin (MTV Networks)
    • The Research Briefing WHY TELL ME WHY
    • Getting intimate with our target groups ID construction of adolescents in a digitalised world Social groups as an extension of psychographic segments Role of brands in ID construction within social groups
    • The Research Approach I SEE YOU, BABY
    • Traditional ethnography context Participant R Hawthorne effect Researcher gaze Time & Cost intensive
    • From traditional to visual ethnography Traditional 360° visual ethnography ethnography context context Participant Participant R R Contact with researcher Hawthorne effect via 2.0 tools Researcher “Informer” gaze: what is gaze important? Time & Cost Follow multiple participants from anywhere intensive
    • 360° Ethnography 1.User generated MM ethnography = Observation takes place via photos/video taken by the participants to the study = Participants observe their own environment & report back to the researcher
    • Ethnographic blog: identity related tasks Non me Aspirational Pictures of... me Clothes you no Pictures of... longer want to Other groups wear Favorite clothes Youngsters where youngsters where you so not want to you would like to Pictures of other groups that be friend with Personal me be friend with are different but okay Pictures of... Place where you can Pictures of youth of today really be yourself Clothes you wear at home Pictures of normal persons Objects that are typical for me Social me Pictures of... Important persons in your life Clothes you wear to go out Groups of youngsters you like
    • 360° Ethnography 1.User generated ethnography = Observation takes place via photos/video taken by the participants to the study = Participants observer their own environment & report back to the researcher 2.Nethnography: social life of teenagers is very much MOVING ONLINE !!! = observation of the online behaviour & content of a target group or within a certain webspace
    • Nethnography: we are your friends
    • Social Network Sites: nethnography Social Personal identity identity Nicknam e Conversation Clan member Profile Monitoring on ship text guestbook Profile picture Photo collection
    • And for the datamining freaks among you, Annelies ripped the internet • 300 active participants of netlog randomly selected. Equal spread age x gender • Webcrawlers to „scan‟ pages of netlog and substract content. • Textmining: profile pages – photo tags – clan membership – conversations on the guestbook
    • Other online behaviour: tracking tool
    • That‟s why we do call it (visual) (n)ethnograpy: shoot me! 100% UGC OBSERVED BY US
    • The Research Analysis THE HARD PART
    • Analysis INTIMACY
    • ME-SEUM
    • Social Networks are not so social as you think they are Only 32% of the conversations on the guestbook are „interactive‟! Food The rest are all single statements. kids Shoes Litte Mobile phones Cars Quarrel Miss you Making fun Confirm friendship Travel Music Welcome by strangers Transport It was great Feedback on picture I’ am bored Youth movement Alcohol Sleeping How are you? Practical appointment School Feedback on profiles MSN Festivals Congratulations Family Gaming Age Food Party Sport Online movies Express love Study Tokio hotel
    • one of the 3 key research questions Social groups among today‟s youngsters
    • To what youngster group do I belong?
    • Methodology M E Non group Aspira Social tional Group Group Differe nt but OK
    • REAL WORLD WE ME (I’m better) CHANGE Skater Alternative Fashion girl Fashion boy Tektonic Rapper THINK Rockers DRINK Hippies MAINSTREA M Breezer sluts Punk Jumpers emo Nerds Gothic Geek girls CONSERVATISM
    • spartacus121 best pk (runescape) 30/04/2008 19:07:55 door laurent ik hit iets in de 27 en dat is al zot hoog zij hits zijn gestoord lol. zen zwaard is 140M (ik heb 10 M)
    • REAL WORLD WE ME (I’m better) CHANGE SKILLS SKILLS SKILLS SKILLS SKILLS SKILLS SKILLS Skater Alternative SKILLS SKILLS Fashion girl SKILLS Fashion boy SKILLS SKILLS SKILLS SKILLS SKILLS SKILLS Tektonic SKILLS Rapper THINK Rockers SKILLS DRINK SKILLS Hippies Breezer sluts SKILLS SKILLS SKILLS SKILLS Punk SKILLS SKILLS SKILLS SKILLS Jumpers SKILLS emo SKILLS SKILLS SKILLS SKILLS SKILLS SKILLS Gothic SKILLS Nerds SKILLS SKILLS SKILLS Geek girls CONSERVATISM
    • REAL WORLD WE ME (I’m better) CHANGE LOOKS LOOKS LOOKS LOOKS LOOKS LOOKS Skater LOOKS Alternative LOOKS LOOKS Fashion girl LOOKS Fashion boy LOOKS LOOKS LOOKS LOOKS LOOKS Tektonic LOOKS LOOKSRapper THINK Rockers LOOKS DRINK LOOKS Hippies LOOKS Breezer sluts LOOKS LOOKS LOOKS Punk LOOKS LOOKS LOOKS LOOKS LOOKS Jumpers LOOKS emo LOOKS LOOKS LOOKS LOOKS LOOKS LOOKS Nerds LOOKSGothic LOOKS LOOKS Geek girls CONSERVATISM
    • REAL WORLD WE ME (I’m better) CHANGE SKILLS LOOKS LOOKS LOOKS LOOKS SKILLS SKILLS SKILLS SKILLS SKILLS LOOKS LOOKS SKILLS Skater LOOKS Alternative SKILLS LOOKS SKILLS LOOKS Fashion girl SKILLS LOOKS Fashion boy SKILLS LOOKS SKILLS LOOKS LOOKS SKILLS SKILLS LOOKS LOOKS SKILLS SKILLS Tektonic LOOKS SKILLS LOOKSRapper THINK Rockers SKILLS LOOKS DRINK SKILLS LOOKS Hippies LOOKS Breezer sluts LOOKS LOOKS LOOKS SKILLS SKILLS SKILLS SKILLS Punk LOOKS LOOKS SKILLS SKILLS LOOKS LOOKS SKILLS SKILLS LOOKS Jumpers SKILLS LOOKS SKILLS emo LOOKS SKILLS LOOKS LOOKS SKILLS SKILLS LOOKS LOOKS SKILLS SKILLS LOOKS SKILLS Nerds LOOKSGothic LOOKS SKILLS LOOKS SKILLS SKILLS Geek girls CONSERVATISM
    • REAL WORLD WE ME (I’m better) CHANGE SKILLS LOOKS LOOKS LOOKS LOOKS SKILLS SKILLS SKILLS SKILLS SKILLS LOOKS LOOKS SKILLS Skater LOOKS Alternative SKILLS LOOKS SKILLS LOOKS Fashion girl SKILLS LOOKS Fashion boy SKILLS LOOKS SKILLS LOOKS LOOKS SKILLS SKILLS LOOKS LOOKS SKILLS SKILLS Tektonic LOOKS SKILLS LOOKSRapper THINK Rockers SKILLS LOOKS DRINK SKILLS LOOKS Hippies LOOKS Breezer sluts LOOKS LOOKS LOOKS SKILLS SKILLS SKILLS SKILLS Punk LOOKS LOOKS SKILLS SKILLS LOOKS LOOKS SKILLS SKILLS LOOKS Jumpers SKILLS LOOKS SKILLS emo LOOKS SKILLS LOOKS LOOKS SKILLS SKILLS LOOKS LOOKS SKILLS SKILLS LOOKS SKILLS Nerds LOOKSGothic LOOKS SKILLS LOOKS SKILLS SKILLS Geek girls CONSERVATISM
    • How this research changed our lives REMEMBER ME
    • 6 changes that rocked our socks off • The tools for ID construction have changed • New online quali methods proved to be efficient • Our target group = the new and better quali researchers • New reflection kit for entire MTV Networks staff • Closer connection with MTV & new clients • Redefine content strategy of TMF on screen & on line
    • 4C Consulting Introduction to our services
    • Our Mission | Boosting your customer value 4C Consulting helps companies win, keep and grow customer value
    • Our Solutions | Call us for… Business Customer Value requirements Strategy definition Package selection Customer & implementation Insight Process Post-launch Excellence care
    • Our Focus | Boosting your customer value Increase Revenue Reduce Costs 1. Acquire new customers 1. Efficient Delivery (Process Excellence) 2. Sell more to existing customers • More of the same (increase turnover) 2. Align value propositions • Expand value proposition portfolio (cross-sell & product development *) 3. Advanced pricing • Upgrade value proposition (Upsell) 3. Prevent existing customers from leaving
    • Business Intelligence Practice SCV ROMSCCI Competing on Analytics Infrastruc- tuur Audit Exploitatie Coaching Data Quality
    • Why 4C Consulting | 7 compelling reasons 1. 100% focus on customer value management 2. Result-oriented project approach 3. Connecting marketing, sales & customer care with senior management and IT 4. Independent consultant for 10 years 5. Experienced crew, passionate about marketing, sales & customer care 6. Value based pricing model 7. Satisfied & loyal customers: 90 customers, more than 380 projects
    • 60
    • Optimize your business with Business &Decision Michel Meulders - Domain Manager -
    • Business & Decision Benelux Founded in 2002 Merger of several companies specialised in BI Business & Decision Benelux is : Consulting & System Integrator : • a multi-specialist, in specific - More than 300 consultants technology fields : - About 18 mio Euro turnover in 2007 • Business Intelligence - 58% organic growth comparing to 2006 • Customer Relationship Management • Life sciences - Last acquisition : BnV Group (BE+NL) • Risk & Compliance Turnover evolution (consolidated) 25000 Belgium Luxembourg Netherlands • with foreign offices in Brussels, Thousands Amsterdam, Luxembourg 20000 15000 • Top accounts in finance, pharma, 10000 telco, distribution, industry (Fortis, ING, ABN Amro, Dexia, GSK, UCB, Proximus, 5000 Belgacom, Carrefour, Honda,…) 0 2004 2005 2006 2007 Obj 2008
    • For more info see http://www.businessdecision.com
    • Belgian Federation of Market Research Institutes www.febelmar.be
    • Febelmar mission Development and promotion of market research and opinion polls in Belgium Protecting the sector interests Watching over correct use of deontological rules of market research in all phases of the market research process Stimulating continuous improvement of quality of service in market research Being a platform for communication, exchange of expertise and networking
    • Members 27 agencies. Together they represent about 75% of the total market research expenditures in Belgium.
    • What is I4BI? i4bi is specialized in implementing Business Intelligence Solutions in your company. Our team of BI experts has deep functional and technical experience with Application development, Business Model definitions and Data Warehousing. Functional/technical designs, development, application role out, training etc… are phases in a project where our consultants have many years of experience. i4bi consultants have a deep knowledge of the Oracle Business Intelligence products and solutions. i4bi sponsors the development of an independent analytical branch, which will probably see the light in 2009
    • What do we provide? Strategic Decision Making To Support Business Technical Analytical Experience Abilities Expertise We Provide
    • Expertise • Analytical Expertise • Data Mining • Statistical modelling • Predictive analysis • Basel II compliant modelling • Forecasting • Business Analysis Expertise • Reporting • Delivering Business Insight to decision makers • Marketing Analysis • Data Quality • Use of technical tools such as SAS – SPSS – Statistica to support & extend business knowledge
    • Contact • For more general information: www.I4BI.be • For more analytical information: Filip.deroover@I4BI.be
    • InSites Consulting 6 beliefs in 60 seconds
    • We believe ... in the empowered consumer Human-to-human interactions are more powerful than ever and can make or break your brand “Consumers are beginning in a very real sense to own our brands and participate. We need to begin to learn how to let go” A.G. Lafley, CEO & Chairman of P&G
    • We believe ... in giving back Rewarding experiences for participants Active involvement of panel members Charity contributions
    • We believe ... in connecting Everything we do is aimed at strengthening connections between you, your market and us “Connected Research” brings you closer to your market and taps into the wisdom of the crowds Some of our connected research methods Research communities Bulletin boards Blog research Online discussion groups More information: http://connectedresearch.insites.eu/
    • We believe ... in the power of new research methods for better marketing decision making Informational Providing more depth to research insights Transformational Doing things that were previously not possible Automational Conducting research more efficiently
    • We believe ... in 1 + 1 = 3 Old and new methods need to be optimally “fused” in order to fully grasp the new customer / consumer reality
    • We believe ... in the power of our team People make the difference Open, forward thinking, dedicated, passionate Specific knowledge centers
    • performance management  consulting  technology Welcome to Keyrus BAQMaR, 17 December 2008 ©Keyrus – all rights reserved
    • About Keyrus (Belgium) • founded in 1996 as SOLIDPartners • focus on performance management, business intelligence & data warehousing • strong and balanced client base spread over different industries • +100 consultants specialised in both technical and business domains • part of Keyrus group (France)
    • Keyrus‟ global footprint  head office in Paris  present in 9 countries  +1300 employees  listed on Paris stock exchange Euronext
    • Vision & mission Keyrus will be one of the few leading service providers in the area of performance management. We help our clients to effectively design, build and operate the adequate performance management organization and solutions in an integrated end-to-end fashion.
    • Portfolio of solutions & services Information Business Analytic People and Corporate Management Intelligence Applications Processes Performance Platforms Management IM BIP AA P&P (C)PM BI Layer (reports, OLAP, dashboards, alerts) outflow functions source systems data warehouse & applications data delivery & management information & data marts functions CPM data delivery, exchange & synchronization CPM Applications (e.g. Analytic Applications planning, ABM, PA) (e.g. data mining)
    • Contact us Keyrus nv info@keyrus.be Nijverheidslaan 3/2 B-1853 Strombeek-Bever t +32 2 706 03 00 f +32 2 706 03 09 www.keyrus.be performance management  consulting  technology 17-Dec-2008
    • Introduction of Profacts
    • Who is PROFACTS ? We are „the new kids on the block‟ in (online) market research ...
    • 1 REVEALING FACTORS FOR SUCCESS strategy 200% 286 people are nowyrs growth rate working mean age @ Profacts 2 people have founded Profacts Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 2006 2007 2007 2007 2007 2008 2008 2008 2008
    • REVEALING FACTORS FOR SUCCESS Profacts is active in more then 10 sectors ... AUTOMOTIVE FMCG RECRUITMENT GPS BANKING INSURANCES ICT PHARMACEUTICAL TELECOM ENERGY
    • Python Predictions
    • Python Predictions PREDICT
    • Python Predictions
    • Python Predictions GROWTH
    • Python Predictions www.pythonpredictions.com
    • Rogil Research A research agency with a view
    • MARKETING & SENSORY RESEARCH OUR PASSION Sensory research FTF research (Mobile Unit) Eye-tracking / Eye|watch Telephone research Tachistoscope Online research Trained panel Panel services Consumer panel Fieldwork in Europe Taste lab
    • Sensory Safari Note down in your agenda SENSORY SAFARI • March 26th 2009 • 18u • At Rogil in Leuven
    • Sensory Safari 5 SENSES MARKETING
    • We hope to welcome you in March. Thanks for your attention !
    • SAS Analytics For Challenging Times Start Focused, Think Wide
    • Campaign Managment Requires Optimization
    • CRM is becoming Risk Managment
    • SAS Breadth of Analytic Offering… • Statistical Analysis • Survey Design/Analysis • Data Mining • Text Mining • Time Series Mining • Forecasting • Quality Improvement • Operations Research
    • SAS Innovations in Marketing Solutions….
    • http://www.sas.com/feature/analytics/index.html Copyright © 2006, SAS Institute Inc. All rights reserved.
    • The Mission Drive the widespread use of data in decision making
    • The Focus Attract Retain Grow Fraud Risk Driving and Maximizing Profit
    • The Vision Operational Processes Interaction data Attitudinal data Descriptive data Behavioral data Enterprise Enterprise Data Data Sources Sources
    • The Acceptance • The rise of the agnostics  Science vs. Chance  In numbers we trust!
    • The myth of the „best‟ algorithm lessons learned from innovations in data sampling and data pre-processing for marketing analytics Dr. Sven F. Crone Deputy Director, Ass. Prof.
    • Associated Experts Prof. Paul Goodwin Directors Dr. Andrew Eaves Prof. Robert Fildes Prof. Peter Young Research & PhD students Dr.Sven F. Crone Heiko Kausch, RA Stavros Asimakopoulos Researchers Xi Chen Dr. Steve Finlay Bruce Havel Dr. Alastair Robertson Suzi Ismail Dr. Didier Soopramanien Nikolaos Kourentzes Dr. Kostas Nikolopoulos Ioannis Stamatopoulos Andrey Davidenko Prof. Stephen Taylor Charlotte Brown Dr. Wlodek Tych Hong Juan Liu Prof. David Peel T Hu Prof. Peter Pope John Prest Huang Tao Visiting Researchers Prof. Geoff Allen Dr. Yukun Bao Young-Sang Cho
    • “Take away this pudding, it has no theme.” Sir Winston Churchill (1915)
    • Agenda • Sampling issues in Data Mining • Case study 1: Direct Marketing • Cross-selling of Magazine subscriptions • Effect of data preprocessing: Sampling • Interaction of Sampling with Scaling & Coding • Case study 2: Credit & Behavioral Scoring • Predicting consumer credit default • Effects of sample size • Effects of sample distribution • Case study 3: Online Shopping Behaviour • Predicting consumer shopping channel choice • Sample distribution & multiple classes • Conclusion & Take-aways
    • Why (Under/Over) Sampling? • Knowledge Discovery (KDD) = non-trivial process of identifying valid, novel, useful patterns in large data sets • Data Mining = only one single step in the KDD process • Data sample determines the whole process! ( GIGO) • “Research seems preoccupied with algorithms” [Hand 2000] SAS SEMMA DM-Process Monitoring CRISP-DM Process
    • Sampling in Direct Marketing Literature? Data reduction** Data projection Input Paramete Feature Re- Continuous attributes Categories type* Methods*** r tuning Selection sampling Standardisation Discretisation Coding [2] 2 BMLP, LR, LDA, QDA X X [42] 1 MLP, LR, CHAID X X [43] 2 MLP, RBF, LR, GP, CHAID X X [44] 3 MLP, LR, LDA X X [4] 2 CHAID, CART X [6] 2 MLP, LR X X X X X [9] 2 LVQ, RBF, 22 DT, 9 SC X X LDA, LR, KNN, KDE, CART, MLP, [45] 2 X X RBF, MOE, FAR, LVQ [3] 1 MLP X X [7] 2 LSSVM X X X [11] 2 LR, LS-SVM, KNN, NB, DT X X X LDA, QDA, LR, BMLP, DT, SVM, [10] 1 X X LSSVM, TAN, LP, KNN [46] 2 LR, MLP, BMLP X X LSSVM, SVM, DT, RL, LDA, QDA, [47] 2 X X LR, NB, IBL [48] 1 DT, MLP, LR, FC X [49] 1 FC X X Majority of direct marketing papers focus on algorithm tuning Only 3 papers consider Resampling / Instance Selection No analysis of the interaction with Sampling & Projection & …
    • Classification since last purchase … Many Last campaign No response Subscribed to magazine Few … Days 1… Number of subscriptions … Many Database of customers (instances) Known attributes for all customers (age, gender, existing subscriptions, …) Known response (class membership) of buyers & non-buyers from past mailings Build a model to separate classes  decision boundary of different complexity
    • Classification since last purchase … Many No response Subscribed to magazine Class unknown Few … Days 1… Number of subscriptions … Many Use the decision boundary to classify unseen instances Calculate on which side of hyperplane the instances lie (or distance) Assign class to unseen instances
    • Reality Check: Imbalanced classes since last purchase … Many No response Subscribed to magazine Problem • Classifiers are biased towards the majority class • Shifts the decision boundary • Error / Accuracy based learning Few … Days creates naïve classifiers • Invalid separation of classes 1… Number of subscriptions … Many Balanced dataset = class distributions are equal P(x|y=A)=P(x|y=B)  proportional sampling or stratified sampling feasible Imbalanced dataset = class distributions unequal P(x|y=A)>>P(x|y=B) ` The class of interest is often the minority (in most business applications)
    • Imbalanced Data Sampling since last purchase … Many No response Subscribed to magazine Stratified Random Sampling divide DB in mutually exclusive strata (subpopulations) & draw random samples from each Proportional assure proportions in samples Few … Days equal those in population Disproportional weighted over-& undersampling of important classes 1… Number of subscriptions … Many Size of the sample? Distribution / location of the sample?
    • Random Undersampling since last purchase … Many No response Subscribed to magazine Benefits • Helps detect rare target levels Risks • Biases predictions (correctable) • Looses information contained in Few … Days instances of the majority class • Creates different boundaries • Increases prediction variability 1… Number of subscriptions … Many •… Exclude random instances of the majority class Retain all instances of the minority class Establish a balanced class distribution
    • Random Oversampling since last purchase … Many No response Subscribed to magazine Benefits • Helps detect rare target levels • No loss of information Risks • Biases predictions (correctable) Few … Days • Increases prediction variability • Increases processing time 1… Number of subscriptions … Many Retain all instances of the majority class in the sample Duplicate identical instances of the minority class Establish a balanced class distribution
    • Ready for more theory…? x  rather some case studies ...!
    • Agenda • Sampling issues in Data Mining • Case study 1: Direct Marketing • Cross-selling of Magazine subscriptions • Effect of data preprocessing: Sampling • Interaction of Sampling with Scaling & Coding • Case study 2: Credit & Behavioral Scoring • Predicting consumer credit default • Effects of sample size • Effects of sample distribution • Case study 3: Online Shopping Behaviour • Predicting consumer shopping channel choice • Sample distribution & multiple classes • Conclusion & Take-aways
    • Business Case: Direct Marketing/Response Optimization • Sell a magazine subscription to existing customers • Whom to send mail to? (Which customers are most likely to respond?) • How many customers to contact? (What is the optimal mailing size?) Corporate project with leading German Publishing House Provided data set of past mailing campaigns Benchmark novel methods against in-house SPSS Clementine Explore Neural Networks (NN) an Support Vector Machines (SVM)
    • Benefits of Direct Marketing Simple With data mining Addressees 100.000 Top 40% = 40.000 Cost 2€/mail = 200.000€ 2,5€/mail = 100.000€ Response rate 0,5% = 500 1,0% = 400  Sales volume 300€ 300€ Sales volume 150.000€ 120.000€ Revenue -50.000€ 20.000€ Smaller mailing (number of letters sent)  lower costs (Euro 1.- per letter) Higher response rate  higher revenue More specific mailing  lower cost More relevant information  higher customer satisfaction
    • NN get worse with learning … • Wish to implement Neural Networks for next campaign • In-house team (with no NN knowledge) outperformed us EVERY TIME! • Analyzed software, training parameters, etc.  internal competition • Observed expert in building models … ! Pred Pred Sum Pred. Pred. Sum % % C0 C1 C0 C1 C0 61.86 38.14 100 C0 72.96 27.04 100 C1 55.09 44.81 100 C1 62.02 37.98 100 116.95 82.95 54.26 134.98 65.02 55.47 Pred. Pred. Sum % C0 C1 C0 52.87 43.37 100 C1 47.13 56.63 100 100 100 54.75
    • Experimental Design: Different data pre-processing Handle categorical Scale numerical Different Encoding Different Scaling features n, n-1, thermo, ordinal features Standardise Discretise, Adjust imbalanced Decide on sample Different Sampling class distributions Over-& Undersampling size and method Handle outliers Select useful features Evaluate across 3 algorithms: Neural Networks (MLPs), Support Vector Machines & Decision Trees
    • Dataset Structure Data set size Data set structure • 300,000 customer records • 18 categorical features • 4,019 subscriptions sold • 35 numerical features • Response rate of 1.3% • Binary target variable Evaluated the Impact of Data Preprocessing • Data Sampling (over sampling vs. undersampling) • Categorical attribute Encoding (N, N-1, thermo, ordinal) • Continuous attribute Projection (Binning vs. Normalisation) • Continuous attribute Scaling ( [0,+1] vs. [-1,+1] range) Multifactorial design to evaluate impact across multiple methods Neural Networks (NN) Support Vector Machines (SVM) Decision Trees (CART)
    • Sampling  Created 2 Dataset Sampling candidates Data partition (number of records) Oversampling Undersampling Data subset Class 1 Class -1 Class 1 Class -1 Training set 20,000 20,000 2,072 2,072 Validation set 10,000 10,000 1,035 1,035 SUM 30,000 30,000 3,107 3,107 Test (hold-out) set 912 64,088 912 64,088 Different balancing in the training data Original distribution in the test data (65,000 instances)
    • Results Increase Increase Increase Oversampling outperforms undersampling consistently! Gain in Lift depends on method (different sensitivity) Oversampling has higher impact than data coding & scaling
    • Recommendations from Case Study • Sampling • Oversampling outperfoms undersampling for all methods • Undersampling: better in-sample results & worse out of sample • Choice of method • NN & SVM better than CART • Encoding & Projection • SVM: avoid Ordinal coding (e.g. 1,2,3) all other similar (incl. N !) • NN: avoid standardization & ordinal encoding • DT / CART: use temperature, all others similar (incl. ordinal) Binning & Scaling of continuous attributes irrelevant for all methods! Use Undersampling & N-1 encoding with SVM & NN Best preprocessed SVM  lift of 0.645 on test set … BUT …
    • Results across Pre-processing  Preprocessing: higher impact than method selection  Lift-variation per method from Sampling/Scaling/Coding > Difference of Lift between competing methods! Lift performance on Arithmetic Mean Performance Geometric Mean Performance Test data subset on Test data subset on Test data subset 0,65 0,58 0,58 0,57 0,64 0,57 0,56 0,55 0,63 0,56 GM test Lift test AM test 0,54 0,62 0,55 0,53 0,52 0,61 0,54 DPP causes 50%-70% of the 0,51 differences between models 0,60 0,53 0,50 NN SVM DT NN SVM DT NN SVM DT Method Method Method Results are consistent across error measures Experiments allow identification of „best practices‟ to model methods Best-practice preprocessing varies between methods
    • Agenda • Sampling issues in Data Mining • Case study 1: Direct Marketing • Cross-selling of Magazine subscriptions • Effect of data preprocessing: Sampling • Interaction of Sampling with Scaling & Coding • Case study 2: Credit & Behavioral Scoring • Predicting consumer credit default • Effects of sample size • Effects of sample distribution • Case study 3: Online Shopping Behaviour • Predicting consumer shopping channel choice • Sample distribution & multiple classes • Conclusion & Take-aways
    • Business Case: Predicting Customer Online Shopping Adoption • Traditional buying process is offline & simultaneous  “bricks” store • Introduction of the Internet changes consumer behaviour • Seek information online & offline • Purchasing online & offline  Changing purchasing behaviour through internet adoption  Changing purchasing behaviour through Technology Acceptance • Development of heterogeneous Purchasing Behaviour • Example: Purchasing electronic durable consumer goods • Search for product info (e.g. video cameras) online  test product in-store  search for best deal on internet & purchase Search for Information Online Purchase Online Online Shoppers Browsers Search for Information Offline Purchase Offline Non-Internet Shoppers
    • Stages of Internet Adoption 1. OFFLINE BUYERS Information gathering & purchasing in Stores 2. BROWSERS Information gathering online & purchasing in stores 3.ONLINE BUYERS Information gathering & purchasing online
    • Motivation DIDIER: Marketing Modelling SVEN: Data Mining Perspective • Econometric / Marketing Domain • IS/OR/MS Domain  Data Mining • Seeks to explain how customers behave in • Seeks to accurately predict regardless of online shopping explanation why customers buy • Use of „black-box” logistic regression • Use of “black-box” methods from models computational intelligence Models class membership to identify Models class membership to causal variables that explain choices accurately classify unseen instances Descriptive & Normative Modelling Predictive Modelling Best practices Best practices   balance datasets for distribution  Rebalance datasets for equal distribution representative of population of target variables  Use ordinal variables & nominal variables  Recode ordinal  binary scale without recoding  Rescale & normalise data to facilitate  Do not normalise / scale data learning speed etc. same dataset & same objectives & similar methods Conflicting “best practice” approaches to modelling Outside of most software simulators!!! Implicit knowledge? … WHO IS “CORRECT”? WHAT IS THE IMPACT?
    • Dataset • Survey on Internet Shopping Behaviour • 5500 UK households  685 respondents • Adjusted for age, income etc. of customers (older less likely to buy) • Adjusted for product specific risk of online shopping for branded durable consumer goods (inspection required to some extent) • 73 questions on factors related to internet shopping, products etc. Online Shopping Factors: “Going to the shops is as convenient as Internet shopping” Demographics Class 1: “I would buy online if products are Browse Ônline & branded” etc. [1=strongly agree; …] Buy Online Internet Class 2: Logistic Regression specific Browse Online & Neural Networks Demographic Factors Factors Buy Offline Age, Gender, Income Class 3: Browse & Online Buy Offline Internet Utility Factors shopping specific Score from 6 correlated variables Factors Input Variables Models Output Variables  Mixed scale of nominal, ordinal, interval
    • Imbalanced Classification problem • Split of Dataset for Training, Validation and Test {50%;25;25%} • Distribution of target classes is skewed {65% online buyers; 22.5% browsers; 12.5% offline shoppers} • Rebalancing of data sets through over- & undersampling) Dataset Dataset Imbalanced Imbalanced Oversampling Oversampling Undersampling Undersampling 400 400 300 300 Count Count 200 200 100 100 Data Subset Data Subset Training Training Validation Validation Test Test 0 0 Online- Browsers Offline- Online- Browsers Offline- Online- Browsers Offline- Online- Browsers Offline- Online- Browsers Offline- Online- Browsers Offline- Shoppers Shoppers Shoppers Shoppers Shoppers Shoppers Shoppers Shoppers Shoppers Shoppers Shoppers Shoppers
    • Results without Discretisation Logist.Reg. True Training Data Test Data Dataset Value Online Browse Offline Online Browse Offline Original Online 93.36 5.17 1.48 88.89 7.78 3.33 MCRtrain=54.3% Imbalanced Browser 62.77 23.40 13.83 49.39 22.58 29.03 MCRtest =48.9% Offline 36.54 17.31 46.15 35.29 29.41 35.29  Under- Online 57.69 30.77 11.54 64.44 23.33 12.22 Sampling Browser 26.92 48.08 25.00 32.26 25.81 41.94 MCRtrain=55.8% Offline 17.31 21.15 61.54 29.41 35.29 35.29 MCRtest =41.8%  Over- Online 68.27 24.35 7.38 74.44 16.67 8.89 Sampling Browser 30.63 43.91 25.46 35.48 29.03 35.48 MCRtrain=58.4% Offline 16.97 19.93 63.10 29.41 29.41 41.18 MCRtest =48.2% Neural Net Training Data Test Data Dataset Online Browse Offline Online Browse Offline MCRtrain=54.4% Original Online 86.19 12.71 1.10 86.67 8.89 4.44 MCRtest =52.5% Imbalanced Browser 53.13 31.25 15.63 41.94 35.48 22.58 Offline 25.17 28.57 45.71 29.41 35.29 35.29  Under- Online 44.86 40.00 17.14 27.78 58.89 13.33 MCRtrain=54.9% Sampling Browser 14.29 48.57 37.14 16.13 32.26 51.61 MCRtest =35.7% Offline 8.57 20.00 71.43 11.76 41.18 47.06   Over- Online 81.22 18.23 0.55 61.11 22.22 16.67 MCRtrain=88.0% Sampling Browser 14.92 83.43 1.66 19.35 77.42 3.23 MCRtest =75.6% Offline 15.52 0.55 99.45 0.00 11.76 88.24 Mean Classification Rate (%)
    • Results with Discretisation of Ordinal Logist.Reg. True Training Data Test Data Dataset Value Online Browse Offline Online Browse Offline Original Online 91.51 6.64 1.85 85.56 7.78 6.67 MCRtrain=61.15% Imbalanced Browser 54.26 36.17 9.57 48.39 32.26 19.35 MCRtest =45.1% Offline 26.92 17.31 55.77 58.82 47.62 17.65  Under- Online 71.15 21.15 7.69 55.56 24.44 20.00 Sampling Browser 17.31 65.38 17.31 67.74 6.45 25.81 MCRtrain=69.9% Offline 15.38 11.54 73.08 58.82 0.00 41.18 MCRtest =34.4%   Over- Online 68.63 22.88 8.49 70.0 21.11 8.89 Sampling Browser 17.34 56.83 25.83 12.90 58.06 29.03 MCRtrain=66.0% Offline 13.28 14.02 72.69 17.65 23.53 58.82 MCRtest =62.3% Neural Net Training Data Test Data Dataset Online Browse Offline Online Browse Offline MCRtrain=56.5% Original Online 96.13 3.87 0.00 84.44 11.11 4.44 MCRtest =45.5% Imbalanced Browser 68.75 28.13 3.13 64.52 22.58 12.90 Offline 40.00 14.29 45.17 58.82 11.76 29.41  Under- Online 57.14 40.00 2.86 25.56 72.22 2.22 MCRtrain=55.2% Sampling Browser 34.29 54.29 11.43 67.74 29.03 3.23 MCRtest =28.0% Offline 14.29 31.43 54.29 52.94 17.65 29.41   Over- Online 98.34 1.10 0.55 58.89 24.44 16.67 MCRtrain=99.5% Sampling Browser 0.00 100.0 0.00 3.23 83.87 12.90 MCRtest =79.0% Offline 0.00 0.00 100.0 0.00 5.88 94.12 Mean Classification Rate (%)
    • Summary Oversampling outperforms other samplings - Across Different Datasets - Across various data preprocessing Methods show different sensitivity to Sampling - More variation from sampling, coding & scaling than between methods - Using different preprocessing variants is important in modeling Various sophisticated extensions exist - SMOTE (Synthetic Minority Oversampling Technique) - K-nearest Neighbor sampling (removal / creation) - One-class learning etc. … Extend your bad of tricks … - … and experiment with imbalanced sampling!
    • Questions? Sven F. Crone Lancaster University Management School Centre for Forecasting Lancaster, LA1 4YX email s.crone@lancaster.ac.uk SYt  Yt 1  (1   )SYt 1  
    • Exploring Innovation “Online panel” vs “Online „streaming/convenience” sampling
    • Unfortunately, the presenters of iVOX & Corelio cannot share their presentation with the BAQMaR community due to reasons of confidentiality!