DATA MINING
Upcoming SlideShare
Loading in...5
×
 

DATA MINING

on

  • 419 views

 

Statistics

Views

Total Views
419
Views on SlideShare
419
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

DATA MINING DATA MINING Presentation Transcript

  • Data Mining David L. Olson James & H.K. Stuart Professor in MIS University of Nebraska Lincoln
  • Definition
    • DATA MINING : exploration & analysis
      • by automatic means
      • of large quantities of data
      • to discover actionable patterns & rules
    • Data mining a way to utilize massive quantities of data that businesses generate
  • Retail Outlets
    • Bar coding & Scanning generate masses of data
      • customer service
      • inventory control
      • MICROMARKETING
      • CUSTOMER PROFITABILITY ANALYSIS
      • MARKET BASKET ANALYSIS
  • FINGERHUT
    • Founded 1948
      • today sends out 130 different catalogs
      • to over 65 million customers
      • 6 terabyte data warehouse
      • 3000 variables of 12 million most active customers
      • over 300 predictive models
    • Focused marketing
  • Fingerhut
    • Purchased by Federated Department Stores for $1.7 billion in 1999 (for database)
    • Fingerhut had $1.6 to $2 billion business per year, targeted at lower-income households
    • Can mail 400,000 packages per day
    • Each product line has its own catalog
  • Fingerhut
    • Uses segmentation , decision tree , regression , neural network tools from SAS and SPSS
    • Segmentation - combines order & demographic data with product offerings
      • can target mailings to greatest payoff
        • customers who recently had moved tripled their purchasing 12 weeks after the move
        • send furniture, telephone, decoration catalogs
  • Data for SEGMENTATION
    • cluster indices
    • subj age income marital grocery dine out savings
    • 1001 53 80000 wife 180 90 30000
    • 1002 48 120000 husband 120 110 20000
    • 1003 32 90000 single 30 160 5000
    • 1004 26 40000 wife 80 40 0
    • 1005 51 90000 wife 110 90 20000
    • 1006 59 150000 wife 160 120 30000
    • 1007 43 120000 husband 140 110 10000
    • 1008 38 160000 wife 80 130 15000
    • 1009 35 70000 single 40 170 5000
    • 1010 27 50000 wife 130 80 0
  • Initial Look at Data
    • Want to know features of those who spend a lot dining out
    • INCLUDE AS MANY ACTIONABLE VARIABLES AS POSSIBLE
      • things you can identify
    • Manipulate data
      • sort on most likely indicator ( dine out )
  • Sorted by Dine Out
    • cluster indices
    • subject age income marital grocery dine out savings
    • 1004 26 40000 wife 80 40 0
    • 1010 27 50000 wife 130 80 0
    • 1001 53 80000 wife 180 90 30000
    • 1005 51 90000 wife 110 90 20000
    • 1002 48 120000 husband 120 110 20000
    • 1007 43 120000 husband 140 110 10000
    • 1006 59 150000 wife 160 120 30000
    • 1008 38 160000 wife 80 130 15000
    • 1003 32 90000 single 30 160 5000
    • 1009 35 70000 single 40 170 5000
  • Analysis
    • Best indicators
      • marital status
      • groceries
    • Available
      • marital status might be easier to get
  • Fingerhut
    • Mailstream optimization
      • which customers most likely to respond to existing catalog mailings
      • save near $3 million per year
      • reversed trend of catalog sales industry in 1998
      • reduced mailings by 20% while increasing net earnings to over $37 million
  • Banking
    • Among first users of data mining
    • Used to find out what motivates their customers (reduce churn )
    • Loan applications
    • Target marketing
    • Norwest: 3% of customers provided 44% profits
    • Bank of America: program cultivating top 10% of customers
  • CREDIT SCORING
    • Bank Loan Applications
    • Age Income Assets Debts Want On-time
    • 24 55557 27040 48191 1500 1
    • 20 17152 11090 20455 400 1
    • 20 85104 0 14361 4500 1
    • 33 40921 91111 90076 2900 1
    • 30 76183 101162 114601 1000 1
    • 55 80149 511937 21923 1000 1
    • 28 26169 47355 49341 3100 0
    • 20 34843 0 21031 2100 1
    • 20 52623 0 23054 15900 0
    • 39 59006 195759 161750 600 1
  • Characteristics of Not On-time
    • Age Income Assets Debts Want On-time
    • 28 26169 47355 49341 3100 0
    • 20 52623 0 23054 15900 0
    • Here, Debts exceed Assets
    • Age Young
    • Income Low
    • BETTER: Base on statistics, large sample
    • supplement data with other relevant variables
  • CHURN
    • Customer turnover
    • critical to:
      • telecommunications
      • banks
      • human resource management
      • retailers
  • Identify characteristics of those who leave
    • Age Time-job Time-town min bal checking savings card loan
    • years months months $
    • 27 12 12 549 x x
    • 41 18 41 3259 x x x
    • 28 9 15 286 x x
    • 55 301 5 2854 x x x
    • 43 18 18 1112 x x x
    • 29 6 3 0 x
    • 38 55 20 321 x x x
    • 63 185 3 2175 x x x
    • 26 15 15 386 x x
    • 46 13 12 1187 x x x
    • 37 32 25 1865 x x x
  • Analysis
    • What are the characteristics of those who leave?
      • Correlation analysis
    • Which customers do you want to keep?
      • Customer value - net present value of customer to the firm
  • Correlation
    • Age Time Time min-bal check saving card loan
    • Job Town
    • Age 1.0 0.6 0.4 -0.4 0.0 0.4 0.2 0.3
    • Job 1.0 0.9 -0.6 0.1 0.6 0.9 -0.2
    • Town 1.0 -0.5 -0.1 0.3 0.5 0.4
    • Min-Bal 1.0 -0.2 0.3 0.6 -0.1
    • Check 1.0 0.5 0.2 0.2
    • Saving 1.0 0.9 0.3
    • Card 1.0 0.5
    • Loan 1.0
  • Mortgage Market
    • Early 1990s - massive refinancing
    • need to keep customers happy to retain
    • contact current customers who have rates significantly higher than market
      • a major change in practice
      • data mining & telemarketing increased Crestar Mortgage ’s retention rate from 8% to over 20%
  • Banking
    • Fleet Financial Group
      • $30 million data warehouse
      • hired 60 database marketers, statistical/quantitative analysts & DSS specialists
      • expect to add $100 million in profit by 2001
  • Banking
    • First Union
      • concentrated on contact-point
      • previously had very focused product groups, little coordination
      • Developed offers for customers
  • CREDIT SCORING
    • Data warehouse including demand deposits, savings, loans, credit cards, insurance, annuities, retirement programs, securities underwriting, other
    • Statistical & mathematical models ( regression ) to predict repayment
  • CUSTOMER RELATIONSHIP MANAGEMENT ( CRM )
    • understanding value customer provides to firm
      • Kathleen Khirallah - The Tower Group
        • Banks will spend $9 billion on CRM by end of 1999
      • Deloitte
        • only 31% of senior bank executives confident that their current distribution mix anticipated customer needs
  • Customer Value
    • Middle aged (41-55), 3-9 years on job, 3-9 years in town, savings account
    • year annual purchases profit discounted net 1.3 rate
    • 1 1000 200 153 153
    • 2 1000 200 118 272
    • 3 1000 200 91 363
    • 4 1000 200 70 433
    • 5 1000 200 53 487
    • 6 1000 200 41 528
    • 7 1000 200 31 560
    • 8 1000 200 24 584
    • 9 1000 200 18 603
    • 10 1000 200 14 618
  • Younger Customer
    • Young (21-29), 0-2 years on job, 0-2 years in town, no savings account
    • year annual purchases profit discounted net 1.3
    • 1 300 60 46 46
    • 2 360 72 43 89
    • 3 432 86 39 128
    • 4 518 104 36 164
    • 5 622 124 34 198
    • 6 746 149 31 229
    • 7 896 179 29 257
    • 8 1075 215 26 284
    • 9 1290 258 24 308
    • 10 1548 310 22 331
  • Credit Card Management
    • Very profitable industry
    • Card surfing - pay old balance with new card
    • promotions typically generate 1000 responses, about 1%
    • in early 1990s, almost all mass-marketing
    • data mining improves ( lift )
  • LIFT
    • LIFT = probability in class by sample divided by probability in class by population
      • if population probability is 20% and
      • sample probability is 30%,
      • LIFT = 0.3/0.2 = 1.5
    • best lift not necessarily best
      • need sufficient sample size
      • as confidence increases, longer list but lower lift
  • Lift Example
    • Product to be promoted
    • Sampled over 10 identifiable segments of potential buying population
      • Profit $50 per item sold
      • Mailing cost $1
      • Sorted by Estimated response rates
  • Lift Data
  • Lift Chart
  • Profit Impact
  • INSURANCE
    • Marketing, as retailing & banking
    • Special:
      • Farmers Insurance Group - underwriting system generating $ millions in higher revenues, lower claims
        • 7 databases, 35 million records
      • better understanding of market niches
        • lower rates on sports cars, increasing business
  • Insurance Fraud
    • Specialist criminals - multiple personas
    • InfoGlide specializes in fraud detection products
      • similarity search engine
        • link names, telephone numbers, streets, birthdays, variations
        • identify 7 times more fraud than exact-match systems
  • Insurance Fraud - Link Analysis
    • claim
    • type amount physician attorney
    • back 50000 Welby McBeal
    • neck 80000 Frank Jones
    • arm 40000 Barnard Fraser
    • neck 80000 Frank Jones
    • leg 30000 Schmidt Mason
    • multiple 120000 Heinrich Feiffer
    • neck 80000 Frank Jones
    • back 60000 Schwartz Nixon
    • arm 30000 Templer White
    • internal 180000 Weiss Richards
  • Insurance Fraud
    • Analytics’ NetMap for Claims
      • uses industry-wide database
      • creates data mart of internal, external data
      • unusual activity for specific chiropractors, attorneys
    • HNC Insurance Solutions
      • workers compensation fraud
    • VeriComp - predictive software ( neural nets )
      • saved Utah over $2 million
  • TELECOMMUNICATIONS
    • Deregulation - widespread competition
      • churn
        • 1/3rd poor call quality, 1/2 poor equipment
      • wireless performance monitor tracking
        • reduced churn about 61%, $580,000/year
      • cellular fraud prevention
      • spot problems when cell phones begin to go bad
  • Telecommunications
    • Metapath’s Communications Enterprise Operating System
      • help identify telephone customer problems
        • dropped calls, mobility patterns, demographics
        • to target specific customers
      • reduce subscription fraud
        • $1.1 billion
      • reduce cloning fraud
        • cost $650 million in 1996
  • Telecommunications
    • Churn Prophet , ChurnAlert
      • data mining to predict subscribers who cancel
    • Arbor/Mobile
      • set of products, including churn analysis
  • TELEMARKETING
    • MCI uses data marts to extract data on prospective customers
      • typically a 2 month program
      • 20% improvement in sales leads
      • multimillion investment in data marts & hardware
      • staff of 45
      • trend spotting (which approaches specific customers like)
  • Telemarketing
    • Australian Tourist Commission
      • maintained database since 1992
        • responses to travel inquiries on tours, hotels, airlines, travel agents, consumers
        • data mine to identify travel agents & consumers responding to various media
        • sales closure rate at 10% and up
        • lead lists faxed weekly to productive travel agents
  • Telemarketing
    • Segmentation
      • which customers respond to new promotions, to discounts, to new product offers
      • Determine who
        • to offer new service to
        • those most likely to commit fraud
  • Human Resource Management
    • Identify individuals liable to leave company without additional compensation or benefits
    • Firm may already know 20% use 80% of offered services
      • don’t know which 20%
      • data mining (business intelligence) can identify
    • Use most talented people in highest priority(or most profitable) business units
  • Human Resource Management
    • Downsizing
      • identify right people, treat them well
      • track key performance indicators
      • data on talents, company needs, competitor requirements
    • State of Mississippi ’s MERLIN network
      • 30 databases (finance, payroll, personnel, capital projects)
      • Cognos Impromptu system - 230 users
  • CASINOS
    • Casino gaming one of richest data sets known
    • Harrah ’s - incentive programs
      • about 8 million customers hold Total Gold cards, used whenever the customer spends money in the casino
      • comprehensive data collection
    • Trump’s Taj Card similar
  • Casinos
    • Bellagio & Mandelay Bay
      • strategy of luxury visits
      • child entertainment
      • change from old strategy - cheap food
    • Identify high rollers - cultivate
      • identify those to discourage from play
      • estimate lifetime value of players
  • ARTS
    • computerized box offices leads to high volumes of data
    • Identify potential consumers for shows
    • software to manage shows
      • similar to airline seating chart software
  • Research Projects
    • Techniques
      • Statistics (difference between data mining, conventional statistics)
    • Data Management
      • How to beat data into usable form
    • Visualization
      • Manny Parzen
    • Applications
  • Class Projects
    • Application
      • Gallup: rehabilitation of drug-using women
      • Relationship between strengths-based counseling, success
      • Finding: good counselor relationship key
    • Data: limited (but Gallup tends to agglomerate thousands over time)
    • Technique
      • Regression
        • On ordinal data
  • Sleep Disorder Prediction
    • OSHA data
      • 11 Nebraska plants
      • Demographic data
      • Epworth sleepiness scale
      • 21 sleep disorder variables
    • Applied Clementine models
      • Trained on 1500
      • Tested 214
    • If 4 of 6 models predicted problem, assigned
      • Increased prediction accuracy
  • Test Bank Analysis
    • Effort to develop on-line test
      • Math department
      • Service course – freshmen, entire University
        • Thousands of cases
    • Data manipulation problem
      • Some link analysis
        • Early prediction of performance
    • Identified which questions predicted results
      • Used to take corrective action early
  • Survey Mode Effects
    • Gallup
    • Surveys via telephone or internet
      • Effect of Interviewer
    • DATA
      • 2,979 Internet, 900 telephone
    • DATA MINING
      • Decision trees & neural networks
      • Provided valuable information when traditional models limited by missing data
  • IT R&D & Economy
    • Proposal: more IT R&D, better economy
      • Apparently not reverse
    • Data: 30 thousand cases, COMPUSTAT
      • 28 quarter differences
    • Technique:
      • Decision Tree, SQL Server
    • Results
      • Weak
      • Look at more variables
  • IT Effect on Firm Size
    • IT reduces transaction costs, reducing firm size
    • IT reduces coordination costs, increasing firm size
    • DATA:
      • Private fixed investment on IT
      • Firm size
      • Compustat
    • DATA MINING:
      • Rules
    • Source of hypotheses
  • Online Auction Fraud Prediction
    • eBay
      • Over 16 million items per day
      • Fraud: 3,700 in 1999, 6,600 in 2000
    • Purpose:
      • Predict seller fraud profile, which products
    • User Trust
    • DATA: golf clubs, humidifiers
    • Initial results inconclusive – more work