Dbm630 lecture10

434
-1

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
434
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
35
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Dbm630 lecture10

  1. 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 10 Data Mining: Case Studies and Applications by Kritsada Sriphaew (sriphaew.k AT gmail.com)1
  2. 2. Topics Data Mining Goals Data Floods Machine Learning / Data Mining Application areas Case Studies2 Data Warehousing and Data Mining by Kritsada Sriphaew
  3. 3. THE FOUR GOALS OF DATA MINING• Prediction: Using current data to make prediction on future activities• Identification: "Data patterns can be used to identify the existence of an item, an event, or an activity“• Classification: Breaking the data down into categories based on certain attributes.• Optimization: Using the mined data to make optimizations on resources, such as time, money, etc. 3 Data Warehousing and Data Mining by Kritsada Sriphaew
  4. 4. Trends leading to Data Flood More data is generated:  Bank, telecom, other business transactions ...  Scientific data: astronomy, biology, etc  Web, text, and e-commerce4 Data Warehousing and Data Mining by Kritsada Sriphaew
  5. 5. Big Data Examples Europes Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session  storage and analysis a big problem AT&T handles billions of calls per day  so much data, it cannot be all stored -- analysis has to be done “on the fly”, on streaming data5 Data Warehousing and Data Mining by Kritsada Sriphaew
  6. 6. Largest databases in 2003 Commercial databases:  Winter Corp. 2003 Survey: France Telecom has largest decision-support DB, ~30TB; AT&T ~ 26 TB Web  Alexa internet archive: 7 years of data, 500 TB  Google searches 4+ Billion pages, several hundreds TB  IBM WebFountain, 160 TB (2003)  Internet Archive (www.archive.org),~ 300 TB6 Data Warehousing and Data Mining by Kritsada Sriphaew
  7. 7. 5 million terabytes created in 2002 UC Berkeley 2003 estimate: 5 exabytes (5 million terabytes) of new data was created in 2002. www.sims.berkeley.edu/research/projects/how-much-info-2003/ US produces ~40% of new stored data worldwide10187 Data Warehousing and Data Mining by Kritsada Sriphaew
  8. 8. Data Growth Rate Twice as much information was created in 2002 as in 1999 (~30% growth rate) Other growth rate estimates even higher Very little data will ever be looked at by a human Knowledge Discovery is NEEDED to make sense and use of data. 8 Data Warehousing and Data Mining by Kritsada Sriphaew
  9. 9. Machine Learning / Data MiningApplication areas Science  astronomy, bioinformatics, drug discovery, … Business  advertising, CRM (Customer Relationship management), investments, manufacturing, sports/entertainment, telecom, e-Commerce, targeted marketing, health care, … Web:  search engines, advertising, web and text mining, … Government  surveillance & anti-terror (?|), crime detection, profiling tax cheaters, … 9 Data Warehousing and Data Mining by Kritsada Sriphaew
  10. 10. DM applications in 2004 (in %) 13%: Banking 9%: Direct Marketing, Fraud Detection, Scientific data analysis 8%: Bioinformatics 7%: Insurance, Medical/Pharmaceutic Applications 6%: eCommerce/Web, Telecommunications 4%: Investments/Stocks, Manufacturing, Retail, Security Bellow: Travel, Entertainment/News, … 10 Data Warehousing and Data Mining by Kritsada Sriphaew
  11. 11. Data Mining for Customer Modeling Customer Tasks:  attrition prediction (odchod zákazníků)  targeted marketing:  cross-sell, customer acquisition  credit-risk  fraud detection Industries  banking, telecom, retail sales (maloobchodní prodej), …11 Data Warehousing and Data Mining by Kritsada Sriphaew
  12. 12. Customer Attrition: Case Study  Situation: Attrition rate for mobile phone customers is around 25-30% a year! Task:  Given customer information for the past N months, predict who is likely to attrite next month.  Also, estimate customer value and what is the cost- effective offer to be made to this customer.12 Data Warehousing and Data Mining by Kritsada Sriphaew
  13. 13. Customer Attrition Results Verizon Wireless built a customer data warehouse Identified potential attriters Developed multiple, regional models Targeted customers with high propensity to accept the offer Reduced attrition rate from over 2%/month to under 1.5%/month (huge impact, with >30 M subscribers)(Reported in 2003) 13 Data Warehousing and Data Mining by Kritsada Sriphaew
  14. 14. Assessing Credit Risk: Case Study Situation: Person applies for a loan Task: Should a bank approve the loan? Note: People who have the best credit don’t need the loans, and people with worst credit are not likely to repay. Bank’s best customers are in the middle 14 Data Warehousing and Data Mining by Kritsada Sriphaew
  15. 15. Credit Risk - Results Banks develop credit models using variety of machine learning methods. Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan Widely deployed in many countries 15 Data Warehousing and Data Mining by Kritsada Sriphaew
  16. 16. Successful e-commerce – Case Study A person buys a book (product) at Amazon.com. Task: Recommend other books (products) this person is likely to buy Amazon does clustering based on books bought:  customers who bought “Advances in Knowledge Discovery and Data Mining”, also bought “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” Recommendation program is quite successful 16 Data Warehousing and Data Mining by Kritsada Sriphaew
  17. 17. Unsuccessful e-commerce case study (KDD-Cup2000) Data: clickstream and purchase data from Gazelle.com, legwear and legcare e-tailer Q: Characterize visitors who spend more than $12 on an average order at the site Dataset of 3,465 purchases, 1,831 customers Very interesting analysis by Cup participants  thousands of hours - $X,000,000 (Millions) of consulting Total sales -- $Y,000 Obituary: Gazelle.com out of business, Aug 2000 17 Data Warehousing and Data Mining by Kritsada Sriphaew
  18. 18. Genomic Microarrays – Case StudyGiven microarray data for a number of samples (patients), can we Accurately diagnose the disease? Predict outcome for given treatment? Recommend best treatment? 18 Data Warehousing and Data Mining by Kritsada Sriphaew
  19. 19. Example: ALL/AML data 38 training cases, 34 test, ~ 7,000 genes 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) Use train data to build diagnostic modelALL AML Results on test data: 33/34 correct, 1 error may be mislabeled 19 Data Warehousing and Data Mining by Kritsada Sriphaew
  20. 20. Security and Fraud Detection - Case Study  Credit Card Fraud Detection  Detection of Money laundering  FAIS (US Treasury)  Securities Fraud  NASDAQ KDD system  Phone fraud  AT&T, Bell Atlantic, British Telecom/MCI  Bio-terrorism detection at Salt Lake Olympics 200220 Data Warehousing and Data Mining by Kritsada Sriphaew
  21. 21. Data Mining and Privacy in 2006, NSA (National Security Agency) was reported to be mining years of call info, to identify terrorism networks Social network analysis has a potential to find networks Invasion of privacy – do you mind if your call information is in a gov database? What if NSA program finds one real suspect for 1,000 false leads ? 1,000,000 false leads? 21 Data Warehousing and Data Mining by Kritsada Sriphaew
  22. 22. Case Study: Intrusion Detection Systems Detection Approach  Misuse Detection ▪ Based on known malicious patterns (signatures)  Anomaly Detection ▪ Based on deviations from established normal patterns (profiles) Data Source  Network-based (NIDS) ▪ Network traffic  Host-based (HIDS) ▪ Audit trails23 Data Warehousing and Data Mining by Kritsada Sriphaew
  23. 23. Case Study: Intrusion Detection SystemsData Mining Usage Signature extraction Rule matching Alarm data analysis  Reduce false alarms  Eliminate redundant alarms Feature selection Training Data cleaning 24 Data Warehousing and Data Mining by Kritsada Sriphaew
  24. 24. Case Study: Intrusion Detection Systems Behavioral Feature for Network Anomaly Detection  Training set = normal network traffic  Feature provides semantics of the values of data  Feature selection is important  Proposed method: ▪ Feature extraction based on protocol behavior ▪ Many Attacks uses protocol improperly ▪ Ping of Death ▪ SYN Flood ▪ Teardrop25 Data Warehousing and Data Mining by Kritsada Sriphaew
  25. 25. Case Study: Intrusion Detection Systems Decision Tree (only small part) <=0.4 WWW tcpPerPSH <=0.79 SMTP >0.01 >0.4 tcpPerPSH >0.79 FTPtcpPerFIN <=0.03 telnet >0.79 SMTP <=0.01 >546773 tcpPerSYN >73 tcpPerPSH meanIAT >0.03 meanipTLen … >546773 … <=73 … 26 Data Warehousing and Data Mining by Kritsada Sriphaew
  26. 26. CASE STUDY: METLIFECompany ProfileMetLife, Inc. is a leading provider of insurance and other financialservices to millions of individual and institutional customersthroughout the United States.Established in 1863, Metlife now has offices all overthe US and the world, and offers ten different typesof insurances and financial services. 27 Data Warehousing and Data Mining by Kritsada Sriphaew
  27. 27. CASE STUDY: METLIFEIndustry: Insurance and Financial ServicesHow they use Data Mining: Fraud Detection 28 Data Warehousing and Data Mining by Kritsada Sriphaew
  28. 28. CASE STUDY: METLIFE• Project first started in 2001• MetLife set out to build $50 Million relational database• This project would consolidate data from 30 business world wide. 29 Data Warehousing and Data Mining by Kritsada Sriphaew
  29. 29. CASE STUDY: METLIFE• Around same time, it was reported that $30 Million of insurance money went to fraudulent claims.• MetLife teamed up with Computer Sciences Corporation (CSC) to o License their data mining tool (called Fraud Investigator), o Develop @First, "an early fraud detection system" 30 Data Warehousing and Data Mining by Kritsada Sriphaew
  30. 30. CASE STUDY: METLIFE• By 2003, MetLifes data mining operation was in full swing.• They were able to detect fraud in a fraction of the time it would take in man hours• One example is detecting rate evasion 31 Data Warehousing and Data Mining by Kritsada Sriphaew
  31. 31. CASE STUDY: METLIFE• Rate evasion is lying about where you live to pay lower premiums.• Metlife used data mining to detect rate evasion by matching ZIP codes with phone numbers to see if the cities matched.• In 2.5 hours, Metlife found 107 fraudulent claims, all linked to a rate-evasion ring in NY and Massachusetts. 32 Data Warehousing and Data Mining by Kritsada Sriphaew

×