Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 by Kritsada Sriphaew (sriphaew.k AT gmail.com) Lecture 1 Introduction to Data Mining and Data Warehousing Text: Data Mining: Concepts and Techniques, By Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers (2006). ISBN: 978-15586090131
  2. 2. Administrative Matters Course Syllabus Lecture Notes & Assignments & Quizzes Course’s Communication Announcements, discussion, lecture notes, etc.  Page: http://www.facebook.com/pages/Data-mining-MSIT- RSU/2 Data Mining and Data Warehousing by Kritsada Sriphaew
  3. 3. How we will be evaluated? Assessment Tasks Tasks % Scores Quizzes (Approx. 2 times) 20 Assignment 20 (Disscussion/Demonstration) Final 60 To Pass  At least 60% of the overall scores.3 Data Mining and Data Warehousing by Kritsada Sriphaew
  4. 4. Text Books Mandatory Book Data Mining: Concepts and Techniques By Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers (2006), Second Edition, ISBN-10: 1558609016, ISBN-13: 978-1558609013 Supplementary Book Practical Machine Learning Tools and Techniques with JAVA Implementations By Ian H. Witten and Eibe Frank, Data Mining Morgan Kaufmann Publishers (2005), 2nd Edition ISBN-10: 0120884070, ISBN-13: 978-01208840704 Data Mining and Data Warehousing by Kritsada Sriphaew
  5. 5. Course Description (What we’LL learn?) Introduction to data warehousing. Characteristics of data warehousing, drawbacks and benefits of data warehousing, architecture of data warehousing, internal data structure for data warehousing, data integration, creating high quality data, data mart, online analytical processing (OLAP). Introduction to data mining, types of data for mining, architecture of typical data mining system, data preprocessing, association rule mining, classification and prediction, clustering, data mining applications, current trends in data mining, text mining, web mining, including tools for data mining analysis such as WEKA, SAS, etc. ั แนวคิดเบืองต้นของคลังข้อมูล คุณลักษณะของคลังข้อมูล ข้อดีและข้อเสียของคลังข้อมูล สถาปตยกรรมของคลังข้อมูล ้ โครงสร้างการจัดเก็บข้อมูลภายในคลังข้อมูล การบูรณาการข้อมูล การสร้างข้อมูลทีมคุณภาพ ดาต้ามาร์ท การ ่ ี ประมวลผลออนไลน์เชิงวิเคราะห์ แนวคิดเบืองต้นการทาเหมืองข้อมูล ชนิดข้อมูลสาหรับการทาเหมืองข้อมูล ้ ั สถาปตยกรรมของระบบเหมืองข้อมูล การเตรียมข้อมูล การขุดค้นกฎสัมพันธ์ การจาแนกประเภทและการทานาย การ ่ ่ ี ั ั จัดกลุม การทาเหมืองข้อมูลทีมความซับซ้อน การประยุกต์ใช้เหมืองข้อมูล แนวโน้มปจจุบนการทาเหมืองข้อมูล เหมือง ข้อมูลตัวอักษร เหมืองข้อมูลเว็บ รวมถึงการใช้เครืองมือในการวิเคราะห์เหมืองข้อมูล เช่น WEKA, SAS เป็ นต้น ่ 5 Data Mining and Data Warehousing by Kritsada Sriphaew
  6. 6. Course Schedule (tentative)Week Date Topics 1 8 JAN Introduction to Data Mining and Data Warehousing 2 15 JAN Data Warehouse and OLAP Technology – I 3 22 JAN Data Warehouse and OLAP Technology – II 4 29 JAN Data Mining Concepts and Data Preparation 5 5 FEB Association Rule Mining 6 12 FEB Classification Model: Decision Tree, Classification Rules 7 19 FEB Classification Model: Naïve Bayes 8 26 FEB Prediction Model: Regression 9 4 MAR Clustering 10 11 MAR Data Mining Application: Text Mining, Web Mining, Social Network Analysis 11 18 MAR Introduction to Data Mining Tool: WEKA 12 25 MAR Tutorials6 Final Mining and Data Warehousing by Kritsada Sriphaew Data
  7. 7. Prerequisites Basic Database Concepts Basic Statistics:  Probability, Sampling, Logic, Linear Regression, … Algorithms:  Basic Data Structures, Dynamic Programming, ...We provide some backgrounds, but the class will befast pace if you have some basics in advance. 7 Data Mining and Data Warehousing by Kritsada Sriphaew
  8. 8. Introduction Motivation: Why mine data? KDD: Knowledge Discovery in Databases What is Data Mining? Data Mining: on What kind of Data? Data Mining Tasks Data Mining Applications8 Data Mining and Data Warehousing by Kritsada Sriphaew
  9. 9. Evolution of Database Technology 1960s:  Data collection, database creation, IMS and network DBMS 1970s:  Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases 9 Data Mining and Data Warehousing by Kritsada Sriphaew
  10. 10. Large Data Sets: A Motivation There is often information “hidden” in the data that is not readily evident. Human analysts take weeks to discover useful information. Much of the data is never been analyzed at all How do you explore millions of records, tens or hundreds of fields, and find patterns? 10 Data Mining and Data Warehousing by Kritsada Sriphaew
  11. 11. KDD Process(Knowledge Discovery in Databases) Interpretation/ Evaluation Data Mining Knowledge Preprocessing Patterns Selection Preprocessed Data Data Target Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press11 Data Mining and Data Warehousing by Kritsada Sriphaew
  12. 12. Knowledge Discovery12 Data Mining and Data Warehousing by Kritsada Sriphaew
  13. 13. Business Intelligence (BI) vs. Data Mining A word to call processes, techniques and tools that support business decision using information technology Increasing potential to support End User business decisions Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Knowledge Discovery Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP 13 Data Mining and Data Warehousing by Kritsada Sriphaew
  14. 14. Terminology Data Mining A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data. Knowledge Discovery Process The process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.14 Data Mining and Data Warehousing by Kritsada Sriphaew
  15. 15. Other definitions of Data Mining Non‐trivial extraction of implicit, previously unknown and useful information from data Automatic or semi-automatic process for analyzing large databases to find patterns that are:  valid: hold on new data with some certainty  novel: non‐obvious to the system  useful: should be possible to act on the item  understandable: humans should be able to interpret the pattern 15 Data Mining and Data Warehousing by Kritsada Sriphaew
  16. 16. Origins of Data Mining  Overlaps various fields, but focus on  Scalability  Algorithm and Architecture  Automation to handle large data16 Data Mining and Data Warehousing by Kritsada Sriphaew
  17. 17. Data Mining: on What kind of Data? Relational Databases Data Warehouses Structure - 3D Anatomy Transactional Databases Advanced Database Systems Function – 1D Signal  Object-Relational  Spatial and Temporal  Time-Series Metadata – Annotation  Multimedia GeneFilter Comparison Report  Text GeneFilter 1 Name: O2#1 8-20-99adjfinal INTENSITIES GeneFilter 1 N2#1finaladj Name:  Heterogeneous, Legacy, and Distributed ORF NAME YAL001C TFC3 1 RAW GENE NAME NORMALIZED CHRM F G 1 A 1 2 12.03 7.38 R GF1 403.83 GF2 WWW YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,  YBR154C YCL044C RPB5 2 3 1 A 1 4 79.26 78.51 1 A 1 5 53.22 44.66 "2,660.73" "1,786.53" YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 YDL211C 4 1 A 1 7 17.31 35.34 581.00 YDR155C CPH1 4 1 A 1 8 349.78 401.84 YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 17 YBL088C TEL1 2 1 A 2 3 8.50 7.74 Data Mining and Data Warehousing by Kritsada Sriphaew YBR162C 2 1 A 2 4 226.84 285.38 293.83 YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79" YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99
  18. 18. Data Mining Tasks Classification Clustering Association Rule Mining Sequential Pattern Discovery Regression Anomaly Detection
  19. 19. Ex: Classifying Galaxy19 Data Mining and Data Warehousing by Kritsada Sriphaew
  20. 20. Ex: Market Basket Analysis ? Where should detergents be placed in the Store to maximize their sales? ? Are window cleaning products purchased when detergents and orange juice are bought together? ? Is soda typically purchased with bananas? Does the brand of soda make a difference? ? How are the demographics of the neighborhood affecting what customers are buying?20 Data Mining and Data Warehousing by Kritsada Sriphaew
  21. 21. Ex: Anomaly Detection Detect significant deviations from normal behavior Applications:  Credit Card Fraud Detection  Network Intrusion Detection21 Data Mining and Data Warehousing by Kritsada Sriphaew
  22. 22. Some Success Stories Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data  Won over (manual) knowledge engineering approach  http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire process Major US bank: Customer attrition prediction  Segment customers based on financial behavior: 3 segments  Build attrition models for each of the 3 segments  40‐50% of attritions were predicted == factor of 18 increase Targeted credit marketing: major US banks  find customer segments based on 13 months credit balances  build another response model based on surveys  increased response 4 times -- 2% 22 Data Mining and Data Warehousing by Kritsada Sriphaew
  23. 23. How You’LL Benefit Confidently discuss the role and applicability of data warehousing and data mining to business/organization problems Get background knowledge for further explore to your thesis, independent study or your career’s projects since data mining methods (to extract knowledge from the data) are very useful for every fields.
  24. 24. Assignment Assignments will aim to test your detailed knowledge and understanding of the topics, as well as your critical thinking and research ability. Assignments may include tasks involving: writing detailed designs; reading research papers; learning and using specialist software/hardware. Assessment: the assignment will be worth 20% of the total course assessment.
  25. 25. PreTest1. Select only one of the following items to fill in the blanks. (a) Characterization/Discrimination (b) Classification (c) Numeric Prediction (d) Clustering (e) Association Analysis (f) Trend Analysis Which function matches with the following task? ______(1) To estimate the price of the stock A in next month ______(2) To display a portion of sold products, according to their types. ______(3) To know which products are likely to be sold with which products ______(4) To group customers to a set of similar groups based on their features ______(5) To find the value of an experiment when a substance is tested. ______(6) To predict that a customer tends to be a good customer or not.2. Assume that we want to design a model to forecast tomorrow’s SET index, please suggest the detail of the model that we should construct and recommend the input and output to the model. 25
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.