PowerPoint Template


Published on

1 Comment
1 Like
  • Professional presentation should be made on worthy template such as www.smiletemplates.com
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

PowerPoint Template

  1. 1. Data Warehousing & Data Mining<br />Lecturer: Dr. Bo Yuan <br /> E-mail: yuanb@sz.tsinghua.edu.cn<br />
  2. 2. Welcome<br />2<br />
  3. 3. Mining? Warehousing? <br />3<br />
  4. 4. Data Rich, Information Poor<br />4<br />
  5. 5. Heterogeneous Data<br />5<br />
  6. 6. The Value of Data<br />6<br />
  7. 7. Data Integration & Analysis<br />7<br />
  8. 8. From Data To Intelligence<br />8<br />Decision Models<br />Decision Support<br />Data Mining<br />Knowledge<br />Preprocessing<br />Information<br />Database<br />Data<br />
  9. 9. Business Intelligence<br />9<br />
  10. 10. Related Areas<br />10<br />
  11. 11. Is DM really important?<br />Q: Your job sounds extremely interesting. What jobs would you recommend to a young person with an interest, and maybe a bachelors degree, in economics?<br />A: If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.<br />An interview with Google Chief Economist <br />Hal Varian from the New York Times<br />11<br />
  12. 12. It is all about data …<br />12<br />Retail<br />Financial Institutions<br />WWW<br />Healthcare<br />Consulting Companies<br />Government<br />Bioinformatics<br />Telecommunication<br />
  13. 13. Course Profile<br />Lecturer: Dr. Bo Yuan<br />Contact<br />Phone: 2603 6067<br />E-mail: yuanb@sz.tsinghua.edu.cn<br />Room: F-401A<br />Time<br />2:00 pm – 3:35 pm, Friday<br />Venue: CI-105<br />Consultation<br /> 2:00pm – 3:00pm, Wednesday<br /> Appointment via phone or e-mail preferred<br />13<br />
  14. 14. Aims & Objectives<br />Course Aims<br />To gain a good understanding of popular data mining techniques.<br />To gain experience in implementing and using data mining methods.<br />To gain an appreciation for the basic principles of data warehousing.<br />Learning Objectives<br />Able to implement and apply data mining techniques to solve problems.<br />Understand the main issues and core problems in data mining.<br />Understand the relationship between data mining and other fields.<br />Appreciate data mining research ideas and practice.<br />Get familiar with academic writing and presentation.<br />Graduate Attributes<br />In-depth knowledge of the field of study<br />Effective communication<br />Independence and teamwork<br />Critical judgment<br />14<br />
  15. 15. Learning Activities<br />Week 1: Introduction<br />Week 2: Principles of Data Warehousing<br />ETL, OLAP, Metadata<br />Week 3: Data Preprocessing<br />Week 4 – Week 7: Data Mining (Foundations)<br />Bayesian Classifiers, Decision Trees, Neural Networks, Regression, Clustering<br />Support Vector Machines, Association Rules<br />Week 8: Field Study<br />Week 9 – Week 11: Data Mining (Advanced)<br />Semi-supervised Learning, Active Learning<br />Ensemble Learning, Evolutionary Computation<br />Week 12 – Week 13: Special Topic A (Text Mining & Web Information Retrieval)<br />Week 14: Special Topic B (Bioinformatics, CRM, Privacy Issue)<br />Week 15: Project Presentation <br />15<br />
  16. 16. Assessment<br />Assignment 1<br />Type: Class Presentation<br />Weight: 10%<br />Task Description: Individual 25 minutes talks on selected topics<br />Assignment 2<br />Type: Algorithm Experimentation<br />Weight: 10%<br />Task Description: Coding and testing of selected data mining algorithms<br />Assignment 3<br />Type: Problem Solving<br />Weight: 30%<br />Task Description: Group project on solving real-world data mining problems<br />Final Exam<br />Type: Closed Book Examination<br />Weight: 50%<br />Duration: 120 minutes<br />16<br />Presentation matters!<br />
  17. 17. Learning Resources<br />17<br />
  18. 18. Learning Resources<br />18<br />International Conference on Data Mining<br />International Conference on Data Engineering<br />International Conference on Machine Learning<br />Pacific-Asia Conference on Knowledge Discovery and Data Mining<br />ACM SIGKDD Conference on Knowledge Discovery and Data Mining<br />
  19. 19. Rules & Policies<br />Plagiarism<br />Plagiarism is the act of misrepresenting as one's own original work the ideas, interpretations, words or creative works of another. <br />Direct copying of paragraphs, sentences, a single sentence or significant parts of a sentence.<br />Presenting as independent work done in collaboration with others.<br />Copying ideas, concepts, research results, computer codes, statistical tables, designs, images, sounds or text or any combination of these.<br />Paraphrasing, summarizing or simply rearranging another person's words, ideas, etc without changing the basic structure and/or meaning of the text.<br />Copying or adapting another student's original work into a submitted assessment item. <br />19<br />
  20. 20. Rules & Policies<br />Late Submission<br />Late submissions will incur a penalty of 10% of the total marks for each day that the submission is late (including weekends). Submissions more than 5 days late will not be accepted.<br />Assumed Background<br />This course will deal with concepts using algorithms and data structures, mathematics, statistics and probability.<br />20<br />
  21. 21. 21<br />10 Minutes …<br />
  22. 22. Data<br />Definition<br />“Data are pieces of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.”<br />Data Types<br />Continuous, Binary<br />Discrete, String<br />Symbolic<br />Storage<br />Physical<br />Logical<br />Major Issues<br />Transformation<br />Errors and corruption <br />22<br />
  23. 23. Database<br />Definition<br />“A database is an integrated collection of logically related records or files that is stored in a computer system which consolidates records previously stored in separate files into a common pool of data records that provides data for many applications.” <br />“A database is a collection of information that is organized so that it can easily be accessed, managed, and updated.”<br />Relational Databases<br />23<br />
  24. 24. Relational Model<br />24<br />
  25. 25. First Normal Form(1NF)<br />There's no top-to-bottom ordering to the rows. <br />There's no left-to-right ordering to the columns. <br />There are no duplicate rows.<br />Every cell contains exactly one value from the applicable domain.<br />25<br />
  26. 26. First Normal Form(1NF)<br />26<br />
  27. 27. First Normal Form(1NF)<br />27<br />
  28. 28. Second Normal Form(2NF)<br />Definition<br />A 1NF table is in 2NF if and only if none of its non-prime attributes are functionally dependent on a part (proper subset) of a candidate key.<br />28<br />
  29. 29. Second Normal Form(2NF)<br />29<br />
  30. 30. Third Normal Form(3NF)<br />Definition:<br />Every non-prime attribute of R is non-transitively dependent (directly dependent) on every key of R. <br />30<br />
  31. 31. Third Normal Form(3NF)<br />31<br />
  32. 32. Data Warehouse<br />Operational databases are optimized for the preservation of data integrity and speed of recording of business transactions.<br />Data warehouses are optimized for the speed of data retrieval. <br />Data warehouse is a repository of an organization's electronically stored data, which are designed to facilitate reporting and analysis.<br />W. H. Inmon states that the data warehouse is:<br />Subject-oriented  <br />Time-variant  <br />Non-volatile  <br />Integrated  <br />Data Warehousing<br />Business Intelligence Tools<br />Tools to extract, transform, and load data into the repository<br />Tools to manage and retrieve metadata<br />32<br />
  33. 33. Multidimensional Data<br />33<br />OLAP Cube<br />
  34. 34. Star Schema<br />34<br />
  35. 35. To Build a Data Warehouse<br />Data must be extracted from multiple, heterogeneous sources such as databases or other data feeds. <br />Data must be formatted for consistency within the data warehouse. Names, meanings and domains of data from unrelated sources must be reconciled. <br />Data must be cleaned to ensure validity. Data cleaning is an important part in building a data warehouse and it is one of the most labor-demanding tasks.<br />Data must be fitted into the data model of the warehouse. Data may have to be converted from relational, object-oriented, or legacy databases. <br />Data must be loaded into the warehouse. The sheer volume of data in the warehouse makes loading the data a significant task.<br />35<br />
  36. 36. Data Warehouse vs. Database<br />36<br />
  37. 37. Performance Dashboard<br />37<br />
  38. 38. 38<br />5 Minutes …<br />
  39. 39. Data Mining<br />People have been analysing and investigating data for centuries.<br />Statistics<br />Mean, Variance, Correlation, Distribution …<br />In modern days, data are often far beyond human comprehension.<br />Diversity<br />Volume<br />Dimensionality<br />Definition<br />Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data.<br />Not a fully automatic process<br />Human interventions are often inevitable.<br />Domain Knowledge<br />Data Collection and Pre-processing<br />Synonym: Knowledge Discovery<br />One Field, Many Techniques, Unlimited Applications<br />39<br />
  40. 40. The Process of Data Mining<br />40<br />
  41. 41. DM Techniques - Classification<br />“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as variables, characters, etc) and based on a training set of previously labeled items”.<br />Given training data {(x1, y1), …, (xn, yn)}, the task is to produce a classifier that maps any unknown object xi to its true classification label yi defined by some unknown mapping.<br />Algorithms<br />Decision Trees<br />K-nearest neighbours<br />Neural Networks<br />Support Vector Machines<br />Applications<br />Credit Scoring<br />Churn Prediction<br />Medical Diagnosis<br />41<br />X<br />Y<br />
  42. 42. Classification Boundaries<br />42<br />?<br />?<br />
  43. 43. Confusion Matrix<br />43<br />Accuracy=(TP+TN)/(P+N)<br />
  44. 44. Receiver Operating Characteristic<br />44<br />
  45. 45. Lift <br />45<br />
  46. 46. DM Techniques - Clustering<br />Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.<br />Distance Metrics<br />Euclidean distance<br />Manhattan distance<br />Mahalanobis distance<br />Algorithms<br />K-means<br />Leader<br />RPCL<br />Affinity Propagation<br />Applications<br />Market Research<br />Image Segmentation<br />Social Network Analysis<br />46<br />What is the difference between classification and clustering?<br />
  47. 47. Hierarchical Clustering<br />47<br />
  48. 48. DM Techniques – Association Rule<br />48<br />
  49. 49. Association Rule<br />49<br />
  50. 50. DM Techniques – Regression<br />50<br />
  51. 51. Regression<br />51<br />
  52. 52. Overfitting – Regression<br />52<br />
  53. 53. Overfitting – Classification<br />53<br />
  54. 54. Cross Validation<br />54<br />Training Set<br />Generated Models<br />Evaluation<br />Data<br />Test Set<br />
  55. 55. Seeing is Knowing<br />55<br />
  56. 56. Data Preprocessing<br />Why data processing?<br />Real data are often surprisingly dirty.<br />Incomplete Data<br />Inconsistent Data<br />Noisy Data<br />Typical Issues<br />Missing Attribute Values<br />Different Coding/Naming Schemes<br />Infeasible Values<br />Outliers<br />Data Quality<br />Accuracy<br />Completeness<br />Consistency<br />Interpretability<br />Credibility<br />Timeliness<br />56<br />
  57. 57. Data Preprocessing<br />Data quality is a crucial factor in successful data mining tasks.<br />Data Cleaning<br />Fill in missing values.<br />Correct inconsistent data.<br />Identify outliers and noisy data.<br />Data Integration<br />Combine data from different sources.<br />Data Transformation<br />Normalization<br />Aggregation<br />Type Conversion<br />Data Reduction<br />Feature Selection<br />Sampling<br />57<br />
  58. 58. Review<br />What is data mining?<br />Why is data mining important?<br />What are the typical data mining applications?<br />What is the general procedure of data mining?<br />What are the major techniques in data mining?<br />What is the difference between data warehouses and databases?<br />What to expect in this course?<br />Where to find relevant information?<br />How to make the most of this course?<br />58<br />
  59. 59. Just in Case Someone Asks …<br />59<br />
  60. 60. Just in Case Someone Asks …<br />60<br />