Chapter 1: Introduction to Data Mining


Published on

introduction to data mining

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Chapter 1: Introduction to Data Mining

  1. 1. Knowledge Acquisition In Decision Making (SQIT 3033) Izwan Nizal Mohd Shaharanee SQS 4017/ 6866
  2. 2. Course Objective      To introduce :: knowledge about data mining and data warehouse To evaluate and understand several data mining techniques To enhance skill on data mining through analysis problem in business Being able to apply the commonly used functions of SAS Enterprise Miner and WEKA to solve data mining problems Developing the skills of data mining modeling and data analysis with SAS Enterprise Miner and WEKA
  3. 3. Course Content     Intro to Knowledge Acquisition aka ~knowledge discovery~ (3 hours) Knowledge Discovery Process (4 hours) Pre-processing data (5 hours) Predictive Modeling (10 hours)    Evaluation And Implementation (6 hours) Descriptive Modeling (7 hours)    Decision Tree, Regression, Neural Network, Rough Set Clustering, Association Rules Data mining ethics (1.5 hours) PROJECT PRESENTATION
  4. 4. Course Evaluation  Assignments  Case study + Presentations  Project + Poster Presentations  Mid Term ? Quizzes ?  Class PARTICIPATION !!  Final Exam 40% 60%
  5. 5. PreRequisites       A “Basic statistics course such as SQQS2023”Bussiness Statistical”+” programming language knowledge”+“SAS knowledge”+”Database”+ “spreadsheet+ web 2.0” Passion in computer applications Dare to take the challenges Have a sincere heart to understand infinite God’s knowledge Attendance is compulsory (no freely “tuang kelas”) Behave your “gadget". Please respect others
  6. 6. Timetable
  7. 7. Please introduce yourself.. Facebook Group Youtube Channel + Vimeo Video izwan nizal
  8. 8.
  9. 9. The Age of Big Data   “The BBC documentary follows people who mine Big Data, including LAPD police officers who use data to predict crime, a London scientist/trader who makes millions with math, and a South African astronomer who wants to catalog the entire cosmos.” “Data Scientist” is the sexiest job of the 21st century. The Harvard Business Review made this claim last October and it seems that everyone (including your grandmother) has been repeating it ever since.
  10. 10. Why Knowledge Acquisitions ?  Why? Data explosion (tremendous amount of data available + cloud computing)  Data is being warehoused  Computing power – Bionic Skin?  Competitive pressure  Hard Disk Nowadays more than 1TB capacities
  11. 11. What is Knowledge Acquisitions ?     aka :: data mining, knowledge discovery, knowledge extraction, information discovery, information harvesting ect. Process of discovering useful information,hidden pattern or rules in large quantities of data ( nontrivial, unknown data) By automatic or semiautomatic means It’s impossible to find pattern using manual method.
  12. 12. Traditional Approaches    Traditional database queries:. Access a database using a well defined query such as SQL The query output consist of data from database The output usually a subset of the database SQL DBMS DB
  13. 13. Disciplines Of Data Mining Database System Machine Learning Algorithm Statistics Data Mining Visualization Information Retrieval
  14. 14. Data Mining Model & Task Data Mining Predictive Descriptive •Classification •Clustering •Time •Association •Regression Series Analysis •Prediction •Summarization Rules •Sequence Discovery
  15. 15. Try to related with your previous knowledge? Hmmm…how this data mining differ with forecasting or prediction?  Are there similar? 
  16. 16. Predictive Model    Make prediction about values of data using known results found from different data Or based on the use of other historical data Example:: credit card fraud, breast cancer early warning, terrorist act, tsunami and ect.  Ghost Protocol, Minority Report, Eagle Eye,
  17. 17. Predictive Model      Perform inference on the current data to make predictions. We know what to predict based on historical data) Never accurate 100% Concentrate more to input output relation ship ( x,f(x)) Typical Question  Which costumer are likely to buy this product next four month  What kind of transactions that are likely to be fraudulent  Who is likely to drop this paper?
  18. 18. Predictive Model Profit (RM) O ? Future data x x x x x x x x x xx x x x x x Current data months
  19. 19. Descriptive Model        Identifies pattern or relationships in data. Serves as a way to explore the properties of data examined, not to predict new properties Always required a domain expert Example:: Segmenting marketing area Profiling student performances Profiling GooglePlay/ AppleApps customer
  20. 20. Descriptive Model      Discovering new patterns inside the data We may don’t have any idea how the data looks like Explores the properties of the data examined Pattern at various granularities (eg: Student: University> faculty->program-> major? Typical Question  What is the data  What does it look like  What does the data suggest for group of costumer advertisement?
  21. 21. Descriptive Model Results y o y y y y y y y y y o o o y y o o o o y o Group 3 o o o y o y x x o o x o x x x x x Group 2 x x Group 1 major
  22. 22. View Of DM     Data To Be Mined  Data warehouse, WWW, time series, textual. spatial multimedia, transactional Knowledge To Be Mined  Classification, prediction, summarization, trend Techniques Utilized  Database, machine learning, visualization, statistics Applications Adapted  Marketing, demographic segmentation, stock analysis
  23. 23. DM In Action      Medical Applications ::clinical diagnosis, drug analysis Business (marketing segmentation & strategies, insolvency predictor, loan risk assessment Education (Online learning) Internet (searching engine) Ect
  24. 24. Data Mining Methodology  Hypothesis Testing vs Knowledge Discovery  Hypothesis   Top down approach Attempts to substantiate or disprove preconceived idea  Knowledge   Testing Discovery Bottom-up approach Start with data and tries to get it to tell us something we didn’t already know
  25. 25. Data Mining Methodology  Hypothesis Testing  Generate good ideas  Determine what data allow these hypotheses to be tested  Locate the data  Prepare the data for analysis  Build computer models based on the data  Evaluate computer model to confirm or reject hypotheses
  26. 26. Data Mining Methodology  Knowledge Discovery  Directed          Identified sources of pre classified data Prepare data analysis Select appropriated KD techniques based on data characteristics and data mining goal Divide data into training, testing and evaluation Use the training dataset to build model Tune the model by applying it to test dataset Take action based on data mining results Measure the effect of the action taken Restart the DM process taking advantage of new data generated by the action taken
  27. 27. Data Mining Methodology  Knowledge Discovery  Undirected       Identified available data sources Prepare data analysis Select appropriated undirected KD techniques based on data characteristics and data mining goal Use the selected technique to uncover hidden structure in the data Identify potential targets for directed KD Generate new hypothesis to test
  28. 28. Revision:: Two Approaches In data Mining Predict the future value Data Mining Predictive Define R/S among data Descriptive •Classification •Clustering •Time •Association •Regression Series Analysis •Prediction •Summarization Rules •Sequence Discovery
  29. 29. Knowledge Discovery Process
  30. 30. Knowledge Discovery Process
  31. 31. Knowledge Discovery Process  1.0 Selection  The data needs for the data mining process may be obtained from many different and heterogeneous data sources  Examples     Business Transactions Scientific Data Video and pictures UUM Student Database
  32. 32. Knowledge Discovery Process   2.0 Pre Processing Main idea – to ensure that data is clean (high quality of data).  The data to be used by the process may have incorrect or missing data.  There may be anomalous data from multiple sources involving different data types and metrics  Erroneous data may be corrected or removed, whereas missing data must be supplied or predicted (Often using data mining tools)
  33. 33. Knowledge Discovery Process  3.0 Transformation  Data from different sources must be converted into a common format for processing  Some data may be encoded or transformed into more usable formats  Example::  Data Reduction Data Cleaning, Data Integration, Data Transformation, Data Reduction and Data Discretization
  34. 34. Knowledge Discovery Process      4.0 Data Mining Main idea –to use intelligent method to extract patterns and knowledge from database This step applies algorithms to the transformed data to generate the desired results. The heart of KD process (where unknown pattern will be revealed). Example of algorithms: Regression (classification, prediction), Neural Networks (prediction, classification, clustering), Apriori Algorithms (association rules), KMeans & K-Nearest Neighbor (clustering), Decision Tree (classification), Instance Learning (classification).
  35. 35. Knowledge Discovery Process  5.0 Interpretation/Evaluation  How the data mining results are presented to the users is extremely important because the usefulness of the results is dependent on it  Example::  Graphical  Geometric  Icon Based  Pixel Based  Hierarchical Based  Hybrid
  36. 36. Case Study: Predicting SQS Final Year’s Studentrecord Selected Performance Student database {contains 30,000 records} Academics activities Knowledge (apply model) Testing result: 90 % correct  accept model {matric, PMK, grades} – only 2,000 records (contains incomplete records etc. Clean record {replace the missing value, removed the replicated} academics Selection academics Pre-processing Transformation Generated Model : pattern for prediction Interpretation Y=w1x1+w2x2+b1 & evaluation Data mining Using neural networks : transform into numerical.
  37. 37. Assignment 1    Group Assignment >> you may be selected (randomly) to present your answer? (2 minutes max) Discuss how prediction/forecasting related to your life? Or any issues related to prediction/forecasting that might interest to you. You may discuss     Give an appropriated example? Ect. Weather forecasting can determine your daily exercise planning? How it been done? Minimum 1 pages Due Date: 18 September 2013