Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web-Based Data Mining System


Published on

  • Be the first to comment

  • Be the first to like this

Web-Based Data Mining System

  1. 1. C.-C. Chan Department of Computer Science University of Akron Akron, OH 44325-4003 USA UA Faculty Forum 2008 by C.-C. Chan
  2. 2. Outline <ul><li>Overview of Data Mining </li></ul><ul><li>Software Tools </li></ul><ul><li>A Rule-Based System for Data Mining </li></ul><ul><li>Concluding Remarks </li></ul>UA Faculty Forum 2008 by C.-C. Chan
  3. 3. Data Mining (KDD) <ul><li>From Data to Knowledge </li></ul><ul><li>Process of KDD (Knowledge Discovery in Databases) </li></ul><ul><li>Related Technologies </li></ul><ul><li>Comparisons </li></ul>UA Faculty Forum 2008 by C.-C. Chan
  4. 4. Why KDD? <ul><li>We are drowning in information, but starving for knowledge  John Naisbett </li></ul><ul><li>Growing Gap between Data Generation and Data Understanding: </li></ul><ul><li>Automation of business activities: </li></ul><ul><li>Telephone calls, credit card charges, medical tests, etc. </li></ul><ul><li>Earth observation satellites: </li></ul><ul><li>Estimated will generate one terabyte (10 15 bytes) of data per day. At a rate of one picture per second. </li></ul><ul><li>Biology: </li></ul><ul><li>Human Genome database project has collected over gigabytes of data on the human genetic code [Fasman, Cuticchia, Kingsbury, 1994.] </li></ul><ul><li>US Census data: </li></ul><ul><li>NASA databases: </li></ul><ul><li>… </li></ul><ul><li>World Wide Web: </li></ul>UA Faculty Forum 2008 by C.-C. Chan
  5. 5. Process of KDD [1] Fayyad, U., Editorial, Int. J. of Data Mining and Knowledge Discovery , Vol.1, Issue 1, 1997. [2] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, &quot;From data mining to knowledge discovery: an overview,&quot; in Advances in Knowledge Discovery and Data Mining , Fayyad et al (Eds.), MIT Press, 1996. UA Faculty Forum 2008 by C.-C. Chan
  6. 6. Process of KDD <ul><li>Selection </li></ul><ul><ul><ul><li>Learning the application domain </li></ul></ul></ul><ul><ul><ul><li>Creating a target dataset </li></ul></ul></ul><ul><li>Pre-Processing </li></ul><ul><ul><ul><li>Data cleaning and preprocessing </li></ul></ul></ul><ul><li>Transformation </li></ul><ul><ul><ul><li>Data reduction and projection </li></ul></ul></ul><ul><li>Data Mining </li></ul><ul><ul><ul><li>Choosing the functions and algorithms of data mining </li></ul></ul></ul><ul><ul><ul><li>Association rules, classification rules, clustering rules </li></ul></ul></ul><ul><li>Interpretation and Evaluation </li></ul><ul><ul><ul><li>Validate and verify discovered patterns </li></ul></ul></ul><ul><li>Using discovered knowledge </li></ul>UA Faculty Forum 2008 by C.-C. Chan
  7. 7. Typical Data Mining Tasks <ul><li>Finding Association Rules [Rakesh Agrawal et al, 1993] </li></ul><ul><ul><li>Each transaction is a set of items. </li></ul></ul><ul><li>Given a set of transactions, an association rule is of the form X  Y </li></ul><ul><li>where X and Y are sets of items. </li></ul><ul><ul><ul><li>e.g.: 30% of transactions that contain beer also contain diapers; </li></ul></ul></ul><ul><ul><ul><li>2% of all transactions contain both of these items. </li></ul></ul></ul><ul><li>Applications: </li></ul><ul><ul><li>Market basket analysis and cross-marketing </li></ul></ul><ul><ul><li>Catalog design </li></ul></ul><ul><ul><li>Store layout </li></ul></ul><ul><ul><li>Buying patterns </li></ul></ul>UA Faculty Forum 2008 by C.-C. Chan
  8. 8. <ul><li>Finding Sequential Patterns </li></ul><ul><ul><ul><li>Each data sequence is a list of transactions. </li></ul></ul></ul><ul><ul><ul><li>Find all sequential patterns with a user-specified minimum support. </li></ul></ul></ul><ul><ul><ul><ul><li>e.g.: Consider a book-club database </li></ul></ul></ul></ul><ul><ul><ul><ul><li>A sequential pattern might be </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>5% of customers bought “Harry Potter I”, then “Harry Potter II”, and then “Harry Potter III”. </li></ul></ul></ul></ul></ul><ul><li>Applications: </li></ul><ul><ul><li>Add-on sales </li></ul></ul><ul><ul><li>Customer satisfaction </li></ul></ul><ul><ul><li>Identify symptoms/diseases that precede certain diseases </li></ul></ul>UA Faculty Forum 2008 by C.-C. Chan
  9. 9. <ul><li>Finding Classification Rules </li></ul><ul><ul><ul><li>Finding discriminant rules for objects of different classes. </li></ul></ul></ul><ul><ul><li>Approaches: </li></ul></ul><ul><ul><ul><li>Finding Decision Trees </li></ul></ul></ul><ul><ul><ul><li>Finding Production Rules </li></ul></ul></ul><ul><li>Applications: </li></ul><ul><ul><li>Process loans and credit cards applications </li></ul></ul><ul><ul><li>Model identification </li></ul></ul>UA Faculty Forum 2008 by C.-C. Chan
  10. 10. <ul><li>Text Mining </li></ul><ul><li>Web Usage Mining </li></ul><ul><li>Etc. </li></ul>UA Faculty Forum 2008 by C.-C. Chan
  11. 11. Related Technologies <ul><li>Database Systems </li></ul><ul><ul><li>MS SQL server </li></ul></ul><ul><ul><ul><li>Transaction databases </li></ul></ul></ul><ul><ul><ul><li>OLAP (Data Cubes) </li></ul></ul></ul><ul><ul><ul><li>Data Mining </li></ul></ul></ul><ul><ul><ul><ul><li>Decision Trees </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Clustering Tools </li></ul></ul></ul></ul><ul><li>Machine Learning/Data Mining Systems </li></ul><ul><ul><li>CART (Classification And Regression Trees) </li></ul></ul><ul><ul><li>C 5.x (Decision Trees) </li></ul></ul><ul><ul><li>WEKA (Waikato Environment for Knowledge Analysis) </li></ul></ul><ul><ul><li>LERS </li></ul></ul><ul><ul><li>ROSE 2 </li></ul></ul><ul><li>Rule-Based Expert System Development Environments </li></ul><ul><ul><li>CLIPS, JESS </li></ul></ul><ul><ul><li>EXSYS </li></ul></ul><ul><li>Web-based Platforms </li></ul><ul><ul><li>Java </li></ul></ul><ul><ul><li>MS .Net </li></ul></ul>UA Faculty Forum 2008 by C.-C. Chan
  12. 12. Comparisons UA Faculty Forum 2008 by C.-C. Chan Pre- Processing Learning Data Mining Inference Engine End-User Interface Web-Based Access Reasoning with Uncertainties MS SQL Server N/A Decision Trees Clustering N/A N/A N/A N/A CART C 5.x N/A Decision Trees Built-in Embedded N/A N/A WEKA Yes Trees, Rules, Clustering, Association N/A Embedded Need Programming N/A CLIPS JESS N/A N/A Built-in Embedded Need Programming 3 rd parties Extensions
  13. 13. Rule-Based Data Mining System Objectives <ul><li>Develop an integrated rule-based data mining system provides </li></ul><ul><ul><li>Synergy of database systems, machine learning, and expert systems </li></ul></ul><ul><ul><li>Dealing with uncertain rules </li></ul></ul><ul><ul><li>Delivery of web-based user interface </li></ul></ul>UA Faculty Forum 2008 by C.-C. Chan
  14. 14. Structure of Rule-Based Systems UA Faculty Forum 2008 by C.-C. Chan
  15. 15. System Workflow UA Faculty Forum 2008 by C.-C. Chan Input Data Set Data Pre-processing Rule Generator User Interface Generator
  16. 16. <ul><li>Input Data Set : </li></ul><ul><ul><li>Text file with comma separated values (CSV) </li></ul></ul><ul><ul><li>It is assumed that there are N columns of values corresponding to N variables or parameters, which may be real or symbolic values. </li></ul></ul><ul><ul><li>The first N – 1 variables are considered as inputs and the last one is the output variable. </li></ul></ul><ul><li>Data Preprocessing : </li></ul><ul><ul><li>Discretize domains of real variables into a finite number of intervals </li></ul></ul><ul><ul><li>Discretized data file is then used to generate an attribute information file and a training data file. </li></ul></ul><ul><li>Rule Generator : </li></ul><ul><ul><li>A symbolic learning program called BLEM2 is used to generate rules with uncertainty </li></ul></ul><ul><li>User Interface Generator : </li></ul><ul><ul><li>Generate a web-based rule-based system from a rule file and corresponding attribute file </li></ul></ul>UA Faculty Forum 2008 by C.-C. Chan
  17. 17. Architecture of RBC generator Workflow of RBC generator Rule set File Metadata File RBC Generator UA Faculty Forum 2008 by C.-C. Chan Requests Middle Tier Client Responses SQL DB server Rule Table Definition
  18. 18. Concluding Remarks <ul><li>A system for generating rule-based classifier from data with the following benefits: </li></ul><ul><li>No need of end user programming </li></ul><ul><li>Automatic rule-based system creation </li></ul><ul><li>Delivery system is web-based provides easy access </li></ul>UA Faculty Forum 2008 by C.-C. Chan
  19. 19. Project Status <ul><li>The current version 1.4 of our system provides fundamental features for data mining from data including: </li></ul><ul><ul><li>Data Preprocessing </li></ul></ul><ul><ul><li>Management of preprocessed data files </li></ul></ul><ul><ul><li>Machine Learning tool to generate rules from data </li></ul></ul><ul><ul><li>Rule-Based Classifier system supporting uncertain rules </li></ul></ul><ul><ul><li>Web-Based access </li></ul></ul>UA Faculty Forum 2008 by C.-C. Chan
  20. 20. Future Work <ul><li>More advanced features in Data Preprocessing such as data cleansing, data transformation, and data statistics </li></ul><ul><li>Learning from multi-criteria inputs with preferential rankings to support Multiple Criteria Decision Making processes </li></ul><ul><li>Concept-Oriented information retrieval and search </li></ul>UA Faculty Forum 2008 by C.-C. Chan
  21. 21. <ul><li>Thank You! </li></ul>UA Faculty Forum 2008 by C.-C. Chan