Lecture 2


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lecture 2

  1. 1. Data Mining UMUC CSMN 667 Lecture #2
  2. 2. Term Paper - Data Mining Case Analysis <ul><li>Refer to Project Descriptions section of WebTycho course Syllabus for detailed information. </li></ul><ul><li>1-page Summary (Abstract+Outline) due: April 4, 2005 </li></ul><ul><li>Final Paper Due Date: 12midnight, April 18, 2005 </li></ul><ul><li>Submit both in your WebTycho Assignments Folder </li></ul><ul><li>Term Paper Page Restrictions: 5-8 pages </li></ul><ul><li>I will submit your paper to TurnItIn.com for verification of originality – per UMUC Graduate School policies. </li></ul><ul><li>Format/Style: Use the SPIE Conference Proceedings Style , which is available at: </li></ul><ul><li>http://www.spie.org/app/Publications/index.cfm?fuseaction=authinfo&type=manspecs </li></ul><ul><li>[ONLY USE THIS FOR STYLE FILES AND FORMATTING INSTRUCTIONS] </li></ul>
  3. 3. Case Analysis Instructions (1) <ul><li>The goal of the paper assignment is to complete an in-depth study of a data mining application. Examples of applications include financial, scientific, medical, intrusion detection, and web mining. Describe data types, data volumes, technical challenges, end-goals, who is the user community, which data mining algorithms are most relevant, why data mining, how is it used, what is the current status of data mining usage in this field? --- Possible case topics include: </li></ul><ul><ul><li>A direct mailing application looking to maximize cross-selling opportunities (e.g., Doubleclick). </li></ul></ul><ul><ul><li>A bank determining the credit worthiness of a potential customer (e.g., American Express, Bank of America). </li></ul></ul><ul><ul><li>A medical insurer looking to detect medical fraud. </li></ul></ul><ul><ul><li>Gene detection in BioInformatics (e.g., Celera). </li></ul></ul><ul><ul><li>Glitch or anomaly detection in scientific time series data. </li></ul></ul><ul><ul><li>Abnormal network access behavior for detection of computer system intrusion and security violation. </li></ul></ul>
  4. 4. Case Analysis Instructions (2) <ul><li>You may choose to go in depth in either one of these two areas: </li></ul><ul><ul><li>A data mining application domain : Evaluate the application area in detail, as explained on the previous slide, including a review and analysis of the different data mining techniques employed there. </li></ul></ul><ul><ul><li>Or </li></ul></ul><ul><ul><li>A data mining technique : Research in depth the different application domains where this technique has been used. Answer the questions on the previous slide when evaluating this technique’s different application areas. </li></ul></ul>
  5. 5. Case Analysis Paper - Instructions (3) <ul><li>Please e-mail me your suggested topic (application area to be researched) so that I may verify that it is okay. </li></ul>
  6. 6. Case Analysis Paper - Instructions (4) <ul><li>Submit your completed paper in WebTycho . </li></ul><ul><li>You may submit your paper in any of these formats: PDF , or Microsoft WORD , or postscript ( PS ). </li></ul><ul><li>You must submit it no later than midnight on April 18 . WebTycho will not allow submissions after that time. </li></ul><ul><li>Submit the paper in your &quot;Assignments Folder&quot; (on the left menu bar within the WebTycho course website). </li></ul>
  7. 7. Lecture 2: “Data Mining Roots” (Chapter 2 of Dunham textbook)
  8. 8. Lecture 2 Outline <ul><li>Summary of “What is Data Mining?” Tutorial </li></ul><ul><li>Foundations of Data Mining </li></ul><ul><li>Database Systems </li></ul><ul><li>Data Warehousing and OLAP </li></ul><ul><li>Statistics and Data Mining </li></ul><ul><li>Information Retrieval </li></ul><ul><li>Data Mining as “Rule Induction” </li></ul><ul><li>Fuzzy Sets and Logic </li></ul><ul><li>Machine Learning </li></ul><ul><li>Steps in the Data Mining Process </li></ul><ul><li>Major Issues in Data Mining </li></ul><ul><li>A Case Study: The NASA Mars Rover </li></ul>
  9. 9. “ What is Data Mining?” From online reading assigment -- Data Mining Tutorial at : http://www.megaputer.com/dm/dm101.php3
  10. 10. Summary of “What is Data Mining?” Tutorial <ul><li>What is data mining? </li></ul><ul><li>Why use data mining? </li></ul><ul><li>What can Data Mining do for you? </li></ul><ul><li>Reasons for the growing popularity of Data Mining </li></ul><ul><li>Tasks Solved by Data Mining </li></ul><ul><li>Different DM Technologies and Systems </li></ul><ul><ul><ul><ul><li>Subject-oriented analytical systems </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Statistical packages </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Neural Networks </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Evolutionary Programming </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Memory Based Reasoning </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Decision Trees </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Genetic Algorithms </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Nonlinear Regression Methods </li></ul></ul></ul></ul>
  11. 11. What can Data Mining do for you? (business-focused list) <ul><li>Identify your best prospects and then retain them as customers. </li></ul><ul><li>Predict cross-sell opportunities and make recommendations. </li></ul><ul><li>Learn parameters influencing trends in sales and margins. </li></ul><ul><li>Segment markets and personalize communications. </li></ul>
  12. 12. Reasons for the Growing Popularity of Data Mining <ul><li>Growing Data Volumes </li></ul><ul><li>Limitations of Human Analysis </li></ul><ul><li>Low Cost of Machine Learning </li></ul>Tasks Solved by Data Mining <ul><li>Prediction </li></ul><ul><li>Classification </li></ul><ul><li>Detection of Relations </li></ul><ul><li>Deviation Detection </li></ul><ul><li>Explicit Modeling </li></ul><ul><li>Clustering </li></ul><ul><li>Market Basket Analysis </li></ul>
  13. 13. Foundations of Data Mining
  14. 14. Foundations of Data Mining: Databases, Statistics, and Machine Learning <ul><li>David Hand (1998. “Data Mining: Statistics and More?”, The American Statistician , 52, pp. 112–118) used the following definition. </li></ul><ul><ul><li>&quot;Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysis of large databases in order to find previously unsuspected relationships which are of interest or value to the database owners.” </li></ul></ul><ul><ul><li>Why “secondary” ? … Because the data were typically collected for other purposes (such as billing, accounting, customer addresses, etc.). Primary analysis of large databases is generally the domain of STATISTICS. </li></ul></ul>
  15. 15. Evolution of Data Mining < http://www.thearling.com/text/dmwhite/dmwhite.htm > Slide from Lecture 1
  16. 16. Foundation for Data Mining Techniques <ul><li>1960s: </li></ul><ul><ul><li>Data collection, database creation, IMS, and hierarchical DBMS </li></ul></ul><ul><li>1970s: </li></ul><ul><ul><li>Relational data model, relational DBMS implementation </li></ul></ul><ul><li>1980s: </li></ul><ul><ul><li>RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, financial, manufacturing, sales, etc.) </li></ul></ul><ul><li>1990s—2000s: </li></ul><ul><ul><li>Data mining and data warehousing, multimedia databases, and Web databases </li></ul></ul>
  17. 17. History of Data Mining <ul><li>Dates for specific events were imprecise in the preceding slides. This might be a little better : </li></ul>
  18. 18. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization
  19. 19. Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP Data Mining Stepping Stones http://www.cs.sfu.ca/~han/DM_Book.html
  20. 20. Database Systems
  21. 21. Database Systems <ul><li>DBMS joins “AI and statistics” to become Data Mining </li></ul><ul><li>Data mining usually asks complex statistical questions that are difficult to answer via traditional SQL queries </li></ul><ul><li>Data mining relies on special algorithms outside of the standard DBMS/SQL family of tools </li></ul><ul><li>Data mining is used to extract knowledge from DBMS, not just the data bits (i.e., KDD) </li></ul><ul><li>Data mining applies familiar statistical concepts to large DBMS (e.g., outlier detection; cluster analysis; data modeling; evolutionary analysis; prediction) </li></ul>
  22. 22. Data Mining is a core database function <ul><li>Data Mining has many names / aliases : </li></ul><ul><ul><li>Knowledge Discovery in Databases (KDD) </li></ul></ul><ul><ul><li>Machine Learning (ML) </li></ul></ul><ul><ul><li>Exploratory Data Analysis (EDA) </li></ul></ul><ul><ul><li>Intelligent Data Analysis (IDA) </li></ul></ul><ul><ul><li>On-Line Analytical Processing (OLAP) </li></ul></ul><ul><ul><li>Business Intelligence (BI) </li></ul></ul><ul><ul><li>Customer Relationship Management (CRM) </li></ul></ul><ul><ul><li>Business Analytics </li></ul></ul><ul><ul><li>Target Marketing </li></ul></ul><ul><ul><li>Cross-Selling </li></ul></ul><ul><ul><li>Market Basket Analysis </li></ul></ul><ul><ul><li>Credit Scoring </li></ul></ul><ul><ul><li>Case-Based Reasoning (CBR) </li></ul></ul><ul><ul><li>Connecting the Dots </li></ul></ul><ul><ul><li>Intrusion Detection Systems (IDS) </li></ul></ul><ul><ul><li>Recommendation / Personalization Systems! </li></ul></ul>
  23. 23. Database Systems and Data Mining <ul><li>Data mining brings novel non-traditional concepts to large DBMS (e.g., association mining; neural nets; decision trees; link analysis; pattern recognition; classification; regression; SOMs). For example: </li></ul><ul><ul><li>Clustering Analysis = group together similar items and separate dissimilar items </li></ul></ul><ul><ul><li>Classification Prediction = predict the class label </li></ul></ul><ul><ul><li>Regression = predict a numeric attribute value </li></ul></ul><ul><ul><li>Association Analysis = detect attribute-value conditions that occur frequently together (e.g., Beer & Diapers example) </li></ul></ul>
  24. 24. Types of Databases to be Mined <ul><li>Relational databases </li></ul><ul><li>Data warehouses </li></ul><ul><li>Transactional databases </li></ul><ul><li>Advanced DB and information repositories: </li></ul><ul><ul><li>Object-oriented and object-relational databases </li></ul></ul><ul><ul><li>Spatial databases </li></ul></ul><ul><ul><li>Time-series data and temporal data </li></ul></ul><ul><ul><li>Text databases and multimedia databases </li></ul></ul><ul><ul><li>Heterogeneous and legacy databases </li></ul></ul><ul><ul><li>WWW, and eventually the Semantic Web </li></ul></ul>
  25. 25. Data Warehousing and OLAP
  26. 26. Data Warehousing <ul><li>Data warehouse = Materialized view </li></ul><ul><li>Integrated view of data from distributed sources </li></ul><ul><li>If transformation process can be represented via SQL, then data warehouse can be seen as a DB view: </li></ul><ul><ul><li>CREATE VIEW warehouse_table AS SELECT … FROM source_table1, source_table2, … WHERE … </li></ul></ul><ul><ul><li>except that the view is materialized = result is stored and needs to be maintained when source data change </li></ul></ul>
  27. 27. Order of Database Operations (1) <ul><li>When building a DW, pay attention to the order of operations in the SQL command </li></ul><ul><ul><li>particularly if large data need to be selected, grouped, and ordered </li></ul></ul><ul><ul><li>perhaps build intermediate views to cull data down to manageable size </li></ul></ul><ul><li>Order of operations . . . </li></ul>
  28. 28. Order of Database Operations (2) select ..... specifies attributes and computations to appear in answer from .... indicates Cartesian product of source tables where ..... provides boolean to filter Cartesian product groupby .... specifies attributes necessary to cluster the results of the where-filter orderby .... indicates attributes on which to order any visual display or sequential tuple returns into .... specifies a temporary table to hold the answer (4) (1) (2) (3) (5) (6) Operational order
  29. 29. Maintaining the Data Warehouse <ul><li>The key concept is ETL : </li></ul><ul><ul><li>Extraction : extract relevant data and/or changes from the DB sources </li></ul></ul><ul><ul><li>Transformation : transform the data to match the warehouse schema </li></ul></ul><ul><ul><li>Loading : integrate data (and subsequent changes to data) into the warehouse </li></ul></ul>
  30. 30. Data Warehousing “features” <ul><li>Data are integrated into the DW in advance, prior to queries being formulated </li></ul><ul><ul><li>Caution : Query results could therefore be stale </li></ul></ul><ul><li>Data are copied from distributed sources </li></ul><ul><ul><li>Care must be exercised to maintain consistency </li></ul></ul><ul><ul><li>Query processing is local to the DW: </li></ul></ul><ul><ul><ul><li>faster </li></ul></ul></ul><ul><ul><ul><li>can operate even when data sources are unavailable </li></ul></ul></ul>
  31. 31. Selecting views to materialize <ul><li>Factors that affect what to materialize: </li></ul><ul><ul><li>Storage cost </li></ul></ul><ul><ul><li>Update cost </li></ul></ul><ul><ul><li>Which queries will benefit from it </li></ul></ul><ul><ul><li>How much will those queries benefit from it </li></ul></ul><ul><li>Examples: </li></ul><ul><ul><li>GROUP BY A1 is small, but not useful for most queries </li></ul></ul><ul><ul><li>GROUP BY A1, B2, C3 is useful for most queries, but too large to be of much benefit </li></ul></ul>
  32. 32. Data Warehousing and OLAP (On-Line Analytical Processing) <ul><li>OLAP as Data Mining: </li></ul><ul><ul><li>Read data from integrated view of data sources </li></ul></ul><ul><ul><li>Complex queries of DW for Data Analysis </li></ul></ul><ul><ul><li>Data Analysis for Knowledge Discovery (KDD = Data Mining) </li></ul></ul><ul><ul><li>Knowledge Discovery for Decision Making </li></ul></ul><ul><ul><li>Goal: optimize reads and data warehouse queries for data exploration, mining, analysis </li></ul></ul>
  33. 33. OLTP versus OLAP (On-Line Transaction Processing vs. On-Line Analytical Processing) <ul><li>OLTP </li></ul><ul><ul><li>Mostly updates </li></ul></ul><ul><ul><li>Short, simple transactions </li></ul></ul><ul><ul><li>DBA, clerical users </li></ul></ul><ul><ul><li>Goal: transaction throughput </li></ul></ul><ul><ul><li>Local sources: heterogeneous DBs </li></ul></ul><ul><li>OLAP </li></ul><ul><ul><li>Mostly reads </li></ul></ul><ul><ul><li>Long, complex queries </li></ul></ul><ul><ul><li>Analysts, decision makers </li></ul></ul><ul><ul><li>Goal: fast queries </li></ul></ul><ul><ul><li>Distributed sources: single integrated view (data warehouse) </li></ul></ul>
  34. 34. OLAP Operations in the Warehouse <ul><li>Slice (select one dimensional view) </li></ul><ul><li>Dice (select multi-dimensional view; aids in the search for trends and patterns) </li></ul><ul><li>Roll-up (consolidation; dimension reduction; aggregation; using simple or complex expressions) </li></ul><ul><li>Drill-down (querying specific items) </li></ul><ul><li>Visualize (“see” the results; allows for intuitive data understanding) </li></ul>
  35. 35. The Data Warehouse as the Source for the Mining Process From Lecture #1
  36. 36. From “DataMines for DataWarehouses” article (available in Webliography ) Data Mining within the Data Warehouse Data Mining external to the Data Warehouse
  37. 37. Statistics and Data Mining
  38. 38. Data Mining = Statistical Analysis? <ul><li>&quot;Data mining … is the exploration and analysis, by automatic and semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.&quot; (Berry, J. A. & Linoff, G. [1997]. Data mining Techniques For Marketing, Sales and Customer Support , John Wiley & Sons, Inc. New York, p.5, http://www.data-miners.com /books/order.html ) </li></ul><ul><li>&quot;Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns of data for business advantage.&quot; (SAS Institute Inc., http://www.sas.com/technologies/analytics/datamining/index.html ) </li></ul><ul><li>&quot;Data mining simply means finding patterns in your business data which you can use to do your business better&quot; (SPSS Inc., http://www.statistical.com.au/dm.htm ) </li></ul><ul><li>” Data mining is the use of statistical analysis and machine learning techniques, in a semiautomatic fashion, on large collections of data.&quot; (Jorgensen, M. & Gentleman, R. [1998]. Data Mining. Chance 11, 34–42.) </li></ul>
  39. 39. Statistics and Data Mining <ul><li>Data mining got a bad name initially because it was initially viewed as “statistical dredging” or a “fishing expedition”. </li></ul><ul><li>Data mining became an acceptable practice because its users exercised statistical rigor in their analyses. </li></ul><ul><li>Challenges and concerns: </li></ul><ul><ul><li>Data volumes are huge. Techniques don’t often scale. </li></ul></ul><ul><ul><li>Contaminated or corrupt data values (6-sigma effect) </li></ul></ul><ul><ul><li>Selection bias; non-independent observations </li></ul></ul><ul><ul><li>Fishing expedition = if you look hard enough, you will find something. But, is it really useful or not? … … this is the “Interestingness” Problem … </li></ul></ul><ul><ul><ul><li>Are the data mining results interesting to anyone? </li></ul></ul></ul>
  40. 40. Quality Management and Data Mining <ul><li>The focus of TQM (Total Quality Management) is total customer satisfaction. </li></ul><ul><li>This can be realized through CRM ( Customer Relationship Management ) systems = a data mining technology : </li></ul><ul><ul><li>Gather data </li></ul></ul><ul><ul><li>Analyze data </li></ul></ul><ul><ul><li>Make decisions based upon results </li></ul></ul><ul><li>Related to this are 6-Sigma quality control processes : customer satisfaction maximized through minimizing defects in products and services delivered. </li></ul><ul><li>Some references: </li></ul><ul><ul><li>http://www.sbaer.uca.edu/newsletter/2002/012202.pdf </li></ul></ul><ul><ul><li>http://www.qualitydigest.com/apr99/html/body_spcguide.html </li></ul></ul>
  41. 41. Information Retrieval
  42. 42. Information Retrieval (IR) <ul><li>IR is a combination of data discovery and data mining in digital libraries or other information repositories. </li></ul><ul><li>An IR system operates on a collection of documents (e.g., the WWW) </li></ul><ul><li>IR is sometimes called Text Mining or Web Mining </li></ul><ul><li>Effectiveness of an IR project is measured by precision and recall </li></ul>
  43. 43. Information Retrieval Metrics <ul><li>Precision = (relevant & retrieved) / (retrieved) </li></ul><ul><ul><li>“ Am I interested in the documents retrieved?” </li></ul></ul><ul><ul><li>High Precision means most of the retrieved documents are relevant to my query </li></ul></ul><ul><li>Recall = (relevant & retrieved) / (relevant) </li></ul><ul><ul><li>“ Have all relevant documents been retrieved?” </li></ul></ul><ul><ul><li>High Recall means that most of the relevant documents have been retrieved. </li></ul></ul>
  44. 44. IR and Text/Web Mining <ul><li>Semantic markup of Web or other text documents using XML (eXtensible Markup Language) </li></ul><ul><li>XML enables metadata / keyword harvesting from document collections (e.g., Web screen-scraping) </li></ul><ul><li>Harvested metadata can be stored in a Data Warehouse for mining -- this is clearly an example of a materialized view of distributed data sources </li></ul><ul><li>Other metrics: “similarity” to other documents (e.g., common keywords, common keyphrases) </li></ul><ul><li>Application area: Automated Recommendation System </li></ul>
  45. 45. Information Retrieval Issues <ul><li>Semantic content of documents </li></ul><ul><li>Unstructured versus structured content </li></ul><ul><li>Multi-modal content (image, text, numeric) </li></ul><ul><li>Reliability of sources </li></ul><ul><li>Quality of sources </li></ul><ul><li>Indexing for efficient & effective access </li></ul><ul><li>Similarity metrics (e.g., how do you do a Groupby or a Roll-up ?) </li></ul><ul><li>Privacy, Copyright, Intellectual Property </li></ul>
  46. 46. IR and Image Mining <ul><li>Image Mining is a form of IR and data mining </li></ul><ul><li>Techniques: </li></ul><ul><ul><li>Wavelet analysis and summarization </li></ul></ul><ul><ul><li>Pixel value (color) histograms and vectorization </li></ul></ul><ul><ul><li>Scene pattern recognition and indexing </li></ul></ul><ul><ul><li>Event/anomaly detection and cataloguing (e.g, forest fires seen in satellite photos) </li></ul></ul><ul><ul><li>Edge detection (unsharp masking) and graphs </li></ul></ul><ul><li>The data to be mined are the information databases extracted from the images (not the raw image data themselves) </li></ul>
  47. 47. Data Mining as “Rule Induction”
  48. 48. Decision Tree Classification: based on rules at each node of the tree <ul><li>Should I play tennis today? </li></ul>From Lecture #1
  49. 49. Intelligent actions (decision support) are often represented by a set of rules… (example of Decision Tree rules) <ul><ul><li>IF age = “<=30” AND student = “ no ” THEN buys_computer = “ no ” </li></ul></ul><ul><ul><li>IF age = “<=30” AND student = “ yes ” THEN buys_computer = “ yes ” </li></ul></ul><ul><ul><li>IF age = “31…40” THEN buys_computer = “ yes ” </li></ul></ul><ul><ul><li>IF age = “>40” AND credit_rating = “ excellent ” THEN buys_computer = “ yes ” </li></ul></ul><ul><ul><li>IF age = “>40” AND credit_rating = “ fair ” THEN buys_computer = “ no ” </li></ul></ul>
  50. 50. Rule-Based Algorithms (RBA) <ul><li>RBA = Decision Support via “if-then rules” </li></ul><ul><li>Can generate the rules from a Decision Tree (DT). </li></ul><ul><li>But, rules do not need to be derived from a DT. </li></ul><ul><li>Rules have no order, unlike Decision Trees. </li></ul><ul><li>Trees are built by examining all cases; whereas rules are generated one case at a time. </li></ul><ul><li>Rule Induction is the method for deriving rules. </li></ul><ul><li>Case-Based Reasoning (CBR) is a related application of rule-based algorithms. </li></ul>
  51. 51. Sometimes the rules are fuzzy… (example of Fuzzy Rule Induction)
  52. 52. Fuzzy Sets and Logic
  53. 53. Fuzzy Sets and Logic <ul><li>Data mining does not always yield absolute answers, but statistical answers that indicate the probability frequency of occurrence of patterns or classes, or the likelihood that an object in the database belongs to a given class. </li></ul><ul><li>In predictive data mining, the result is fuzzy (e.g., predicting loan default through bank account analysis does not guarantee that the customer will indeed default on their loan). </li></ul><ul><li>Fuzzy Logic is a method for handling uncertainty in data, in decision-making, and in control systems. </li></ul>
  54. 54. Sets and Logic - Classical (Boolean)
  55. 55. Sets and Logic - Fuzzy
  56. 56. Classical versus Fuzzy
  57. 57. Fuzzy Logic, Control Systems, and Data Mining <ul><li>Suppose you have a R/T (real-time) data monitoring (data mining) control system attached to machinery in a large manufacturing plant. </li></ul><ul><li>Temperature sensor on a machine says that it is running very hot (... what is “hot”? -- that’s fuzzy ). </li></ul><ul><li>Motion sensor within machine says that it is running at high RPM, very fast (… what is “fast”? -- that’s fuzzy ). </li></ul><ul><li>The machine is not technically over-heating, which you know because of past experience and common sense. </li></ul><ul><li>Control System responds to data and knowledge-base by invoking a rule to slow down the motor speed a little bit. </li></ul>
  58. 58. Application of Fuzzy Logic to Data Mining - 1 < http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html > <ul><li>Direct Mailing System </li></ul><ul><li>The problem is to identify customers from a customer database who can be targeted for a sale under the assumption that these customers responded positively to advertisements mailed to them. The additional constraint is that the mailing list budget is limited and number of advertisements to be mailed are to be controlled to increase profit. The first step involves analyzing the database for attributes like &quot;frequency of visits to the store&quot;, &quot;sum of purchases&quot;, etc. Analysis and plots of the data then determine the cluster of good customers . Next, one has to find the attribute relationships to define a query condition which is represented by a pair of attributes and a fuzzy linguistic value. One then verifies and refines the query condition by using another customer database . Thus the customer database is ranked and sorted by degree values based on a given fuzzy query condition . The customers retrieved by the query determine the list of the potential of good customers . </li></ul>
  59. 59. Application of Fuzzy Logic to Data Mining - 2 < http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html > <ul><li>Vibration Sensor </li></ul><ul><li>A product which was used to sense vibrations and predict the causes of these vibrations (i.e., earthquakes, etc.) was improved by utilizing fuzzy rules. The original sensor was based on simple threshold rule . The error rate for this sensor was around 12%. The fuzzy rules were created by analyzing the actual data in specified cases of earthquakes, automobiles etc. A feature extraction was done on the data set to identify each kind of cause . Relationships between the feature parameters and the kind of vibration were discovered to develop the fuzzy rules . These rules were then tested and refined. The accuracy of the sensor’s prediction improved dramatically , with the error rate falling to within 1%. </li></ul>
  60. 60. Non-Fuzzy Logic System
  61. 61. Adaptive Fuzzy Logic System This example is related to air conditioner settings in a warm room, but the adaptive fuzzy logic system may be applied to activate other “thinking machines”.
  62. 62. Machine Learning – a tool for Data Mining and Intelligent Decision Support
  63. 63. Machine Learning <ul><li>What is Machine Learning? -- “ML is the application of computer algorithms that improve automatically through experience.” </li></ul><ul><li>Why is ML applicable to Data Mining? -- </li></ul><ul><ul><li>Refer to earlier slide “Reasons for the growing popularity of data mining” : </li></ul></ul><ul><ul><ul><li>Growing Data Volume -- ML enables the intelligent analysis of overwhelmingly large data/knowledge repositories </li></ul></ul></ul><ul><ul><ul><li>Limitations of Human Analysis -- ML enables automated searches for complex multifactor dependencies in data </li></ul></ul></ul><ul><ul><ul><li>Low Cost of Machine Learning -- machines and software are cheaper than people; the ML process is repeatable, consistent, and robust in handling very large data analysis tasks; adaptive ML algorithms can scale with the problem. </li></ul></ul></ul>
  64. 64. Machine Learning and Data Mining <ul><li>ML Techniques for DM (to be covered later): </li></ul><ul><ul><li>Decision Trees </li></ul></ul><ul><ul><li>Rule Mining and Rule Learning </li></ul></ul><ul><ul><li>Case-Based Reasoning (CBR) </li></ul></ul><ul><ul><li>Neural Nets (NN) </li></ul></ul><ul><ul><li>Supervised and Unsupervised Learning </li></ul></ul><ul><ul><li>Support Vector Machines (SVM) </li></ul></ul><ul><ul><li>Bayesian Networks </li></ul></ul><ul><ul><li>Genetic Algorithms (GA) </li></ul></ul>
  65. 65. <ul><li>“ Neural networks are the second best way of doing just about anything.” (John Denker) </li></ul><ul><li>The best way is “is to apply all available domain knowledge and spend a considerable amount of time, money and effort in building a rule system that will give the right answer. The second best way of doing anything is to learn from experience.” (Burbidge & Buxton) </li></ul>Neural Nets Neural Network Data Fuzzy Rules
  66. 66. Supervised vs. Unsupervised Learning <ul><li>In Supervised Learning algorithms, a training set is provided (data with correct answers), which is used to mine for known patterns. </li></ul><ul><li>In Unsupervised Learning algorithms, data are provided with no a priori knowledge of the hidden patterns (knowledge) that they contain. The goal is to discover (learn) these patterns. </li></ul><ul><li>A class known as Semi-Supervised Learning also exists, where knowledge is known and applied from one data collection in order to mine, analyze, classify, and interpret a related data collection. </li></ul>
  67. 67. Machine Learning, Data Mining, and Support Vector Machines (SVM) <ul><li>SVM is the tool of choice for the application of ML to the data mining classification problem. </li></ul><ul><li>So what are they? … “a statistical learning system for predictive data mining -- for estimating regression functions.” </li></ul><ul><li>Loads of information available here: </li></ul><ul><li>http://www.cs.rpi.edu/~bij2/svm.html </li></ul><ul><li>http://www.kernel-machines.org/tutorial.html </li></ul>
  68. 68. SVM Process Overview SVM Training SVM Classification Elements In Classification Elements Out of Classification Initial Classification Data Weights Data
  69. 69. SVM Classification <ul><li>SVM attempts to find an optimal separating hyperplane between members of the two initial classifications. </li></ul>Separating hyperplane Class “A” Class “B”
  70. 70. SVM Class Separation Problem <ul><li>An optimal hyperplane partitions the initial classification correctly and maximizes distance from the plane to elements on either ‘side’: positive and negative examples. </li></ul><ul><li>When the training examples (initial classification) consist of very diverse expression patterns, then finding an optimal hyperplane can be impossible. </li></ul>
  71. 71. SVM Kernel Construction <ul><li>The expression data can be transformed to a higher dimensional space (feature space) by applying a kernel function. This transformation can have the effect of allowing a separating hyperplane to be found. </li></ul>
  72. 72. Practical SVM Issues <ul><li>Results depend heavily on the input parameters. </li></ul><ul><li>Using a high degree kernel function risks artificial separation of the data. </li></ul><ul><li>An iterative approach to increasing the kernel power is advisable. </li></ul>
  73. 73. SVM Results <ul><li>Two classes are produced: </li></ul><ul><ul><li>Positive Class : contains elements with expression patterns similar to those in the positive examples in the training set. </li></ul></ul><ul><ul><li>Negative Class : contains all other members of the input set. </li></ul></ul><ul><li>Each of these classes has elements that fall in two groups: </li></ul><ul><ul><li>Those initially in the class (true positives and true negatives) </li></ul></ul><ul><ul><li>Those recruited into the class (false positives and false negatives) </li></ul></ul>
  74. 74. Machine Learning Resources <ul><li>1. Massive compilation of ML resources at : </li></ul><ul><li>http://home.earthlink.net/~dwaha/research/machine-learning.html </li></ul><ul><li>2. Excellent Reference Book: Tom Mitchell’s “Machine Learning” (1997; McGraw-Hill) : </li></ul><ul><li>http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html </li></ul><ul><li>3. Machine Learning & Data Mining Resources : </li></ul><ul><li>http:// www.mlnet.org / </li></ul><ul><li>… a site dedicated to “machine learning, knowledge discovery, case-based reasoning, knowledge acquisition, and data mining.” </li></ul>My favorite ML site … Click on Software
  75. 75. Recap of ML and DM <ul><li>DM requires machine assistance in the search and analysis of very large (often distributed, heterogeneous) databases </li></ul><ul><li>Intelligent analysis of complex multi-dimensional multiple-dependency data also demands machine assistance </li></ul><ul><li>Algorithms for DM are most efficient when they are adaptable to the type and content of the data (i.e., the system “learns”) </li></ul><ul><li>Machines are less expensive than humans </li></ul><ul><li>Machines are usually scalable as the problem size grows </li></ul><ul><li>Actionable data (the end-goal of DM) depends in many cases on an embedded ML algorithm to take appropriate action (in control systems; decision-support systems; robotics; autonomous systems) </li></ul><ul><li>ML and DM are historically, technically, and functionally intertwined (e.g, some data mining research groups call themselves Machine Learning Groups) </li></ul>
  76. 76. Steps in the Data Mining Process
  77. 77. Steps in the Data Mining Process http://www.cs.sfu.ca/~han/DM_Book.html <ul><li>Learning the application domain: </li></ul><ul><ul><li>relevant prior knowledge and goals of DM application </li></ul></ul><ul><li>Creating a target data set: Data selection </li></ul><ul><li>Data cleaning and preprocessing : (may take 40-60% of effort!) </li></ul><ul><li>Data reduction and transformation : </li></ul><ul><ul><li>Find useful features, dimensionality/variable reduction, invariant representation. </li></ul></ul><ul><li>Choosing data mining functions </li></ul><ul><ul><li>summarization, classification, regression, association, clustering </li></ul></ul><ul><li>Choosing the mining algorithm(s) </li></ul><ul><li>Data mining & KDD : search for patterns of interest </li></ul><ul><li>Pattern evaluation and knowledge presentation </li></ul><ul><ul><li>visualization, transformation, removing redundant patterns, etc. </li></ul></ul><ul><li>Using the discovered knowledge = Actionable Data! </li></ul>
  78. 78. Steps in the Data Mining Process - Pictorial View
  79. 79. Cleaning the “Dirty Data” <ul><li>Excellent reference: Dorian Pyle’s book “ Data Preparation for Data Mining ” (1999, Morgan Kaufmann; 540pp) </li></ul><ul><li>Frequent problem: missing (NULL) values </li></ul><ul><li>Empty value  Missing value (must treat each case differently) </li></ul><ul><li>Various options for NULLs (may introduce bias): </li></ul><ul><ul><li>use “fill value” (e.g, -999 ) </li></ul></ul><ul><ul><li>use estimated value (prediction from data model) </li></ul></ul><ul><ul><li>use interpolated value (from surrounding entries) </li></ul></ul><ul><ul><li>ignore any records with nulls </li></ul></ul><ul><li>November 2003 Workshop on Data Cleaning: </li></ul><ul><li>http:// dimacs.rutgers.edu/Workshops/DataCleaning / </li></ul>
  80. 80. Data Preprocessing (Laundering the Data) (may take 40-80% of the total data mining project effort!) (Reference: “Data Scrubbing” article in Computerworld 2003)
  81. 81. &quot;Data Scrubbing by the Numbers” ( http://www.computerworld.com/printthis/2003/0,4814,78260,00.html ) <ul><li>Here are some of the findings: </li></ul><ul><ul><ul><ul><li>Data cleansing accounts for up to 70% of the cost and effort of implementing most data warehouse projects, according to analysts. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>In 2001, The Data Warehousing Institute estimated that dirty data costs U.S. businesses $600 billion per year. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Data cleanliness and quality was the No. 2 problem -- right behind budget cuts -- cited in a 2003 IDC survey of 1,648 companies implementing business analytics software enterprise-wide. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Only 23% of 130 companies surveyed by Cutter Consortium on their data warehousing and business-intelligence practices use specialized data cleansing tools. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Of those companies in the Cutter Consortium study using specialized data scrubbing software, 31% are using tools that were built in-house. </li></ul></ul></ul></ul>
  82. 82. Major Issues in Data Mining
  83. 83. Major Issues in Data Mining (1) <ul><li>Mining methodology and user interaction </li></ul><ul><ul><li>Mining different kinds of knowledge in databases </li></ul></ul><ul><ul><li>Interactive mining of knowledge at multiple levels of abstraction </li></ul></ul><ul><ul><li>Incorporation of background knowledge </li></ul></ul><ul><ul><li>Data mining query languages and ad-hoc data mining </li></ul></ul><ul><ul><li>Expression and visualization of data mining results </li></ul></ul><ul><ul><li>Handling of noise and incomplete data </li></ul></ul><ul><ul><li>Pattern evaluation: the interestingness problem </li></ul></ul><ul><li>Performance and scalability </li></ul><ul><ul><li>Handling very large data volumes (the “data flood”) </li></ul></ul><ul><ul><li>Efficiency and scalability of data mining algorithms </li></ul></ul><ul><ul><li>Parallel, distributed, and incremental mining methods </li></ul></ul>
  84. 84. Major Issues in Data Mining (2) <ul><li>Issues relating to the diversity of data types </li></ul><ul><ul><li>Handling relational and complex types of data </li></ul></ul><ul><ul><li>Mining information from heterogeneous databases and global information systems (WWW) </li></ul></ul><ul><li>Issues related to applications and social impacts </li></ul><ul><ul><li>Application of discovered knowledge </li></ul></ul><ul><ul><ul><li>Domain-specific data mining tools </li></ul></ul></ul><ul><ul><ul><li>Intelligent query answering </li></ul></ul></ul><ul><ul><ul><li>Process control and decision making </li></ul></ul></ul><ul><ul><li>Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem </li></ul></ul><ul><ul><li>Protection of data security, integrity, and privacy </li></ul></ul><ul><li>Dirty data (60% of the effort, or more) </li></ul><ul><ul><li>Preparing the data for mining (transformation, cleaning, processing) </li></ul></ul>
  85. 85. Case Study - The Mars Rover http://mars.jpl.nasa.gov/mer/mission/spacecraft_surface_rover.html
  86. 86. Data Mining in Action <ul><li>Data Mining facilitates Intelligent Data Understanding </li></ul><ul><li>Data Mining enables Decision Support and Active Control Systems </li></ul>
  87. 87. What is Intelligent Data Understanding? <ul><li>IDU refers to the application of techniques for transforming data into understanding. … (sound familiar?) </li></ul><ul><li>Web reference : http://is.arc.nasa.gov/IDU/index.html </li></ul><ul><li>IDU specifically refers to automating the following techniques for machine-assisted data analysis: </li></ul><ul><ul><li>Data Mining ( e.g., http://is.arc.nasa.gov/IDU/tasks/NVODDM.html ) </li></ul></ul><ul><ul><li>Knowledge Discovery </li></ul></ul><ul><ul><li>Machine Learning </li></ul></ul>Data  Information  Knowledge  Understanding / Wisdom!
  88. 88. Intelligent Data System Applications (1) <ul><li>Rove around the surface of Mars and take samples of rocks (mass spectroscopy = a data histogram) </li></ul><ul><li>Supervised Learning (search for rocks with known compositions) </li></ul><ul><li>Unsupervised Learning (discover what types of rocks are present, without preconceived biases) </li></ul><ul><li>Association Mining (find unusual associations) </li></ul><ul><li>Clustering (find the set of unique classes of rocks) </li></ul><ul><li>Classification (assign rocks to known classes) </li></ul><ul><li>Deviation/Outlier Detection (one-of-kind; interesting?) </li></ul>
  89. 89. <ul><li>On-board Intelligent Data Understanding & Decision Support Systems ( Fuzzy Logic & Decision Trees & Cased-Based Reasoning ) – Science Goal Monitoring : </li></ul><ul><ul><li>“ stay here and do more ” ; or else “ move on to another rock ” </li></ul></ul><ul><ul><li>“ send results to Earth immediately ” ; or “ send results later ” </li></ul></ul><ul><li>Learn as it goes ( Machine Learning & Neural Nets ) </li></ul><ul><li>Relate the results to other factors, such as dust storms ( XML & Information Retrieval & Information Fusion with other data from orbiting satellite “mother ship”) </li></ul><ul><li>Predict where to go in order to find interesting rocks ( Logistic Regression & Case-Based Reasoning ) </li></ul>Intelligent Data System Applications (2)
  90. 90. Mars Rover as an Adaptive Fuzzy Logic System <ul><li>Decisions are based on data mined, prior experience, new knowledge, and fuzzy logic </li></ul><ul><li>Rover acts autonomously, without human intervention, in Deep Space environment </li></ul><ul><li>Actions are driven by mining actionable data from all sensors </li></ul>
  91. 91. Summary
  92. 92. Summary of Topics Covered <ul><li>Summary of “What is Data Mining?” Tutorial </li></ul><ul><li>Foundations of Data Mining </li></ul><ul><li>Database Systems </li></ul><ul><li>Data Warehousing and OLAP </li></ul><ul><li>Statistics and Data Mining </li></ul><ul><li>Information Retrieval </li></ul><ul><li>Data Mining as “Rule Induction” </li></ul><ul><li>Fuzzy Sets and Logic </li></ul><ul><li>Machine Learning </li></ul><ul><li>Steps in the Data Mining Process </li></ul><ul><li>Major Issues in Data Mining </li></ul><ul><li>A Case Study: The NASA Mars Rover </li></ul>