Data Mining-Current Status and Research Directions


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining-Current Status and Research Directions

  1. 1. Data Mining: Current Status and Research Directions <ul><li>Jiawei Han </li></ul><ul><li>Intelligent Database Systems Research Lab </li></ul><ul><li>School of Computing Science </li></ul><ul><li>Simon Fraser University, Canada </li></ul><ul><li> </li></ul>
  2. 2. Why Is Data Mining Hot? <ul><li>Data mining ( knowledge discovery in databases ) </li></ul><ul><ul><li>Extraction of interesting ( non-trivial, implicit , previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories </li></ul></ul><ul><li>Necessity is the mother of invention </li></ul><ul><ul><li>Data is everywhere—data mining should be everywhere, too! </li></ul></ul><ul><ul><li>Understand and use data—an imminent task! </li></ul></ul>
  3. 3. Data, Data, Everywhere!! <ul><li>Relational database—A commodity of every enterprise </li></ul><ul><li>Huge data warehouses are under construction </li></ul><ul><li>POS (Point of Sales): Transactional DBs in terabytes </li></ul><ul><li>Object-relational databases, distributed, heterogeneous, and legacy databases </li></ul><ul><li>Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases </li></ul><ul><li>Time-series data (e.g., stock trading) and temporal data </li></ul><ul><li>Text (documents, emails) and multimedia databases </li></ul><ul><li>WWW: A huge, hyper-linked, dynamic, global information system </li></ul>
  4. 4. Data Mining Is Everywhere, too! — A Multi-Dimensional View of Data Mining <ul><li>Databases to be mined </li></ul><ul><ul><li>Relational, transactional, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc. </li></ul></ul><ul><li>Knowledge to be mined </li></ul><ul><ul><li>Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. </li></ul></ul><ul><li>Techniques utilized </li></ul><ul><ul><li>Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. </li></ul></ul><ul><li>Applications adapted </li></ul><ul><ul><li>Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc. </li></ul></ul>
  5. 5. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning (AI) Visualization
  6. 6. Data Mining—One Can Trace Back to Early Civilization <ul><li>Most scientific discoveries involve “data mining” </li></ul><ul><ul><li>Kepler’s Law, Newton’s Laws, periodic table of chemical elements, …, from “big bang” to DNA </li></ul></ul><ul><li>Statistics: A discipline dedicated to data analysis </li></ul><ul><li>Then why data mining? What are the differences? </li></ul><ul><ul><li>Huge amount of data—in giga to tera bytes </li></ul></ul><ul><ul><li>Fast computer—quick response, interactive analysis </li></ul></ul><ul><ul><li>Multi-dimensional, powerful, thorough analysis </li></ul></ul><ul><ul><li>High-level, “declarative”—user’s ease and control </li></ul></ul><ul><ul><li>Automated or semi-automated—mining functions hidden or built-in in many systems </li></ul></ul>
  7. 7. A Brief History of Data Mining Activities <ul><li>1989 IJCAI Workshop on Knowledge Discovery in Databases </li></ul><ul><ul><li>Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) </li></ul></ul><ul><li>1991-1994 Workshops on Knowledge Discovery in Databases </li></ul><ul><ul><li>Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) </li></ul></ul><ul><li>1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) </li></ul><ul><ul><li>Journal of Data Mining and Knowledge Discovery (1997) </li></ul></ul><ul><li>1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations </li></ul><ul><li>More conferences on data mining </li></ul><ul><ul><li>PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc. </li></ul></ul>
  8. 8. Research Progress in the Last Decade <ul><li>Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing) </li></ul><ul><li>Association, correlation, and causality analysis </li></ul><ul><li>Classification: scalability and new approaches </li></ul><ul><li>Clustering and outlier analysis </li></ul><ul><li>Sequential patterns and time-series analysis </li></ul><ul><li>Similarity analysis: curves, trends, images, texts, etc. </li></ul><ul><li>Text mining, Web mining and Weblog analysis </li></ul><ul><li>Spatial, multimedia, scientific data analysis </li></ul><ul><li>Data preprocessing and database compression </li></ul><ul><li>Data visualization and visual data mining </li></ul><ul><li>Many others, e.g., collaborative filtering </li></ul>
  9. 9. Multi-Dimensional Data Analysis <ul><li>Data warehousing: integration from heterogeneous or semi-structured databases </li></ul><ul><li>Multi-dimensional modeling of data: star & snowflake schemas </li></ul><ul><li>Efficient and scalable computation of data cubes or iceberg cubes </li></ul><ul><li>OLAP (on-line analytical processing): drilling, dicing, slicing, etc. </li></ul><ul><li>Discovery-driven exploration of data cubes </li></ul><ul><li>From OLAP to OLAM: A multi-dimensional view for on-line analytical mining </li></ul>
  10. 10. Association and Frequent Pattern Analysis <ul><li>Efficient mining of frequent patterns and association rules: </li></ul><ul><ul><li>Apriori and FP-growth algorithms </li></ul></ul><ul><ul><li>Multi-level, multi-dimensional, quantitative association mining </li></ul></ul><ul><li>From association to correlation, sequential patterns, partial periodicity, cyclic rules, ratio rules, etc. </li></ul><ul><li>Query and constraint-based association analysis </li></ul>
  11. 11. Classification: Scalable Methods and Handling of Complex Types of Data <ul><li>Classification has been an essential theme in machine learning, and statistics research </li></ul><ul><ul><li>Decision trees, Bayesian classification, neural networks, k-nearest neighbors, etc. </li></ul></ul><ul><ul><li>Tree-pruning, Boosting, bagging techniques </li></ul></ul><ul><li>Efficient and scalable classification methods </li></ul><ul><ul><li>Exploration of attribute-class pairs </li></ul></ul><ul><ul><li>SLIQ, SPRINT, RainForest, BOAT, etc. </li></ul></ul><ul><li>Classification of semi-structured and non-structured data </li></ul><ul><ul><li>Classification by clustering association rules (ARCS) </li></ul></ul><ul><ul><li>Association-based classification </li></ul></ul><ul><ul><li>Web document classification </li></ul></ul>
  12. 12. Clustering and Outlier Analysis <ul><li>Partitioning methods </li></ul><ul><ul><li>k-means, k-medoids, CLARANS </li></ul></ul><ul><li>Hierarchical methods: micro-clusters </li></ul><ul><ul><li>Birch, Cure, Chameleon </li></ul></ul><ul><li>Density-based methods: </li></ul><ul><ul><li>DBSCAN and OPTICS, DENCLU </li></ul></ul><ul><li>Grid-based methods </li></ul><ul><ul><li>STING, CLIQUE, WaveCluster </li></ul></ul><ul><li>Outlier analysis: </li></ul><ul><ul><li>statistics-based, distance-based, deviation-based </li></ul></ul><ul><li>Constraint-based clustering </li></ul><ul><ul><li>COD (Clustering with Obstructed Distance) </li></ul></ul><ul><ul><li>User-specified constraints </li></ul></ul>
  13. 13. Sequential Patterns and Time-Series Analysis <ul><li>Trend analysis </li></ul><ul><ul><li>Trend movement vs. cyclic variations, seasonal variations and random fluctuations </li></ul></ul><ul><li>Similarity search in time-series database </li></ul><ul><ul><li>Handling gaps, scaling, etc. </li></ul></ul><ul><ul><li>Indexing methods and query languages for time-series </li></ul></ul><ul><li>Sequential pattern mining </li></ul><ul><ul><li>Various kinds of sequences, various methods </li></ul></ul><ul><ul><li>From GSP to PrefixSpan </li></ul></ul><ul><li>Periodicity analysis </li></ul><ul><ul><li>Full periodicity, partial periodicity, cyclic association rules </li></ul></ul>
  14. 14. Similarity Search: Similar Curves, Trends, Images, and Texts <ul><li>Various kinds of data, various similarity mining methods </li></ul><ul><li>Discovery of similar trends in time-series data </li></ul><ul><ul><li>Data transformation & high-dimensional structures </li></ul></ul><ul><li>Finding similar images based on color, texture, etc. </li></ul><ul><ul><li>Content-based vs. keyword-based retrieval </li></ul></ul><ul><ul><li>Color histogram-based signature </li></ul></ul><ul><ul><li>Multi-feature composed signature </li></ul></ul><ul><li>Finding documents with similar texts </li></ul><ul><ul><li>Similar keywords (synonymy & polysemy) </li></ul></ul><ul><ul><li>Term frequency matrix </li></ul></ul><ul><ul><li>Latent semantic indexing </li></ul></ul>
  15. 15. Spatial, Multimedia, Scientific Data Analysis <ul><li>Multi-dimensional analysis of spatial, multimedia and scientific data </li></ul><ul><ul><li>Geo-spatial data cube and spatial OLAP </li></ul></ul><ul><ul><li>The curse of dimensionality problem </li></ul></ul><ul><li>Association analysis </li></ul><ul><ul><li>A progressive refinement methodology </li></ul></ul><ul><ul><li>Micro-clustering can be used for preprocessing in the analysis of complex types of data </li></ul></ul><ul><li>Classification </li></ul><ul><ul><li>Association-based for handling high-dimensionality and sparse data </li></ul></ul>
  16. 16. Data Mining Industry and Applications <ul><li>From research prototypes to data mining products, languages, and standards </li></ul><ul><ul><li>IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc. </li></ul></ul><ul><ul><li>A few data mining languages and standards (esp. MS OLEDB for Data Mining). </li></ul></ul><ul><li>Application achievements in many domains </li></ul><ul><ul><li>Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc. </li></ul></ul>
  17. 17. Web Mining: A Fast Expanding Frontier in Data Mining <ul><li>Mine what Web search engine finds </li></ul><ul><li>Automatic classification of Web documents </li></ul><ul><li>Discovery of authoritative Web pages, Web structures and Web communities </li></ul><ul><li>Meta-Web Warehousing: Web yellow page service </li></ul><ul><li>Web usage mining </li></ul>
  18. 18. Mine What Web Search Engine Finds <ul><li>Current Web search engines: A convenient source for mining </li></ul><ul><ul><li>keyword-based, return too many, often low quality answers, still missing a lot, not customized, etc. </li></ul></ul><ul><li>Data mining will help: </li></ul><ul><ul><li>coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies </li></ul></ul><ul><ul><li>better search primitives: user preferences/hints </li></ul></ul><ul><ul><li>linkage analysis: authoritative pages and clusters </li></ul></ul><ul><ul><li>Web-based languages: XML + WebSQL + WebML </li></ul></ul><ul><ul><li>customization: home page + Weblog + user profiles </li></ul></ul>
  19. 19. Discovery of Authoritative Pages in WWW <ul><li>Page-rank method ( Brin and Page, 1998): </li></ul><ul><ul><li>Rank the &quot;importance&quot; of Web pages, based on a model of a &quot;random browser.&quot; </li></ul></ul><ul><li>Hub/authority method (Kleinberg, 1998): </li></ul><ul><ul><li>Prominent authorities often do not endorse one another directly on the Web. </li></ul></ul><ul><ul><li>Hub pages have a large number of links to many relevant authorities. </li></ul></ul><ul><ul><li>Thus hubs and authorities exhibit a mutually reinforcing relationship: </li></ul></ul><ul><li>Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW. </li></ul>
  20. 20. Automatic Classification of Web Documents <ul><li>Web document classification: </li></ul><ul><ul><li>Good human classification: Yahoo!, CS term hierarchies </li></ul></ul><ul><ul><li>These classifications can be used as training sets to build up learning model </li></ul></ul><ul><li>Key-word based classification is different from multi-dimensional classification </li></ul><ul><ul><li>Association or clustering-based classification is often more effective </li></ul></ul><ul><ul><li>Multi-level classification is important </li></ul></ul>
  21. 21. Web Usage (Click-Stream) Mining <ul><li>Weblog provides rich information about Web dynamics </li></ul><ul><li>Multidimensional Weblog analysis: </li></ul><ul><ul><li>disclose potential customers, users, markets, etc. </li></ul></ul><ul><li>Plan mining (mining general Web accessing regularities): </li></ul><ul><ul><li>Web linkage adjustment, performance improvements </li></ul></ul><ul><li>Web accessing association/sequential pattern analysis: </li></ul><ul><ul><li>Web cashing, prefetching, swapping </li></ul></ul><ul><li>Trend analysis: </li></ul><ul><ul><li>Dynamics of the Web: what has been changing? </li></ul></ul><ul><li>Customized to individual users </li></ul>
  22. 22. Querying and Mining: An Integrated Information Analysis Environment <ul><li>Data mining as a component of DBMS, data warehouse, or Web information system </li></ul><ul><ul><li>Integrated information processing environment </li></ul></ul><ul><ul><ul><li>MS/SQLServer-2000 (Analysis service) </li></ul></ul></ul><ul><ul><ul><li>IBM IntelligentMiner on DB2 </li></ul></ul></ul><ul><ul><ul><li>SAS EnterpriseMiner: data warehousing + mining </li></ul></ul></ul><ul><li>Query-based mining </li></ul><ul><ul><li>Querying database/DW/Web knowledge </li></ul></ul><ul><ul><li>Efficiency and flexibility: preprocessing, on-line processing, optimization, integration, etc. </li></ul></ul>
  23. 23. Basic Mining Operations and Mining Query Optimization <ul><li>Relational databases: There are a set of basic relational operations and a standard query language, SQL </li></ul><ul><ul><li>E.g., selection, projection, join, set difference, intersection, Cartesian product, etc. </li></ul></ul><ul><li>Are there a set of standard data mining operations, on which optimizations can be done? </li></ul><ul><ul><li>Difficulty: different definitions on operations </li></ul></ul><ul><ul><li>Importance: optimization can be performed on them systematically, standardization to facilitate information exchange and system interoperability </li></ul></ul>
  24. 24. “ Vertical” Data Mining <ul><li>Generic data mining tools? —Too simple to match domain-specific, sophisticated applications </li></ul><ul><ul><li>Expert knowledge and business logic represent many years of work in their own fields! </li></ul></ul><ul><ul><li>Data mining + business logic + domain experts </li></ul></ul><ul><li>A multi-dimensional view of data miners </li></ul><ul><ul><li>Complexity of data: Web, sequence, spatial, multimedia, … </li></ul></ul><ul><ul><li>Complexity of domains: DNA, astronomy, market, telecom, … </li></ul></ul><ul><li>Domain-specific data mining tools </li></ul><ul><ul><li>Provide concrete, killer solution to specific problems </li></ul></ul><ul><ul><li>Feedback to build more powerful tools </li></ul></ul>
  25. 25. One Picture May Worth 1000 Words! <ul><li>Visual Data Mining </li></ul><ul><ul><li>Visualization of data </li></ul></ul><ul><ul><li>Visualization of data mining results </li></ul></ul><ul><ul><li>Visualization of data mining processes </li></ul></ul><ul><ul><li>Interactive data mining: visual classification </li></ul></ul><ul><li>One melody may worth 1000 words too! </li></ul><ul><ul><li>Audio data mining: turn data into music and melody! </li></ul></ul><ul><ul><li>Uses audio signals to indicate the patterns of data or the features of data mining results </li></ul></ul>
  26. 26. Visualization of data mining results in SAS Enterprise Miner: scatter plots
  27. 27. Visualization of association rules in MineSet 3.0
  28. 28. Visualization of a decision tree in MineSet 3.0
  29. 29. Visualization of Data Mining Processes by Clementine
  30. 30. Interactive Visual Mining by Perception-Based Classification (PBC)
  31. 31. Constraint-Based Mining <ul><li>What kinds of constraints can be used in mining? </li></ul><ul><ul><li>Knowledge type constraint : classification, association, etc. </li></ul></ul><ul><ul><li>Data constraint : SQL-like queries </li></ul></ul><ul><ul><ul><li>Find products sold together in Vancouver in Feb.’01 . </li></ul></ul></ul><ul><ul><li>Dimension/level constraints: </li></ul></ul><ul><ul><ul><li>in relevance to region, price, brand, customer category . </li></ul></ul></ul><ul><ul><li>Rule constraints: </li></ul></ul><ul><ul><ul><li>small sales (price < $10) triggers big sales (sum > $200). </li></ul></ul></ul><ul><ul><li>Interestingness constraints: </li></ul></ul><ul><ul><ul><li>E.g., strong rules (min_support  3%, min_confidence  60%, min_lift > 3.0). </li></ul></ul></ul>
  32. 32. Conclusions <ul><li>Data mining—A promising research frontier </li></ul><ul><li>Data mining research has been striding forward greatly in the last decade </li></ul><ul><li>However, data mining, as an industry, has not been flying as high as expected </li></ul><ul><li>Much research and application exploration are needed </li></ul><ul><ul><li>Web mining </li></ul></ul><ul><ul><li>Towards integrated data mining environments and tools </li></ul></ul><ul><ul><li>Towards intelligent, efficient, and scalable data mining methods </li></ul></ul>
  33. 33. <ul><li>Thank you !!! </li></ul>
  34. 34. References <ul><li>J. Han and M. Kamber, Data Mining: Concepts and Techniques , Morgan Kaufmann, 2001. </li></ul><ul><li>J. Han, L. V. S. Lakshmanan, and R. T. Ng, &quot;Constraint-Based, Multidimensional Data Mining&quot;, COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999. </li></ul>