Your SlideShare is downloading. ×
Data Mining-Current Status and Research Directions
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Mining-Current Status and Research Directions

3,548

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,548
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
178
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Mining: Current Status and Research Directions
    • Jiawei Han
    • Intelligent Database Systems Research Lab
    • School of Computing Science
    • Simon Fraser University, Canada
    • http://www.cs.sfu.ca/~han
  • 2. Why Is Data Mining Hot?
    • Data mining ( knowledge discovery in databases )
      • Extraction of interesting ( non-trivial, implicit , previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories
    • Necessity is the mother of invention
      • Data is everywhere—data mining should be everywhere, too!
      • Understand and use data—an imminent task!
  • 3. Data, Data, Everywhere!!
    • Relational database—A commodity of every enterprise
    • Huge data warehouses are under construction
    • POS (Point of Sales): Transactional DBs in terabytes
    • Object-relational databases, distributed, heterogeneous, and legacy databases
    • Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases
    • Time-series data (e.g., stock trading) and temporal data
    • Text (documents, emails) and multimedia databases
    • WWW: A huge, hyper-linked, dynamic, global information system
  • 4. Data Mining Is Everywhere, too! — A Multi-Dimensional View of Data Mining
    • Databases to be mined
      • Relational, transactional, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc.
    • Knowledge to be mined
      • Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.
    • Techniques utilized
      • Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc.
    • Applications adapted
      • Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.
  • 5. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning (AI) Visualization
  • 6. Data Mining—One Can Trace Back to Early Civilization
    • Most scientific discoveries involve “data mining”
      • Kepler’s Law, Newton’s Laws, periodic table of chemical elements, …, from “big bang” to DNA
    • Statistics: A discipline dedicated to data analysis
    • Then why data mining? What are the differences?
      • Huge amount of data—in giga to tera bytes
      • Fast computer—quick response, interactive analysis
      • Multi-dimensional, powerful, thorough analysis
      • High-level, “declarative”—user’s ease and control
      • Automated or semi-automated—mining functions hidden or built-in in many systems
  • 7. A Brief History of Data Mining Activities
    • 1989 IJCAI Workshop on Knowledge Discovery in Databases
      • Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
    • 1991-1994 Workshops on Knowledge Discovery in Databases
      • Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
    • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
      • Journal of Data Mining and Knowledge Discovery (1997)
    • 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations
    • More conferences on data mining
      • PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.
  • 8. Research Progress in the Last Decade
    • Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing)
    • Association, correlation, and causality analysis
    • Classification: scalability and new approaches
    • Clustering and outlier analysis
    • Sequential patterns and time-series analysis
    • Similarity analysis: curves, trends, images, texts, etc.
    • Text mining, Web mining and Weblog analysis
    • Spatial, multimedia, scientific data analysis
    • Data preprocessing and database compression
    • Data visualization and visual data mining
    • Many others, e.g., collaborative filtering
  • 9. Multi-Dimensional Data Analysis
    • Data warehousing: integration from heterogeneous or semi-structured databases
    • Multi-dimensional modeling of data: star & snowflake schemas
    • Efficient and scalable computation of data cubes or iceberg cubes
    • OLAP (on-line analytical processing): drilling, dicing, slicing, etc.
    • Discovery-driven exploration of data cubes
    • From OLAP to OLAM: A multi-dimensional view for on-line analytical mining
  • 10. Association and Frequent Pattern Analysis
    • Efficient mining of frequent patterns and association rules:
      • Apriori and FP-growth algorithms
      • Multi-level, multi-dimensional, quantitative association mining
    • From association to correlation, sequential patterns, partial periodicity, cyclic rules, ratio rules, etc.
    • Query and constraint-based association analysis
  • 11. Classification: Scalable Methods and Handling of Complex Types of Data
    • Classification has been an essential theme in machine learning, and statistics research
      • Decision trees, Bayesian classification, neural networks, k-nearest neighbors, etc.
      • Tree-pruning, Boosting, bagging techniques
    • Efficient and scalable classification methods
      • Exploration of attribute-class pairs
      • SLIQ, SPRINT, RainForest, BOAT, etc.
    • Classification of semi-structured and non-structured data
      • Classification by clustering association rules (ARCS)
      • Association-based classification
      • Web document classification
  • 12. Clustering and Outlier Analysis
    • Partitioning methods
      • k-means, k-medoids, CLARANS
    • Hierarchical methods: micro-clusters
      • Birch, Cure, Chameleon
    • Density-based methods:
      • DBSCAN and OPTICS, DENCLU
    • Grid-based methods
      • STING, CLIQUE, WaveCluster
    • Outlier analysis:
      • statistics-based, distance-based, deviation-based
    • Constraint-based clustering
      • COD (Clustering with Obstructed Distance)
      • User-specified constraints
  • 13. Sequential Patterns and Time-Series Analysis
    • Trend analysis
      • Trend movement vs. cyclic variations, seasonal variations and random fluctuations
    • Similarity search in time-series database
      • Handling gaps, scaling, etc.
      • Indexing methods and query languages for time-series
    • Sequential pattern mining
      • Various kinds of sequences, various methods
      • From GSP to PrefixSpan
    • Periodicity analysis
      • Full periodicity, partial periodicity, cyclic association rules
  • 14. Similarity Search: Similar Curves, Trends, Images, and Texts
    • Various kinds of data, various similarity mining methods
    • Discovery of similar trends in time-series data
      • Data transformation & high-dimensional structures
    • Finding similar images based on color, texture, etc.
      • Content-based vs. keyword-based retrieval
      • Color histogram-based signature
      • Multi-feature composed signature
    • Finding documents with similar texts
      • Similar keywords (synonymy & polysemy)
      • Term frequency matrix
      • Latent semantic indexing
  • 15. Spatial, Multimedia, Scientific Data Analysis
    • Multi-dimensional analysis of spatial, multimedia and scientific data
      • Geo-spatial data cube and spatial OLAP
      • The curse of dimensionality problem
    • Association analysis
      • A progressive refinement methodology
      • Micro-clustering can be used for preprocessing in the analysis of complex types of data
    • Classification
      • Association-based for handling high-dimensionality and sparse data
  • 16. Data Mining Industry and Applications
    • From research prototypes to data mining products, languages, and standards
      • IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc.
      • A few data mining languages and standards (esp. MS OLEDB for Data Mining).
    • Application achievements in many domains
      • Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc.
  • 17. Web Mining: A Fast Expanding Frontier in Data Mining
    • Mine what Web search engine finds
    • Automatic classification of Web documents
    • Discovery of authoritative Web pages, Web structures and Web communities
    • Meta-Web Warehousing: Web yellow page service
    • Web usage mining
  • 18. Mine What Web Search Engine Finds
    • Current Web search engines: A convenient source for mining
      • keyword-based, return too many, often low quality answers, still missing a lot, not customized, etc.
    • Data mining will help:
      • coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies
      • better search primitives: user preferences/hints
      • linkage analysis: authoritative pages and clusters
      • Web-based languages: XML + WebSQL + WebML
      • customization: home page + Weblog + user profiles
  • 19. Discovery of Authoritative Pages in WWW
    • Page-rank method ( Brin and Page, 1998):
      • Rank the "importance" of Web pages, based on a model of a "random browser."
    • Hub/authority method (Kleinberg, 1998):
      • Prominent authorities often do not endorse one another directly on the Web.
      • Hub pages have a large number of links to many relevant authorities.
      • Thus hubs and authorities exhibit a mutually reinforcing relationship:
    • Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW.
  • 20. Automatic Classification of Web Documents
    • Web document classification:
      • Good human classification: Yahoo!, CS term hierarchies
      • These classifications can be used as training sets to build up learning model
    • Key-word based classification is different from multi-dimensional classification
      • Association or clustering-based classification is often more effective
      • Multi-level classification is important
  • 21. Web Usage (Click-Stream) Mining
    • Weblog provides rich information about Web dynamics
    • Multidimensional Weblog analysis:
      • disclose potential customers, users, markets, etc.
    • Plan mining (mining general Web accessing regularities):
      • Web linkage adjustment, performance improvements
    • Web accessing association/sequential pattern analysis:
      • Web cashing, prefetching, swapping
    • Trend analysis:
      • Dynamics of the Web: what has been changing?
    • Customized to individual users
  • 22. Querying and Mining: An Integrated Information Analysis Environment
    • Data mining as a component of DBMS, data warehouse, or Web information system
      • Integrated information processing environment
        • MS/SQLServer-2000 (Analysis service)
        • IBM IntelligentMiner on DB2
        • SAS EnterpriseMiner: data warehousing + mining
    • Query-based mining
      • Querying database/DW/Web knowledge
      • Efficiency and flexibility: preprocessing, on-line processing, optimization, integration, etc.
  • 23. Basic Mining Operations and Mining Query Optimization
    • Relational databases: There are a set of basic relational operations and a standard query language, SQL
      • E.g., selection, projection, join, set difference, intersection, Cartesian product, etc.
    • Are there a set of standard data mining operations, on which optimizations can be done?
      • Difficulty: different definitions on operations
      • Importance: optimization can be performed on them systematically, standardization to facilitate information exchange and system interoperability
  • 24. “ Vertical” Data Mining
    • Generic data mining tools? —Too simple to match domain-specific, sophisticated applications
      • Expert knowledge and business logic represent many years of work in their own fields!
      • Data mining + business logic + domain experts
    • A multi-dimensional view of data miners
      • Complexity of data: Web, sequence, spatial, multimedia, …
      • Complexity of domains: DNA, astronomy, market, telecom, …
    • Domain-specific data mining tools
      • Provide concrete, killer solution to specific problems
      • Feedback to build more powerful tools
  • 25. One Picture May Worth 1000 Words!
    • Visual Data Mining
      • Visualization of data
      • Visualization of data mining results
      • Visualization of data mining processes
      • Interactive data mining: visual classification
    • One melody may worth 1000 words too!
      • Audio data mining: turn data into music and melody!
      • Uses audio signals to indicate the patterns of data or the features of data mining results
  • 26. Visualization of data mining results in SAS Enterprise Miner: scatter plots
  • 27. Visualization of association rules in MineSet 3.0
  • 28. Visualization of a decision tree in MineSet 3.0
  • 29. Visualization of Data Mining Processes by Clementine
  • 30. Interactive Visual Mining by Perception-Based Classification (PBC)
  • 31. Constraint-Based Mining
    • What kinds of constraints can be used in mining?
      • Knowledge type constraint : classification, association, etc.
      • Data constraint : SQL-like queries
        • Find products sold together in Vancouver in Feb.’01 .
      • Dimension/level constraints:
        • in relevance to region, price, brand, customer category .
      • Rule constraints:
        • small sales (price < $10) triggers big sales (sum > $200).
      • Interestingness constraints:
        • E.g., strong rules (min_support  3%, min_confidence  60%, min_lift > 3.0).
  • 32. Conclusions
    • Data mining—A promising research frontier
    • Data mining research has been striding forward greatly in the last decade
    • However, data mining, as an industry, has not been flying as high as expected
    • Much research and application exploration are needed
      • Web mining
      • Towards integrated data mining environments and tools
      • Towards intelligent, efficient, and scalable data mining methods
  • 33. http://www.cs.sfu.ca/~han http://db.cs.sfu.ca
    • Thank you !!!
  • 34. References
    • J. Han and M. Kamber, Data Mining: Concepts and Techniques , Morgan Kaufmann, 2001.
    • J. Han, L. V. S. Lakshmanan, and R. T. Ng, &quot;Constraint-Based, Multidimensional Data Mining&quot;, COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.

×