• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data Mining-Current Status and Research Directions
 

Data Mining-Current Status and Research Directions

on

  • 3,087 views

 

Statistics

Views

Total Views
3,087
Views on SlideShare
3,087
Embed Views
0

Actions

Likes
2
Downloads
154
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data Mining-Current Status and Research Directions Data Mining-Current Status and Research Directions Presentation Transcript

    • Data Mining: Current Status and Research Directions
      • Jiawei Han
      • Intelligent Database Systems Research Lab
      • School of Computing Science
      • Simon Fraser University, Canada
      • http://www.cs.sfu.ca/~han
    • Outline
      • Why is data mining hot?
      • Current status: Major technical progress
      • Is data mining flying high, or not?
      • How to fly data mining high?—Research directions on data mining
    • Why Is Data Mining Hot?
      • Data mining ( knowledge discovery in databases )
        • Extraction of interesting ( non-trivial, implicit , previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories
      • Necessity is the mother of invention
        • Data is everywhere—data mining should be everywhere, too!
        • Understand and use data—an imminent task!
    • Data, Data, Everywhere!!
      • Relational database—A commodity of every enterprise
      • Huge data warehouses are under construction
      • POS (Point of Sales): Transactional DBs in terabytes
      • Object-relational databases, distributed, heterogeneous, and legacy databases
      • Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases
      • Time-series data (e.g., stock trading) and temporal data
      • Text (documents, emails) and multimedia databases
      • WWW: A huge, hyper-linked, dynamic, global information system
    • Data Mining Is Everywhere, too! — A Multi-Dimensional View of Data Mining
      • Databases to be mined
        • Relational, transactional, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc.
      • Knowledge to be mined
        • Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.
      • Techniques utilized
        • Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc.
      • Applications adapted
        • Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.
    • Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning (AI) Visualization
    • Data Mining—One Can Trace Back to Early Civilization
      • Most scientific discoveries involve “data mining”
        • Kepler’s Law, Newton’s Laws, periodic table of chemical elements, …, from “big bang” to DNA
      • Statistics: A discipline dedicated to data analysis
      • Then why data mining? What are the differences?
        • Huge amount of data—in giga to tera bytes
        • Fast computer—quick response, interactive analysis
        • Multi-dimensional, powerful, thorough analysis
        • High-level, “declarative”—user’s ease and control
        • Automated or semi-automated—mining functions hidden or built-in in many systems
    • A Brief History of Data Mining Activities
      • 1989 IJCAI Workshop on Knowledge Discovery in Databases
        • Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
      • 1991-1994 Workshops on Knowledge Discovery in Databases
        • Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
      • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
        • Journal of Data Mining and Knowledge Discovery (1997)
      • 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations
      • More conferences on data mining
        • PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.
    • Research Progress in the Last Decade
      • Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing)
      • Association, correlation, and causality analysis
      • Classification: scalability and new approaches
      • Clustering and outlier analysis
      • Sequential patterns and time-series analysis
      • Similarity analysis: curves, trends, images, texts, etc.
      • Text mining, Web mining and Weblog analysis
      • Spatial, multimedia, scientific data analysis
      • Data preprocessing and database compression
      • Data visualization and visual data mining
      • Many others, e.g., collaborative filtering
    • Multi-Dimensional Data Analysis
      • Data warehousing: integration from heterogeneous or semi-structured databases
      • Multi-dimensional modeling of data: star & snowflake schemas
      • Efficient and scalable computation of data cubes or iceberg cubes
      • OLAP (on-line analytical processing): drilling, dicing, slicing, etc.
      • Discovery-driven exploration of data cubes
      • From OLAP to OLAM: A multi-dimensional view for on-line analytical mining
    • Association and Frequent Pattern Analysis
      • Efficient mining of frequent patterns and association rules:
        • Apriori and FP-growth algorithms
        • Multi-level, multi-dimensional, quantitative association mining
      • From association to correlation, sequential patterns, partial periodicity, cyclic rules, ratio rules, etc.
      • Query and constraint-based association analysis
    • Classification: Scalable Methods and Handling of Complex Types of Data
      • Classification has been an essential theme in machine learning, and statistics research
        • Decision trees, Bayesian classification, neural networks, k-nearest neighbors, etc.
        • Tree-pruning, Boosting, bagging techniques
      • Efficient and scalable classification methods
        • Exploration of attribute-class pairs
        • SLIQ, SPRINT, RainForest, BOAT, etc.
      • Classification of semi-structured and non-structured data
        • Classification by clustering association rules (ARCS)
        • Association-based classification
        • Web document classification
    • Clustering and Outlier Analysis
      • Partitioning methods
        • k-means, k-medoids, CLARANS
      • Hierarchical methods: micro-clusters
        • Birch, Cure, Chameleon
      • Density-based methods:
        • DBSCAN and OPTICS, DENCLU
      • Grid-based methods
        • STING, CLIQUE, WaveCluster
      • Outlier analysis:
        • statistics-based, distance-based, deviation-based
      • Constraint-based clustering
        • COD (Clustering with Obstructed Distance)
        • User-specified constraints
    • Sequential Patterns and Time-Series Analysis
      • Trend analysis
        • Trend movement vs. cyclic variations, seasonal variations and random fluctuations
      • Similarity search in time-series database
        • Handling gaps, scaling, etc.
        • Indexing methods and query languages for time-series
      • Sequential pattern mining
        • Various kinds of sequences, various methods
        • From GSP to PrefixSpan
      • Periodicity analysis
        • Full periodicity, partial periodicity, cyclic association rules
    • Similarity Search: Similar Curves, Trends, Images, and Texts
      • Various kinds of data, various similarity mining methods
      • Discovery of similar trends in time-series data
        • Data transformation & high-dimensional structures
      • Finding similar images based on color, texture, etc.
        • Content-based vs. keyword-based retrieval
        • Color histogram-based signature
        • Multi-feature composed signature
      • Finding documents with similar texts
        • Similar keywords (synonymy & polysemy)
        • Term frequency matrix
        • Latent semantic indexing
    • Spatial, Multimedia, Scientific Data Analysis
      • Multi-dimensional analysis of spatial, multimedia and scientific data
        • Geo-spatial data cube and spatial OLAP
        • The curse of dimensionality problem
      • Association analysis
        • A progressive refinement methodology
        • Micro-clustering can be used for preprocessing in the analysis of complex types of data
      • Classification
        • Association-based for handling high-dimensionality and sparse data
    • Data Mining Industry and Applications
      • From research prototypes to data mining products, languages, and standards
        • IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc.
        • A few data mining languages and standards (esp. MS OLEDB for Data Mining).
      • Application achievements in many domains
        • Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc.
    • Is Data Mining Flying? Or Not??
      • Data mining is flying
        • R & D have been striding forward greatly
        • Applications have been broadened substantially
      • But not as high as some may have hoped. Why not?
        • Hope to see billions of $’s within years?
          • A young and coming technology, not a hype!
        • Not bread-and-butter but value-added service
          • DBMS, WWW, and other information systems will still be a “data mining” aircraft-carrier
        • Not on-the-shelf in nature
          • Need training, understanding, and customizing (re-develop.)
        • Young technology—need much R&D to fly high
          • Much research, development, and real problem solving!
    • How to Fly Data Mining High?—Research Directions
      • Web mining
      • Towards integrated data mining environments and tools
        • “ Vertical” (or application-specific) data mining
        • Invisible data mining
      • Towards intelligent, efficient, and scalable data mining methods
    • Web Mining: A Fast Expanding Frontier in Data Mining
      • Mine what Web search engine finds
      • Automatic classification of Web documents
      • Discovery of authoritative Web pages, Web structures and Web communities
      • Meta-Web Warehousing: Web yellow page service
      • Web usage mining
    • Mine What Web Search Engine Finds
      • Current Web search engines: A convenient source for mining
        • keyword-based, return too many, often low quality answers, still missing a lot, not customized, etc.
      • Data mining will help:
        • coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies
        • better search primitives: user preferences/hints
        • linkage analysis: authoritative pages and clusters
        • Web-based languages: XML + WebSQL + WebML
        • customization: home page + Weblog + user profiles
    • Discovery of Authoritative Pages in WWW
      • Page-rank method ( Brin and Page, 1998):
        • Rank the "importance" of Web pages, based on a model of a "random browser."
      • Hub/authority method (Kleinberg, 1998):
        • Prominent authorities often do not endorse one another directly on the Web.
        • Hub pages have a large number of links to many relevant authorities.
        • Thus hubs and authorities exhibit a mutually reinforcing relationship:
      • Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW.
    • Automatic Classification of Web Documents
      • Web document classification:
        • Good human classification: Yahoo!, CS term hierarchies
        • These classifications can be used as training sets to build up learning model
      • Key-word based classification is different from multi-dimensional classification
        • Association or clustering-based classification is often more effective
        • Multi-level classification is important
    • A Multiple Layered Meta-Web Architecture Generalized Descriptions More Generalized Descriptions Layer 0 Layer 1 Layer n ...
    • Web Yellow Page Service: A Multi-Layer, Meta-Web Approach
      • XML: facilitates structured and meta-information extraction
      • Automatic classification of Web documents:
        • based on Yahoo!, etc. as training set + keyword-based correlation/classification analysis (IR/AI assistance)
      • Automatic ranking of important Web pages
        • authoritative site recognition and clustering Web pages
      • Generalization-based multi-layer meta-Web construction
        • With the assistance of clustering and classification analysis
      • Meta-Web can be warehoused and incrementally updated
      • Querying and mining can be performed on or assisted by meta-Web
    • Importance of Constructing Multi-Layer Meta Web
      • Benefits of Multi-Layer Meta-Web:
        • Multi-dimensional Web info summary analysis
        • Approximate and intelligent query answering
        • Web high-level query answering (WebSQL, WebML)
        • Web content and structure mining
        • Observing the dynamics/evolution of the Web
      • Is it realistic to construct such a meta-Web?
        • It benefits even if it is partially constructed
        • The benefit may justify the cost of tool development, standardization, and partial restructuring
    • Web Usage (Click-Stream) Mining
      • Weblog provides rich information about Web dynamics
      • Multidimensional Weblog analysis:
        • disclose potential customers, users, markets, etc.
      • Plan mining (mining general Web accessing regularities):
        • Web linkage adjustment, performance improvements
      • Web accessing association/sequential pattern analysis:
        • Web cashing, prefetching, swapping
      • Trend analysis:
        • Dynamics of the Web: what has been changing?
      • Customized to individual users
    • Towards Integrated Data Mining Environments and Tools
      • OLAP Mining: Integration of Data Warehousing and Data Mining
      • Querying and Mining: An Integrated Information Analysis Environment
      • Basic Mining Operations and Mining Query Optimization
      • “ Vertical” (or application-specific) data mining
      • Invisible data mining
    • OLAP Mining: An Integration of Data Mining and Data Warehousing
      • Data mining systems, DBMS, Data warehouse systems coupling
        • No coupling, loose-coupling, semi-tight-coupling, tight-coupling
      • On-line analytical mining data
        • integration of mining and OLAP technologies
      • Interactive mining multi-level knowledge
        • Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
      • Integration of multiple mining functions
        • Characterized classification, first clustering and then association
    • An OLAM Architecture Data Warehouse Meta Data MDDB OLAM Engine OLAP Engine User GUI API Data Cube API Database API Data cleaning Data integration Layer3 OLAP/OLAM Layer2 MDDB Layer1 Data Repository Layer4 User Interface Filtering&Integration Filtering Databases Mining query Mining result
    • Querying and Mining: An Integrated Information Analysis Environment
      • Data mining as a component of DBMS, data warehouse, or Web information system
        • Integrated information processing environment
          • MS/SQLServer-2000 (Analysis service)
          • IBM IntelligentMiner on DB2
          • SAS EnterpriseMiner: data warehousing + mining
      • Query-based mining
        • Querying database/DW/Web knowledge
        • Efficiency and flexibility: preprocessing, on-line processing, optimization, integration, etc.
    • Basic Mining Operations and Mining Query Optimization
      • Relational databases: There are a set of basic relational operations and a standard query language, SQL
        • E.g., selection, projection, join, set difference, intersection, Cartesian product, etc.
      • Are there a set of standard data mining operations, on which optimizations can be done?
        • Difficulty: different definitions on operations
        • Importance: optimization can be performed on them systematically, standardization to facilitate information exchange and system interoperability
    • “ Vertical” Data Mining
      • Generic data mining tools? —Too simple to match domain-specific, sophisticated applications
        • Expert knowledge and business logic represent many years of work in their own fields!
        • Data mining + business logic + domain experts
      • A multi-dimensional view of data miners
        • Complexity of data: Web, sequence, spatial, multimedia, …
        • Complexity of domains: DNA, astronomy, market, telecom, …
      • Domain-specific data mining tools
        • Provide concrete, killer solution to specific problems
        • Feedback to build more powerful tools
    • Invisible Data Mining
      • Build mining functions into daily information services
        • Web search engine (link analysis, authoritative pages, user profiles)—adaptive web sites, etc.
        • Improvement of query processing: history + data
        • Making service smart and efficient
      • Benefits from/to data mining research
        • Data mining research has produced many scalable, efficient, novel mining solutions
        • Applications feed new challenge problems to research
    • Towards Intelligent Tools for Data Mining
      • Integration paves the way to intelligent mining
      • Smart interface brings intelligence
        • Easy to use, understand and manipulate
      • One picture may worth 1,000 words
        • Visual and audio data mining
      • Human-Centered Data Mining
      • Towards self-tuning, self-managing, self-triggering data mining
    • Integrated Mining: A Booster for Intelligent Mining
      • Integration paves the way to intelligent mining
        • Data mining integrates with DBMS, DW, WebDB, etc
        • Integration inherits the power of up-to-date information technology: querying, MD analysis, similarity search, etc.
        • Mining can be viewed as querying database knowledge
      • Integration leads to standard interface/language, function/process standardization, utility, and reachability
      • Efficiency and scalability bring intelligent mining to reality
    • One Picture May Worth 1000 Words!
      • Visual Data Mining
        • Visualization of data
        • Visualization of data mining results
        • Visualization of data mining processes
        • Interactive data mining: visual classification
      • One melody may worth 1000 words too!
        • Audio data mining: turn data into music and melody!
        • Uses audio signals to indicate the patterns of data or the features of data mining results
    • Visualization of data mining results in SAS Enterprise Miner: scatter plots
    • Visualization of association rules in MineSet 3.0
    • Visualization of a decision tree in MineSet 3.0
    • Visualization of Data Mining Processes by Clementine
    • Interactive Visual Mining by Perception-Based Classification (PBC)
    • Human-Centered Data Mining
      • Finding all the patterns autonomously in a database? — unrealistic because the patterns could be too many but uninteresting
      • Data mining should be an interactive process
        • User directs what to be mined
      • Users must be provided with a set of primitives to be used to communicate with the data mining system — using a data mining query language
      • User should provide constraints on what to be mined
      • System should use such constraints to guide the mining process (constraint-based mining or mining query optimization)
    • Constraint-Based Mining
      • What kinds of constraints can be used in mining?
        • Knowledge type constraint : classification, association, etc.
        • Data constraint : SQL-like queries
          • Find products sold together in Vancouver in Feb.’01 .
        • Dimension/level constraints:
          • in relevance to region, price, brand, customer category .
        • Rule constraints:
          • small sales (price < $10) triggers big sales (sum > $200).
        • Interestingness constraints:
          • E.g., strong rules (min_support  3%, min_confidence  60%, min_lift > 3.0).
    • Rule Constraints: A Classification Succinctness Anti-monotonicity Monotonicity Convertible constraints Inconvertible constraints
    • Constraint-Based Clustering Analysis
      • User-specified constraints: no cluster has less than 1000 gold customers
      • Resource allocation (clustering) with obstacles
    • Towards Automated Data Mining?
      • It is not realistic to automatically find all the knowledge in a large database
      • Thus we promote human-centered, constraint-based mining
      • However, to achieve genuine intelligent data mining, data mining process should be self-tuning, self-managing, self-triggering
      • Functions should be developed to achieve such performance
    • Conclusions
      • Data mining—A promising research frontier
      • Data mining research has been striding forward greatly in the last decade
      • However, data mining, as an industry, has not been flying as high as expected
      • Much research and application exploration are needed
        • Web mining
        • Towards integrated data mining environments and tools
        • Towards intelligent, efficient, and scalable data mining methods
    • http://www.cs.sfu.ca/~han http://db.cs.sfu.ca
      • Thank you !!!
    • References
      • J. Han and M. Kamber, Data Mining: Concepts and Techniques , Morgan Kaufmann, 2001.
      • J. Han, L. V. S. Lakshmanan, and R. T. Ng, &quot;Constraint-Based, Multidimensional Data Mining&quot;, COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.