Data Mining Xuequn Shang NorthWestern Polytechnical UniversityPresentation Transcript
Data Mining Xuequn Shang NorthWestern Polytechnical University September 2006
About the Course
Tue. 7:00 pm ~9:00 pm
Fri. 7:00 pm~9:00 pm
Room XA107 West building
Xuequn shang, Ph.D.
How many people took database course before?
How many people took statistic course?
How many people took machine learning before?
Textbook and Reference
Data Mining: Concepts and Techniques, JiaweiHan and Micheline Kamber, Morgan Kaufmann, 2001.
范明、孟小峰等译，数据挖掘概念与技术，机械工业出版社， 2001 年 8 月
Principles of Data Mining (Adaptive Computation and Machine Learning), David J. Hand, Heikki Mannila, Padhraic Smyth, MIT Press, 2001
Many research papers
Data that has relevance for managerial decisions is accumulating at an incredible rate due to a host of technological advances.
Electronic data capture has become inexpensive and ubiquitous as a by-product of innovations such as the internet, e-commerce, electronic banking, point-of-sale devices, bar-code readers, and intelligent machines.
Such data is often stored in data warehouses and data marts specifically intended for management decision support.
Data mining is a rapidly growing field that is concerned with developing techniques to assist managers to make intelligent use of these repositories.
Such as credit rating, fraud detection, database marketing, customer relationship management, and stock market investments.
This course will examine methods that have emerged from both fields and proven to be of value in recognizing patterns and making predictions from an applications perspective. We will survey applications and provide an opportunity for hands-on experimentation with algorithms for data mining using easy-to-use software and cases.
To provide an introduction to knowledge discovery in databases and complex data repositories, and to present basic concepts relevant to real data mining applications, as well as reveal important research issues germane to the knowledge discovery domain and advanced mining applications.
Students will understand the fundamental concepts underlying knowledge discovery in databases and gain hands-on experience with implementation of some data mining algorithms applied to real world cases.
Assignments (2) 20%
Class participant 10%
Final Exam 50%
– Quality of presentation + quality of report + quality of demos
About the Project
Implement and experimentally evaluate the major method in the paper (60%)
If possible, improve the method in effectiveness or efficiency, implement and experimentally evaluate your improvement
Professional patients, ring of doctors, and ring of references
Unnecessary or correlated screening tests
Telecommunications: phone-call fraud
Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm
Analysts estimate that 38% of retail shrink is due to dishonest employees
The KDD Process
Data mining—core of knowledge discovery process
Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation
KDD Process Steps
Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
What Kind of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
The World-Wide Web
Table –records –attributes
Indexes & SQL
Online transactional processing (OLTP)
Insert a student “Jennet” into class CMPT 741, fall 2005
Online analytical processing (OLAP)
Find the average class size of CMPT 700 level courses in the last 3 years, grouped by semesters
A subject-oriented , integrated , time-variant , and nonvolatile collection of data in support of management’s decision making process [Inmon]
Data Warehouses Data Warehouse Clean Transform Integrate Load Query and analysis tools Client Client
A Multi-dimensional Database
Data Cube A B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64 63 62 61 48 47 46 45 a1 a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C B 44 28 56 40 24 52 36 20 60
Transactional Databases What kind of product combinations that customers like to buy together? … … Beer, cook, fish, potato, orange, apple T200 Milk, bread, beer, diaper T100 Itemset TID
Geographic databases (map)
VLSI chip design databases
Satellite image databases
What are the changes of the forest in the last 10 years?
Find clusters of homes with kids of age 5-10
A sequence of values that change over time
The sequences of stock price at every 5 minutes
The daily temperature
Time Series Data
HTML web documents
Annotated multimedia databases
Image, audio and video data
DNA, gene, protein: very long sequences
Medical documents and images
Typically very noisy
Data cleaning and integration are challenging
What can be discovered depends upon the data mining task employed.
Descriptive DM tasks
characterize general properties
Predictive DM tasks
Infer on available data
What Can Be Discovered?
What Kinds of Patterns?
Association rules and sequential patterns
Other data mining tasks
Are All the “Discovered” Patterns Interesting?
Data mining may generate thousands even million of patterns: Not all of them are interesting
What makes a pattern interesting?
Can a data mining system generate all of the interesting patterns?
Can a data mining system generate only interesting patterns?
What makes a pattern interesting?
A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty , potentially useful , novel, or validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective : based on statistics and structures of patterns , e.g., support, confidence, etc.
Subjective : based on user’s belief in the data, e.g., unexpectedness, novelty, etc.
Find All Interesting Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns?
Heuristic vs. exhaustive search
Association vs. classification vs. clustering
Find Only Interesting Patterns?
Search for only interesting patterns: An optimization problem
Can a data mining system find only the interesting patterns?
First general all the patterns and then filter out the uninteresting ones
Generate only the interesting patterns—mining query optimization
Research Issues in Data Mining
What kind of patterns to mine?
Propose interesting data mining problems
How to identify interesting patterns
Visualization and interaction
Presentation of mining results
Interactive, adaptive mining
Develop fast data mining algorithms
Identify effective heuristics for mining
Theoretical and/or empirical justification
Parallel, distributed, and incremental mining
Integration to product systems
Data mining module in DBMS and data warehouses
Handle noisy or incomplete data
Incorporate background knowledge
Data mining algebra and language
Integration of multiple mining tasks/DBMS
Open for new data/knowledge
Interaction and visualization
Data mining query optimization
Automatic optimization by construct rewriting
Foundation for Data Mining
Major Issues in Data Mining
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
Data mining: Discovering interesting patterns from large amounts of data
A natural evolution of database technology, in great demand, with wide applications
A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining
What is data mining?
Data mining is the task of discovering interesting patterns from large amounts of data , where the data can be stored in databases, data warehouses, or other information repositories . It is a young interdisciplinary field , drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high-performance computing. Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing, and many application fields, such as business, economics, and bioinformatics.
Define each of the following data mining functionalities: association and correlation analysis, classification, prediction, clustering, and evolution analysis. Give example of each data mining functionality, using a real-life database with which you are familiar.
showing attribute-value conditions that occur frequently in a given set of data
finding a set of models that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown
analyzing data objects without consulting a known class label
finding data objects that do not comply with the general behavior or model of the data
describes and models regularities or trends for objects whose behavior changes over time
A student asked me what the difference between data mining and information retrieval is
There is really no clear difference
Actually some of the recent information retrieval system do discover associations between words and paragraphs
What is the difference between data mining (DM) and pattern recognition (PR)
Both of them are to find useful relations
In PR, we typically deal with data set of moderate size, while in a typical DM application, we are concerned with data sets that are large in terms of dimension and number of clusters
PR is an important techniques used in DM
Data mining involves an integration of techniques from multiple disciplines
Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowledge-Base Database Data Warehouse World-Wide Web Other Info Repositories