1. February 21, 2024 Data Mining: Concepts and Techniques 1
Data Mining:
Concepts and Techniques
â Chapter 1 â
â Introduction â
Prof. Jianlin Cheng
Department of Computer Science
University of Missouri, Columbia
Slides are adapted from
ŠJiawei Han and Micheline Kamber. All rights reserved.
2. February 21, 2024 Data Mining: Concepts and Techniques 2
Brodsky, 2013
8. February 21, 2024 Data Mining: Concepts and Techniques 8
Brodsky, 2013
9. Syllabus
īŽ Instructor: Prof. Jianlin Cheng
īŽ My Teaching
īŽ My Research
īŽ Office Hours: EBW 109, MoWe: 4 â 5
īŽ TA (Xiaokai Qian)
īŽ Objectives
īŽ Text Book
īŽ Assignments
īŽ Projects
īŽ Grading
īŽ Course web site:
http://calla.rnet.missouri.edu/cheng_courses/datamining20
16/
February 21, 2024 Data Mining: Concepts and Techniques 9
10. February 21, 2024 Data Mining: Concepts and Techniques 10
Coverage of Topics
īŽ The book will be covered in two courses at CS, UIUC
īŽ Introduction to data warehousing and data mining
īŽ Data mining: Principles and algorithms
īŽ Our Coverage (both introductory and advanced materials)
īŽ Introduction
īŽ Data Preprocessing
īŽ Mining Frequent Patterns, Association and Correlations
īŽ Classification and Prediction
īŽ Cluster Analysis
īŽ Network mining (advanced)
11. February 21, 2024 Data Mining: Concepts and Techniques 11
Advanced Topics (projects)
īŽ Mining data streams, time-series, and sequence data
īŽ Mining graphs, social networks
īŽ Mining object, spatial, multimedia, text and Web data
īŽ Mining complex data objects
īŽ Spatial and spatiotemporal data mining
īŽ Multimedia data mining
īŽ Text mining
īŽ Web mining
īŽ Applications and trends of data mining
īŽ Mining business & biological data
īŽ Visual data mining
īŽ Data mining and society: Privacy-preserving data mining
12. February 21, 2024 Data Mining: Concepts and Techniques 12
Chapter 1. Introduction
īŽ Motivation: Why data mining?
īŽ What is data mining?
īŽ Data Mining: On what kind of data?
īŽ Data mining functionality
īŽ Classification of data mining systems
īŽ Top-10 most popular data mining algorithms
īŽ Major issues in data mining
īŽ Overview of the course
13. February 21, 2024 Data Mining: Concepts and Techniques 13
Why Data Mining?
īŽ The Explosive Growth of Data: from terabytes to petabytes
īŽ Data collection and data availability
īŽ Automated data collection tools, database systems, Web,
computerized society
īŽ Major sources of abundant data
īŽ Business: Web, e-commerce, transactions, stocks, âĻ
īŽ Science: Remote sensing, bioinformatics, scientific simulation, âĻ
īŽ Society and everyone: news, digital cameras, YouTube
īŽ We are drowning in data, but starving for knowledge!
īŽ âNecessity is the mother of inventionââData miningâAutomated
analysis of massive data sets
14. February 21, 2024 Data Mining: Concepts and Techniques 14
Evolution of Sciences
īŽ Before 1600, empirical science
īŽ 1600-1950s, theoretical science
īŽ Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
īŽ 1950s-1990s, computational science
īŽ Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
īŽ Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
īŽ 1990-now, data science
īŽ The flood of data from new scientific instruments and simulations
īŽ The ability to economically store and manage petabytes of data online
īŽ The Internet and computing Grid that makes all these archives universally accessible
īŽ Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
īŽ Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
15. February 21, 2024 Data Mining: Concepts and Techniques 15
Evolution of Database Technology
īŽ 1960s:
īŽ Data collection, database creation, IMS and network DBMS
īŽ 1970s:
īŽ Relational data model, relational DBMS implementation
īŽ 1980s:
īŽ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
īŽ Application-oriented DBMS (spatial, scientific, engineering, etc.)
īŽ 1990s:
īŽ Data mining, data warehousing, multimedia databases, and Web
databases
īŽ 2000s
īŽ Stream data management and mining
īŽ Data mining and its applications
īŽ Web technology (XML, data integration) and global information systems
16. February 21, 2024 Data Mining: Concepts and Techniques 16
What Is Data Mining?
īŽ Data mining (knowledge discovery from data)
īŽ Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
īŽ Data mining: a misnomer?
īŽ Alternative names
īŽ Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
īŽ Watch out: Is everything âdata miningâ?
īŽ Simple search and query processing
īŽ (Deductive) expert systems
17. February 21, 2024 Data Mining: Concepts and Techniques 17
Knowledge Discovery (KDD) Process
īŽ Data miningâcore of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
18. February 21, 2024 Data Mining: Concepts and Techniques 18
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Ex: Cancer research
19. February 21, 2024 Data Mining: Concepts and Techniques 19
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithm
Other
Disciplines
Visualization
20. February 21, 2024 Data Mining: Concepts and Techniques 20
Why Not Traditional Data Analysis?
īŽ Tremendous amount of data
īŽ Algorithms must be highly scalable to handle such as tera-bytes of
data
īŽ High-dimensionality of data
īŽ Micro-array may have tens of thousands of dimensions
īŽ High complexity of data
īŽ Data streams and sensor data
īŽ Time-series data, temporal data, sequence data
īŽ Structure data, graphs, social networks and multi-linked data
īŽ Heterogeneous databases and legacy databases
īŽ Spatial, spatiotemporal, multimedia, text and Web data
īŽ Software programs, scientific simulations
īŽ New and sophisticated applications
21. 3V of Big Data
February 21, 2024 Data Mining: Concepts and Techniques 21
http://www.datasciencecentral.com/forum/topics/the-3vs-that-
define-big-data
22. February 21, 2024 Data Mining: Concepts and Techniques 22
Multi-Dimensional View of Data Mining
īŽ Data to be mined
īŽ Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
īŽ Knowledge to be mined
īŽ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
īŽ Multiple/integrated functions and mining at multiple levels
īŽ Techniques utilized
īŽ Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
īŽ Applications adapted
īŽ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
23. February 21, 2024 Data Mining: Concepts and Techniques 23
Data Mining: Classification Schemes
īŽ General functionality
īŽ Descriptive data mining (Democrat <-> Republican)
īŽ Predictive data mining
īŽ Different views lead to different classifications
īŽ Data view: Kinds of data to be mined
īŽ Knowledge view: Kinds of knowledge to be discovered
īŽ Method view: Kinds of techniques utilized
īŽ Application view: Kinds of applications adapted
24. February 21, 2024 Data Mining: Concepts and Techniques 24
Data Mining: On What Kinds of Data?
īŽ Database-oriented data sets and applications
īŽ Relational database, data warehouse, transactional database
īŽ Advanced data sets and advanced applications
īŽ Data streams and sensor data
īŽ Time-series data, temporal data, sequence data (incl. bio-sequences)
īŽ Structure data, graphs, social networks and multi-linked data
īŽ Object-relational databases
īŽ Heterogeneous databases and legacy databases
īŽ Spatial data and spatiotemporal data
īŽ Multimedia database
īŽ Text databases
īŽ The World-Wide Web
25. February 21, 2024 Data Mining: Concepts and Techniques 25
Data Mining Functionalities
īŽ Multidimensional concept description: Characterization and
discrimination
īŽ Generalize, summarize, and contrast data characteristics, e.g.,
human and monkey?
īŽ Good income VS poor?
īŽ Frequent patterns, association, correlation vs. causality
īŽ Diaper ī Beer [0.5%, 75%], Education -> Income
īŽ Classification and prediction
īŽ Construct models (functions) that describe and distinguish classes
or concepts for future prediction
īŽ E.g., classify countries based on (economy), cars based on (gas
mileage), internet news (Google News), product (Amazon)
īŽ Predict some stock price, traffic jam
26. February 21, 2024 Data Mining: Concepts and Techniques 26
Data Mining Functionalities (2)
īŽ Cluster analysis
īŽ Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns, terrain images?
īŽ Maximizing intra-cluster similarity & minimizing intercluster
similarity
īŽ Outlier analysis
īŽ Outlier: Data object that does not comply with the general behavior
of the data
īŽ Noise or exception? Useful in fraud detection, rare events analysis
īŽ Trend and evolution analysis
īŽ Trend and deviation: e.g., political polls? Who will win republican
nomination in 2016? Who will win presidential election?
īŽ Sequential pattern mining: e.g., video mining -> identify objects
īŽ Periodicity analysis: climate change?
īŽ Similarity-based analysis: future of the Apple company?
27. February 21, 2024 Data Mining: Concepts and Techniques 27
Top-10 Most Popular DM Algorithms:
18 Identified Candidates (I)
īŽ Classification
īŽ #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan
Kaufmann., 1993. (e.g. 2012 presidential election)
īŽ #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification
and Regression Trees. Wadsworth, 1984.
īŽ #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996.
Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
īŽ #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid
After All? Internat. Statist. Rev. 69, 385-398.
īŽ Statistical Learning
īŽ #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory.
Springer-Verlag.
īŽ #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J.
Wiley, New York. Association Analysis
īŽ #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms
for Mining Association Rules. In VLDB '94.
īŽ #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns
without candidate generation. In SIGMOD '00.
28. February 21, 2024 Data Mining: Concepts and Techniques 28
The 18 Identified Candidates (II)
īŽ Link Mining
īŽ #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a
large-scale hypertextual Web search engine. In WWW-7, 1998.
īŽ #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a
hyperlinked environment. SODA, 1998.
īŽ Clustering
īŽ #11. K-Means: MacQueen, J. B., Some methods for classification
and analysis of multivariate observations, in Proc. 5th Berkeley
Symp. Mathematical Statistics and Probability, 1967.
īŽ #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996.
BIRCH: an efficient data clustering method for very large
databases. In SIGMOD '96.
īŽ Bagging and Boosting
īŽ #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-
theoretic generalization of on-line learning and an application to
boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.
29. February 21, 2024 Data Mining: Concepts and Techniques 29
The 18 Identified Candidates (III)
īŽ Sequential Patterns
īŽ #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns:
Generalizations and Performance Improvements. In Proceedings of the
5th International Conference on Extending Database Technology, 1996.
īŽ #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U.
Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by
Prefix-Projected Pattern Growth. In ICDE '01.
īŽ Integrated Mining
īŽ #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and
association rule mining. KDD-98.
īŽ Rough Sets
īŽ #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of
Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992
īŽ Graph Mining
īŽ #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure
Pattern Mining. In ICDM '02.
31. February 21, 2024 Data Mining: Concepts and Techniques 31
Major Issues in Data Mining
īŽ Mining methodology
īŽ Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
īŽ Performance: efficiency, effectiveness, and scalability
īŽ Pattern evaluation: the interestingness problem
īŽ Incorporation of background knowledge
īŽ Handling noise and incomplete data
īŽ Parallel, distributed and incremental mining methods
īŽ Integration of the discovered knowledge with existing one: knowledge fusion
īŽ User interaction
īŽ Data mining query languages and ad-hoc mining
īŽ Expression and visualization of data mining results
īŽ Interactive mining of knowledge at multiple levels of abstraction
īŽ Applications and social impacts
īŽ Domain-specific data mining
īŽ Protection of data security, integrity, and privacy
32. February 21, 2024 Data Mining: Concepts and Techniques 32
A Brief History of Data Mining Society
īŽ 1989 IJCAI Workshop on Knowledge Discovery in Databases
īŽ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
īŽ 1991-1994 Workshops on Knowledge Discovery in Databases
īŽ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
īŽ 1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDDâ95-98)
īŽ Journal of Data Mining and Knowledge Discovery (1997)
īŽ ACM SIGKDD conferences since 1998 and SIGKDD Explorations
īŽ More conferences on data mining
īŽ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
īŽ ACM Transactions on KDD starting in 2007
33. February 21, 2024 Data Mining: Concepts and Techniques 33
Conferences and Journals on Data Mining
īŽ KDD Conferences
īŽ ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining
(KDD)
īŽ SIAM Data Mining Conf. (SDM)
īŽ (IEEE) Int. Conf. on Data
Mining (ICDM)
īŽ Conf. on Principles and
practices of Knowledge
Discovery and Data Mining
(PKDD)
īŽ Pacific-Asia Conf. on
Knowledge Discovery and Data
Mining (PAKDD)
īŽ Other related conferences
īŽ ACM SIGMOD
īŽ VLDB
īŽ (IEEE) ICDE
īŽ WWW, SIGIR
īŽ ICML, CVPR, NIPS
īŽ Journals
īŽ Data Mining and Knowledge
Discovery (DAMI or DMKD)
īŽ IEEE Trans. On Knowledge
and Data Eng. (TKDE)
īŽ KDD Explorations
īŽ ACM Trans. on KDD
34. February 21, 2024 Data Mining: Concepts and Techniques 34
Where to Find References? DBLP, CiteSeer, Google
īŽ Data mining and KDD (SIGKDD: CDROM)
īŽ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
īŽ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
īŽ Database systems (SIGMOD: ACM SIGMOD AnthologyâCD ROM)
īŽ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
īŽ Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
īŽ AI & Machine Learning
īŽ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
īŽ Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
īŽ Web and IR
īŽ Conferences: SIGIR, WWW, CIKM, etc.
īŽ Journals: WWW: Internet and Web Information Systems,
īŽ Statistics
īŽ Conferences: Joint Stat. Meeting, etc.
īŽ Journals: Annals of statistics, etc.
īŽ Visualization
īŽ Conference proceedings: CHI, ACM-SIGGraph, etc.
īŽ Journals: IEEE Trans. visualization and computer graphics, etc.
35. February 21, 2024 Data Mining: Concepts and Techniques 35
Architecture: Typical Data Mining System
data cleaning, integration, and selection
Database or Data
Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowl
edge-
Base
Database
Data
Warehouse
World-Wide
Web
Other Info
Repositories
36. February 21, 2024 Data Mining: Concepts and Techniques 36
Recommended Reference Books
īŽ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
īŽ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
īŽ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
īŽ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
īŽ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
īŽ J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006
īŽ D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
īŽ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, Springer-Verlag, 2001
īŽ B. Liu, Web Data Mining, Springer 2006.
īŽ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
īŽ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
īŽ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
īŽ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
īŽ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
37. February 21, 2024 Data Mining: Concepts and Techniques 37
Summary
īŽ Data mining: Discovering interesting patterns from large amounts of
data
īŽ A natural evolution of database technology and machine learning, in
great demand, with wide applications
īŽ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
īŽ Mining can be performed in a variety of information repositories
īŽ Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
īŽ Data mining systems and architectures
īŽ Major issues in data mining
38. Programming Tools
īŽ Any general programming languages: C/C++,
Java, Perl, Python
īŽ Specialized language packages: R, Matlab (or
Octave), Mathematica
īŽ Machine learning and data mining packages:
Weka, NNClass, SVMlight
īŽ Homework submission:
mudatamining@gmail.com
īŽ Due by Feb. 8, 2016
February 21, 2024 Data Mining: Concepts and Techniques 38
39. February 21, 2024 Data Mining: Concepts and Techniques 39
Why Data Mining?âPotential Applications
īŽ Data analysis and decision support
īŽ Market analysis and management
īŽ Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
īŽ Risk analysis and management
īŽ Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
īŽ Fraud detection and detection of unusual patterns (outliers)
īŽ Other Applications
īŽ Text mining (news group, email, documents) and Web mining
īŽ Stream data mining
īŽ Bioinformatics and bio-data analysis
40. February 21, 2024 Data Mining: Concepts and Techniques 40
Ex. 1: Market Analysis and Management
īŽ Where does the data come from?âCredit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
īŽ Target marketing
īŽ Find clusters of âmodelâ customers who share the same characteristics: interest,
income level, spending habits, etc.
īŽ Determine customer purchasing patterns over time
īŽ Cross-market analysisâFind associations/co-relations between product sales,
& predict based on such association
īŽ Customer profilingâWhat types of customers buy what products (clustering
or classification)
īŽ Customer requirement analysis
īŽ Identify the best products for different groups of customers
īŽ Predict what factors will attract new customers
īŽ Provision of summary information
īŽ Multidimensional summary reports
īŽ Statistical summary information (data central tendency and variation)
41. February 21, 2024 Data Mining: Concepts and Techniques 41
Ex. 2: Corporate Analysis & Risk Management
īŽ Finance planning and asset evaluation
īŽ cash flow analysis and prediction
īŽ contingent claim analysis to evaluate assets
īŽ cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
īŽ Resource planning
īŽ summarize and compare the resources and spending
īŽ Competition
īŽ monitor competitors and market directions
īŽ group customers into classes and a class-based pricing procedure
īŽ set pricing strategy in a highly competitive market
42. February 21, 2024 Data Mining: Concepts and Techniques 42
Ex. 3: Fraud Detection & Mining Unusual Patterns
īŽ Approaches: Clustering & model construction for frauds, outlier analysis
īŽ Applications: Health care, retail, credit card service, telecomm.
īŽ Auto insurance: ring of collisions
īŽ Money laundering: suspicious monetary transactions
īŽ Medical insurance
īŽ Professional patients, ring of doctors, and ring of references
īŽ Unnecessary or correlated screening tests
īŽ Telecommunications: phone-call fraud
īŽ Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
īŽ Retail industry
īŽ Analysts estimate that 38% of retail shrink is due to dishonest
employees
īŽ Anti-terrorism