SlideShare a Scribd company logo
1 of 61
Download to read offline
CSE5243 INTRO. TO DATA MINING
Chapter 1. Introduction
Yu Su, CSE@The Ohio State University
Slides adapted from UIUC CS412 by Prof. Jiawei Han and OSU CSE5243 by
Prof. Huan Sun
2
CSE 5243. Course Page & Schedule
¨ Class Homepage:
https://ysu1989.github.io/courses/sp20/cse5243/
¨ Class Schedule:
9:35-10:55 AM, Wed/Fri, Caldwell Lab 171
¨ Office hours:
¤ Instructor: Yu Su @ DL783, Fri 11:00am-12:15pm (right after class)
First week: No office hours
¤ TA: Jiaqi Xu (xu.1629), Wed 03:00pm-04:00pm, Baker 406
3
CSE 5243. Textbook
¨ Recommended but not required
¨ (Primary) Jiawei Han, Micheline Kamber and Jian Pei, Data
Mining: Concepts and Techniques (3rd ed), 2011
¤ More resources:
https://wiki.illinois.edu/wiki/display/cs412/2.+Course+Syllabus+and
+Schedule
¨ (Primary) Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,
Introduction to Data Mining, 2006
¨ (Supplementary) Mohammed J. Zaki and Wagner Meira, Jr.,
Data Mining Analysis and Concepts, 2014
¨ (Supplementary) Jure Leskovec, Anand Rajaraman, Jeff Ullman,
Mining of Massive Datasets
¤ More resources: http://www.mmds.org/
4
CSE 5243. Course Work and Grading
¨ Homework, Course Projects, and Exams
¤ Participation: 10% (Online discussion and/or class participation)
¤ Homework: 50% (No Late Submissions!)
¤ Midterm exam: 20%
¤ Final exam: 20%
¨ Need help and/or discussions?
¤ Carmen: https://osu.instructure.com/courses/76423/discussion_topics
n Receive credits: answer questions on Carmen and engage in class discussion.
¨ Check your homework/exam scores
¤ Carmen: https://osu.instructure.com/courses/76423/gradebook
5
Videos
¨ 10 TED talks on Big Data and Analytics
¤ https://www.promptcloud.com/blog/top-ted-talks-on-big-data/
¤ Shyan Sanker (Director at Palantir Technologies):
n https://www.youtube.com/watch?time_continue=19&v=ltelQ3iKybU
¨ 5 TED talks on Data analytics for business leaders
¤ https://bigdata-madesimple.com/5-best-ted-talks-on-data-analytics-for-business-
leaders/
¨ Data analytics for beginners
¤ https://www.youtube.com/watch?v=66ko_cWSHBU (If you love sports, this
TED Talk on data analytics is going to be an interesting watch)
6
Chapter 1. Introduction
¨ What is Data Mining?
¨ Why Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
7
What is Data Mining?
¨ Data mining (knowledge discovery from data, KDD)
¤ Extraction of interesting (non-trivial, implicit, previously
unknown, and potentially useful) patterns or knowledge
from huge amount of data
¨ Alternative names
¤ Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
8
What is Data Mining?
¨ Data mining (knowledge discovery from data, KDD)
¤ Extraction of interesting (non-trivial, implicit, previously
unknown, and potentially useful) patterns or knowledge
from huge amount of data
¨ Alternative names
¤ Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
One of the best conferences to publish your research work:
SIGKDD (check resources)
9
Knowledge Discovery (KDD) Process
¨ (Narrow view) Data mining plays
an essential role in the
knowledge discovery process
¨ (Broad view) Data mining is the
knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data
Mining
Pattern
Evaluation
10
Example: A Web Mining Framework
¨ Web mining usually involves
¤ Data crawling and cleaning
¤ Data integration from multiple sources
¤ (Optional) Warehousing the data
¤ (Optional) Data cube construction
¤ Data selection for data mining
¤ Data mining
¤ Presentation of the mining results
¤ Patterns and knowledge to be used or stored into
knowledge base
12
KDD Process: A View from ML and Statistics
¨ This is a view from typical machine learning and statistics
communities
Input Data Data
Mining
Data
Pre-
Processing
Post-
Processing
Data integration
Normalization
Feature selection
Dimension
reduction
Pattern discovery
Classification
Clustering
Outlier analysis
… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
13
Data Science
Figure from: https://www.datasciencecentral.com/profiles/blogs/difference-
of-data-science-machine-learning-and-data-mining
14
Chapter 1. Introduction
¨ What Is Data Mining?
¨ Why Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
15
Why Data Mining?
¨ The Explosive Growth of Data: from terabytes to petabytes
¤ Data collection and data availability
n Automated data collection tools, database systems, Web,
computerized society
16
Why Data Mining?
¨ The Explosive Growth of Data: from terabytes to petabytes
¤ Data collection and data availability
n Automated data collection tools, database systems, Web,
computerized society
¤ Major sources of data
n Business: Web, e-commerce, transactions, stocks, …
n Science: Remote sensing, bioinformatics, scientific simulation, …
n Society and everyone: news, digital cameras, YouTube
17 “How much data is generated each day?” – World Economic Forum
18
Why Data Mining?
¨ The Explosive Growth of Data: from terabytes to petabytes
¤ Data collection and data availability
n Automated data collection tools, database systems, Web,
computerized society
¤ Major sources of data
n Business: Web, e-commerce, transactions, stocks, …
n Science: Remote sensing, bioinformatics, scientific simulation, …
n Society and everyone: news, digital cameras, YouTube
¨ We are drowning in data, but starving for knowledge!
¨ “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
19
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
20
Multi-Dimensional View of Data Mining
¨ Data to be mined
¤ Database data (extended-relational, object-oriented, heterogeneous), data
warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information networks
21
Multi-Dimensional View of Data Mining
¨ Data to be mined
¤ Database data (extended-relational, object-oriented, heterogeneous), data
warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information networks
¨ Knowledge to be mined (or: Data mining functions)
¤ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, …
¤ Descriptive vs. predictive data mining
¤ Multiple/integrated functions and mining at multiple levels
22
Multi-Dimensional View of Data Mining
¨ Data to be mined
¤ Database data (extended-relational, object-oriented, heterogeneous), data
warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information networks
¨ Knowledge to be mined (or: Data mining functions)
¤ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, …
¤ Descriptive vs. predictive data mining
¤ Multiple/integrated functions and mining at multiple levels
¨ Techniques utilized
¤ Data warehousing (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance computing, etc.
23
Multi-Dimensional View of Data Mining
¨ Data to be mined
¤ Database data (extended-relational, object-oriented, heterogeneous), data
warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information networks
¨ Knowledge to be mined (or: Data mining functions)
¤ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, …
¤ Descriptive vs. predictive data mining
¤ Multiple/integrated functions and mining at multiple levels
¨ Techniques utilized
¤ Data warehousing (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance computing, etc.
¨ Applications adapted
¤ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market
analysis, text mining, Web mining, etc.
24
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
25
Data Mining: On What Kinds of Data?
¨ Database-oriented data sets and applications
¤ Relational database, data warehouse, transactional database
¤ Object-relational databases, Heterogeneous databases and legacy
databases
¨ Advanced data sets and advanced applications
¤ Data streams and sensor data
¤ Time-series data, temporal data, sequence data (incl. bio-sequences)
¤ Structure data, graphs, social networks and information networks
¤ Spatial data and spatiotemporal data
¤ Multimedia database
¤ Text databases
¤ The World-Wide Web
26
Survey
Your Name, ID, Major
Question 1: What do you think Data Mining is?
Question 2: What project have you done so far that you think is most relevant to
Data Mining?
• Not necessarily research project; can be your course project or any hackathon
event you participated in.
Question 3: What do you expect to learn from this course?
Briefly answer each question with a few sentences.
27
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
28
Data Mining Functions: Pattern Discovery
¨ Frequent patterns
¤ What items do you frequently purchase together on Amazon?
29
Data Mining Functions: Pattern Discovery
¨ Frequent patterns
¤ What items do you frequently purchase together on Amazon?
¨ Association and Correlation Analysis
30
Data Mining Functions: Pattern Discovery
¨ Frequent patterns
¤ What items do you frequently purchase together on Amazon?
¨ Association and Correlation Analysis
q A typical association rule
q Diaper à Beer [0.5%, 75%] (support, confidence)
q How to mine such patterns and rules efficiently in large datasets?
q How to use such patterns for classification, clustering, and other applications?
q More: friend recommendation, motif discovery, malware detection, fraud
detection, etc.
31
Data Mining Functions: Classification
¨ Classification and label prediction
¤ Construct models (functions) based on some training examples
¤ Describe and distinguish classes or concepts for future prediction
n Ex. 1. Classify countries based on (climate)
n Ex. 2. Classify cars based on (gas mileage)
¤ Predict some unknown class labels
32
Data Mining Functions: Classification
¨ Classification and label prediction
¤ Construct models (functions) based on some training examples
¤ Describe and distinguish classes or concepts for future prediction
n Ex. 1. Classify countries based on (climate)
n Ex. 2. Classify cars based on (gas mileage)
¤ Predict some unknown class labels
¨ Typical methods
¤ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
33
Data Mining Functions: Classification
¨ Classification and label prediction
¤ Construct models (functions) based on some training examples
¤ Describe and distinguish classes or concepts for future prediction
n Ex. 1. Classify countries based on (climate)
n Ex. 2. Classify cars based on (gas mileage)
¤ Predict some unknown class labels
¨ Typical methods
¤ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
¨ Typical applications:
¤ Credit card fraud detection, direct marketing, classifying stars,
diseases, web pages, …
34
Data Mining Functions: Cluster Analysis
¨ Unsupervised learning (i.e., Class
label is unknown)
¨ Group data to form new
categories (i.e., clusters), e.g.,
cluster houses to find distribution
patterns
35
Data Mining Functions: Cluster Analysis
¨ Unsupervised learning (i.e., Class
label is unknown)
¨ Group data to form new
categories (i.e., clusters), e.g.,
cluster houses to find distribution
patterns
¨ Principle: Maximizing intra-class
similarity & minimizing interclass
similarity
¨ Many methods and applications
36
Data Mining Functions: Outlier Analysis
¨ Outlier analysis
¤ Outlier: A data object that does not comply with the
general behavior of the data
¤ Noise or exception?―One person’s garbage could
be another person’s treasure
37
Data Mining Functions: Outlier Analysis
¨ Outlier analysis
¤ Outlier: A data object that does not comply with the
general behavior of the data
¤ Noise or exception?―One person’s garbage could
be another person’s treasure
¤ Methods: by product of clustering or regression
analysis, …
¤ Useful in fraud detection, rare events analysis
38
Data Mining Functions: Time and Ordering:
Sequential Pattern, Trend and Evolution Analysis
¨ Sequence, trend and evolution analysis
¤ Trend, time-series, and deviation analysis
n e.g., regression and value prediction
¤ Sequential pattern mining
n e.g., buy digital camera, then buy large
memory cards
¤ Periodicity analysis
¤ Motifs and biological sequence analysis
n Approximate and consecutive motifs
¤ Similarity-based analysis
¨ Mining data streams
¤ Ordered, time-varying, potentially infinite, data
streams
39
Data Mining Functions: Structure and
Network Analysis
¨ Graph mining
¤ Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
40
Data Mining Functions: Structure and
Network Analysis
¨ Graph mining
¤ Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
¨ Information network analysis
¤ Social networks: actors (objects, nodes) and relationships (edges)
n e.g., author networks in CS, terrorist networks
¤ Multiple heterogeneous networks
n A person could be multiple information networks: friends, family,
classmates, …
¤ Knowledge graphs: knowledge backbone of AI systems
41
Data Mining Functions: Structure and
Network Analysis
¨ Graph mining
¤ Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
¨ Information network analysis
¤ Social networks: actors (objects, nodes) and relationships (edges)
n e.g., author networks in CS, terrorist networks
¤ Multiple heterogeneous networks
n A person could be multiple information networks: friends, family,
classmates, …
¤ Knowledge graphs: knowledge backbone of AI systems
¨ Web mining
¤ Web is a big information network: from PageRank to Google
¤ Analysis of Web information networks
n Web community discovery, opinion mining, usage mining, …
42
Evaluation of Knowledge
¨ Are all mined knowledge interesting?
¤ One can mine tremendous amounts of “patterns”
¤ Some may fit only certain dimension space (time, location,
…)
¤ Some may not be representative, may be transient, …
43
Evaluation of Knowledge
¨ Are all mined knowledge interesting?
¤ One can mine tremendous amount of “patterns”
¤ Some may fit only certain dimension space (time, location, …)
¤ Some may not be representative, may be transient, …
¨ Evaluation of mined knowledge → directly mine only interesting
knowledge?
¤ Descriptive vs. predictive
¤ Coverage
¤ Typicality vs. novelty
¤ Accuracy
¤ Timeliness
¤ …
44
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
45
Data Mining: Confluence of Multiple Disciplines
Data Mining
Machine
Learning
Statistics
Applications
Algorithm
Pattern
Recognition
High-Performance
Computing
Visualization
Database
Technology
46
Why Confluence of Multiple Disciplines?
¨ Tremendous amount of data
¤ Algorithms must be scalable to handle big data
¨ High-dimensionality of data
¤ Micro-array may have tens of thousands of dimensions
¨ High complexity of data
¤ Data streams and sensor data
¤ Time-series data, temporal data, sequence data
¤ Structure data, graphs, social and information networks
¤ Spatial, spatiotemporal, multimedia, text and Web data
¤ Software programs, scientific simulations
¨ New and sophisticated applications
47
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
48
Applications of Data Mining
¨ Web page analysis: classification, clustering, ranking
¨ Collaborative analysis & recommender systems
¨ Biological and medical data analysis
¨ Data mining and software engineering
¨ Data mining and text analysis
¨ Data mining and social and information network analysis
¨ Built-in (invisible data mining) functions in Google, MS, Yahoo!,
Linked, Facebook, …
49
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
50
Major Issues in Data Mining (1)
¨ Mining Methodology
¤ Mining various and new kinds of knowledge
¤ Mining knowledge in multi-dimensional space
¤ Data mining: An interdisciplinary effort
¤ Boosting the power of discovery in a networked environment
¤ Handling noise, uncertainty, and incompleteness of data
¤ Pattern evaluation and pattern- or constraint-guided mining
51
Major Issues in Data Mining (1)
¨ Mining Methodology
¤ Mining various and new kinds of knowledge
¤ Mining knowledge in multi-dimensional space
¤ Data mining: An interdisciplinary effort
¤ Boosting the power of discovery in a networked environment
¤ Handling noise, uncertainty, and incompleteness of data
¤ Pattern evaluation and pattern- or constraint-guided mining
¨ User Interaction & Human-Machine Collaboration
¤ Interactive mining
¤ Incorporation of background knowledge
¤ Presentation and visualization of data mining results
52
¨ Efficiency and Scalability
¤ Efficiency and scalability of data mining algorithms
¤ Parallel, distributed, stream, and incremental mining methods
¨ Diversity of data types
¤ Handling complex types of data
¤ Mining dynamic, networked, and global data repositories
¨ Data mining and society
¤ Social impacts of data mining
¤ Privacy-preserving data mining
Major Issues in Data Mining (2)
53
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
54
A Brief History of Data Mining Society
¨ 1989 IJCAI Workshop on Knowledge Discovery in Databases
¤ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991)
¨ 1991-1994 Workshops on Knowledge Discovery in Databases
¤ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
¨ 1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
¤ Journal of Data Mining and Knowledge Discovery (1997)
¨ ACM SIGKDD conferences since 1998 and SIGKDD Explorations
¨ More conferences on data mining
¤ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), WSDM (2008), etc.
¨ ACM Transactions on KDD (2007)
55
Conferences and Journals on Data Mining
¨ KDD Conferences
¤ ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining (KDD)
¤ SIAM Data Mining Conf. (SDM)
¤ (IEEE) Int. Conf. on Data Mining
(ICDM)
¤ European Conf. on Machine
Learning and Principles and
practices of Knowledge Discovery
and Data Mining (ECML-PKDD)
¤ Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
¤ Int. Conf. on Web Search and
Data Mining (WSDM)
n Other related conferences
n DB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, …
n Web and IR conferences:
WWW, SIGIR, WSDM
n ML conferences: ICML, NIPS
n PR conferences: CVPR,
n Journals
n Data Mining and Knowledge
Discovery (DAMI or DMKD)
n IEEE Trans. On Knowledge and
Data Eng. (TKDE)
n KDD Explorations
n ACM Trans. on KDD
56
Where to Find References? DBLP, CiteSeer, Google
¨ Data mining and KDD (SIGKDD)
¤ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
¤ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM
TKDD
¨ Database systems (SIGMOD)
¤ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT,
DASFAA
¤ Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
¨ AI & Machine Learning
¤ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory),
CVPR, NIPS, etc.
¤ Journals: Machine Learning, Artificial Intelligence, Knowledge and
Information Systems, IEEE-PAMI, etc.
57
Where to Find References? DBLP, CiteSeer, Google
¨ Web and IR
¤ Conferences: SIGIR, WWW, CIKM, etc.
¤ Journals: WWW: Internet and Web Information Systems
¨ Statistics
¤ Conferences: Joint Stat. Meeting, etc.
¤ Journals: Annals of statistics, etc.
¨ Visualization
¤ Conference proceedings: CHI, ACM-SIGGraph, etc.
¤ Journals: IEEE Trans. visualization and computer graphics, etc.
58
Future of Data Science
Figure from: https://www.datasciencecentral.com/profiles/blogs/difference-
of-data-science-machine-learning-and-data-mining
https://www.youtube.com/watc
h?v=hxXIJnjC_HI (Future of
Data Science @ Stanford)
Related events in OSU:
DataFest
Hackathon
Conduct research in labs
59
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
60
Summary
¨ Data mining: Discovering interesting patterns and knowledge from
massive amount of data
¨ A natural evolution of science and information technology, in great
demand, with wide applications
¨ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
¨ Mining can be performed in a variety of data
¨ Data mining functionalities: characterization, discrimination,
association, classification, clustering, trend and outlier analysis, etc.
¨ Data mining technologies and applications
¨ Major issues in data mining
61
Recommended Reference Books
¨ Charu C. Aggarwal, Data Mining: The Textbook, Springer, 2015
¨ E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011
¨ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-
Interscience, 2000
¨ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining
and Knowledge Discovery, Morgan Kaufmann, 2001
¨ J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan
Kaufmann, 3rd ed. , 2011
¨ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009
¨ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
¨ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
(2nd ed. 2016)
¨ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
¨ Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms 2014
62
Future of Data Science
Figure from: https://www.datasciencecentral.com/profiles/blogs/difference-
of-data-science-machine-learning-and-data-mining
https://www.youtube.com/watc
h?v=hxXIJnjC_HI
Related events in OSU:
DataFest
Hackathon
Conduct research in labs

More Related Content

Similar to 01datamining.pdf

Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptSangrangBargayary3
 
Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining IntroductionVijayasankariS
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesDeepaR42
 
datamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptxdatamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptxHASHEMHASH
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 

Similar to 01datamining.pdf (20)

01 intro
01 intro01 intro
01 intro
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining Introduction
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
Data Mining
Data MiningData Mining
Data Mining
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
Unit 1
Unit 1Unit 1
Unit 1
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
10 problems 06
10 problems 0610 problems 06
10 problems 06
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
datamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptxdatamining_Lecture_1(introduction).pptx
datamining_Lecture_1(introduction).pptx
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
isd314-01
isd314-01isd314-01
isd314-01
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
 
Dwdm
DwdmDwdm
Dwdm
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Recently uploaded (20)

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

01datamining.pdf

  • 1. CSE5243 INTRO. TO DATA MINING Chapter 1. Introduction Yu Su, CSE@The Ohio State University Slides adapted from UIUC CS412 by Prof. Jiawei Han and OSU CSE5243 by Prof. Huan Sun
  • 2. 2 CSE 5243. Course Page & Schedule ¨ Class Homepage: https://ysu1989.github.io/courses/sp20/cse5243/ ¨ Class Schedule: 9:35-10:55 AM, Wed/Fri, Caldwell Lab 171 ¨ Office hours: ¤ Instructor: Yu Su @ DL783, Fri 11:00am-12:15pm (right after class) First week: No office hours ¤ TA: Jiaqi Xu (xu.1629), Wed 03:00pm-04:00pm, Baker 406
  • 3. 3 CSE 5243. Textbook ¨ Recommended but not required ¨ (Primary) Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques (3rd ed), 2011 ¤ More resources: https://wiki.illinois.edu/wiki/display/cs412/2.+Course+Syllabus+and +Schedule ¨ (Primary) Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, 2006 ¨ (Supplementary) Mohammed J. Zaki and Wagner Meira, Jr., Data Mining Analysis and Concepts, 2014 ¨ (Supplementary) Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets ¤ More resources: http://www.mmds.org/
  • 4. 4 CSE 5243. Course Work and Grading ¨ Homework, Course Projects, and Exams ¤ Participation: 10% (Online discussion and/or class participation) ¤ Homework: 50% (No Late Submissions!) ¤ Midterm exam: 20% ¤ Final exam: 20% ¨ Need help and/or discussions? ¤ Carmen: https://osu.instructure.com/courses/76423/discussion_topics n Receive credits: answer questions on Carmen and engage in class discussion. ¨ Check your homework/exam scores ¤ Carmen: https://osu.instructure.com/courses/76423/gradebook
  • 5. 5 Videos ¨ 10 TED talks on Big Data and Analytics ¤ https://www.promptcloud.com/blog/top-ted-talks-on-big-data/ ¤ Shyan Sanker (Director at Palantir Technologies): n https://www.youtube.com/watch?time_continue=19&v=ltelQ3iKybU ¨ 5 TED talks on Data analytics for business leaders ¤ https://bigdata-madesimple.com/5-best-ted-talks-on-data-analytics-for-business- leaders/ ¨ Data analytics for beginners ¤ https://www.youtube.com/watch?v=66ko_cWSHBU (If you love sports, this TED Talk on data analytics is going to be an interesting watch)
  • 6. 6 Chapter 1. Introduction ¨ What is Data Mining? ¨ Why Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 7. 7 What is Data Mining? ¨ Data mining (knowledge discovery from data, KDD) ¤ Extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or knowledge from huge amount of data ¨ Alternative names ¤ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • 8. 8 What is Data Mining? ¨ Data mining (knowledge discovery from data, KDD) ¤ Extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or knowledge from huge amount of data ¨ Alternative names ¤ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. One of the best conferences to publish your research work: SIGKDD (check resources)
  • 9. 9 Knowledge Discovery (KDD) Process ¨ (Narrow view) Data mining plays an essential role in the knowledge discovery process ¨ (Broad view) Data mining is the knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 10. 10 Example: A Web Mining Framework ¨ Web mining usually involves ¤ Data crawling and cleaning ¤ Data integration from multiple sources ¤ (Optional) Warehousing the data ¤ (Optional) Data cube construction ¤ Data selection for data mining ¤ Data mining ¤ Presentation of the mining results ¤ Patterns and knowledge to be used or stored into knowledge base
  • 11. 12 KDD Process: A View from ML and Statistics ¨ This is a view from typical machine learning and statistics communities Input Data Data Mining Data Pre- Processing Post- Processing Data integration Normalization Feature selection Dimension reduction Pattern discovery Classification Clustering Outlier analysis … … … … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization
  • 12. 13 Data Science Figure from: https://www.datasciencecentral.com/profiles/blogs/difference- of-data-science-machine-learning-and-data-mining
  • 13. 14 Chapter 1. Introduction ¨ What Is Data Mining? ¨ Why Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 14. 15 Why Data Mining? ¨ The Explosive Growth of Data: from terabytes to petabytes ¤ Data collection and data availability n Automated data collection tools, database systems, Web, computerized society
  • 15. 16 Why Data Mining? ¨ The Explosive Growth of Data: from terabytes to petabytes ¤ Data collection and data availability n Automated data collection tools, database systems, Web, computerized society ¤ Major sources of data n Business: Web, e-commerce, transactions, stocks, … n Science: Remote sensing, bioinformatics, scientific simulation, … n Society and everyone: news, digital cameras, YouTube
  • 16. 17 “How much data is generated each day?” – World Economic Forum
  • 17. 18 Why Data Mining? ¨ The Explosive Growth of Data: from terabytes to petabytes ¤ Data collection and data availability n Automated data collection tools, database systems, Web, computerized society ¤ Major sources of data n Business: Web, e-commerce, transactions, stocks, … n Science: Remote sensing, bioinformatics, scientific simulation, … n Society and everyone: news, digital cameras, YouTube ¨ We are drowning in data, but starving for knowledge! ¨ “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
  • 18. 19 Chapter 1. Introduction ¨ Why Data Mining? ¨ What Is Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 19. 20 Multi-Dimensional View of Data Mining ¨ Data to be mined ¤ Database data (extended-relational, object-oriented, heterogeneous), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
  • 20. 21 Multi-Dimensional View of Data Mining ¨ Data to be mined ¤ Database data (extended-relational, object-oriented, heterogeneous), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks ¨ Knowledge to be mined (or: Data mining functions) ¤ Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, … ¤ Descriptive vs. predictive data mining ¤ Multiple/integrated functions and mining at multiple levels
  • 21. 22 Multi-Dimensional View of Data Mining ¨ Data to be mined ¤ Database data (extended-relational, object-oriented, heterogeneous), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks ¨ Knowledge to be mined (or: Data mining functions) ¤ Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, … ¤ Descriptive vs. predictive data mining ¤ Multiple/integrated functions and mining at multiple levels ¨ Techniques utilized ¤ Data warehousing (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance computing, etc.
  • 22. 23 Multi-Dimensional View of Data Mining ¨ Data to be mined ¤ Database data (extended-relational, object-oriented, heterogeneous), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks ¨ Knowledge to be mined (or: Data mining functions) ¤ Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, … ¤ Descriptive vs. predictive data mining ¤ Multiple/integrated functions and mining at multiple levels ¨ Techniques utilized ¤ Data warehousing (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance computing, etc. ¨ Applications adapted ¤ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
  • 23. 24 Chapter 1. Introduction ¨ Why Data Mining? ¨ What Is Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 24. 25 Data Mining: On What Kinds of Data? ¨ Database-oriented data sets and applications ¤ Relational database, data warehouse, transactional database ¤ Object-relational databases, Heterogeneous databases and legacy databases ¨ Advanced data sets and advanced applications ¤ Data streams and sensor data ¤ Time-series data, temporal data, sequence data (incl. bio-sequences) ¤ Structure data, graphs, social networks and information networks ¤ Spatial data and spatiotemporal data ¤ Multimedia database ¤ Text databases ¤ The World-Wide Web
  • 25. 26 Survey Your Name, ID, Major Question 1: What do you think Data Mining is? Question 2: What project have you done so far that you think is most relevant to Data Mining? • Not necessarily research project; can be your course project or any hackathon event you participated in. Question 3: What do you expect to learn from this course? Briefly answer each question with a few sentences.
  • 26. 27 Chapter 1. Introduction ¨ Why Data Mining? ¨ What Is Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 27. 28 Data Mining Functions: Pattern Discovery ¨ Frequent patterns ¤ What items do you frequently purchase together on Amazon?
  • 28. 29 Data Mining Functions: Pattern Discovery ¨ Frequent patterns ¤ What items do you frequently purchase together on Amazon? ¨ Association and Correlation Analysis
  • 29. 30 Data Mining Functions: Pattern Discovery ¨ Frequent patterns ¤ What items do you frequently purchase together on Amazon? ¨ Association and Correlation Analysis q A typical association rule q Diaper à Beer [0.5%, 75%] (support, confidence) q How to mine such patterns and rules efficiently in large datasets? q How to use such patterns for classification, clustering, and other applications? q More: friend recommendation, motif discovery, malware detection, fraud detection, etc.
  • 30. 31 Data Mining Functions: Classification ¨ Classification and label prediction ¤ Construct models (functions) based on some training examples ¤ Describe and distinguish classes or concepts for future prediction n Ex. 1. Classify countries based on (climate) n Ex. 2. Classify cars based on (gas mileage) ¤ Predict some unknown class labels
  • 31. 32 Data Mining Functions: Classification ¨ Classification and label prediction ¤ Construct models (functions) based on some training examples ¤ Describe and distinguish classes or concepts for future prediction n Ex. 1. Classify countries based on (climate) n Ex. 2. Classify cars based on (gas mileage) ¤ Predict some unknown class labels ¨ Typical methods ¤ Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern- based classification, logistic regression, …
  • 32. 33 Data Mining Functions: Classification ¨ Classification and label prediction ¤ Construct models (functions) based on some training examples ¤ Describe and distinguish classes or concepts for future prediction n Ex. 1. Classify countries based on (climate) n Ex. 2. Classify cars based on (gas mileage) ¤ Predict some unknown class labels ¨ Typical methods ¤ Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern- based classification, logistic regression, … ¨ Typical applications: ¤ Credit card fraud detection, direct marketing, classifying stars, diseases, web pages, …
  • 33. 34 Data Mining Functions: Cluster Analysis ¨ Unsupervised learning (i.e., Class label is unknown) ¨ Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns
  • 34. 35 Data Mining Functions: Cluster Analysis ¨ Unsupervised learning (i.e., Class label is unknown) ¨ Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns ¨ Principle: Maximizing intra-class similarity & minimizing interclass similarity ¨ Many methods and applications
  • 35. 36 Data Mining Functions: Outlier Analysis ¨ Outlier analysis ¤ Outlier: A data object that does not comply with the general behavior of the data ¤ Noise or exception?―One person’s garbage could be another person’s treasure
  • 36. 37 Data Mining Functions: Outlier Analysis ¨ Outlier analysis ¤ Outlier: A data object that does not comply with the general behavior of the data ¤ Noise or exception?―One person’s garbage could be another person’s treasure ¤ Methods: by product of clustering or regression analysis, … ¤ Useful in fraud detection, rare events analysis
  • 37. 38 Data Mining Functions: Time and Ordering: Sequential Pattern, Trend and Evolution Analysis ¨ Sequence, trend and evolution analysis ¤ Trend, time-series, and deviation analysis n e.g., regression and value prediction ¤ Sequential pattern mining n e.g., buy digital camera, then buy large memory cards ¤ Periodicity analysis ¤ Motifs and biological sequence analysis n Approximate and consecutive motifs ¤ Similarity-based analysis ¨ Mining data streams ¤ Ordered, time-varying, potentially infinite, data streams
  • 38. 39 Data Mining Functions: Structure and Network Analysis ¨ Graph mining ¤ Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)
  • 39. 40 Data Mining Functions: Structure and Network Analysis ¨ Graph mining ¤ Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) ¨ Information network analysis ¤ Social networks: actors (objects, nodes) and relationships (edges) n e.g., author networks in CS, terrorist networks ¤ Multiple heterogeneous networks n A person could be multiple information networks: friends, family, classmates, … ¤ Knowledge graphs: knowledge backbone of AI systems
  • 40. 41 Data Mining Functions: Structure and Network Analysis ¨ Graph mining ¤ Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) ¨ Information network analysis ¤ Social networks: actors (objects, nodes) and relationships (edges) n e.g., author networks in CS, terrorist networks ¤ Multiple heterogeneous networks n A person could be multiple information networks: friends, family, classmates, … ¤ Knowledge graphs: knowledge backbone of AI systems ¨ Web mining ¤ Web is a big information network: from PageRank to Google ¤ Analysis of Web information networks n Web community discovery, opinion mining, usage mining, …
  • 41. 42 Evaluation of Knowledge ¨ Are all mined knowledge interesting? ¤ One can mine tremendous amounts of “patterns” ¤ Some may fit only certain dimension space (time, location, …) ¤ Some may not be representative, may be transient, …
  • 42. 43 Evaluation of Knowledge ¨ Are all mined knowledge interesting? ¤ One can mine tremendous amount of “patterns” ¤ Some may fit only certain dimension space (time, location, …) ¤ Some may not be representative, may be transient, … ¨ Evaluation of mined knowledge → directly mine only interesting knowledge? ¤ Descriptive vs. predictive ¤ Coverage ¤ Typicality vs. novelty ¤ Accuracy ¤ Timeliness ¤ …
  • 43. 44 Chapter 1. Introduction ¨ Why Data Mining? ¨ What Is Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 44. 45 Data Mining: Confluence of Multiple Disciplines Data Mining Machine Learning Statistics Applications Algorithm Pattern Recognition High-Performance Computing Visualization Database Technology
  • 45. 46 Why Confluence of Multiple Disciplines? ¨ Tremendous amount of data ¤ Algorithms must be scalable to handle big data ¨ High-dimensionality of data ¤ Micro-array may have tens of thousands of dimensions ¨ High complexity of data ¤ Data streams and sensor data ¤ Time-series data, temporal data, sequence data ¤ Structure data, graphs, social and information networks ¤ Spatial, spatiotemporal, multimedia, text and Web data ¤ Software programs, scientific simulations ¨ New and sophisticated applications
  • 46. 47 Chapter 1. Introduction ¨ Why Data Mining? ¨ What Is Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 47. 48 Applications of Data Mining ¨ Web page analysis: classification, clustering, ranking ¨ Collaborative analysis & recommender systems ¨ Biological and medical data analysis ¨ Data mining and software engineering ¨ Data mining and text analysis ¨ Data mining and social and information network analysis ¨ Built-in (invisible data mining) functions in Google, MS, Yahoo!, Linked, Facebook, …
  • 48. 49 Chapter 1. Introduction ¨ Why Data Mining? ¨ What Is Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 49. 50 Major Issues in Data Mining (1) ¨ Mining Methodology ¤ Mining various and new kinds of knowledge ¤ Mining knowledge in multi-dimensional space ¤ Data mining: An interdisciplinary effort ¤ Boosting the power of discovery in a networked environment ¤ Handling noise, uncertainty, and incompleteness of data ¤ Pattern evaluation and pattern- or constraint-guided mining
  • 50. 51 Major Issues in Data Mining (1) ¨ Mining Methodology ¤ Mining various and new kinds of knowledge ¤ Mining knowledge in multi-dimensional space ¤ Data mining: An interdisciplinary effort ¤ Boosting the power of discovery in a networked environment ¤ Handling noise, uncertainty, and incompleteness of data ¤ Pattern evaluation and pattern- or constraint-guided mining ¨ User Interaction & Human-Machine Collaboration ¤ Interactive mining ¤ Incorporation of background knowledge ¤ Presentation and visualization of data mining results
  • 51. 52 ¨ Efficiency and Scalability ¤ Efficiency and scalability of data mining algorithms ¤ Parallel, distributed, stream, and incremental mining methods ¨ Diversity of data types ¤ Handling complex types of data ¤ Mining dynamic, networked, and global data repositories ¨ Data mining and society ¤ Social impacts of data mining ¤ Privacy-preserving data mining Major Issues in Data Mining (2)
  • 52. 53 Chapter 1. Introduction ¨ Why Data Mining? ¨ What Is Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 53. 54 A Brief History of Data Mining Society ¨ 1989 IJCAI Workshop on Knowledge Discovery in Databases ¤ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) ¨ 1991-1994 Workshops on Knowledge Discovery in Databases ¤ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) ¨ 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) ¤ Journal of Data Mining and Knowledge Discovery (1997) ¨ ACM SIGKDD conferences since 1998 and SIGKDD Explorations ¨ More conferences on data mining ¤ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc. ¨ ACM Transactions on KDD (2007)
  • 54. 55 Conferences and Journals on Data Mining ¨ KDD Conferences ¤ ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) ¤ SIAM Data Mining Conf. (SDM) ¤ (IEEE) Int. Conf. on Data Mining (ICDM) ¤ European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD) ¤ Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) ¤ Int. Conf. on Web Search and Data Mining (WSDM) n Other related conferences n DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, … n Web and IR conferences: WWW, SIGIR, WSDM n ML conferences: ICML, NIPS n PR conferences: CVPR, n Journals n Data Mining and Knowledge Discovery (DAMI or DMKD) n IEEE Trans. On Knowledge and Data Eng. (TKDE) n KDD Explorations n ACM Trans. on KDD
  • 55. 56 Where to Find References? DBLP, CiteSeer, Google ¨ Data mining and KDD (SIGKDD) ¤ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. ¤ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD ¨ Database systems (SIGMOD) ¤ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA ¤ Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. ¨ AI & Machine Learning ¤ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. ¤ Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.
  • 56. 57 Where to Find References? DBLP, CiteSeer, Google ¨ Web and IR ¤ Conferences: SIGIR, WWW, CIKM, etc. ¤ Journals: WWW: Internet and Web Information Systems ¨ Statistics ¤ Conferences: Joint Stat. Meeting, etc. ¤ Journals: Annals of statistics, etc. ¨ Visualization ¤ Conference proceedings: CHI, ACM-SIGGraph, etc. ¤ Journals: IEEE Trans. visualization and computer graphics, etc.
  • 57. 58 Future of Data Science Figure from: https://www.datasciencecentral.com/profiles/blogs/difference- of-data-science-machine-learning-and-data-mining https://www.youtube.com/watc h?v=hxXIJnjC_HI (Future of Data Science @ Stanford) Related events in OSU: DataFest Hackathon Conduct research in labs
  • 58. 59 Chapter 1. Introduction ¨ Why Data Mining? ¨ What Is Data Mining? ¨ A Multi-Dimensional View of Data Mining ¨ What Kinds of Data Can Be Mined? ¨ What Kinds of Patterns Can Be Mined? ¨ What Kinds of Technologies Are Used? ¨ What Kinds of Applications Are Targeted? ¨ Major Issues in Data Mining ¨ A Brief History of Data Mining and Data Mining Society ¨ Summary
  • 59. 60 Summary ¨ Data mining: Discovering interesting patterns and knowledge from massive amount of data ¨ A natural evolution of science and information technology, in great demand, with wide applications ¨ A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation ¨ Mining can be performed in a variety of data ¨ Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc. ¨ Data mining technologies and applications ¨ Major issues in data mining
  • 60. 61 Recommended Reference Books ¨ Charu C. Aggarwal, Data Mining: The Textbook, Springer, 2015 ¨ E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011 ¨ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley- Interscience, 2000 ¨ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 ¨ J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011 ¨ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009 ¨ T. M. Mitchell, Machine Learning, McGraw Hill, 1997 ¨ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 (2nd ed. 2016) ¨ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005 ¨ Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms 2014
  • 61. 62 Future of Data Science Figure from: https://www.datasciencecentral.com/profiles/blogs/difference- of-data-science-machine-learning-and-data-mining https://www.youtube.com/watc h?v=hxXIJnjC_HI Related events in OSU: DataFest Hackathon Conduct research in labs