This document provides an introduction to the concept of data mining. It defines data mining as the process of discovering novel and useful patterns from large amounts of data. The growth of data from various sources like business, science, and digital media is discussed. Data mining helps turn this massive data into useful knowledge. Examples of patterns discovered from search engine query data that can predict flu outbreaks are given. The evolution of database technology and relationship between data mining, machine learning, and statistics is outlined. Finally, the multi-step data mining process and different types of data that can be mined are summarized.
3. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
3
4. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why Data Mining
A search engine (e.g., Google) receives hundreds of millions of queries every day. Each query can be
viewed as a transaction where the user describes her or his information need.
What novel and useful knowledge can a search engine learn from such a huge collection of queries
collected from users over time? Some patterns found in user search queries can disclose invaluable
knowledge that cannot be obtained by reading individual data items alone.
For example, Google's Flu Trends uses specific search terms as indicators of flu activity. It found a
close relationship between the number of people who search for flu-related information and the
number of people who actually have flu symptoms. A pattern emerges when all of the search
queries related to flu are aggregated. Using aggregated Google search data, Flu Trends can estimate
flu activity up to two weeks faster than traditional systems can.This example shows how data
mining can turn a large collection of data into knowledge that can help meet a current global
challenge.
5. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evolution of Database Technology
• 1960s:
• Data collection, database creation, IMS and network DBMS
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems
5
6. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What Is Data Mining?
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data
• Data mining: a misnomer?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
• Simple search and query processing
• (Deductive) expert systems
6
7. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is (not) Data Mining?
What is Data Mining?
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Group together similar
documents returned by search
engine according to their
context (e.g. Amazon
rainforest, Amazon.com,)
What is not Data
Mining?
– Look up phone
number in phone
directory
– Query a Web
search engine for
information about
“Amazon”
8. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
• Traditional Techniques
may be unsuitable due to
• Enormity of data
• High dimensionality
of data
• Heterogeneous,
distributed nature
of data
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
9. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining in Business Intelligence
9
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
10. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining/KDD Process
10
Input Data Data
Mining
Data Pre-
Processing
Post-
Processing
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
11. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multi-Dimensional View of Data Mining
• Data to be mined
• Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse,
transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media,
graphs & social and information networks
• Knowledge to be mined (or: Data mining functions)
• Characterization, discrimination, association, classification, clustering, trend/deviation, outlier
analysis, etc.
• Descriptive vs. predictive data mining
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text
mining, Web mining, etc.
11
12. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining & Machine Learning
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon
University and author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with
the experience E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance
increases (to satisfy the definition)
Many data mining tasks are executed successfully with help of machine learning
Machine Learning: Hands-on for Developers and Technical Professionals by Jason Bell John Wiley & Sons
13. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining on Diverse kinds of Data
Besides relational database data (from operational or analytical systems), there are many other kinds of data
that have diverse forms and structures and different semantic meanings.
Examples of data can be :
time-related or sequence data (e.g., historical records, stock exchange data, and time-series and biological
sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or integrated circuits),
hypertext and multimedia data (including text, image, video, and audio data),
graph and networked data (e.g., social and information networks), and
the Web (a widely distributed information repository).
Diversity of data brings in new challenges such as handling special structures (e.g., sequences, trees, graphs,
and networks) and specific semantics (such as ordering, image, audio and video contents, and connectivity)
14. Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Author(s), Title, Edition, Publishing House
T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
Prescribed Text Books