CS501: DATABASE AND DATA
MINING
Data Mining Introduction1
WHY DATA MINING?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation,
…
 Society and everyone: news, digital cameras, YouTube Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets 2
EVOLUTION OF SCIENCES
 Before 1600, empirical science
 1600-1950s, theoretical science
E h di i li h h i l t Th ti l d l ft ti t Each discipline has grown a theoretical component. Theoretical models often motivate
experiments and generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational branch (e.g.Over the last 50 years, most disciplines have grown a third, computational branch (e.g.
empirical, theoretical, and computational ecology, or physics, or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally accessible The Internet and computing Grid that makes all these archives universally accessible
 Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,y y, p yp f ,
Comm. ACM, 45(11): 50-54, Nov. 2002 3
EVOLUTION OF DATABASE TECHNOLOGY
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s: 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
2000 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems 4
WHAT IS DATA MINING?
 Data mining (knowledge discovery from data)
E i f i i i i l i li i i l Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of datauge a ou t o data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.g g, g, g ,
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems 5
PROCESS OF KNOWLEDGE DISCOVERY
6
STEPS FOR KNOWLEDGE DISCOVERY PROCESSSTEPS FOR KNOWLEDGE DISCOVERY PROCESS
 Data cleaningg
 to remove noise or irrelevant data
 Data integration
 where multiple data sources are combined where multiple data sources are combined
 Data selection
 where data relevant to the task analysis is retrieved
 Data transformation
 where data are transformed or consolidated into
forms appropriate for mining by performing
summary or aggregation operations
 Data mining
 an essential process where intelligent methods arean essential process where intelligent methods are
applied in order to extract data patterns 7
S K D PSTEPS FOR KNOWLEDGE DISCOVERY PROCESS
 Pattern evaluation
 To identify the truly interesting patterns
representing knowledge based on some
interestingness measureinterestingness measure
 Knowledge representation
 Where visualization and knowledge representation
techniques are used to present the mined knowledge
to the user
Data mining is the process of discovering interesting knowledge from
large amount of data stored either in the data bases, data
warehouses, or other information repositories
8
, p
ARCHITECTURE OF DATA MINING SYSTEM
 Based on this view, the architecture of data,
mining system has the following major
components-
D t b d t h th i f ti Database, data warehouse or other information
repository
 Database or data warehouse server
 Knowledge base
 Data mining engine
 Pattern evaluation modulePattern evaluation module
 Graphical user interface
9
ARCHITECTURE OF A DATA MINING SYSTEM
10
DATA MINING: CONFLUENCE OF MULTIPLE DISCIPLINES
Database
S i iTechnology Statistics
Data Mining
Machine
Learning
Visualization
Learning
PatternPattern
Recognition
Algorithm
Other
Disciplines
11
WHY NOT TRADITIONAL DATA ANALYSIS?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytesAlgorithms must be highly scalable to handle such as tera bytes
of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor dataData streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs scientific simulations Software programs, scientific simulations
 New and sophisticated applications
12
DATA MINING- ON WHAT KIND OF DATA
 Different data store on which data mining can beg
done
 It is applicable to any kind of information
itrepository
 However, the challenges and techniques of
mining may differ for each of the repositorymining may differ for each of the repository
systems
13
DATA MINING: ON WHAT KINDS OF DATA?
 Database-oriented data sets and applications
 Relational database data warehouse transactional database Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal dataSpatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web 14
RELATIONAL DATABASES
15
DATA WAREHOUSE
16
MULTIDIMENSIONAL DATA CUBE
Showing
summarized
data
17
TRANSACTIONAL DATABASE
 Consists of files where
each record represent
a transaction
T i ll i l d Typically includes a
transaction identity
number and a list of
items making up the
transaction
18
DATA MINING FUNCTIONALITIES
 Multidimensional concept description: Characterization and
discriminationdiscrimination
 Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
 Frequent patterns, association, correlation vs. causality
 Diaper  Beer [0.5%, 75%] (Correlation or causality?)
Cl ifi i d di i Classification and prediction
 Construct models (functions) that describe and distinguish
classes or concepts for future predictionp p
 E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Predict some unknown or missing numerical values 19
DATA MINING FUNCTIONALITIES (2)
 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g., clusterp , g ,
houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass similarity
 Outlier analysisy
 Outlier: Data object that does not comply with the general behavior
of the data
 Noise or exception? Useful in fraud detection, rare events analysiso se o e cep o ? Use a e ec o , a e eve s a a ys s
 Trend and evolution analysis
 Trend and deviation: e.g., regression analysis
 Sequential pattern mining: e g digital camera  large SD memory Sequential pattern mining: e.g., digital camera  large SD memory
 Periodicity analysis
 Similarity-based analysis
O h di d i i l l Other pattern-directed or statistical analyses 20

Cs501 dm intro

  • 1.
    CS501: DATABASE ANDDATA MINING Data Mining Introduction1
  • 2.
    WHY DATA MINING? The Explosive Growth of Data: from terabytes to petabytes  Data collection and data availability Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube Society and everyone: news, digital cameras, YouTube  We are drowning in data, but starving for knowledge!  “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 2
  • 3.
    EVOLUTION OF SCIENCES Before 1600, empirical science  1600-1950s, theoretical science E h di i li h h i l t Th ti l d l ft ti t Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.  1950s-1990s, computational science  Over the last 50 years, most disciplines have grown a third, computational branch (e.g.Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)  Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.  1990-now, data science  The flood of data from new scientific instruments and simulations  The ability to economically store and manage petabytes of data online  The Internet and computing Grid that makes all these archives universally accessible The Internet and computing Grid that makes all these archives universally accessible  Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!  Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,y y, p yp f , Comm. ACM, 45(11): 50-54, Nov. 2002 3
  • 4.
    EVOLUTION OF DATABASETECHNOLOGY  1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s: 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.)  1990 1990s:  Data mining, data warehousing, multimedia databases, and Web databases 2000 2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems 4
  • 5.
    WHAT IS DATAMINING?  Data mining (knowledge discovery from data) E i f i i i i l i li i i l Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of datauge a ou t o data  Data mining: a misnomer?  Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.g g, g, g ,  Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems 5
  • 6.
  • 7.
    STEPS FOR KNOWLEDGEDISCOVERY PROCESSSTEPS FOR KNOWLEDGE DISCOVERY PROCESS  Data cleaningg  to remove noise or irrelevant data  Data integration  where multiple data sources are combined where multiple data sources are combined  Data selection  where data relevant to the task analysis is retrieved  Data transformation  where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations  Data mining  an essential process where intelligent methods arean essential process where intelligent methods are applied in order to extract data patterns 7
  • 8.
    S K DPSTEPS FOR KNOWLEDGE DISCOVERY PROCESS  Pattern evaluation  To identify the truly interesting patterns representing knowledge based on some interestingness measureinterestingness measure  Knowledge representation  Where visualization and knowledge representation techniques are used to present the mined knowledge to the user Data mining is the process of discovering interesting knowledge from large amount of data stored either in the data bases, data warehouses, or other information repositories 8 , p
  • 9.
    ARCHITECTURE OF DATAMINING SYSTEM  Based on this view, the architecture of data, mining system has the following major components- D t b d t h th i f ti Database, data warehouse or other information repository  Database or data warehouse server  Knowledge base  Data mining engine  Pattern evaluation modulePattern evaluation module  Graphical user interface 9
  • 10.
    ARCHITECTURE OF ADATA MINING SYSTEM 10
  • 11.
    DATA MINING: CONFLUENCEOF MULTIPLE DISCIPLINES Database S i iTechnology Statistics Data Mining Machine Learning Visualization Learning PatternPattern Recognition Algorithm Other Disciplines 11
  • 12.
    WHY NOT TRADITIONALDATA ANALYSIS?  Tremendous amount of data  Algorithms must be highly scalable to handle such as tera-bytesAlgorithms must be highly scalable to handle such as tera bytes of data  High-dimensionality of data  Micro-array may have tens of thousands of dimensions  High complexity of data  Data streams and sensor dataData streams and sensor data  Time-series data, temporal data, sequence data  Structure data, graphs, social networks and multi-linked data  Heterogeneous databases and legacy databases  Spatial, spatiotemporal, multimedia, text and Web data  Software programs scientific simulations Software programs, scientific simulations  New and sophisticated applications 12
  • 13.
    DATA MINING- ONWHAT KIND OF DATA  Different data store on which data mining can beg done  It is applicable to any kind of information itrepository  However, the challenges and techniques of mining may differ for each of the repositorymining may differ for each of the repository systems 13
  • 14.
    DATA MINING: ONWHAT KINDS OF DATA?  Database-oriented data sets and applications  Relational database data warehouse transactional database Relational database, data warehouse, transactional database  Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequence data (incl. bio-sequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data and spatiotemporal dataSpatial data and spatiotemporal data  Multimedia database  Text databases  The World-Wide Web 14
  • 15.
  • 16.
  • 17.
  • 18.
    TRANSACTIONAL DATABASE  Consistsof files where each record represent a transaction T i ll i l d Typically includes a transaction identity number and a list of items making up the transaction 18
  • 19.
    DATA MINING FUNCTIONALITIES Multidimensional concept description: Characterization and discriminationdiscrimination  Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions  Frequent patterns, association, correlation vs. causality  Diaper  Beer [0.5%, 75%] (Correlation or causality?) Cl ifi i d di i Classification and prediction  Construct models (functions) that describe and distinguish classes or concepts for future predictionp p  E.g., classify countries based on (climate), or classify cars based on (gas mileage)  Predict some unknown or missing numerical values 19
  • 20.
    DATA MINING FUNCTIONALITIES(2)  Cluster analysis  Class label is unknown: Group data to form new classes, e.g., clusterp , g , houses to find distribution patterns  Maximizing intra-class similarity & minimizing interclass similarity  Outlier analysisy  Outlier: Data object that does not comply with the general behavior of the data  Noise or exception? Useful in fraud detection, rare events analysiso se o e cep o ? Use a e ec o , a e eve s a a ys s  Trend and evolution analysis  Trend and deviation: e.g., regression analysis  Sequential pattern mining: e g digital camera  large SD memory Sequential pattern mining: e.g., digital camera  large SD memory  Periodicity analysis  Similarity-based analysis O h di d i i l l Other pattern-directed or statistical analyses 20