SlideShare a Scribd company logo
1 of 145
Download to read offline
This is part of your
ADVANCED ALGORITHMS IN
COMPUTATIONAL BIOLOGY (C3),




                              1
Class Info
• Lecturer: Chi-Yao Tseng ( 曾祺堯 )
            cytseng@citi.sinica.edu.tw
• Grading:
  – No assignments
  – Midterm:
     • 2012/04/20
     • I’m in charge of 17x2 points out of 120
     • No take-home questions


                                                 2
Outline
• Introduction
  – From data warehousing to data mining
• Mining Capabilities
  – Association rules
  – Classification
  – Clustering
• More about Data Mining


                                           3
Main Reference
• Jiawei Han, Micheline Kamber, Data Mining:
  Concepts and Techniques, 2nd Edition, Morgan
  Kaufmann, 2006.
  – Official website:
    http://www.cs.uiuc.edu/homes/hanj/bk2/




                                                 4
Why Data Mining?
•   The Explosive Growth of Data: from terabytes to petabytes (1015 B= 1 million GB)
     – Data collection and data availability
          • Automated data collection tools, database systems, Web,
            computerized society
     – Major sources of abundant data
          • Business: Web, e-commerce, transactions, stocks, …
          • Science: Remote sensing, bioinformatics, scientific simulation, …
          • Society and everyone: news, digital cameras, YouTube, Facebook

• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—
  Automated analysis of massive data sets
                                                                                       5
Why Not Traditional Data Analysis?
• Tremendous amount of data
   – Algorithms must be highly scalable to handle such as terabytes of data

• High-dimensionality of data
   – Micro-array may have tens of thousands of dimensions

• High complexity of data
• New and sophisticated applications




                                                                          6
Evolution of Database Technology
•   1960s:
     – Data collection, database creation, IMS and network DBMS
•   1970s:
     – Relational data model, relational DBMS implementation
•   1980s:
     – RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
     – Application-oriented DBMS (spatial, scientific, engineering, etc.)
•   1990s:
     – Data mining, data warehousing, multimedia databases, and Web databases
•   2000s
     – Stream data management and mining
     – Data mining and its applications
     – Web technology (XML, data integration) and global information systems


                                                                                7
What is Data Mining?
• Knowledge discovery in databases
  – Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    patterns or knowledge from huge amount of data.
• Alternative names:
  – Knowledge discovery (mining) in databases (KDD),
    knowledge extraction, data/pattern analysis, data
    archeology, data dredging, information
    harvesting, business intelligence, etc.

                                                        8
Data Mining: On What Kinds of Data?
•   Database-oriented data sets and applications
     – Relational database, data warehouse, transactional database
•   Advanced data sets and advanced applications
     – Data streams and sensor data
     – Time-series data, temporal data, sequence data (incl. bio-sequences)
     – Structure data, graphs, social networks and multi-linked data
     – Object-relational databases
     – Heterogeneous databases and legacy databases
     – Spatial data and spatiotemporal data
     – Multimedia database
     – Text databases
     – The World-Wide Web


                                                                              9
Knowledge Discovery (KDD) Process
                                    Interpretation /         Knowledge!
                                    Evaluation

                         Data Mining
           Selection &                            Patterns

           Transformation
                                   Transformed
Data Cleaning                          data
& Integration
                Data warehouse
                                 • This is a view from typical database systems
                                   and data warehousing communities.
         Databases               • Data mining plays an essential role in the
                                   knowledge discovery process.
                                                                          10
Data Mining and Business Intelligence
Increasing potential
to support
business decisions                                                       End User
                                   Decision
                                    Making

                             Data Presentation                           Business
                                                                          Analyst
                            Visualization Techniques
                                Data Mining                                    Data
                           Information Discovery                             Analyst

                               Data Exploration
                 Statistical Summary, Querying, and Reporting

             Data Preprocessing/Integration, Data Warehouses
                                                                               DBA
                              Data Sources
     Paper, Files, Web documents, Scientific experiments, Database Systems
                                                                                  11
Data Mining: Confluence of Multiple Disciplines




                                            12
Typical Data Mining System

           Graphical User Interface


             Pattern Evaluation

                                                      Knowledge
            Data Mining Engine                          Base

             Database or Data
             Warehouse Server

   data cleaning, integration, and selection



               Data      World-Wide    Other info.
Database
             warehouse      Web        repositories
                                                                  13
Data Warehousing
• A data warehouse is a subject-oriented,
  integrated, time-variant, and nonvolatile
  collection of data in support of managements’
  decision making process. —W. H. Inmon




                                              14
Data Warehousing
• Subject-oriented:
   – Provide a simple and concise view around particular subject issues by
     excluding data that are not useful in the decision support process.
• Integrated:
   – Constructed by integrating multiple, heterogeneous data sources.
• Time-variant:
   – Provide information from a historical perspective (e.g., past 5-10
     years.)
• Nonvolatile:
   – Operational update of data does not occur in the data warehouse
     environment
   – Usually requires only two operations: load data & access data.
                                                                          15
Data Warehousing
• The process of constructing and using data
  warehouses
• A decision support database that is maintained
  separately from the organization’s operational
  database
• Support information processing by providing a solid
  platform of consolidated, historical data for analysis
• Set up stages for effective data mining


                                                           16
Illustration of Data Warehousing

                                                      client


Data source in Taipei

                          Clean
                          Transform               Query and
                                        Data       Analysis
Data source in New York   Integrate   Warehouse     Tools
          .               Load
          .
          .


                                                      client

Data source in London

                                                               17
OLTP vs. OLAP
OLTP(On-line Transaction Processing)
             Short online transactions:
             update, insert, delete        current & detailed data,
                                           Versatile
              Online-Transaction
                  Processing                 Tx.
                                          database



                                                                 Complex Queries
 Analytics
 Data Mining
 Decision Making
                                                                Data Warehouse
                   OLAP(On-line Analytical Processing)

                                                              aggregated & historical data,
                                                              Static and Low volume


                                                                                              18
Multi-Dimensional View of Data Mining

•   Data to be mined
     – Relational, data warehouse, transactional, stream, object-oriented/relational,
       active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
•   Knowledge to be mined
     – Characterization, discrimination, association, classification, clustering,
       trend/deviation, outlier analysis, etc.
     – Multiple/integrated functions and mining at multiple levels
•   Techniques utilized
     – Database-oriented, data warehouse (OLAP), machine learning, statistics,
       visualization, etc.
•   Applications adapted
     – Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
       market analysis, text mining, Web mining, etc.

                                                                                    19
Mining Capabilities (1/4)
• Multi-dimensional concept description:
  Characterization and discrimination
  – Generalize, summarize, and contrast data
    characteristics, e.g., dry vs. wet regions
• Frequent patterns (or frequent itemsets),
  association
  – Diaper  Beer [0.5%, 75%] (support, confidence)



                                                      20
Mining Capabilities (2/4)
• Classification and prediction
  – Construct models (functions) that describe and distinguish classes or
    concepts for future prediction
      • E.g., classify countries based on (climate), or classify cars based on
        (gas mileage)
  – Predict some unknown or missing numerical values




                                                                            21
Mining Capabilities (3/4)
• Clustering
   – Class label is unknown: Group data to form new categories
     (i.e., clusters), e.g., cluster houses to find distribution
     patterns
   – Maximizing intra-class similarity & minimizing interclass
     similarity
• Outlier analysis
   – Outlier: Data object that does not comply with the general
     behavior of the data
   – Noise or exception? Useful in fraud detection, rare events
     analysis
                                                                 22
Mining Capabilities (4/4)
• Time and ordering, trend and evolution
  analysis
  – Trend and deviation: e.g., regression analysis
  – Sequential pattern mining: e.g., digital camera 
    large SD memory
  – Periodicity analysis
  – Motifs and biological sequence analysis
     • Approximate and consecutive motifs
  – Similarity-based analysis

                                                        23
More Advanced Mining
                     Techniques
•   Data stream mining
     – Mining data that is ordered, time-varying, potentially infinite.
•   Graph mining
     – Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
        substructures (web fragments)
•   Information network analysis
     – Social networks: actors (objects, nodes) and relationships (edges)
          • e.g., author networks in CS, terrorist networks
     – Multiple heterogeneous networks
          • A person could be multiple information networks: friends, family,
            classmates, …
     – Links carry a lot of semantic information: Link mining
•   Web mining
     – Web is a big information network: from PageRank to Google
     – Analysis of Web information networks
          • Web community discovery, opinion mining, usage mining, …

                                                                                24
Challenges for Data Mining
•   Handling of different types of data
•   Efficiency and scalability of mining algorithms
•   Usefulness and certainty of mining results
•   Expression of various kinds of mining results
•   Interactive mining at multiple abstraction levels
•   Mining information from different source of data
•   Protection of privacy and data security


                                                    25
Brief Summary
• Data mining: Discovering interesting patterns and knowledge from
   massive amount of data
• A natural evolution of database technology, in great demand, with wide
  applications
• A KDD process includes data cleaning, data integration, data selection,
  transformation, data mining, pattern evaluation, and knowledge
  presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination, association,
  classification, clustering, outlier and trend analysis, etc.



                                                                                26
A Brief History of Data Mining Society
 •   1989 IJCAI Workshop on Knowledge Discovery in Databases
      – Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
        1991)
 •   1991-1994 Workshops on Knowledge Discovery in Databases
      – Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
        Shapiro, P. Smyth, and R. Uthurusamy, 1996)
 •   1995-1998 International Conferences on Knowledge Discovery in Databases and
     Data Mining (KDD’95-98)
      – Journal of Data Mining and Knowledge Discovery (1997)
 •   ACM SIGKDD conferences since 1998 and SIGKDD Explorations
 •   More conferences on data mining
      – PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001),
         etc.
More details here: http://www.kdnuggets.com/gpspubs/sigkdd-explorations-kdd-10-years.html
 • ACM Transactions on KDD starting in 2007
                                                                                     27
Conferences and Journals on Data Mining
•   KDD Conferences                       • Other related conferences
     – ACM SIGKDD Int. Conf. on              – DB: ACM SIGMOD, VLDB,
       Knowledge Discovery in                  ICDE, EDBT, ICDT
       Databases and Data Mining
       (KDD)                                 – WEB & IR: CIKM, WWW,
     – SIAM Data Mining Conf. (SDM)            SIGIR
     – (IEEE) Int. Conf. on Data Mining      – ML & PR: ICML, CVPR, NIPS
       (ICDM)                             • Journals
     – European Conf. on Machine
       Learning and Principles and           – Data Mining and Knowledge
       practices of Knowledge Discovery        Discovery (DAMI or DMKD)
       and Data Mining (ECML-PKDD)           – IEEE Trans. On Knowledge
     – Pacific-Asia Conf. on Knowledge         and Data Eng. (TKDE)
       Discovery and Data Mining             – KDD Explorations
       (PAKDD)
                                             – ACM Trans. on KDD
     – Int. Conf. on Web Search and
       Data Mining (WSDM)
                                                                           28
CAPABILITIES OF DATA MINING


                              29
FREQUENT PATTERNS &
ASSOCIATION RULES

                      30
Basic Concepts
• Frequent pattern: a pattern (a set of items, subsequences,
  substructures, etc.) that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
  context of frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
   – What products were often purchased together?— Beer and diapers?!
   – What are the subsequent purchases after buying a PC?
   – What kinds of DNA are sensitive to this new drug?
   – Can we automatically classify web documents?
• Applications
   – Basket data analysis, cross-marketing, catalog design, sale campaign
     analysis, Web log (click stream) analysis, and DNA sequence analysis

                                                                            31
Mining Association Rules
• Transaction data analysis. Given:
  – A database of transactions (Each tx. has a list of
    items purchased)
  – Minimum confidence and minimum support
• Find all association rules: the presence of one
  set of items implies the presence of another
  set of items
                Diaper  Beer [0.5%, 75%]
                   (support, confidence)
                                                         32
Two Parameters
• Confidence (how true)
   – The rule X&Y ⇒Z has 90% confidence:
     means 90% of customers who bought X and Y also
     bought Z.
• Support (how useful is the rule)
   – Useful rules should have some minimum
     transaction support.



                                                  33
Mining Strong Association Rules in
       Transaction Databases (1/2)
• Measurement of rule strength in a transaction
  database.
                      A→B [support, confidence]


                              # of tx containing all items in A ∪ B
    support = Pr( A ∪ B ) =
                                          total # of tx
                              # of tx containing both A ∪ B
    confidence = Pr( B | A) =
                                   # of tx containing A



                                                                      34
Mining Strong Association Rules in
      Transaction Databases (2/2)
• We are often interested in only strong
  associations, i.e.,
  – support ≥ min_sup
  – confidence ≥ min_conf

• Examples:
  – milk → bread [5%, 60%]
  – tire and auto_accessories → auto_services [2%,
    80%].

                                                     35
Example of Association Rules
              Transaction-id   Items bought
                    1            A, B, D
                    2            A, C, D
                    3            A, D, E
                    4             B, E, F
                    5          B, C, D, E, F


   Let min. support = 50%, min. confidence = 50%
      Frequent patterns: {A:3, B:3, D:3, E:3, AD:3}
      Association rules: A  D (s = 60%, c = 100%)
                          D  A (s = 60%, c = 75%)

                                                       36
Two Steps for Mining Association Rules

• Determining “large (frequent) itemsets”
  – The main factor for overall performance
  – The downward closure property of frequent
    patterns
     • Any subset of a frequent itemset must be frequent
     • If {beer, diaper, nuts} is frequent, so is {beer, diaper}
     • i.e., every transaction having {beer, diaper, nuts} also
       contains {beer, diaper}

• Generating rules
                                                                   37
The Apriori Algorithm
• Apriori (R. Agrawal and R. Srikant. Fast algorithms
  for mining association rules. VLDB'94.)
   – Derivation of large 1-itemsets L1: At the first
     iteration, scan all the transactions and count the
     number of occurrences for each item.
   – Level-wise derivation: At the kth iteration, the
     candidate set Ck are those whose every (k-1)-item
     subset is in Lk-1. Scan DB and count the # of
     occurrences for each candidate itemset.

                                                          38
The Apriori Algorithm—An Example
                                    min. support =2 tx’s (50%)
Database TDB                             C1                                 L1
                                                Itemset        sup                 Itemset        sup
Tid         Items             1st scan            {A}            2                     {A}         2
100         A, C, D                               {B}            3                     {B}         3
200         B, C, E                               {C}            3                     {C}         3
300        A, B, C, E                             {D}            1                     {E}         3
400          B, E                                 {E}            3

                                   C2                                             C2    Itemset
                                              Itemset         sup
 L2    Itemset          sup                                           2 scan
                                                                       nd
                                                {A, B}         1                         {A, B}
         {A, C}          2                      {A, C}         2                         {A, C}
         {B, C}          2                      {A, E}         1                         {A, E}
         {B, E}          3                      {B, C}         2                         {B, C}
         {C, E}          2                      {B, E}         3                         {B, E}
                                                {C, E}         2                         {C, E}
      C3     Itemset                3rd scan             L3     Itemset     sup
             {B, C, E}                                          {B, C, E}    2
                                                                                                        39
From Large Itemsets to Rules
• For each large itemset m
    – For each subset p of m
      if ( sup(m) / sup(m-p) ≥ min_conf )
        • output the rule (m-p)→p
          – conf. = sup(m)/sup(m-p)
          – support = sup(m)

•    m = {a,c,d,e,f,g} 2000 tx’s,
      p = {c,e,f,g}
    m-p = {a,d} 5000 tx’s
    – conf. = # {a,c,d,e,f,g} / # {a,d}
    – rule: {a,d} →{c,e,f,g}
      confidence: 40%, support: 2000 tx’s
                                            40
Redundant Rules
• For the same support and confidence, if we
  have a rule {a,d} →{c,e,f,g}, do we have
  [agga98a]:
  – {a,d} →{c,e,f} ?     Yes!
  – {a} →{c,e,f,g} ?     Yes!
  – {a,d,c} →{e,f,g} ?    No!
  – {a} →{c,d,e,f,g} ?    No!




                                               41
Practice
• Suppose we additionally have
  – 500 ACE
  – 600 BCD
  – Support = 3 tx’s (50%), confidence = 66%
• Repeat the large itemset generation
  – Identify all large itemsets
• Derive up to 4 rules
  – Generate rules from the large itemsets with the
    biggest number of elements (from big to small)
                                                      42
Discussion of The Apriori Algorithm
• Apriori (R. Agrawal and R. Srikant. Fast algorithms for mining
  association rules. VLDB'94.)
   – Derivation of large 1-itemsets L1: At the first iteration, scan
     all the transactions and count the number of occurrences
     for each item.
   – Level-wise derivation: At the kth iteration, the candidate set
     Ck are those whose every (k-1)-item subset is in Lk-1. Scan DB
     and count the # of occurrences for each candidate
     itemset.
• The cardinalitiy (number of elements) of C2 is huge.
• The execution time for the first 2 iterations is the
  dominating factor to overall performance!
• Database scan is expensive.                                      43
Improvement of the Apriori Algorithm

• Reduce passes of transaction database scans

• Shrink the number of candidates

• Facilitate the support counting of candidates




                                                  44
Example Improvement 1- Partition: Scan
           Database Only Twice
• Any itemset that is potentially frequent in DB
  must be frequent in at least one of the
  partitions of DB
   – Scan 1: partition database and find local frequent
     patterns
   – Scan 2: consolidate global frequent patterns
• A. Savasere, E. Omiecinski, and S. Navathe. An efficient
  algorithm for mining association in large databases. In
  VLDB’95

                                                             45
Example Improvement 2- DHP
• DHP (direct hashing with pruning): Apriori + hashing
    – Use hash-based method to reduce the size of C2.
    – Allow effective reduction on tx database size (tx number and
      each tx size.)

        Tid       Items
        100       A, C, D
        200       B, C, E
        300     A, B, C, E
        400        B, E




 J. Park, M.-S. Chen, and P. Yu.
 An effective hash-based algorithm for mining association rules. In SIGMOD’95.   46
Mining Frequent Patterns w/o Candidate
              Generation
• A highly compact data structure: frequent
  pattern tree.
• An FP-tree-based pattern fragment growth
  mining method.
• Search technique in mining: partitioning-
  based, divide-and-conquer method.
• J. Han, J. Pei, Y. Yin, Mining Frequent Patterns without
  Candidate Generation, in SIGMOD’2000.

                                                             47
Frequent Patter Tree (FP-tree)
• 3 parts:
   – One root labeled as ‘null’
   – A set of item prefix subtrees
   – Frequent item header table
• Each node in the prefix subtree consists of
   – Item name
   – Count
   – Node-link
• Each entry in the frequent-item header table consists of
   – Item-name
   – Head of node-link

                                                        48
The FP-tree Structure
frequent item header table

Item   Head of node-links          root


  f                          f:4
  c                                       c:1
  a                          c:3
  b                                       b:1
  m                                 b:1
  p                          a:3
                                          p:1


                             m:2
                                   b:1


                             p:2
                                   m:1
                                                49
FP-tree Construction: Step1
• Scan the transaction database DB once (the
  first time), and derives a list of frequent
  items.
• Sort frequent items in frequency descending
  order.
• This ordering is important since each path of a
  tree will follow this order.

                                                50
Example (min. support = 3)
Tx ID   Items Bought          (ordered) Frequent Items
100     f,a,c,d,g,i,m,p       f,c,a,m,p
200     a,b,c,f,l,m,o         f,c,a,b,m
300     b,f,h,j,o             f,b
                                                         frequent item header table
400     b,c,k,s,p             c,b,p
500     a,f,c,e,l,p,m,n       f,c,a,m,p                  Item   Head of node-links

                                                           f
           List of frequent items:                         c
           (f:4), (c:4), (a:3), (b:3), (m:3), (p:3)        a
                                                           b
                                                           m
                                                           p


                                                                               51
FP-tree Construction: Step 2
• Create a root of a tree, label with “null”
• Scan the database the second time. The scan of the first tx
  leads to the construction of the first branch of the tree.

 Scan of 1st transaction: f,a,c,d,g,i,m,p                     root
 The 1st branch of the tree <(f:1),(c:1),(a:1),(m:1),(p:1)>
                                                              f:1

                                                              c:1

                                                              a:1

                                                              m:1

                                                              p:1

                                                                     52
FP-tree Construction: Step 2 (cont’d)
• Scan of 2nd transaction:          root
  a,b,c,f,l,m,o → f,c,a,b,m
                              f:2

                              c:2
  two new nodes:
 (b:1) (m:1)                  a:2



                              m:1
                                    b:1


                              p:1
                                    m:1
                                           53
Tx ID    Items Bought      (ordered) Frequent Items
                             100      f,a,c,d,g,i,m,p   f,c,a,m,p
  The FP-tree                200      a,b,c,f,l,m,o     f,c,a,b,m
                             300      b,f,h,j,o         f,b
                             400      b,c,k,s,p         c,b,p
frequent item header table   500      a,f,c,e,l,p,m,n   f,c,a,m,p

Item   Head of node-links                  root


  f                             f:4
  c                                                      c:1
  a                            c:3
  b                                                      b:1
  m                                         b:1
  p                            a:3
                                                         p:1


                               m:2
                                           b:1


                               p:2
                                                                          54
                                           m:1
Mining Process
• Starts from the least frequent item p
  – Mining order: p -> m -> b -> a -> c -> f
                 frequent item header table

                  Item   Head of node-links

                   f
                   c
                   a
                   b
                   m
                   p



                                               55
Mining Process for item p
• Starts from the least frequent item p
            root            min. support = 3

                            Two paths:
    f:4
                            <f:4, c:3, a:3, m:2, p:2>
                     c:1    <c:1, b:1,p:1>
    c:3
                     b:1    Conditional pattern based of ”p”:
             b:1            <f:2, c:2, a:2, m:2>
    a:3                     <c:1, b:1>
                     p:1

                            Conditional frequent pattern:
    m:2                     <c:3>
             b:1

                            So we have two frequent patterns:
    p:2                     {p:3}, {cp:3}
             m:1                                            56
Mining Process for Item m

        root          min. support = 3

                      Two paths:
f:4
                      <f:4, c:3, a:3, m:2>
                c:1   <f:4, c:3, a:3, b:1, m:1>
c:3
               b:1    Conditional pattern based of ”m”:
         b:1          <f:2, c:2, a:2>
a:3                   <f:1, c:1, a:1, b:1>
               p:1

                      Conditional frequent pattern:
m:2                   <f:3, c:3, a:3>
         b:1


p:2
         m:1                                          57
Mining m’s Conditional FP-tree
                                  Mine (<f:3, c:3, a:3> |
                                  m)                        f
                                a               c

        (am:3)                        (cm:3)                     (fm:3)
        Mine (<f:3, c:3> | am)        Mine (<f:3> | cm)

               c            f                   f

(cam:3)                     (fam:3          (fcm:3
Mine (<f:3> | cam)          )               )
          f

     (fcam:3
     )             So we have frequent patterns:
                   {m:3}, {am:3}, {cm:3}, {fm:3}, {cam:3}, {fam:3}, {fcm:3}, {fcam:3}

                                                                                 58
Analysis of the FP-tree-based method
• Find the complete set of frequent itemsets
• Efficient because
  – Works on a reduced set of pattern bases
  – Performs mining operations less costly than
    generation & test
• Cons:
  – No advantages if the length of most tx’s are short
  – The size of FP-tree not always fit into main memory

                                                    59
Generalized Association Rules
• Given the class hierarchy (taxonomy), one would
  like to choose proper data granularities for
  mining.
• Different confidence/support may be
  considered.
• R. Srikant and R. Agrawal, Mining generalized association
  rules, VLDB’95.




                                                              60
Freq. itemset              Itemset
Concept Hierarchy                                                                     support
         Clothes              Footwear                     Jacket                     2
                                                           Outerwear                  3
  Outerwear     Shirts   Shoes    Hiking Boots             Clothes                    4
                                                           Shoes                      2
Jackets Ski Pants                                          Hiking Boots               2
                                                           Footwear                   4
Tx ID Items Bought                                         Outerwear, Hiking Boots    2
100    Shirt                                               Clothes, Hiking Boots      2
200    Jacket, Hiking Boots                                Outerwear, Footwear        2

300    Ski Pants, Hiking Boots                             Clothes, Footwear          2

400    Shoes                                                               sup(30%   conf(60%
                                                                           )         )
500    Shoes
                                         Outerwear -> Hiking               33%       66%
600    Jacket                            Boots
                                         Outerwear -> Footwear             33%       66%
                                         Hiking Boots ->                   33%       100%
                                         Outerwear
                                         Hiking Boots -> Clothes           33%       100%

                                         Jacket -> Hiking Boots            16%       50%    61
Generalized Association Rules
uniform support                                                        reduced support

 Level 1                              Milk
                                                                          Level 1
 min_sup = 5%                    [support = 10%]                          min_sup = 5%

                  min_sup = 3%
                  Level 2



                                      min_sup = 12%
                                      Level 1
 Level 2             2% Milk                            Skim Milk         Level 2
 min_sup = 5%     [support = 6%]                      [support = 4%]      min_sup = 3%
                  Not examined




                                                                        level filtering
                   2% Milk

                                  [support = 10%
                                       Milk




                                                                                          62
                  Not
                   Sk
Other Relevant Topics
• Max patterns
   – R. J. Bayardo. Efficiently mining long patterns from databases.
     SIGMOD'98.
• Closed patterns
   – N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
     closed itemsets for association rules. ICDT'99.
• Sequential Patterns
   – What items one will purchase if he/she has bought some certain
     items.
   – R. Srikant and R. Agrawal, Mining sequential patterns, ICDE’95
• Traversal Patterns
   – Mining path traversal patterns in a web environment where
     documents or objects are linked together to facilitate interactive
     access.
   – M.-S. Chen, J. Park and P. Yu. Efficient Data Mining for Path Traversal
     Patterns. TKDE’98.
and more…                                                                      63
CLASSIFICATION


                 64
Classification
• Classifying tuples in a database.
• Each tuple has some attributes with known
  values.
• In training set E
  – Each tuple consists of the same set of multiple
    attributes as the tuples in the large database W.
  – Additionally, each tuple has a known class
    identity.

                                                        65
Classification (cont’d)
• Derive the classification mechanism from the
  training set E, and then use this mechanism to
  classify general data (in testing set.)
• A decision tree based approach has been
  influential in machine learning studies.




                                               66
Classification –
         Step 1: Model Construction
   • Train model from the existing data pool
              Training           Classification algorithm
                Data


name   age       income own
                        cars?
Sandy <=30       low       no      Classification rules
Bill   <=30      low       yes
Fox    31…40 high          yes
Susan >40        med       no
Claire >40       med       no
Andy   31…40 high          yes
                                                            67
Classification –
               Step 2: Model Usage

               Testing
                Data             Classification rules


name    age       income own
                         cars?   No

John    >40       hight      ?   No
Sally   <=30      low        ?
                                 Yes
Annie   31…40 high           ?




                                                        68
What is Prediction?
• Prediction is similar to classification
   – First, construct model
   – Second, use model to predict future of unknown
     objects
• Prediction is different from classification
   – Classification refers to predict categorical class
     label.
   – Prediction refers to predict continuous values.
      • Major method: regression
                                                          69
Supervised vs. Unsupervised
                Learning
• Supervised learning (e.g., classification)
   – Supervision: The training data (observations,
     measurements, etc.) are accompanied by labels
     indicating the class of the observations.
• Unsupervised learning (e.g., clustering)
   – We are given a set of measurements, observations,
     etc. with the aim of establishing the existence of
     classes or clusters in the data.
   – No training data, or the “training data” are not
     accompanied by class labels.

                                                          70
Evaluating Classification Methods
• Predictive accuracy
• Speed
   – Time to construct the model and time to use the model
• Robustness
   – Handling noise and missing values
• Scalability
   – Efficiency in large databases (not memory resident data)
• Goodness of rules
   – Decision tree size
   – The compactness of classification rules

                                                                71
A Decision-Tree Based Classification
• A decision tree of whether going to play tennis or not:

                                     outlook
                       sunny                         rainy
                                   overcast
                    humidity                         windy
              high             low     P       Yes           No
                N              P                N            P



• ID-3 and its extended version C4.5 (Quinlan’93):
   A top-down decision tree generation algorithm

                                                                  72
Algorithm for Decision Tree Induction
                 (1/2)
• Basic algorithm (a greedy algorithm)
   – Tree is constructed in a top-down recursive divide-and-
     conquer manner.
   – Attributes are categorical.
     (if an attribute is a continuous number, it needs to be discretized in
     advance.) E.g.
                                                0 ~ 20      61 ~ 80
            0 <= age <= 100                    21 ~ 40      81 ~ 100
                                               41 ~ 60

   – At start, all the training examples are at the root.
   – Examples are partitioned recursively based on selected
     attributes.
                                                                              73
Algorithm for Decision Tree Induction
                 (2/2)
• Basic algorithm (a greedy algorithm)
   – Test attributes are selected on the basis of a heuristic or
     statistical measure (e.g., information gain): maximizing an
     information gain measure, i.e., favoring the partitioning
     which makes the majority of examples belong to a single
     class.
   – Conditions for stopping partitioning:
       • All samples for a given node belong to the same class
       • There are no remaining attributes for further partitioning –
         majority voting is employed for classifying the leaf
       • There are no samples left


                                                                        74
Decision Tree Induction: Training Dataset
     age    income student credit_rating   buys_computer
   <=30    high       no fair                   no
   <=30    high       no excellent              no
   31…40   high       no fair                   yes
   >40     medium     no fair                   yes
   >40     low       yes fair                   yes
   >40     low       yes excellent              no
   31…40   low       yes excellent              yes
   <=30    medium     no fair                   no
   <=30    low       yes fair                   yes
   >40     medium    yes fair                   yes
   <=30    medium    yes excellent              yes
   31…40   medium     no excellent              yes
   31…40   high      yes fair                   yes
   >40     medium     no excellent              no
                                                           75
Age?

<= 30     31…40   > 40




                         76
Primary Issues in Tree Construction (1/2)
• Split criterion: Goodness function
   – Used to select the attribute to be split at a tree
     node during the tree generation phase
   – Different algorithms may use different goodness
     functions:
      • Information gain (used in ID3/C4.5)
      • Gini index (used in CART)




                                                          77
Primary Issues in Tree Construction (2/2)
• Branching scheme:
   – Determining the tree branch to which a sample
     belongs
                                Income: Income:  Income:
   – Binary vs. k-ary splitting   high  medium     low


• When to stop the further splitting of a node?
  e.g. impurity measure
• Labeling rule: a node is labeled as the class to
  which most samples at the node belongs.
                                                      78
How to Use a Tree?
• Directly
   – Test the attribute value of unknown sample against the
     tree.
   – A path is traced from root to a leaf which holds the label.

• Indirectly
   – Decision tree is converted to classification rules.
   – One rule is created for each path from the root to a leaf.
   – IF-THEN is easier for humans to understand .


                                                            79
Attribute Selection Measure:
          Information Gain (ID3/C4.5)
   Select the attribute with the highest information gain
   Let pi be the probability that an arbitrary tuple in D belongs to
    class Ci, estimated by |Ci, D|/|D|
   Expected information (entropy) needed to classify a tuple in D:
                                 m
                    Info( D) = −∑ pi log 2 ( pi )
                                   i =1

   Expected information (entropy):
      Entropy is a measure of how "mixed up" an attribute is.

      It is sometimes equated to the purity or impurity of a

       variable.
      High Entropy means that we are sampling from a uniform

       (boring) distribution.                                           80
Expected Information (Entropy)
    Expected information (entropy) needed to classify a tuple in D:
                                  m
                 Info( D) = −∑ pi log 2 ( pi ) (m: number of
                             i =1
                                                 labels)




                      3       3 2         2                          5       5 0         0
Info( D) = I (3,2) = − log 2 ( ) − log 2 ( )   Info( D) = I (5,0) = − log 2 ( ) − log 2 ( )
                      5       5 5         5                          5       5 5         5
              3              2                          = 0−0 = 0
         ≈ − × (−0.737) − × (−1.322)
              5              5




                                                                                       81
Attribute Selection Measure:
          Information Gain (ID3/C4.5)
   Select the attribute with the highest information gain
   Let pi be the probability that an arbitrary tuple in D belongs to
    class Ci, estimated by |Ci, D|/|D|
   Expected information (entropy) needed to classify a tuple in D:
                                        m
                  Info( D) = I ( D ) = −∑ pi log 2 ( pi )
                                       i =1

   Information needed (after using A to split D into v partitions) to
                                   v |D |
    classify D:     Info A ( D) = ∑
                                          j
                                             × I (D j )
                                  j =1 | D |

   Information gained by branching on attribute A
                   Gain(A) = Info(D) − Info A(D)
                                                                        82
Expected Information (Entropy)
   Information needed (after using A to split D into v
    partitions) to classify D:         v |D |
                        Info A ( D) = ∑
                                              j
                                                 × I (D j )
                                      j =1 | D |




                 2            3                         2            3
    Info( D) =     Info(1,1) + Info(2,1)   Info( D) =     Info(2,0) + Info(3,0)
                 5            5                         5            5


                                                                           83
Attribute Selection: Information Gain
       Class P: buys_computer = “yes”                                     5           4
                                                           Infoage ( D) =    I (2,3) + I (4,0)
       Class N: buys_computer = “no”                                     14          14
                          9         9  5        5                       5
Info( D) = I (9,5) = −      log 2 ( ) − log 2 ( ) =0.940             + I (3,2) = 0.694
                         14        14 14       14                      14
                                                     5
            age           pi    ni    I(pi, ni)        I (2,3) means “age <=30” has 5 out of
                                                    14
          <=30            2     3    0.971                  14 samples, with 2 yes’es and 3
          31…40           4     0    0
                                                            no’s. Hence
          >40             3     2    0.971
                                                       Gain(age) = Info( D) − Infoage ( D) = 0.246
  age    income student credit_rating    buys_computer
<=30    high       no  fair                   no
<=30
31…40
        high
        high
                   no
                   no
                       excellent
                       fair
                                              no
                                              yes
                                                         Similarly,
>40     medium     no  fair                   yes
>40
>40
        low
        low
                  yes
                  yes
                       fair
                       excellent
                                              yes
                                              no
                                                                Gain(income) = 0.029
31…40
<=30
        low
        medium
                  yes
                   no
                       excellent
                       fair
                                              yes
                                              no
                                                                Gain( student ) = 0.151
<=30
>40
        low
        medium
                  yes
                  yes
                       fair
                       fair
                                              yes
                                              yes               Gain(credit _ rating ) = 0.048
<=30    medium    yes  excellent              yes
31…40   medium     no  excellent              yes
31…40   high      yes  fair                   yes
>40     medium     no  excellent              no                                            84
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
  large number of values.
• C4.5 (a successor of ID3) uses gain ratio to overcome the
  problem (normalization to information gain.)
                                       v      | Dj |               | Dj |
              SplitInfo A ( D) = −∑                    × log 2 (            )
                                       j =1   |D|                  |D|

   – GainRatio(A) = Gain(A)/SplitInfo(A)
                                 4           4  6          6  4          4
        SplitInfo A ( D ) = −      × log 2 ( ) − × log 2 ( ) − × log 2 ( ) = 0.926
                                14          14 14         14 14         14
                GainRatio(income) = 0.029/0.926 = 0.031

• The attribute with the maximum gain ratio is selected as the
  splitting attribute.                                                          85
Gini index (CART, IBM
                    IntelligentMiner)
•   If a data set D contains examples from n classes, gini index, Gini(D) is
    defined as                              n
                            Gini( D) = −1    ∑  p2  j
                                             j=1
                    where pj is the relative frequency of class j in D
•   If a data set D is split on A into two subsets D1 and D2, the gini index Gini(D)
    is defined as:                   |D |             |D |
                      Gini A  ( D) = 1 Gini( D1) + 2 Gini( D 2)
                                      | D|             |D|
•   Reduction in Impurity:
                           ∆Gini( A) = Gini(D) − GiniA ( D)

•   The attribute provides the smallest GiniA(D) (or the largest reduction in
    impurity) is chosen to split the node (need to enumerate all the possible
    splitting points for each attribute.)

                                                                                 86
Gini index (CART, IBM
                       IntelligentMiner)
•   Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no.”
                                          2       2
                                    9 5
                    Gini ( D) = 1 −   −   = 0.459
                                     14   14 
•   Suppose the attribute income partitions D into 10 in D 1: {low, medium}
    and 4 in D2: {high}.
                                              10            4
              Giniincome∈{low,medium} ( D) =  Gini ( D1 ) +  Gini ( D1 )
                                              14             14 
                                             10      6        4       4      1    3
                                           = [1 − ( ) 2 − ( ) 2 ] + [1 − ( ) 2 − ( ) 2 ]
                                             14     10       10      14      4    4
                                           = 0.45
                                       = Giniincome∈{high} ( D)

     But Giniincomeϵ{medium,high} is 0.30 and thus the best since it is the lowest.

                                                                                           87
Other Attribute Selection Measures
•   CHAID: a popular decision tree algorithm, measure based on χ2 test for
    independence
•   C-SEP: performs better than info. gain and gini index in certain cases
•   G-statistics: has a close approximation to χ2 distribution
•   MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
     – The best tree as the one that requires the fewest # of bits to both (1)
       encode the tree, and (2) encode the exceptions to the tree
•   Multivariate splits (partition based on multiple variable combinations)
     – CART: finds multivariate splits based on a linear combination of
       attributes.
             Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
                                                                                        88
Other Types of Classification Methods

• Bayes Classification Methods
• Rule-Based Classification
• Support Vector Machine (SVM)

• Some of these methods will be taught in the
  following lessons.



                                                89
CLUSTERING


             90
What is Cluster Analysis?
• Cluster: a collection of data objects
  – Similar to one another within the same cluster
  – Dissimilar to the objects in other clusters
• Cluster Analysis
  – Grouping a set of data objects into clusters
• Typical applications:
  – As a stand-alone tool to get insight into data
    distribution
  – As a preprocessing step for other algorithms
                                                     91
General Applications of Clustering
• Spatial data analysis
     – Create thematic maps in GIS by clustering feature spaces.
     – Detect spatial clusters and explain them in spatial data mining.
•   Image Processing
•   Pattern recognition
•   Economic Science (especially market research)
•   WWW
     – Document classification
     – Cluster Web-log data to discover groups of similar access
       patterns

                                                                          92
Examples of Clustering
                Applications
• Marketing: Help marketers discover distinct groups in their
  customer bases, and then use this knowledge to develop
  targeted marketing programs.

• Land use: Identification of areas of similar land use in an earth
  observation database.

• Insurance: Identifying groups of motor insurance policy
  holders with a high average claim cost.

• City-planning: Identifying groups of houses according to their
  house type, value, and geographical location.

                                                                 93
What is Good Clustering?
• A good clustering method will produce high quality
  clusters with
   – High intra-class similarity
   – Low inter-class similarity

• The quality of a clustering result depends on both the
  similarity measure used by the method and its
  implementation.
• The quality of a clustering method is also measured by
  its ability to discover hidden patterns.

                                                       94
Requirements of Clustering
           in Data Mining (1/2)
• Scalability
• Ability to deal with different types of
  attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements of domain knowledge
  for input
• Able to deal with outliers


                                               95
Requirements of Clustering
            in Data Mining (2/2)
• Insensitive to order of input records
• High dimensionality
  – Curse of dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability




                                                96
Clustering Methods (I)
• Partitioning Method
   – Construct various partitions and then evaluate them by some
     criterion, e.g., minimizing the sum of square errors
   – K-means, k-medoids, CLARANS

• Hierarchical Method
   – Create a hierarchical decomposition of the set of data (or objects)
     using some criterion
   – Diana, Agnes, BIRCH, ROCK, CHAMELEON

• Density-based Method
   – Based on connectivity and density functions
   – Typical methods: DBSACN, OPTICS, DenClue

                                                                           97
Clustering Methods (II)
•   Grid-based approach
     – based on a multiple-level granularity structure
     – Typical methods: STING, WaveCluster, CLIQUE
•   Model-based approach
     – A model is hypothesized for each of the clusters and tries to find the best fit of
       that model to each other
     – Typical methods: EM, SOM, COBWEB
•   Frequent pattern-based
     – Based on the analysis of frequent patterns
     – Typical methods: pCluster
•   User-guided or constraint-based
     – Clustering by considering user-specified or application-specific constraints
     – Typical methods: cluster-on-demand, constrained clustering
                                                                                       98
Typical Alternatives to
              Calculate the Distance between Clusters
• Single link: smallest distance between an element in one cluster and an
   element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

• Complete link: largest distance between an element in one cluster and
   an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

• Average: average distance between an element in one cluster and an
   element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

• Centroid: distance between the centroids of two clusters,
   i.e., dis(Ki, Kj) = dis(Ci, Cj)

• Medoid: distance between the medoids of two clusters,
   i.e., dis(Ki, Kj) = dis(Mi, Mj)
    – Medoid: one chosen, centrally located object in the cluster
                                                                            99
Centroid, Radius and Diameter of a Cluster
                 (for numerical data sets)
• Centroid: the “middle” of a cluster
                                           ΣiN= 1(t        )
                                    Cm =       N
                                                      ip




• Radius: square root of average mean squared distance from any point
   of the cluster to its centroid
                                Σ N (t − cm ) 2
                           Rm = i =1 ip
                                      N
• Diameter: square root of average mean squared distance between all
   pairs of points in the cluster
                            Σ N Σ N (t − t ) 2                 diameter != 2 * radius
                             i = 1 j = 1 ip jq
                        D =
                         m         N ( N − 1)
                                                                                100
Partitioning Algorithms: Basic Concept

• Partitioning method: construct a partition of a
  database D of n objects into a set of k clusters.
• Given a number k, find a partition of k clusters
  that optimizes the chosen partitioning criterion.
   – Global optimal: exhaustively enumerate all partitions.
   – Heuristic methods: k-means, k-medoids
      • k-means (MacQueen’67)
      • k-medoids or PAM, partion around medoids (Kaufman &
        Rousseeuw’87)


                                                              101
The K-Means Clustering Method
       • Given k, the k-means algorithm is implemented in
         four steps:
         1. Arbitrarily choose k points as initial cluster centroids.
         2. Update Means (Centroids): Compute seed points as
            the center of the clusters of the current partition.
loop        (center: mean point of the cluster)
         3. Re-assign Points: Assign each object to the cluster with
            the nearest seed point.
         4. Go back to Step 2, stop when no more new
            assignment.
                                                                 102
Example of the
                                         K-Means Clustering Method
                                                             10                                                                                                   10
10
                                                             9                                                                                                    9
9
                                                             8                                                                                                    8
8
                                                             7                                                                                                    7
7
                                                             6                                                                                                    6
6
                                                             5                                                                                                    5
5

                                                  Assign
                                                             4                                                                                                    4
4
                                                             3                                                                                          Update    3

                                                  each
3

2
                                                             2
                                                                                                                                                        the       2


                                                  objects
                                                             1                                                                                                    1
1
                                                             0                                                                                          cluster   0

                                                  to the
0                                                                 0       1       2       3       4       5       6       7       8       9       10                   0       1       2       3       4       5       6       7       8       9       10
     0   1   2   3   4   5   6   7   8   9   10
                                                                                                                                                        means
                                                  most
                                                  similar                                                         Re-assign                                                                                            Re-assign
                                                  centroid
     Given k = 2:                                             10                                                                                                   10

                                                                  9                                                                                                    9

     Arbitrarily choose k                                         8                                                                                                    8


     object as initial
                                                                  7                                                                                                    7

                                                                  6                                                                                                    6

     cluster centroid                                             5
                                                                                                                                                        Update         5

                                                                  4                                                                                                    4

                                                                  3
                                                                                                                                                        the            3


                                                                                                                                                        cluster
                                                                  2                                                                                                    2

                                                                  1                                                                                                    1

                                                                  0
                                                                      0       1       2       3       4       5       6       7       8       9    10
                                                                                                                                                        means          0
                                                                                                                                                                           0       1       2       3       4       5       6       7       8       9    10



                                                                                                                                                                                                                                       103
Comments on the K-Means Clustering

• Time Complexity: O(tkn), where n is # of objects, k is
  # of clusters, and t is # of iterations. Normally,
  k,t<<n.
• Often terminates at a local optimum.
  (The global optimum may be found using techniques such as:
  deterministic annealing and genetic algorithms)
• Weakness:
   – Applicable only when mean is defined, how about
     categorical data?
   – Need to specify k, the number of clusters, in advance
   – Unable to handle noisy data and outliers                104
Why is K-Means Unable to
           Handle Outliers?
• The k-means algorithm is sensitive to outliers
  – Since an object with an extremely large value may
    substantially distort the distribution of the data.

                                     X


• K-Medoids: Instead of taking the mean value of the
  object in a cluster as a reference point, medoids can
  be used, which is the most centrally located object in
  a cluster.
                                                      105
PAM: The K-Medoids Method
       • PAM: Partition Around Medoids
       • Use real object to represent the cluster
         1. Randomly select k representative objects as medoids.
         2. Assign each data point to the closest medoid.
         3. For each medoid m,
loop
            a.   For each non-medoid data point o
                 b.   Swap m and o, and compute the total cost of the
                      configuration.
         1. Select the configuration with the lowest cost.
         2. Repeat steps 2 to 5 until there is no change in the medoid.
                                                                     106
A Typical K-Medoids Algorithm (PAM)
10                                                                                         10

 9                                                                                         9

 8

 7
                                                                                           8
                                                                                                                                                       Assign each
                                                                                           7

 6
                                                                Arbitrary                  6                                                           remaining object to
 5
                                                                choose k
                                                                                           5                                                           the nearest medoid
 4                                                                                         4

 3                                                              object as                  3

 2
                                                                initial                    2

 1                                                                                         1

 0
                                                                medoids                    0
                                                                                                                                              10
     0   1   2   3   4    5   6   7   8   9        10                                           0   1    2   3   4   5   6   7   8   9   10
                                                                                                                                              9



k=2
                                                                                                                                              8

                                                                                                                                              7
                                          10
                                                                                                                                              6


                                                                     m2
                                           9
                                                                                                                                              5
                                           8
                                                                                                                                              4
                                           7
                                                                                                                                              3
                                           6
                                                                                                                                              2


                                                                          m1
                                           5
                                                                                                                                              1
                                           4
                                                                                                                                              0
                                           3                                                                                                       0   1   2   3   4   5   6   7   8    9    10

                                           2

                                           1

                                           0
                                               0        1   2    3   4    5    6   7   8        9   10




                         Swap each medoid and each data point, and
                         compute the total cost of the configuration                                                                                                                   107
PAM Clustering: Total swapping cost TCih=∑jCjih
                   10
                                                                                          10

                    9
                                                                        d(j,h)<d(j,t)      9
                                                                                                                            j
- Original          8
                                t                                                          8                t
                    7
medoid: t, i                                                                               7

                    6
                                                                j                          6


- h: swap with i    5
                                                                                           5

                    4
                                                    i           h                          4
                                                                                                                                            h
- j: any non-
                    3
                                                                                           3
                                                                                                                                            i
                    2
                                                                                  j   j    2
                                                                                                                                                                                  j    j
selected object     1
                                                                                           1

                    0
                        0   1   2   3   4       5       6   7       8    9   10   i   h    0
                                                                                                0   1   2       3       4       5       6           7       8       9       10    t    t

                   Cjih = d(j, h) - d(j, i)                                               Cjih = 0

                   10                                                                      10


                   9
                                                                        d(j,h)>d(j,t)      9


                   8
                                h                                                          8


                   7

                   6
                                                    j                                      7

                                                                                           6


                   5                                                                       5                        i
                                            i                                                                                               h                       j
                                                        t
                   4                                                                       4


                   3                                                                       3


                   2
                                                                                  j   j
                                                                                           2
                                                                                                                                                t                                 j    j
                   1                                                                       1


                   0
                        0   1   2   3   4       5       6   7       8    9   10   i   t    0
                                                                                                0   1   2       3       4           5       6           7       8       9    10
                                                                                                                                                                                  t    h

                   Cjih = d(j, t) - d(j, i)                                               Cjih = d(j, h) - d(j, t)                                                                    108
What is the Problem with PAM?
• PAM is more robust than k-means in the presence of
  noise and outliers because a medoid is less influenced
  by outliers or other extreme values than a mean.

• PAM works efficiently for small data sets but does not
  scale well for large data sets.
   – O( k(n-k)(n-k) ) for each iteration,
     where n is # of data, k is # of clusters
   – Improvements: CLARA (uses a sampled set to determine
     medoids), CLARANS

                                                            109
Hierarchical Clustering
• Use distance matrix as clustering criteria.
• This method does not require the number of clusters k as an
  input, but needs a termination condition.
          Step 0   Step 1   Step 2 Step 3 Step 4
                                                   agglomerative
                                                   (AGNES)
          a
                   ab
          b
                                        abcde
          c
                                  cde
          d
                            de
          e
                                                   divisive
                                                   (DIANA)
          Step 4   Step 3   Step 2 Step 1 Step 0
                                                                   110
AGNES (Agglomerative Nesting)
•            Introduced in Kaufmann and Rousseeuw (1990)
•            Use the Single-Link method and the dissimilarity matrix.
•            Merge nodes that have the least dissimilarity
•            Go on in a non-descending fashion
•            Eventually all nodes belong to the same cluster

    10                                                10                                                10

    9                                                 9                                                 9

    8                                                 8                                                 8

    7                                                 7                                                 7

    6                                                 6                                                 6

    5                                                 5                                                 5

    4                                                 4                                                 4

    3                                                 3                                                 3

    2                                                 2                                                 2

    1                                                 1                                                 1

    0                                                 0                                                 0
         0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10        0   1   2   3   4   5   6   7   8   9   10




                                                                                                                                                 111
Dendrogram:
    Shows How the Clusters are Merged
   Decompose data objects into a several levels of nested
    partitioning (tree of clusters), called a dendrogram.
   A clustering of the data objects is obtained by cutting the
    dendrogram at the desired level, then each connected
    component forms a cluster.




                                                                  112
DIANA (Divisive Analysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Inverse order of AGNES
• Eventually each node forms a cluster on its own.




                                                                                                                                                  10
 10
                                                   10




                                                                                                                                                  9
  9
                                                   9




                                                                                                                                                  8
  8
                                                   8




                                                                                                                                                  7
  7
                                                   7




                                                                                                                                                  6
  6
                                                   6

  5




                                                                                                                                                  5
                                                   5
  4
                                                   4




                                                                                                                                                  4
  3
                                                   3




                                                                                                                                                  3
  2
                                                   2




                                                                                                                                                  2
  1
                                                   1




                                                                                                                                                  1
  0
                                                   0
      0   1   2   3   4   5   6   7   8   9   10
                                                        0   1   2   3   4   5   6   7   8   9   10




                                                                                                                                                  0
                                                                                                          9

                                                                                                              8

                                                                                                                  7

                                                                                                                      6

                                                                                                                          5

                                                                                                                              4
                                                                                                     10




                                                                                                                                  3

                                                                                                                                      2

                                                                                                                                          1

                                                                                                                                              0
                                                                                                                                                       113
More on Hierarchical Clustering
• Major weakness:
   – Do not scale well: time complexity is at least O(n2), where n is
     the number of total objects.
   – Can never undo what was done previously.

• Integration of hierarchical with distance-based clustering
   – BIRCH(1996): uses CF-tree data structure and incrementally
     adjusts the quality of sub-clusters.
   – CURE(1998): selects well-scattered points from the cluster and
     then shrinks them towards the center of the cluster by a
     specified fraction.


                                                                        114
Density-Based Clustering Methods

• Clustering based on density (local cluster criterion), such as
  density-connected points
• Major features:
   – Discover clusters of arbitrary shape
   – Handle noise
   – One scan
   – Need density parameters as termination condition
• Several interesting studies:
   – DBSCAN: Ester, et al. (KDD’96)
   – OPTICS: Ankerst, et al (SIGMOD’99).
   – DENCLUE: Hinneburg & D. Keim (KDD’98)
   – CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
                                                                   115
Density-Based Clustering: Basic Concepts
• Two parameters:
   – Eps: Maximum radius of the neighborhood
   – MinPts: Minimum number of points in an Eps-neighborhood
     of that point



                                               Eps




                                                         116
Density-Based Clustering: Basic Concepts
• Two parameters:
   – Eps: Maximum radius of the neighborhood
   – MinPts: Minimum number of points in an Eps-neighborhood
     of that point
• NEps(q): {p | dist(p,q) <= Eps} // p, q are two data points
• Directly density-reachable: A point p is directly density-
  reachable from a point q w.r.t. Eps, MinPts if
   – p belongs to NEps(q)
                                                    p      MinPts = 5
   – core point condition:                      q          Eps = 1 cm

            |NEps (q)| >= MinPts
                                                                    117
Density-Reachable and Density-Connected
• Density-reachable:
   – A point p is density-reachable from a                        p
     point q w.r.t. Eps, MinPts if there is a
                                                             p2
     chain of points p1, …, pn,             p1       q
     = q, pn = p such that pi+1 is directly
     density-reachable from pi.
• Density-connected:
   – A point p is density-connected to a         p                     q
     point q w.r.t. Eps, MinPts if there is a
     point o such that both, p and q are                 o
     density-reachable from o w.r.t. Eps
     and MinPts.
                                                                      118
DBSCAN: Density Based Spatial Clustering of
               Applications with Noise

• Relies on a density-based notion of cluster: A cluster is defined
  as a maximal set of density-connected points.
• Discovers clusters of arbitrary shape in spatial databases with
  noise.
                                        Border




             Border
                                                   Eps = 1cm
                                                   MinPts = 5
               Core



                                                                119
DBSCAN: The Algorithm
• Arbitrary select an unvisited point p.
• Retrieve all points density-reachable from p w.r.t. Eps and
  MinPts.
• If p is a core point, a cluster is formed. Mark all these points
  as visited.
• If p is a border point (no points are density-reachable from
  p), mark p as visited and DBSCAN visits the next point of the
  database.
• Continue the process until all of the points have been visited.


                                                                 120
References (1)
•   R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional
    data for data mining applications. SIGMOD'98
•   M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
•   M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering
    structure, SIGMOD’99.
•   P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996
•   Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
•   M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000.
•   M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large
    spatial databases. KDD'96.
•   M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for
    efficient class identification. SSD'95.
•   D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172,
    1987.
•   D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic
    systems. VLDB’98.




                                                                                                            121
References (2)
•   V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99.
•   D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In
    Proc. VLDB’98.
•   S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.
•   S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-
    521, Sydney, Australia, March 1999.
•   A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD’98.
•   A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
•   G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling.
    COMPUTER, 32(8): 68-75, 1999.
•   L. Kaufman and P. J. Rousseeuw, 1987. Clustering by Means of Medoids. In: Dodge, Y. (Ed.), Statistical Data Analysis
    Based on the L1 Norm, North Holland, Amsterdam. pp. 405-416.
•   L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons,
    1990.
•   E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
•   J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations", Proceedings
    of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press,
    1:281-297
•   G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons,
    1988.
•   P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
•   R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.


                                                                                                                      122
References (3)
•   L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review , SIGKDD
    Explorations, 6(1), June 2004
•   E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996
    Int. Conf. on Pattern Recognition,.
•   G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for
    very large spatial databases. VLDB’98.
•   A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large Databases,
    ICDT'01.
•   A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , ICDE'01
•   H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD’ 02.
•   W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.
•   T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large
    databases. SIGMOD'96.
•   Wikipedia: DBSCAN. http://en.wikipedia.org/wiki/DBSCAN.




                                                                                                               123
MORE ABOUT DATA MINING


                         124
http://www.cs.uvm.edu/~xwu/PPT/ICDM10-Sydney/ICDM10-
Keynote.pdf

ICDM ’10 KEYNOTE SPEECH
“10 YEARS OF DATA MINING
RESEARCH: RETROSPECT AND
PROSPECT”      Xindong Wu, University of Vermont, USA

                                                       125
The Top 10 Algorithms
      The 3-Step Identification Process
1. Nominations. ACM KDD Innovation Award and IEEE
   ICDM Research Contributions Award winners were
   invited in September 2006 to each nominate up to 10
   best-known algorithms.

2. Verification. Each nomination was verified for its
   citations on Google Scholar in late October 2006, and
   those nominations that did not have at least 50
   citations were removed. 18 nominations survived and
   were then organized in 10 topics.

3. Voting by the wider community.
                                                      126
Top-10 Most Popular DM Algorithms:
               18 Identified Candidates (I)
•   Classification
     – #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann.,
         1993.
     – #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
         Regression Trees. Wadsworth, 1984.
     – #3. K Nearest Neighbors (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant
         Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
     – #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All?
         Internat. Statist. Rev. 69, 385-398.
•   Statistical Learning
     – #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-
         Verlag.
     – #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New
         York. Association Analysis
•   Association Analysis
     – #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining
         Association Rules. In VLDB '94.
     – #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without
         candidate generation. In SIGMOD '00.



                                                                                      127
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
Data mining - GDi Techno Solutions

More Related Content

What's hot

A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentationmillerca2
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningSi Krishan
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : ConceptsPragya Pandey
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Data Mining methodology
 Data Mining methodology  Data Mining methodology
Data Mining methodology rebeccatho
 
Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008Mark Tabladillo
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
data mining
data miningdata mining
data mininguoitc
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 

What's hot (20)

Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentation
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : Concepts
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Data Mining methodology
 Data Mining methodology  Data Mining methodology
Data Mining methodology
 
Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data mining
Data miningData mining
Data mining
 
data mining
data miningdata mining
data mining
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 

Viewers also liked

Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Calculus Final Review Joshua Conyers
Calculus Final Review Joshua ConyersCalculus Final Review Joshua Conyers
Calculus Final Review Joshua Conyersjcon44
 
Data mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesData mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesSlideTeam.net
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningStefan Dietze
 
Application of KDD & its future scope
Application of KDD & its future scopeApplication of KDD & its future scope
Application of KDD & its future scopeTanmay Sethi
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
Data Mining Techniques for CRM
Data Mining Techniques for CRMData Mining Techniques for CRM
Data Mining Techniques for CRMShyaamini Balu
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methodsProf.Nilesh Magar
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
Wi+ss+14+vl+05+pdf
Wi+ss+14+vl+05+pdfWi+ss+14+vl+05+pdf
Wi+ss+14+vl+05+pdfcmstobi
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 

Viewers also liked (16)

Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Calculus Final Review Joshua Conyers
Calculus Final Review Joshua ConyersCalculus Final Review Joshua Conyers
Calculus Final Review Joshua Conyers
 
Data mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesData mining process powerpoint presentation templates
Data mining process powerpoint presentation templates
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Application of KDD & its future scope
Application of KDD & its future scopeApplication of KDD & its future scope
Application of KDD & its future scope
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Data Mining Techniques for CRM
Data Mining Techniques for CRMData Mining Techniques for CRM
Data Mining Techniques for CRM
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Wi+ss+14+vl+05+pdf
Wi+ss+14+vl+05+pdfWi+ss+14+vl+05+pdf
Wi+ss+14+vl+05+pdf
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 

Similar to Data mining - GDi Techno Solutions

Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementMarieke Guy
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Business Intelligence Data Analytics June 28 2012 Icpas V4 Final 20120625 8am
Business Intelligence  Data Analytics June 28 2012 Icpas V4  Final 20120625 8amBusiness Intelligence  Data Analytics June 28 2012 Icpas V4  Final 20120625 8am
Business Intelligence Data Analytics June 28 2012 Icpas V4 Final 20120625 8amBarrett Peterson
 
MIS: Business Intelligence
MIS: Business IntelligenceMIS: Business Intelligence
MIS: Business IntelligenceJonathan Coleman
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureOdinot Stanislas
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptxKannanThangavelu2
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionDATAVERSITY
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesAmit Sheth
 

Similar to Data mining - GDi Techno Solutions (20)

Data warehousing
Data warehousingData warehousing
Data warehousing
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data Management
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Ch03
Ch03Ch03
Ch03
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Data mining & column stores
Data mining & column storesData mining & column stores
Data mining & column stores
 
Business Intelligence Data Analytics June 28 2012 Icpas V4 Final 20120625 8am
Business Intelligence  Data Analytics June 28 2012 Icpas V4  Final 20120625 8amBusiness Intelligence  Data Analytics June 28 2012 Icpas V4  Final 20120625 8am
Business Intelligence Data Analytics June 28 2012 Icpas V4 Final 20120625 8am
 
MIS: Business Intelligence
MIS: Business IntelligenceMIS: Business Intelligence
MIS: Business Intelligence
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Dbm630_Lecture02-03
Dbm630_Lecture02-03Dbm630_Lecture02-03
Dbm630_Lecture02-03
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 
Dbm630 Lecture01
Dbm630 Lecture01Dbm630 Lecture01
Dbm630 Lecture01
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise Adoption
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
 

Data mining - GDi Techno Solutions

  • 1. This is part of your ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), 1
  • 2. Class Info • Lecturer: Chi-Yao Tseng ( 曾祺堯 ) cytseng@citi.sinica.edu.tw • Grading: – No assignments – Midterm: • 2012/04/20 • I’m in charge of 17x2 points out of 120 • No take-home questions 2
  • 3. Outline • Introduction – From data warehousing to data mining • Mining Capabilities – Association rules – Classification – Clustering • More about Data Mining 3
  • 4. Main Reference • Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann, 2006. – Official website: http://www.cs.uiuc.edu/homes/hanj/bk2/ 4
  • 5. Why Data Mining? • The Explosive Growth of Data: from terabytes to petabytes (1015 B= 1 million GB) – Data collection and data availability • Automated data collection tools, database systems, Web, computerized society – Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube, Facebook • We are drowning in data, but starving for knowledge! • “Necessity is the mother of invention”—Data mining— Automated analysis of massive data sets 5
  • 6. Why Not Traditional Data Analysis? • Tremendous amount of data – Algorithms must be highly scalable to handle such as terabytes of data • High-dimensionality of data – Micro-array may have tens of thousands of dimensions • High complexity of data • New and sophisticated applications 6
  • 7. Evolution of Database Technology • 1960s: – Data collection, database creation, IMS and network DBMS • 1970s: – Relational data model, relational DBMS implementation • 1980s: – RDBMS, advanced data models (extended-relational, OO, deductive, etc.) – Application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s: – Data mining, data warehousing, multimedia databases, and Web databases • 2000s – Stream data management and mining – Data mining and its applications – Web technology (XML, data integration) and global information systems 7
  • 8. What is Data Mining? • Knowledge discovery in databases – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data. • Alternative names: – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 8
  • 9. Data Mining: On What Kinds of Data? • Database-oriented data sets and applications – Relational database, data warehouse, transactional database • Advanced data sets and advanced applications – Data streams and sensor data – Time-series data, temporal data, sequence data (incl. bio-sequences) – Structure data, graphs, social networks and multi-linked data – Object-relational databases – Heterogeneous databases and legacy databases – Spatial data and spatiotemporal data – Multimedia database – Text databases – The World-Wide Web 9
  • 10. Knowledge Discovery (KDD) Process Interpretation / Knowledge! Evaluation Data Mining Selection & Patterns Transformation Transformed Data Cleaning data & Integration Data warehouse • This is a view from typical database systems and data warehousing communities. Databases • Data mining plays an essential role in the knowledge discovery process. 10
  • 11. Data Mining and Business Intelligence Increasing potential to support business decisions End User Decision Making Data Presentation Business Analyst Visualization Techniques Data Mining Data Information Discovery Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems 11
  • 12. Data Mining: Confluence of Multiple Disciplines 12
  • 13. Typical Data Mining System Graphical User Interface Pattern Evaluation Knowledge Data Mining Engine Base Database or Data Warehouse Server data cleaning, integration, and selection Data World-Wide Other info. Database warehouse Web repositories 13
  • 14. Data Warehousing • A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements’ decision making process. —W. H. Inmon 14
  • 15. Data Warehousing • Subject-oriented: – Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. • Integrated: – Constructed by integrating multiple, heterogeneous data sources. • Time-variant: – Provide information from a historical perspective (e.g., past 5-10 years.) • Nonvolatile: – Operational update of data does not occur in the data warehouse environment – Usually requires only two operations: load data & access data. 15
  • 16. Data Warehousing • The process of constructing and using data warehouses • A decision support database that is maintained separately from the organization’s operational database • Support information processing by providing a solid platform of consolidated, historical data for analysis • Set up stages for effective data mining 16
  • 17. Illustration of Data Warehousing client Data source in Taipei Clean Transform Query and Data Analysis Data source in New York Integrate Warehouse Tools . Load . . client Data source in London 17
  • 18. OLTP vs. OLAP OLTP(On-line Transaction Processing) Short online transactions: update, insert, delete current & detailed data, Versatile Online-Transaction Processing Tx. database Complex Queries Analytics Data Mining Decision Making Data Warehouse OLAP(On-line Analytical Processing) aggregated & historical data, Static and Low volume 18
  • 19. Multi-Dimensional View of Data Mining • Data to be mined – Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW • Knowledge to be mined – Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. – Multiple/integrated functions and mining at multiple levels • Techniques utilized – Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. • Applications adapted – Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. 19
  • 20. Mining Capabilities (1/4) • Multi-dimensional concept description: Characterization and discrimination – Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions • Frequent patterns (or frequent itemsets), association – Diaper  Beer [0.5%, 75%] (support, confidence) 20
  • 21. Mining Capabilities (2/4) • Classification and prediction – Construct models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify countries based on (climate), or classify cars based on (gas mileage) – Predict some unknown or missing numerical values 21
  • 22. Mining Capabilities (3/4) • Clustering – Class label is unknown: Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns – Maximizing intra-class similarity & minimizing interclass similarity • Outlier analysis – Outlier: Data object that does not comply with the general behavior of the data – Noise or exception? Useful in fraud detection, rare events analysis 22
  • 23. Mining Capabilities (4/4) • Time and ordering, trend and evolution analysis – Trend and deviation: e.g., regression analysis – Sequential pattern mining: e.g., digital camera  large SD memory – Periodicity analysis – Motifs and biological sequence analysis • Approximate and consecutive motifs – Similarity-based analysis 23
  • 24. More Advanced Mining Techniques • Data stream mining – Mining data that is ordered, time-varying, potentially infinite. • Graph mining – Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) • Information network analysis – Social networks: actors (objects, nodes) and relationships (edges) • e.g., author networks in CS, terrorist networks – Multiple heterogeneous networks • A person could be multiple information networks: friends, family, classmates, … – Links carry a lot of semantic information: Link mining • Web mining – Web is a big information network: from PageRank to Google – Analysis of Web information networks • Web community discovery, opinion mining, usage mining, … 24
  • 25. Challenges for Data Mining • Handling of different types of data • Efficiency and scalability of mining algorithms • Usefulness and certainty of mining results • Expression of various kinds of mining results • Interactive mining at multiple abstraction levels • Mining information from different source of data • Protection of privacy and data security 25
  • 26. Brief Summary • Data mining: Discovering interesting patterns and knowledge from massive amount of data • A natural evolution of database technology, in great demand, with wide applications • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation • Mining can be performed in a variety of data • Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. 26
  • 27. A Brief History of Data Mining Society • 1989 IJCAI Workshop on Knowledge Discovery in Databases – Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) • 1991-1994 Workshops on Knowledge Discovery in Databases – Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) – Journal of Data Mining and Knowledge Discovery (1997) • ACM SIGKDD conferences since 1998 and SIGKDD Explorations • More conferences on data mining – PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. More details here: http://www.kdnuggets.com/gpspubs/sigkdd-explorations-kdd-10-years.html • ACM Transactions on KDD starting in 2007 27
  • 28. Conferences and Journals on Data Mining • KDD Conferences • Other related conferences – ACM SIGKDD Int. Conf. on – DB: ACM SIGMOD, VLDB, Knowledge Discovery in ICDE, EDBT, ICDT Databases and Data Mining (KDD) – WEB & IR: CIKM, WWW, – SIAM Data Mining Conf. (SDM) SIGIR – (IEEE) Int. Conf. on Data Mining – ML & PR: ICML, CVPR, NIPS (ICDM) • Journals – European Conf. on Machine Learning and Principles and – Data Mining and Knowledge practices of Knowledge Discovery Discovery (DAMI or DMKD) and Data Mining (ECML-PKDD) – IEEE Trans. On Knowledge – Pacific-Asia Conf. on Knowledge and Data Eng. (TKDE) Discovery and Data Mining – KDD Explorations (PAKDD) – ACM Trans. on KDD – Int. Conf. on Web Search and Data Mining (WSDM) 28
  • 29. CAPABILITIES OF DATA MINING 29
  • 31. Basic Concepts • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining • Motivation: Finding inherent regularities in data – What products were often purchased together?— Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents? • Applications – Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis 31
  • 32. Mining Association Rules • Transaction data analysis. Given: – A database of transactions (Each tx. has a list of items purchased) – Minimum confidence and minimum support • Find all association rules: the presence of one set of items implies the presence of another set of items Diaper  Beer [0.5%, 75%] (support, confidence) 32
  • 33. Two Parameters • Confidence (how true) – The rule X&Y ⇒Z has 90% confidence: means 90% of customers who bought X and Y also bought Z. • Support (how useful is the rule) – Useful rules should have some minimum transaction support. 33
  • 34. Mining Strong Association Rules in Transaction Databases (1/2) • Measurement of rule strength in a transaction database. A→B [support, confidence] # of tx containing all items in A ∪ B support = Pr( A ∪ B ) = total # of tx # of tx containing both A ∪ B confidence = Pr( B | A) = # of tx containing A 34
  • 35. Mining Strong Association Rules in Transaction Databases (2/2) • We are often interested in only strong associations, i.e., – support ≥ min_sup – confidence ≥ min_conf • Examples: – milk → bread [5%, 60%] – tire and auto_accessories → auto_services [2%, 80%]. 35
  • 36. Example of Association Rules Transaction-id Items bought 1 A, B, D 2 A, C, D 3 A, D, E 4 B, E, F 5 B, C, D, E, F  Let min. support = 50%, min. confidence = 50%  Frequent patterns: {A:3, B:3, D:3, E:3, AD:3}  Association rules: A  D (s = 60%, c = 100%) D  A (s = 60%, c = 75%) 36
  • 37. Two Steps for Mining Association Rules • Determining “large (frequent) itemsets” – The main factor for overall performance – The downward closure property of frequent patterns • Any subset of a frequent itemset must be frequent • If {beer, diaper, nuts} is frequent, so is {beer, diaper} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Generating rules 37
  • 38. The Apriori Algorithm • Apriori (R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.) – Derivation of large 1-itemsets L1: At the first iteration, scan all the transactions and count the number of occurrences for each item. – Level-wise derivation: At the kth iteration, the candidate set Ck are those whose every (k-1)-item subset is in Lk-1. Scan DB and count the # of occurrences for each candidate itemset. 38
  • 39. The Apriori Algorithm—An Example min. support =2 tx’s (50%) Database TDB C1 L1 Itemset sup Itemset sup Tid Items 1st scan {A} 2 {A} 2 100 A, C, D {B} 3 {B} 3 200 B, C, E {C} 3 {C} 3 300 A, B, C, E {D} 1 {E} 3 400 B, E {E} 3 C2 C2 Itemset Itemset sup L2 Itemset sup 2 scan nd {A, B} 1 {A, B} {A, C} 2 {A, C} 2 {A, C} {B, C} 2 {A, E} 1 {A, E} {B, E} 3 {B, C} 2 {B, C} {C, E} 2 {B, E} 3 {B, E} {C, E} 2 {C, E} C3 Itemset 3rd scan L3 Itemset sup {B, C, E} {B, C, E} 2 39
  • 40. From Large Itemsets to Rules • For each large itemset m – For each subset p of m if ( sup(m) / sup(m-p) ≥ min_conf ) • output the rule (m-p)→p – conf. = sup(m)/sup(m-p) – support = sup(m) • m = {a,c,d,e,f,g} 2000 tx’s, p = {c,e,f,g} m-p = {a,d} 5000 tx’s – conf. = # {a,c,d,e,f,g} / # {a,d} – rule: {a,d} →{c,e,f,g} confidence: 40%, support: 2000 tx’s 40
  • 41. Redundant Rules • For the same support and confidence, if we have a rule {a,d} →{c,e,f,g}, do we have [agga98a]: – {a,d} →{c,e,f} ? Yes! – {a} →{c,e,f,g} ? Yes! – {a,d,c} →{e,f,g} ? No! – {a} →{c,d,e,f,g} ? No! 41
  • 42. Practice • Suppose we additionally have – 500 ACE – 600 BCD – Support = 3 tx’s (50%), confidence = 66% • Repeat the large itemset generation – Identify all large itemsets • Derive up to 4 rules – Generate rules from the large itemsets with the biggest number of elements (from big to small) 42
  • 43. Discussion of The Apriori Algorithm • Apriori (R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.) – Derivation of large 1-itemsets L1: At the first iteration, scan all the transactions and count the number of occurrences for each item. – Level-wise derivation: At the kth iteration, the candidate set Ck are those whose every (k-1)-item subset is in Lk-1. Scan DB and count the # of occurrences for each candidate itemset. • The cardinalitiy (number of elements) of C2 is huge. • The execution time for the first 2 iterations is the dominating factor to overall performance! • Database scan is expensive. 43
  • 44. Improvement of the Apriori Algorithm • Reduce passes of transaction database scans • Shrink the number of candidates • Facilitate the support counting of candidates 44
  • 45. Example Improvement 1- Partition: Scan Database Only Twice • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB – Scan 1: partition database and find local frequent patterns – Scan 2: consolidate global frequent patterns • A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 45
  • 46. Example Improvement 2- DHP • DHP (direct hashing with pruning): Apriori + hashing – Use hash-based method to reduce the size of C2. – Allow effective reduction on tx database size (tx number and each tx size.) Tid Items 100 A, C, D 200 B, C, E 300 A, B, C, E 400 B, E J. Park, M.-S. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95. 46
  • 47. Mining Frequent Patterns w/o Candidate Generation • A highly compact data structure: frequent pattern tree. • An FP-tree-based pattern fragment growth mining method. • Search technique in mining: partitioning- based, divide-and-conquer method. • J. Han, J. Pei, Y. Yin, Mining Frequent Patterns without Candidate Generation, in SIGMOD’2000. 47
  • 48. Frequent Patter Tree (FP-tree) • 3 parts: – One root labeled as ‘null’ – A set of item prefix subtrees – Frequent item header table • Each node in the prefix subtree consists of – Item name – Count – Node-link • Each entry in the frequent-item header table consists of – Item-name – Head of node-link 48
  • 49. The FP-tree Structure frequent item header table Item Head of node-links root f f:4 c c:1 a c:3 b b:1 m b:1 p a:3 p:1 m:2 b:1 p:2 m:1 49
  • 50. FP-tree Construction: Step1 • Scan the transaction database DB once (the first time), and derives a list of frequent items. • Sort frequent items in frequency descending order. • This ordering is important since each path of a tree will follow this order. 50
  • 51. Example (min. support = 3) Tx ID Items Bought (ordered) Frequent Items 100 f,a,c,d,g,i,m,p f,c,a,m,p 200 a,b,c,f,l,m,o f,c,a,b,m 300 b,f,h,j,o f,b frequent item header table 400 b,c,k,s,p c,b,p 500 a,f,c,e,l,p,m,n f,c,a,m,p Item Head of node-links f List of frequent items: c (f:4), (c:4), (a:3), (b:3), (m:3), (p:3) a b m p 51
  • 52. FP-tree Construction: Step 2 • Create a root of a tree, label with “null” • Scan the database the second time. The scan of the first tx leads to the construction of the first branch of the tree. Scan of 1st transaction: f,a,c,d,g,i,m,p root The 1st branch of the tree <(f:1),(c:1),(a:1),(m:1),(p:1)> f:1 c:1 a:1 m:1 p:1 52
  • 53. FP-tree Construction: Step 2 (cont’d) • Scan of 2nd transaction: root a,b,c,f,l,m,o → f,c,a,b,m f:2 c:2 two new nodes: (b:1) (m:1) a:2 m:1 b:1 p:1 m:1 53
  • 54. Tx ID Items Bought (ordered) Frequent Items 100 f,a,c,d,g,i,m,p f,c,a,m,p The FP-tree 200 a,b,c,f,l,m,o f,c,a,b,m 300 b,f,h,j,o f,b 400 b,c,k,s,p c,b,p frequent item header table 500 a,f,c,e,l,p,m,n f,c,a,m,p Item Head of node-links root f f:4 c c:1 a c:3 b b:1 m b:1 p a:3 p:1 m:2 b:1 p:2 54 m:1
  • 55. Mining Process • Starts from the least frequent item p – Mining order: p -> m -> b -> a -> c -> f frequent item header table Item Head of node-links f c a b m p 55
  • 56. Mining Process for item p • Starts from the least frequent item p root min. support = 3 Two paths: f:4 <f:4, c:3, a:3, m:2, p:2> c:1 <c:1, b:1,p:1> c:3 b:1 Conditional pattern based of ”p”: b:1 <f:2, c:2, a:2, m:2> a:3 <c:1, b:1> p:1 Conditional frequent pattern: m:2 <c:3> b:1 So we have two frequent patterns: p:2 {p:3}, {cp:3} m:1 56
  • 57. Mining Process for Item m root min. support = 3 Two paths: f:4 <f:4, c:3, a:3, m:2> c:1 <f:4, c:3, a:3, b:1, m:1> c:3 b:1 Conditional pattern based of ”m”: b:1 <f:2, c:2, a:2> a:3 <f:1, c:1, a:1, b:1> p:1 Conditional frequent pattern: m:2 <f:3, c:3, a:3> b:1 p:2 m:1 57
  • 58. Mining m’s Conditional FP-tree Mine (<f:3, c:3, a:3> | m) f a c (am:3) (cm:3) (fm:3) Mine (<f:3, c:3> | am) Mine (<f:3> | cm) c f f (cam:3) (fam:3 (fcm:3 Mine (<f:3> | cam) ) ) f (fcam:3 ) So we have frequent patterns: {m:3}, {am:3}, {cm:3}, {fm:3}, {cam:3}, {fam:3}, {fcm:3}, {fcam:3} 58
  • 59. Analysis of the FP-tree-based method • Find the complete set of frequent itemsets • Efficient because – Works on a reduced set of pattern bases – Performs mining operations less costly than generation & test • Cons: – No advantages if the length of most tx’s are short – The size of FP-tree not always fit into main memory 59
  • 60. Generalized Association Rules • Given the class hierarchy (taxonomy), one would like to choose proper data granularities for mining. • Different confidence/support may be considered. • R. Srikant and R. Agrawal, Mining generalized association rules, VLDB’95. 60
  • 61. Freq. itemset Itemset Concept Hierarchy support Clothes Footwear Jacket 2 Outerwear 3 Outerwear Shirts Shoes Hiking Boots Clothes 4 Shoes 2 Jackets Ski Pants Hiking Boots 2 Footwear 4 Tx ID Items Bought Outerwear, Hiking Boots 2 100 Shirt Clothes, Hiking Boots 2 200 Jacket, Hiking Boots Outerwear, Footwear 2 300 Ski Pants, Hiking Boots Clothes, Footwear 2 400 Shoes sup(30% conf(60% ) ) 500 Shoes Outerwear -> Hiking 33% 66% 600 Jacket Boots Outerwear -> Footwear 33% 66% Hiking Boots -> 33% 100% Outerwear Hiking Boots -> Clothes 33% 100% Jacket -> Hiking Boots 16% 50% 61
  • 62. Generalized Association Rules uniform support reduced support Level 1 Milk Level 1 min_sup = 5% [support = 10%] min_sup = 5% min_sup = 3% Level 2 min_sup = 12% Level 1 Level 2 2% Milk Skim Milk Level 2 min_sup = 5% [support = 6%] [support = 4%] min_sup = 3% Not examined level filtering 2% Milk [support = 10% Milk 62 Not Sk
  • 63. Other Relevant Topics • Max patterns – R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98. • Closed patterns – N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99. • Sequential Patterns – What items one will purchase if he/she has bought some certain items. – R. Srikant and R. Agrawal, Mining sequential patterns, ICDE’95 • Traversal Patterns – Mining path traversal patterns in a web environment where documents or objects are linked together to facilitate interactive access. – M.-S. Chen, J. Park and P. Yu. Efficient Data Mining for Path Traversal Patterns. TKDE’98. and more… 63
  • 65. Classification • Classifying tuples in a database. • Each tuple has some attributes with known values. • In training set E – Each tuple consists of the same set of multiple attributes as the tuples in the large database W. – Additionally, each tuple has a known class identity. 65
  • 66. Classification (cont’d) • Derive the classification mechanism from the training set E, and then use this mechanism to classify general data (in testing set.) • A decision tree based approach has been influential in machine learning studies. 66
  • 67. Classification – Step 1: Model Construction • Train model from the existing data pool Training Classification algorithm Data name age income own cars? Sandy <=30 low no Classification rules Bill <=30 low yes Fox 31…40 high yes Susan >40 med no Claire >40 med no Andy 31…40 high yes 67
  • 68. Classification – Step 2: Model Usage Testing Data Classification rules name age income own cars? No John >40 hight ? No Sally <=30 low ? Yes Annie 31…40 high ? 68
  • 69. What is Prediction? • Prediction is similar to classification – First, construct model – Second, use model to predict future of unknown objects • Prediction is different from classification – Classification refers to predict categorical class label. – Prediction refers to predict continuous values. • Major method: regression 69
  • 70. Supervised vs. Unsupervised Learning • Supervised learning (e.g., classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations. • Unsupervised learning (e.g., clustering) – We are given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data. – No training data, or the “training data” are not accompanied by class labels. 70
  • 71. Evaluating Classification Methods • Predictive accuracy • Speed – Time to construct the model and time to use the model • Robustness – Handling noise and missing values • Scalability – Efficiency in large databases (not memory resident data) • Goodness of rules – Decision tree size – The compactness of classification rules 71
  • 72. A Decision-Tree Based Classification • A decision tree of whether going to play tennis or not: outlook sunny rainy overcast humidity windy high low P Yes No N P N P • ID-3 and its extended version C4.5 (Quinlan’93): A top-down decision tree generation algorithm 72
  • 73. Algorithm for Decision Tree Induction (1/2) • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and- conquer manner. – Attributes are categorical. (if an attribute is a continuous number, it needs to be discretized in advance.) E.g. 0 ~ 20 61 ~ 80 0 <= age <= 100 21 ~ 40 81 ~ 100 41 ~ 60 – At start, all the training examples are at the root. – Examples are partitioned recursively based on selected attributes. 73
  • 74. Algorithm for Decision Tree Induction (2/2) • Basic algorithm (a greedy algorithm) – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain): maximizing an information gain measure, i.e., favoring the partitioning which makes the majority of examples belong to a single class. – Conditions for stopping partitioning: • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left 74
  • 75. Decision Tree Induction: Training Dataset age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no 75
  • 76. Age? <= 30 31…40 > 40 76
  • 77. Primary Issues in Tree Construction (1/2) • Split criterion: Goodness function – Used to select the attribute to be split at a tree node during the tree generation phase – Different algorithms may use different goodness functions: • Information gain (used in ID3/C4.5) • Gini index (used in CART) 77
  • 78. Primary Issues in Tree Construction (2/2) • Branching scheme: – Determining the tree branch to which a sample belongs Income: Income: Income: – Binary vs. k-ary splitting high medium low • When to stop the further splitting of a node? e.g. impurity measure • Labeling rule: a node is labeled as the class to which most samples at the node belongs. 78
  • 79. How to Use a Tree? • Directly – Test the attribute value of unknown sample against the tree. – A path is traced from root to a leaf which holds the label. • Indirectly – Decision tree is converted to classification rules. – One rule is created for each path from the root to a leaf. – IF-THEN is easier for humans to understand . 79
  • 80. Attribute Selection Measure: Information Gain (ID3/C4.5)  Select the attribute with the highest information gain  Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|  Expected information (entropy) needed to classify a tuple in D: m Info( D) = −∑ pi log 2 ( pi ) i =1  Expected information (entropy):  Entropy is a measure of how "mixed up" an attribute is.  It is sometimes equated to the purity or impurity of a variable.  High Entropy means that we are sampling from a uniform (boring) distribution. 80
  • 81. Expected Information (Entropy)  Expected information (entropy) needed to classify a tuple in D: m Info( D) = −∑ pi log 2 ( pi ) (m: number of i =1 labels) 3 3 2 2 5 5 0 0 Info( D) = I (3,2) = − log 2 ( ) − log 2 ( ) Info( D) = I (5,0) = − log 2 ( ) − log 2 ( ) 5 5 5 5 5 5 5 5 3 2 = 0−0 = 0 ≈ − × (−0.737) − × (−1.322) 5 5 81
  • 82. Attribute Selection Measure: Information Gain (ID3/C4.5)  Select the attribute with the highest information gain  Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|  Expected information (entropy) needed to classify a tuple in D: m Info( D) = I ( D ) = −∑ pi log 2 ( pi ) i =1  Information needed (after using A to split D into v partitions) to v |D | classify D: Info A ( D) = ∑ j × I (D j ) j =1 | D |  Information gained by branching on attribute A Gain(A) = Info(D) − Info A(D) 82
  • 83. Expected Information (Entropy)  Information needed (after using A to split D into v partitions) to classify D: v |D | Info A ( D) = ∑ j × I (D j ) j =1 | D | 2 3 2 3 Info( D) = Info(1,1) + Info(2,1) Info( D) = Info(2,0) + Info(3,0) 5 5 5 5 83
  • 84. Attribute Selection: Information Gain  Class P: buys_computer = “yes” 5 4 Infoage ( D) = I (2,3) + I (4,0)  Class N: buys_computer = “no” 14 14 9 9 5 5 5 Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694 14 14 14 14 14 5 age pi ni I(pi, ni) I (2,3) means “age <=30” has 5 out of 14 <=30 2 3 0.971 14 samples, with 2 yes’es and 3 31…40 4 0 0 no’s. Hence >40 3 2 0.971 Gain(age) = Info( D) − Infoage ( D) = 0.246 age income student credit_rating buys_computer <=30 high no fair no <=30 31…40 high high no no excellent fair no yes Similarly, >40 medium no fair yes >40 >40 low low yes yes fair excellent yes no Gain(income) = 0.029 31…40 <=30 low medium yes no excellent fair yes no Gain( student ) = 0.151 <=30 >40 low medium yes yes fair fair yes yes Gain(credit _ rating ) = 0.048 <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no 84
  • 85. Gain Ratio for Attribute Selection (C4.5) • Information gain measure is biased towards attributes with a large number of values. • C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain.) v | Dj | | Dj | SplitInfo A ( D) = −∑ × log 2 ( ) j =1 |D| |D| – GainRatio(A) = Gain(A)/SplitInfo(A) 4 4 6 6 4 4 SplitInfo A ( D ) = − × log 2 ( ) − × log 2 ( ) − × log 2 ( ) = 0.926 14 14 14 14 14 14 GainRatio(income) = 0.029/0.926 = 0.031 • The attribute with the maximum gain ratio is selected as the splitting attribute. 85
  • 86. Gini index (CART, IBM IntelligentMiner) • If a data set D contains examples from n classes, gini index, Gini(D) is defined as n Gini( D) = −1 ∑ p2 j j=1 where pj is the relative frequency of class j in D • If a data set D is split on A into two subsets D1 and D2, the gini index Gini(D) is defined as: |D | |D | Gini A ( D) = 1 Gini( D1) + 2 Gini( D 2) | D| |D| • Reduction in Impurity: ∆Gini( A) = Gini(D) − GiniA ( D) • The attribute provides the smallest GiniA(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute.) 86
  • 87. Gini index (CART, IBM IntelligentMiner) • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no.” 2 2 9 5 Gini ( D) = 1 −   −   = 0.459  14   14  • Suppose the attribute income partitions D into 10 in D 1: {low, medium} and 4 in D2: {high}.  10  4 Giniincome∈{low,medium} ( D) =  Gini ( D1 ) +  Gini ( D1 )  14   14  10 6 4 4 1 3 = [1 − ( ) 2 − ( ) 2 ] + [1 − ( ) 2 − ( ) 2 ] 14 10 10 14 4 4 = 0.45 = Giniincome∈{high} ( D) But Giniincomeϵ{medium,high} is 0.30 and thus the best since it is the lowest. 87
  • 88. Other Attribute Selection Measures • CHAID: a popular decision tree algorithm, measure based on χ2 test for independence • C-SEP: performs better than info. gain and gini index in certain cases • G-statistics: has a close approximation to χ2 distribution • MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): – The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree • Multivariate splits (partition based on multiple variable combinations) – CART: finds multivariate splits based on a linear combination of attributes. Which attribute selection measure is the best? Most give good results, none is significantly superior than others 88
  • 89. Other Types of Classification Methods • Bayes Classification Methods • Rule-Based Classification • Support Vector Machine (SVM) • Some of these methods will be taught in the following lessons. 89
  • 91. What is Cluster Analysis? • Cluster: a collection of data objects – Similar to one another within the same cluster – Dissimilar to the objects in other clusters • Cluster Analysis – Grouping a set of data objects into clusters • Typical applications: – As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms 91
  • 92. General Applications of Clustering • Spatial data analysis – Create thematic maps in GIS by clustering feature spaces. – Detect spatial clusters and explain them in spatial data mining. • Image Processing • Pattern recognition • Economic Science (especially market research) • WWW – Document classification – Cluster Web-log data to discover groups of similar access patterns 92
  • 93. Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs. • Land use: Identification of areas of similar land use in an earth observation database. • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. • City-planning: Identifying groups of houses according to their house type, value, and geographical location. 93
  • 94. What is Good Clustering? • A good clustering method will produce high quality clusters with – High intra-class similarity – Low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation. • The quality of a clustering method is also measured by its ability to discover hidden patterns. 94
  • 95. Requirements of Clustering in Data Mining (1/2) • Scalability • Ability to deal with different types of attributes • Discovery of clusters with arbitrary shape • Minimal requirements of domain knowledge for input • Able to deal with outliers 95
  • 96. Requirements of Clustering in Data Mining (2/2) • Insensitive to order of input records • High dimensionality – Curse of dimensionality • Incorporation of user-specified constraints • Interpretability and usability 96
  • 97. Clustering Methods (I) • Partitioning Method – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – K-means, k-medoids, CLARANS • Hierarchical Method – Create a hierarchical decomposition of the set of data (or objects) using some criterion – Diana, Agnes, BIRCH, ROCK, CHAMELEON • Density-based Method – Based on connectivity and density functions – Typical methods: DBSACN, OPTICS, DenClue 97
  • 98. Clustering Methods (II) • Grid-based approach – based on a multiple-level granularity structure – Typical methods: STING, WaveCluster, CLIQUE • Model-based approach – A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other – Typical methods: EM, SOM, COBWEB • Frequent pattern-based – Based on the analysis of frequent patterns – Typical methods: pCluster • User-guided or constraint-based – Clustering by considering user-specified or application-specific constraints – Typical methods: cluster-on-demand, constrained clustering 98
  • 99. Typical Alternatives to Calculate the Distance between Clusters • Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq) • Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq) • Average: average distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq) • Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj) • Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj) – Medoid: one chosen, centrally located object in the cluster 99
  • 100. Centroid, Radius and Diameter of a Cluster (for numerical data sets) • Centroid: the “middle” of a cluster ΣiN= 1(t ) Cm = N ip • Radius: square root of average mean squared distance from any point of the cluster to its centroid Σ N (t − cm ) 2 Rm = i =1 ip N • Diameter: square root of average mean squared distance between all pairs of points in the cluster Σ N Σ N (t − t ) 2 diameter != 2 * radius i = 1 j = 1 ip jq D = m N ( N − 1) 100
  • 101. Partitioning Algorithms: Basic Concept • Partitioning method: construct a partition of a database D of n objects into a set of k clusters. • Given a number k, find a partition of k clusters that optimizes the chosen partitioning criterion. – Global optimal: exhaustively enumerate all partitions. – Heuristic methods: k-means, k-medoids • k-means (MacQueen’67) • k-medoids or PAM, partion around medoids (Kaufman & Rousseeuw’87) 101
  • 102. The K-Means Clustering Method • Given k, the k-means algorithm is implemented in four steps: 1. Arbitrarily choose k points as initial cluster centroids. 2. Update Means (Centroids): Compute seed points as the center of the clusters of the current partition. loop (center: mean point of the cluster) 3. Re-assign Points: Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment. 102
  • 103. Example of the K-Means Clustering Method 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 Assign 4 4 4 3 Update 3 each 3 2 2 the 2 objects 1 1 1 0 cluster 0 to the 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 means most similar Re-assign Re-assign centroid Given k = 2: 10 10 9 9 Arbitrarily choose k 8 8 object as initial 7 7 6 6 cluster centroid 5 Update 5 4 4 3 the 3 cluster 2 2 1 1 0 0 1 2 3 4 5 6 7 8 9 10 means 0 0 1 2 3 4 5 6 7 8 9 10 103
  • 104. Comments on the K-Means Clustering • Time Complexity: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k,t<<n. • Often terminates at a local optimum. (The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms) • Weakness: – Applicable only when mean is defined, how about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers 104
  • 105. Why is K-Means Unable to Handle Outliers? • The k-means algorithm is sensitive to outliers – Since an object with an extremely large value may substantially distort the distribution of the data. X • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 105
  • 106. PAM: The K-Medoids Method • PAM: Partition Around Medoids • Use real object to represent the cluster 1. Randomly select k representative objects as medoids. 2. Assign each data point to the closest medoid. 3. For each medoid m, loop a. For each non-medoid data point o b. Swap m and o, and compute the total cost of the configuration. 1. Select the configuration with the lowest cost. 2. Repeat steps 2 to 5 until there is no change in the medoid. 106
  • 107. A Typical K-Medoids Algorithm (PAM) 10 10 9 9 8 7 8 Assign each 7 6 Arbitrary 6 remaining object to 5 choose k 5 the nearest medoid 4 4 3 object as 3 2 initial 2 1 1 0 medoids 0 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 9 k=2 8 7 10 6 m2 9 5 8 4 7 3 6 2 m1 5 1 4 0 3 0 1 2 3 4 5 6 7 8 9 10 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Swap each medoid and each data point, and compute the total cost of the configuration 107
  • 108. PAM Clustering: Total swapping cost TCih=∑jCjih 10 10 9 d(j,h)<d(j,t) 9 j - Original 8 t 8 t 7 medoid: t, i 7 6 j 6 - h: swap with i 5 5 4 i h 4 h - j: any non- 3 3 i 2 j j 2 j j selected object 1 1 0 0 1 2 3 4 5 6 7 8 9 10 i h 0 0 1 2 3 4 5 6 7 8 9 10 t t Cjih = d(j, h) - d(j, i) Cjih = 0 10 10 9 d(j,h)>d(j,t) 9 8 h 8 7 6 j 7 6 5 5 i i h j t 4 4 3 3 2 j j 2 t j j 1 1 0 0 1 2 3 4 5 6 7 8 9 10 i t 0 0 1 2 3 4 5 6 7 8 9 10 t h Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t) 108
  • 109. What is the Problem with PAM? • PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. • PAM works efficiently for small data sets but does not scale well for large data sets. – O( k(n-k)(n-k) ) for each iteration, where n is # of data, k is # of clusters – Improvements: CLARA (uses a sampled set to determine medoids), CLARANS 109
  • 110. Hierarchical Clustering • Use distance matrix as clustering criteria. • This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative (AGNES) a ab b abcde c cde d de e divisive (DIANA) Step 4 Step 3 Step 2 Step 1 Step 0 110
  • 111. AGNES (Agglomerative Nesting) • Introduced in Kaufmann and Rousseeuw (1990) • Use the Single-Link method and the dissimilarity matrix. • Merge nodes that have the least dissimilarity • Go on in a non-descending fashion • Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 111
  • 112. Dendrogram: Shows How the Clusters are Merged  Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.  A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. 112
  • 113. DIANA (Divisive Analysis) • Introduced in Kaufmann and Rousseeuw (1990) • Inverse order of AGNES • Eventually each node forms a cluster on its own. 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 9 8 7 6 5 4 10 3 2 1 0 113
  • 114. More on Hierarchical Clustering • Major weakness: – Do not scale well: time complexity is at least O(n2), where n is the number of total objects. – Can never undo what was done previously. • Integration of hierarchical with distance-based clustering – BIRCH(1996): uses CF-tree data structure and incrementally adjusts the quality of sub-clusters. – CURE(1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction. 114
  • 115. Density-Based Clustering Methods • Clustering based on density (local cluster criterion), such as density-connected points • Major features: – Discover clusters of arbitrary shape – Handle noise – One scan – Need density parameters as termination condition • Several interesting studies: – DBSCAN: Ester, et al. (KDD’96) – OPTICS: Ankerst, et al (SIGMOD’99). – DENCLUE: Hinneburg & D. Keim (KDD’98) – CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) 115
  • 116. Density-Based Clustering: Basic Concepts • Two parameters: – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps-neighborhood of that point Eps 116
  • 117. Density-Based Clustering: Basic Concepts • Two parameters: – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps-neighborhood of that point • NEps(q): {p | dist(p,q) <= Eps} // p, q are two data points • Directly density-reachable: A point p is directly density- reachable from a point q w.r.t. Eps, MinPts if – p belongs to NEps(q) p MinPts = 5 – core point condition: q Eps = 1 cm |NEps (q)| >= MinPts 117
  • 118. Density-Reachable and Density-Connected • Density-reachable: – A point p is density-reachable from a p point q w.r.t. Eps, MinPts if there is a p2 chain of points p1, …, pn, p1 q = q, pn = p such that pi+1 is directly density-reachable from pi. • Density-connected: – A point p is density-connected to a p q point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are o density-reachable from o w.r.t. Eps and MinPts. 118
  • 119. DBSCAN: Density Based Spatial Clustering of Applications with Noise • Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points. • Discovers clusters of arbitrary shape in spatial databases with noise. Border Border Eps = 1cm MinPts = 5 Core 119
  • 120. DBSCAN: The Algorithm • Arbitrary select an unvisited point p. • Retrieve all points density-reachable from p w.r.t. Eps and MinPts. • If p is a core point, a cluster is formed. Mark all these points as visited. • If p is a border point (no points are density-reachable from p), mark p as visited and DBSCAN visits the next point of the database. • Continue the process until all of the points have been visited. 120
  • 121. References (1) • R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98 • M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. • M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99. • P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996 • Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02 • M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000. • M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. • M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95. • D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987. • D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. VLDB’98. 121
  • 122. References (2) • V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99. • D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98. • S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98. • S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512- 521, Sydney, Australia, March 1999. • A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD’98. • A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988. • G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999. • L. Kaufman and P. J. Rousseeuw, 1987. Clustering by Means of Medoids. In: Dodge, Y. (Ed.), Statistical Data Analysis Based on the L1 Norm, North Holland, Amsterdam. pp. 405-416. • L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. • E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98. • J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations", Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297 • G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988. • P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997. • R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94. 122
  • 123. References (3) • L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review , SIGKDD Explorations, 6(1), June 2004 • E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition,. • G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98. • A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large Databases, ICDT'01. • A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , ICDE'01 • H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD’ 02. • W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97. • T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96. • Wikipedia: DBSCAN. http://en.wikipedia.org/wiki/DBSCAN. 123
  • 124. MORE ABOUT DATA MINING 124
  • 125. http://www.cs.uvm.edu/~xwu/PPT/ICDM10-Sydney/ICDM10- Keynote.pdf ICDM ’10 KEYNOTE SPEECH “10 YEARS OF DATA MINING RESEARCH: RETROSPECT AND PROSPECT” Xindong Wu, University of Vermont, USA 125
  • 126. The Top 10 Algorithms The 3-Step Identification Process 1. Nominations. ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners were invited in September 2006 to each nominate up to 10 best-known algorithms. 2. Verification. Each nomination was verified for its citations on Google Scholar in late October 2006, and those nominations that did not have at least 50 citations were removed. 18 nominations survived and were then organized in 10 topics. 3. Voting by the wider community. 126
  • 127. Top-10 Most Popular DM Algorithms: 18 Identified Candidates (I) • Classification – #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., 1993. – #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. – #3. K Nearest Neighbors (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6) – #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398. • Statistical Learning – #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer- Verlag. – #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis • Association Analysis – #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. – #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00. 127

Editor's Notes

  1. IMS= information management system
  2. Primary goal: discovering knowledge. Thus, we want to find proper values of confidence and support in different hierarchical levels.
  3. I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
  4. I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
  5. I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
  6. I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
  7. median: 中位數