Document Clustering in Amharic
   for information browsing and retrieval
             Yalemisew Mintesinot Abgaz

              Yabgaz@computing.dcu.ie




                     Dec 1, 2011
Introduction
• The rate of production of information is growing exponentially
• Documents produced in Amharic language are increasing
       available in digital format
       accessible online
• Growing number of Amharic web documents than before
• Growing number of Amharic language users
• Increasing number of applications available in Amharic
Introduction
• Challenges ahead
    – Searching and accessing the information in Amharic is difficult
        • From the language perspective
        • From the knowledge perspective
        • Availability of tools
    – Identifying the relevant documents from the available ones is challenging
        • Searching and Search results
    – Browsing the documents in a concept map is not available
• The challenges call for a solution
Agenda Items
•   Introduction
•   Document clustering
•   Document clustering process
•   Experimental results
•   Conclusion
•   Future work
Document clustering
• Document clustering is a process of identifying groups or clusters of
  documents with common features.
• Groups documents based on similarities of the contents of the documents
• Used for information organization and information retrieval
• To design a retrieval mechanism for searching through the clusters
• Can be
   – Hierarchical
   – None hierarchical
• Is different from document classification
Document clustering
• Hierarchical document clustering
   – Is a widely used method
   – Generates hierarchical classes with generalization at the top and
     specialization at the bottom
• Clustering algorithms
   – Divisive
   – Agglomerative
       • Single link, complete link, group average link, ward’s method and
       • Frequent item based hierarchical clustering
Document clustering process
Document 
collection       Document                        Index words
                    text          Indexing
                 collection


                                  Stemming         Stemmed 
                   Stop                          index words
                  Word list


                                    Vector        Document 
                                                 term vectors
                  Suffix list
                                representation


                                                    Cluster 
                                  Clustering
                                                 Representation



  Query          Query          Query‐Cluster        Output 
               processing        Matching          documents
Document clustering process
1. Document collection
   -   Amharic news documents collected from Walta Information Centre
   -   Similar documents were selected by previous researchers
   -   The documents cover various domains such as
       -   Governance
       -   Market
       -   Politics
       -   Sport
       -   Education etc.
Document clustering process
2. Document pre-processing
   - Indexing the documents
      -   Word identification (Amharic word separators considered)
      -   Smoothing( characters of the same voice were mapped to a single character)
      -   ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ…  ፀሐይ
   - Stop word removal
      -   Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing
          words]
      -   Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect.
   - Stop words are validated against their frequency in the document
     collection [a threshold of 100 is used]
Document clustering process
3. Stemming of indexed terms
    -     Amharic language is morphologically complex
    -     Nouns have inflection [prefix, and suffix]
    -     አስተማሩ
    -     አስተማረ
    -     አስተማረች                              አስተማረ
    -     አስተማርኩ
    -     Verbs have inflection[prefix, suffix and infix]
    -     ሰበረ
    -     ሰበረች
    -      ሰበርክ                                 ሰበር         ስብር
    -     ሰበርሽ
    -     አሰበረ
    -     stemming brings the word into its common form
Document clustering process
4. Representing documents using document vector
   -  Term weighting is used to weight the term frequency
   -  Weight(di,j) = Tfij* (logN- log n)+1
      • Tf ij is frequency of term j in document i
      • N is the number of document in the collection and
      • n is the number of documents containing the term.
   – Weighted term frequency for index terms
Document clustering process
5. Clustering the documents
   -   Constructing the initial clusters
       -   Following the FIHC algorithm, initial clusters are constructed by setting the
           global support between 0 and 1
       -   The initial cluster groups similar documents together and creates a new cluster
           whenever it gets a different document
       -   Used global support
Document clustering process
5. Clustering the documents
   -   Making the clusters disjoint
       -   The score function is used to measure how well a cluster fit the documents at
           hand.
   -   Hierarchical tree construction
       -   The cluster tree is built using inter cluster similarity
       -   Centroid calculation
   -   Tree pruning
Experimental result
 • Tuning the global support to get hierarchical documents
        – More than 10% global support gives flat hierarchy
        – Less than 1% global support gives a single vertical hierarchy
        – 5% global support shows a better performance
Global Support          Width        Depth                                  Remark
>=20%              < =9         0            Flat hierarchy


10%                61           2            1 level hierarchical(only for 2 classes


5%                 92           10           10 level hierarchy for two classes 5 level hierarchy for five classes

<=1%               >=120        25           25 level hierarchy[took too much time to cluster]
Experimental result
Experimental result
                         recall-precison
                       globalsupport=10%
            0.9
            0.8
            0.7
precision




            0.6
            0.5                                           10%
            0.4                                           5%
            0.3
            0.2
            0.1
              0
                  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
                                  recall
Discussion of results
• Tuning the global support threshold plays a significant role in
  creating the required clusters
• Stemming affects the clusters and creates overlapping clusters
• High precision can be achieved if frequent items(terms) are used
• High recall can be achieved when the whole index terms are used
  but it greatly affect precision
Future directions
• Developing standard corpus collection
• Using ontologies as a concept map
• Standardization for Amharic language resources such as standard
  stop word list
• Further research in stemming [cross domain research]
• Comparison with other document clustering algorithms
• Comparison with other information retrieval methods
Thank you!




             Questions?

Amharic document clustering

  • 1.
    Document Clustering inAmharic for information browsing and retrieval Yalemisew Mintesinot Abgaz Yabgaz@computing.dcu.ie Dec 1, 2011
  • 2.
    Introduction • The rateof production of information is growing exponentially • Documents produced in Amharic language are increasing available in digital format accessible online • Growing number of Amharic web documents than before • Growing number of Amharic language users • Increasing number of applications available in Amharic
  • 3.
    Introduction • Challenges ahead – Searching and accessing the information in Amharic is difficult • From the language perspective • From the knowledge perspective • Availability of tools – Identifying the relevant documents from the available ones is challenging • Searching and Search results – Browsing the documents in a concept map is not available • The challenges call for a solution
  • 4.
    Agenda Items • Introduction • Document clustering • Document clustering process • Experimental results • Conclusion • Future work
  • 5.
    Document clustering • Documentclustering is a process of identifying groups or clusters of documents with common features. • Groups documents based on similarities of the contents of the documents • Used for information organization and information retrieval • To design a retrieval mechanism for searching through the clusters • Can be – Hierarchical – None hierarchical • Is different from document classification
  • 6.
    Document clustering • Hierarchicaldocument clustering – Is a widely used method – Generates hierarchical classes with generalization at the top and specialization at the bottom • Clustering algorithms – Divisive – Agglomerative • Single link, complete link, group average link, ward’s method and • Frequent item based hierarchical clustering
  • 7.
    Document clustering process Document  collection Document   Index words text  Indexing collection Stemming Stemmed  Stop  index words Word list Vector  Document  term vectors Suffix list representation Cluster  Clustering Representation Query Query  Query‐Cluster  Output  processing Matching documents
  • 8.
    Document clustering process 1.Document collection - Amharic news documents collected from Walta Information Centre - Similar documents were selected by previous researchers - The documents cover various domains such as - Governance - Market - Politics - Sport - Education etc.
  • 9.
    Document clustering process 2.Document pre-processing - Indexing the documents - Word identification (Amharic word separators considered) - Smoothing( characters of the same voice were mapped to a single character) - ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ…  ፀሐይ - Stop word removal - Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing words] - Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect. - Stop words are validated against their frequency in the document collection [a threshold of 100 is used]
  • 10.
    Document clustering process 3.Stemming of indexed terms - Amharic language is morphologically complex - Nouns have inflection [prefix, and suffix] - አስተማሩ - አስተማረ - አስተማረች አስተማረ - አስተማርኩ - Verbs have inflection[prefix, suffix and infix] - ሰበረ - ሰበረች - ሰበርክ ሰበር ስብር - ሰበርሽ - አሰበረ - stemming brings the word into its common form
  • 11.
    Document clustering process 4.Representing documents using document vector - Term weighting is used to weight the term frequency - Weight(di,j) = Tfij* (logN- log n)+1 • Tf ij is frequency of term j in document i • N is the number of document in the collection and • n is the number of documents containing the term. – Weighted term frequency for index terms
  • 12.
    Document clustering process 5.Clustering the documents - Constructing the initial clusters - Following the FIHC algorithm, initial clusters are constructed by setting the global support between 0 and 1 - The initial cluster groups similar documents together and creates a new cluster whenever it gets a different document - Used global support
  • 13.
    Document clustering process 5.Clustering the documents - Making the clusters disjoint - The score function is used to measure how well a cluster fit the documents at hand. - Hierarchical tree construction - The cluster tree is built using inter cluster similarity - Centroid calculation - Tree pruning
  • 14.
    Experimental result •Tuning the global support to get hierarchical documents – More than 10% global support gives flat hierarchy – Less than 1% global support gives a single vertical hierarchy – 5% global support shows a better performance Global Support Width Depth Remark >=20% < =9 0 Flat hierarchy 10% 61 2 1 level hierarchical(only for 2 classes 5% 92 10 10 level hierarchy for two classes 5 level hierarchy for five classes <=1% >=120 25 25 level hierarchy[took too much time to cluster]
  • 15.
  • 16.
    Experimental result recall-precison globalsupport=10% 0.9 0.8 0.7 precision 0.6 0.5 10% 0.4 5% 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  • 17.
    Discussion of results •Tuning the global support threshold plays a significant role in creating the required clusters • Stemming affects the clusters and creates overlapping clusters • High precision can be achieved if frequent items(terms) are used • High recall can be achieved when the whole index terms are used but it greatly affect precision
  • 18.
    Future directions • Developingstandard corpus collection • Using ontologies as a concept map • Standardization for Amharic language resources such as standard stop word list • Further research in stemming [cross domain research] • Comparison with other document clustering algorithms • Comparison with other information retrieval methods
  • 19.
    Thank you! Questions?