Amharic document clustering


Published on

© Yalemisew Mintesinot Abgaz

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Amharic document clustering

  1. 1. Document Clustering in Amharic for information browsing and retrieval Yalemisew Mintesinot Abgaz Dec 1, 2011
  2. 2. Introduction• The rate of production of information is growing exponentially• Documents produced in Amharic language are increasing available in digital format accessible online• Growing number of Amharic web documents than before• Growing number of Amharic language users• Increasing number of applications available in Amharic
  3. 3. Introduction• Challenges ahead – Searching and accessing the information in Amharic is difficult • From the language perspective • From the knowledge perspective • Availability of tools – Identifying the relevant documents from the available ones is challenging • Searching and Search results – Browsing the documents in a concept map is not available• The challenges call for a solution
  4. 4. Agenda Items• Introduction• Document clustering• Document clustering process• Experimental results• Conclusion• Future work
  5. 5. Document clustering• Document clustering is a process of identifying groups or clusters of documents with common features.• Groups documents based on similarities of the contents of the documents• Used for information organization and information retrieval• To design a retrieval mechanism for searching through the clusters• Can be – Hierarchical – None hierarchical• Is different from document classification
  6. 6. Document clustering• Hierarchical document clustering – Is a widely used method – Generates hierarchical classes with generalization at the top and specialization at the bottom• Clustering algorithms – Divisive – Agglomerative • Single link, complete link, group average link, ward’s method and • Frequent item based hierarchical clustering
  7. 7. Document clustering processDocument collection Document   Index words text  Indexing collection Stemming Stemmed  Stop  index words Word list Vector  Document  term vectors Suffix list representation Cluster  Clustering Representation Query Query  Query‐Cluster  Output  processing Matching documents
  8. 8. Document clustering process1. Document collection - Amharic news documents collected from Walta Information Centre - Similar documents were selected by previous researchers - The documents cover various domains such as - Governance - Market - Politics - Sport - Education etc.
  9. 9. Document clustering process2. Document pre-processing - Indexing the documents - Word identification (Amharic word separators considered) - Smoothing( characters of the same voice were mapped to a single character) - ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ…  ፀሐይ - Stop word removal - Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing words] - Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect. - Stop words are validated against their frequency in the document collection [a threshold of 100 is used]
  10. 10. Document clustering process3. Stemming of indexed terms - Amharic language is morphologically complex - Nouns have inflection [prefix, and suffix] - አስተማሩ - አስተማረ - አስተማረች አስተማረ - አስተማርኩ - Verbs have inflection[prefix, suffix and infix] - ሰበረ - ሰበረች - ሰበርክ ሰበር ስብር - ሰበርሽ - አሰበረ - stemming brings the word into its common form
  11. 11. Document clustering process4. Representing documents using document vector - Term weighting is used to weight the term frequency - Weight(di,j) = Tfij* (logN- log n)+1 • Tf ij is frequency of term j in document i • N is the number of document in the collection and • n is the number of documents containing the term. – Weighted term frequency for index terms
  12. 12. Document clustering process5. Clustering the documents - Constructing the initial clusters - Following the FIHC algorithm, initial clusters are constructed by setting the global support between 0 and 1 - The initial cluster groups similar documents together and creates a new cluster whenever it gets a different document - Used global support
  13. 13. Document clustering process5. Clustering the documents - Making the clusters disjoint - The score function is used to measure how well a cluster fit the documents at hand. - Hierarchical tree construction - The cluster tree is built using inter cluster similarity - Centroid calculation - Tree pruning
  14. 14. Experimental result • Tuning the global support to get hierarchical documents – More than 10% global support gives flat hierarchy – Less than 1% global support gives a single vertical hierarchy – 5% global support shows a better performanceGlobal Support Width Depth Remark>=20% < =9 0 Flat hierarchy10% 61 2 1 level hierarchical(only for 2 classes5% 92 10 10 level hierarchy for two classes 5 level hierarchy for five classes<=1% >=120 25 25 level hierarchy[took too much time to cluster]
  15. 15. Experimental result
  16. 16. Experimental result recall-precison globalsupport=10% 0.9 0.8 0.7precision 0.6 0.5 10% 0.4 5% 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  17. 17. Discussion of results• Tuning the global support threshold plays a significant role in creating the required clusters• Stemming affects the clusters and creates overlapping clusters• High precision can be achieved if frequent items(terms) are used• High recall can be achieved when the whole index terms are used but it greatly affect precision
  18. 18. Future directions• Developing standard corpus collection• Using ontologies as a concept map• Standardization for Amharic language resources such as standard stop word list• Further research in stemming [cross domain research]• Comparison with other document clustering algorithms• Comparison with other information retrieval methods
  19. 19. Thank you! Questions?