Your SlideShare is downloading. ×
Amharic document clustering
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Amharic document clustering

1,054
views

Published on

© Yalemisew Mintesinot Abgaz

© Yalemisew Mintesinot Abgaz

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,054
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Document Clustering in Amharic for information browsing and retrieval Yalemisew Mintesinot Abgaz Yabgaz@computing.dcu.ie Dec 1, 2011
  • 2. Introduction• The rate of production of information is growing exponentially• Documents produced in Amharic language are increasing available in digital format accessible online• Growing number of Amharic web documents than before• Growing number of Amharic language users• Increasing number of applications available in Amharic
  • 3. Introduction• Challenges ahead – Searching and accessing the information in Amharic is difficult • From the language perspective • From the knowledge perspective • Availability of tools – Identifying the relevant documents from the available ones is challenging • Searching and Search results – Browsing the documents in a concept map is not available• The challenges call for a solution
  • 4. Agenda Items• Introduction• Document clustering• Document clustering process• Experimental results• Conclusion• Future work
  • 5. Document clustering• Document clustering is a process of identifying groups or clusters of documents with common features.• Groups documents based on similarities of the contents of the documents• Used for information organization and information retrieval• To design a retrieval mechanism for searching through the clusters• Can be – Hierarchical – None hierarchical• Is different from document classification
  • 6. Document clustering• Hierarchical document clustering – Is a widely used method – Generates hierarchical classes with generalization at the top and specialization at the bottom• Clustering algorithms – Divisive – Agglomerative • Single link, complete link, group average link, ward’s method and • Frequent item based hierarchical clustering
  • 7. Document clustering processDocument collection Document   Index words text  Indexing collection Stemming Stemmed  Stop  index words Word list Vector  Document  term vectors Suffix list representation Cluster  Clustering Representation Query Query  Query‐Cluster  Output  processing Matching documents
  • 8. Document clustering process1. Document collection - Amharic news documents collected from Walta Information Centre - Similar documents were selected by previous researchers - The documents cover various domains such as - Governance - Market - Politics - Sport - Education etc.
  • 9. Document clustering process2. Document pre-processing - Indexing the documents - Word identification (Amharic word separators considered) - Smoothing( characters of the same voice were mapped to a single character) - ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ…  ፀሐይ - Stop word removal - Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing words] - Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect. - Stop words are validated against their frequency in the document collection [a threshold of 100 is used]
  • 10. Document clustering process3. Stemming of indexed terms - Amharic language is morphologically complex - Nouns have inflection [prefix, and suffix] - አስተማሩ - አስተማረ - አስተማረች አስተማረ - አስተማርኩ - Verbs have inflection[prefix, suffix and infix] - ሰበረ - ሰበረች - ሰበርክ ሰበር ስብር - ሰበርሽ - አሰበረ - stemming brings the word into its common form
  • 11. Document clustering process4. Representing documents using document vector - Term weighting is used to weight the term frequency - Weight(di,j) = Tfij* (logN- log n)+1 • Tf ij is frequency of term j in document i • N is the number of document in the collection and • n is the number of documents containing the term. – Weighted term frequency for index terms
  • 12. Document clustering process5. Clustering the documents - Constructing the initial clusters - Following the FIHC algorithm, initial clusters are constructed by setting the global support between 0 and 1 - The initial cluster groups similar documents together and creates a new cluster whenever it gets a different document - Used global support
  • 13. Document clustering process5. Clustering the documents - Making the clusters disjoint - The score function is used to measure how well a cluster fit the documents at hand. - Hierarchical tree construction - The cluster tree is built using inter cluster similarity - Centroid calculation - Tree pruning
  • 14. Experimental result • Tuning the global support to get hierarchical documents – More than 10% global support gives flat hierarchy – Less than 1% global support gives a single vertical hierarchy – 5% global support shows a better performanceGlobal Support Width Depth Remark>=20% < =9 0 Flat hierarchy10% 61 2 1 level hierarchical(only for 2 classes5% 92 10 10 level hierarchy for two classes 5 level hierarchy for five classes<=1% >=120 25 25 level hierarchy[took too much time to cluster]
  • 15. Experimental result
  • 16. Experimental result recall-precison globalsupport=10% 0.9 0.8 0.7precision 0.6 0.5 10% 0.4 5% 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  • 17. Discussion of results• Tuning the global support threshold plays a significant role in creating the required clusters• Stemming affects the clusters and creates overlapping clusters• High precision can be achieved if frequent items(terms) are used• High recall can be achieved when the whole index terms are used but it greatly affect precision
  • 18. Future directions• Developing standard corpus collection• Using ontologies as a concept map• Standardization for Amharic language resources such as standard stop word list• Further research in stemming [cross domain research]• Comparison with other document clustering algorithms• Comparison with other information retrieval methods
  • 19. Thank you! Questions?