Amharic document clustering

Document Clustering in Amharic
for information browsing and retrieval
Yalemisew Mintesinot Abgaz

Yabgaz@computing.dcu.ie

Dec 1, 2011

Introduction
• The rate of production of information is growing exponentially
• Documents produced in Amharic language are increasing
available in digital format
accessible online
• Growing number of Amharic web documents than before
• Growing number of Amharic language users
• Increasing number of applications available in Amharic

Introduction
• Challenges ahead
– Searching and accessing the information in Amharic is difficult
• From the language perspective
• From the knowledge perspective
• Availability of tools
– Identifying the relevant documents from the available ones is challenging
• Searching and Search results
– Browsing the documents in a concept map is not available
• The challenges call for a solution

Agenda Items
• Introduction
• Document clustering
• Document clustering process
• Experimental results
• Conclusion
• Future work

Document clustering
• Document clustering is a process of identifying groups or clusters of
documents with common features.
• Groups documents based on similarities of the contents of the documents
• Used for information organization and information retrieval
• To design a retrieval mechanism for searching through the clusters
• Can be
– Hierarchical
– None hierarchical
• Is different from document classification

Document clustering
• Hierarchical document clustering
– Is a widely used method
– Generates hierarchical classes with generalization at the top and
specialization at the bottom
• Clustering algorithms
– Divisive
– Agglomerative
• Single link, complete link, group average link, ward’s method and
• Frequent item based hierarchical clustering

Document clustering process
Document
collection Document Index words
text Indexing
collection

Stemming Stemmed
Stop index words
Word list

Vector Document
term vectors
Suffix list
representation

Cluster
Clustering
Representation

Query Query Query‐Cluster Output
processing Matching documents

1. Document collection
- Amharic news documents collected from Walta Information Centre
- Similar documents were selected by previous researchers
- The documents cover various domains such as
- Governance
- Market
- Politics
- Sport
- Education etc.

2. Document pre-processing
- Indexing the documents
- Word identification (Amharic word separators considered)
- Smoothing( characters of the same voice were mapped to a single character)
- ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ…  ፀሐይ
- Stop word removal
- Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing
words]
- Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect.
- Stop words are validated against their frequency in the document
collection [a threshold of 100 is used]

3. Stemming of indexed terms
- Amharic language is morphologically complex
- Nouns have inflection [prefix, and suffix]
- አስተማሩ
- አስተማረ
- አስተማረች አስተማረ
- አስተማርኩ
- Verbs have inflection[prefix, suffix and infix]
- ሰበረ
- ሰበረች
- ሰበርክ ሰበር ስብር
- ሰበርሽ
- አሰበረ
- stemming brings the word into its common form

4. Representing documents using document vector
- Term weighting is used to weight the term frequency
- Weight(di,j) = Tfij* (logN- log n)+1
• Tf ij is frequency of term j in document i
• N is the number of document in the collection and
• n is the number of documents containing the term.
– Weighted term frequency for index terms

5. Clustering the documents
- Constructing the initial clusters
- Following the FIHC algorithm, initial clusters are constructed by setting the
global support between 0 and 1
- The initial cluster groups similar documents together and creates a new cluster
whenever it gets a different document
- Used global support

5. Clustering the documents
- Making the clusters disjoint
- The score function is used to measure how well a cluster fit the documents at
hand.
- Hierarchical tree construction
- The cluster tree is built using inter cluster similarity
- Centroid calculation
- Tree pruning

Experimental result
• Tuning the global support to get hierarchical documents
– More than 10% global support gives flat hierarchy
– Less than 1% global support gives a single vertical hierarchy
– 5% global support shows a better performance
Global Support Width Depth Remark
>=20% < =9 0 Flat hierarchy

10% 61 2 1 level hierarchical(only for 2 classes

5% 92 10 10 level hierarchy for two classes 5 level hierarchy for five classes

<=1% >=120 25 25 level hierarchy[took too much time to cluster]

Experimental result
recall-precison
globalsupport=10%
0.9
0.8
0.7
precision

0.6
0.5 10%
0.4 5%
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
recall

Discussion of results
• Tuning the global support threshold plays a significant role in
creating the required clusters
• Stemming affects the clusters and creates overlapping clusters
• High precision can be achieved if frequent items(terms) are used
• High recall can be achieved when the whole index terms are used
but it greatly affect precision

Future directions
• Developing standard corpus collection
• Using ontologies as a concept map
• Standardization for Amharic language resources such as standard
stop word list
• Further research in stemming [cross domain research]
• Comparison with other document clustering algorithms
• Comparison with other information retrieval methods

Thank you!

Questions?

Amharic document clustering

More Related Content

Viewers also liked

Similar to Amharic document clustering

More from Guy De Pauw

Recently uploaded

Amharic document clustering