SlideShare a Scribd company logo
Document Clustering in Amharic
   for information browsing and retrieval
             Yalemisew Mintesinot Abgaz

              Yabgaz@computing.dcu.ie




                     Dec 1, 2011
Introduction
• The rate of production of information is growing exponentially
• Documents produced in Amharic language are increasing
       available in digital format
       accessible online
• Growing number of Amharic web documents than before
• Growing number of Amharic language users
• Increasing number of applications available in Amharic
Introduction
• Challenges ahead
    – Searching and accessing the information in Amharic is difficult
        • From the language perspective
        • From the knowledge perspective
        • Availability of tools
    – Identifying the relevant documents from the available ones is challenging
        • Searching and Search results
    – Browsing the documents in a concept map is not available
• The challenges call for a solution
Agenda Items
•   Introduction
•   Document clustering
•   Document clustering process
•   Experimental results
•   Conclusion
•   Future work
Document clustering
• Document clustering is a process of identifying groups or clusters of
  documents with common features.
• Groups documents based on similarities of the contents of the documents
• Used for information organization and information retrieval
• To design a retrieval mechanism for searching through the clusters
• Can be
   – Hierarchical
   – None hierarchical
• Is different from document classification
Document clustering
• Hierarchical document clustering
   – Is a widely used method
   – Generates hierarchical classes with generalization at the top and
     specialization at the bottom
• Clustering algorithms
   – Divisive
   – Agglomerative
       • Single link, complete link, group average link, ward’s method and
       • Frequent item based hierarchical clustering
Document clustering process
Document 
collection       Document                        Index words
                    text          Indexing
                 collection


                                  Stemming         Stemmed 
                   Stop                          index words
                  Word list


                                    Vector        Document 
                                                 term vectors
                  Suffix list
                                representation


                                                    Cluster 
                                  Clustering
                                                 Representation



  Query          Query          Query‐Cluster        Output 
               processing        Matching          documents
Document clustering process
1. Document collection
   -   Amharic news documents collected from Walta Information Centre
   -   Similar documents were selected by previous researchers
   -   The documents cover various domains such as
       -   Governance
       -   Market
       -   Politics
       -   Sport
       -   Education etc.
Document clustering process
2. Document pre-processing
   - Indexing the documents
      -   Word identification (Amharic word separators considered)
      -   Smoothing( characters of the same voice were mapped to a single character)
      -   ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ…  ፀሐይ
   - Stop word removal
      -   Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing
          words]
      -   Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect.
   - Stop words are validated against their frequency in the document
     collection [a threshold of 100 is used]
Document clustering process
3. Stemming of indexed terms
    -     Amharic language is morphologically complex
    -     Nouns have inflection [prefix, and suffix]
    -     አስተማሩ
    -     አስተማረ
    -     አስተማረች                              አስተማረ
    -     አስተማርኩ
    -     Verbs have inflection[prefix, suffix and infix]
    -     ሰበረ
    -     ሰበረች
    -      ሰበርክ                                 ሰበር         ስብር
    -     ሰበርሽ
    -     አሰበረ
    -     stemming brings the word into its common form
Document clustering process
4. Representing documents using document vector
   -  Term weighting is used to weight the term frequency
   -  Weight(di,j) = Tfij* (logN- log n)+1
      • Tf ij is frequency of term j in document i
      • N is the number of document in the collection and
      • n is the number of documents containing the term.
   – Weighted term frequency for index terms
Document clustering process
5. Clustering the documents
   -   Constructing the initial clusters
       -   Following the FIHC algorithm, initial clusters are constructed by setting the
           global support between 0 and 1
       -   The initial cluster groups similar documents together and creates a new cluster
           whenever it gets a different document
       -   Used global support
Document clustering process
5. Clustering the documents
   -   Making the clusters disjoint
       -   The score function is used to measure how well a cluster fit the documents at
           hand.
   -   Hierarchical tree construction
       -   The cluster tree is built using inter cluster similarity
       -   Centroid calculation
   -   Tree pruning
Experimental result
 • Tuning the global support to get hierarchical documents
        – More than 10% global support gives flat hierarchy
        – Less than 1% global support gives a single vertical hierarchy
        – 5% global support shows a better performance
Global Support          Width        Depth                                  Remark
>=20%              < =9         0            Flat hierarchy


10%                61           2            1 level hierarchical(only for 2 classes


5%                 92           10           10 level hierarchy for two classes 5 level hierarchy for five classes

<=1%               >=120        25           25 level hierarchy[took too much time to cluster]
Experimental result
Experimental result
                         recall-precison
                       globalsupport=10%
            0.9
            0.8
            0.7
precision




            0.6
            0.5                                           10%
            0.4                                           5%
            0.3
            0.2
            0.1
              0
                  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
                                  recall
Discussion of results
• Tuning the global support threshold plays a significant role in
  creating the required clusters
• Stemming affects the clusters and creates overlapping clusters
• High precision can be achieved if frequent items(terms) are used
• High recall can be achieved when the whole index terms are used
  but it greatly affect precision
Future directions
• Developing standard corpus collection
• Using ontologies as a concept map
• Standardization for Amharic language resources such as standard
  stop word list
• Further research in stemming [cross domain research]
• Comparison with other document clustering algorithms
• Comparison with other information retrieval methods
Thank you!




             Questions?

More Related Content

Viewers also liked

Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudRob Gillen
 
C3.3.1
C3.3.1C3.3.1
C3.3.1
Daniel LIAO
 
Human vs-Machine-Translation
Human vs-Machine-TranslationHuman vs-Machine-Translation
Human vs-Machine-Translation
NordicTrans.com
 
Is Google Translate Effective At Sentence Changing
Is Google Translate Effective At Sentence ChangingIs Google Translate Effective At Sentence Changing
Is Google Translate Effective At Sentence Changing
SentenceChanger
 
Google Translate + TectoMT
Google Translate + TectoMTGoogle Translate + TectoMT
Google Translate + TectoMTMartin Majlis
 
Google translate 1
Google translate 1Google translate 1
Google translate 1Debbie Lahav
 
Language Use And Preservation Online
Language Use And Preservation OnlineLanguage Use And Preservation Online
Language Use And Preservation Online
Tadej Gregorcic
 
Building Translate on Glass
Building Translate on GlassBuilding Translate on Glass
Building Translate on Glass
Trish Whetzel
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
8 Google Translate
8 Google Translate8 Google Translate
8 Google Translateaptwano
 
Google translate (new russian)
Google translate (new russian)Google translate (new russian)
Google translate (new russian)Nurbek Matzhani
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
Mahmoud Alfarra
 
Google Translate in the Classroom
Google Translate in the ClassroomGoogle Translate in the Classroom
Google Translate in the Classroom
marafaye
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Guy De Pauw
 
Google Translate Update
Google Translate UpdateGoogle Translate Update
Google Translate Update
mrsvogel
 

Viewers also liked (19)

C4.1.1
C4.1.1C4.1.1
C4.1.1
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
C3.3.1
C3.3.1C3.3.1
C3.3.1
 
C4.5
C4.5C4.5
C4.5
 
Human vs-Machine-Translation
Human vs-Machine-TranslationHuman vs-Machine-Translation
Human vs-Machine-Translation
 
Is Google Translate Effective At Sentence Changing
Is Google Translate Effective At Sentence ChangingIs Google Translate Effective At Sentence Changing
Is Google Translate Effective At Sentence Changing
 
Google Translate + TectoMT
Google Translate + TectoMTGoogle Translate + TectoMT
Google Translate + TectoMT
 
Google translate 1
Google translate 1Google translate 1
Google translate 1
 
Language Use And Preservation Online
Language Use And Preservation OnlineLanguage Use And Preservation Online
Language Use And Preservation Online
 
Building Translate on Glass
Building Translate on GlassBuilding Translate on Glass
Building Translate on Glass
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Text clustering
Text clusteringText clustering
Text clustering
 
8 Google Translate
8 Google Translate8 Google Translate
8 Google Translate
 
Google translate (new russian)
Google translate (new russian)Google translate (new russian)
Google translate (new russian)
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Google Translate in the Classroom
Google Translate in the ClassroomGoogle Translate in the Classroom
Google Translate in the Classroom
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Google Translate Update
Google Translate UpdateGoogle Translate Update
Google Translate Update
 

Similar to Amharic document clustering

3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
MedinaBedru
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
Elsevier
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
Josh Patterson
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documentsKriti Khanna
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
Habtamu100
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
Sameer Wadkar
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
Sangameswar Venkatraman
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Lucidworks
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
vincent683379
 
What might a spoken corpus tell us about language
What might a spoken corpus tell us about languageWhat might a spoken corpus tell us about language
What might a spoken corpus tell us about language
UCLDH
 
Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing
Mojtaba Lotfaliany
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
Tariq Hassan
 
An evaluation and overview of indices
An evaluation and overview of indicesAn evaluation and overview of indices
An evaluation and overview of indices
IJCSEA Journal
 
Introduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic ModellingIntroduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic Modelling
David Paule
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
Findwise
 
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
Rasha
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
IJCSEA Journal
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
IJCSEA Journal
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 

Similar to Amharic document clustering (20)

3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documents
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
IR
IRIR
IR
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
What might a spoken corpus tell us about language
What might a spoken corpus tell us about languageWhat might a spoken corpus tell us about language
What might a spoken corpus tell us about language
 
Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
An evaluation and overview of indices
An evaluation and overview of indicesAn evaluation and overview of indices
An evaluation and overview of indices
 
Introduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic ModellingIntroduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic Modelling
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
 
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 

More from Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
Guy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
Guy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
Guy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
Guy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
Guy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
Guy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
Guy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Guy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
Guy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
Guy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
Guy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
Guy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
Guy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
Guy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
Guy De Pauw
 

More from Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Recently uploaded

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Amharic document clustering

  • 1. Document Clustering in Amharic for information browsing and retrieval Yalemisew Mintesinot Abgaz Yabgaz@computing.dcu.ie Dec 1, 2011
  • 2. Introduction • The rate of production of information is growing exponentially • Documents produced in Amharic language are increasing available in digital format accessible online • Growing number of Amharic web documents than before • Growing number of Amharic language users • Increasing number of applications available in Amharic
  • 3. Introduction • Challenges ahead – Searching and accessing the information in Amharic is difficult • From the language perspective • From the knowledge perspective • Availability of tools – Identifying the relevant documents from the available ones is challenging • Searching and Search results – Browsing the documents in a concept map is not available • The challenges call for a solution
  • 4. Agenda Items • Introduction • Document clustering • Document clustering process • Experimental results • Conclusion • Future work
  • 5. Document clustering • Document clustering is a process of identifying groups or clusters of documents with common features. • Groups documents based on similarities of the contents of the documents • Used for information organization and information retrieval • To design a retrieval mechanism for searching through the clusters • Can be – Hierarchical – None hierarchical • Is different from document classification
  • 6. Document clustering • Hierarchical document clustering – Is a widely used method – Generates hierarchical classes with generalization at the top and specialization at the bottom • Clustering algorithms – Divisive – Agglomerative • Single link, complete link, group average link, ward’s method and • Frequent item based hierarchical clustering
  • 7. Document clustering process Document  collection Document   Index words text  Indexing collection Stemming Stemmed  Stop  index words Word list Vector  Document  term vectors Suffix list representation Cluster  Clustering Representation Query Query  Query‐Cluster  Output  processing Matching documents
  • 8. Document clustering process 1. Document collection - Amharic news documents collected from Walta Information Centre - Similar documents were selected by previous researchers - The documents cover various domains such as - Governance - Market - Politics - Sport - Education etc.
  • 9. Document clustering process 2. Document pre-processing - Indexing the documents - Word identification (Amharic word separators considered) - Smoothing( characters of the same voice were mapped to a single character) - ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ…  ፀሐይ - Stop word removal - Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing words] - Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect. - Stop words are validated against their frequency in the document collection [a threshold of 100 is used]
  • 10. Document clustering process 3. Stemming of indexed terms - Amharic language is morphologically complex - Nouns have inflection [prefix, and suffix] - አስተማሩ - አስተማረ - አስተማረች አስተማረ - አስተማርኩ - Verbs have inflection[prefix, suffix and infix] - ሰበረ - ሰበረች - ሰበርክ ሰበር ስብር - ሰበርሽ - አሰበረ - stemming brings the word into its common form
  • 11. Document clustering process 4. Representing documents using document vector - Term weighting is used to weight the term frequency - Weight(di,j) = Tfij* (logN- log n)+1 • Tf ij is frequency of term j in document i • N is the number of document in the collection and • n is the number of documents containing the term. – Weighted term frequency for index terms
  • 12. Document clustering process 5. Clustering the documents - Constructing the initial clusters - Following the FIHC algorithm, initial clusters are constructed by setting the global support between 0 and 1 - The initial cluster groups similar documents together and creates a new cluster whenever it gets a different document - Used global support
  • 13. Document clustering process 5. Clustering the documents - Making the clusters disjoint - The score function is used to measure how well a cluster fit the documents at hand. - Hierarchical tree construction - The cluster tree is built using inter cluster similarity - Centroid calculation - Tree pruning
  • 14. Experimental result • Tuning the global support to get hierarchical documents – More than 10% global support gives flat hierarchy – Less than 1% global support gives a single vertical hierarchy – 5% global support shows a better performance Global Support Width Depth Remark >=20% < =9 0 Flat hierarchy 10% 61 2 1 level hierarchical(only for 2 classes 5% 92 10 10 level hierarchy for two classes 5 level hierarchy for five classes <=1% >=120 25 25 level hierarchy[took too much time to cluster]
  • 16. Experimental result recall-precison globalsupport=10% 0.9 0.8 0.7 precision 0.6 0.5 10% 0.4 5% 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  • 17. Discussion of results • Tuning the global support threshold plays a significant role in creating the required clusters • Stemming affects the clusters and creates overlapping clusters • High precision can be achieved if frequent items(terms) are used • High recall can be achieved when the whole index terms are used but it greatly affect precision
  • 18. Future directions • Developing standard corpus collection • Using ontologies as a concept map • Standardization for Amharic language resources such as standard stop word list • Further research in stemming [cross domain research] • Comparison with other document clustering algorithms • Comparison with other information retrieval methods
  • 19. Thank you! Questions?