This document discusses document clustering in the Amharic language for information browsing and retrieval. It introduces the challenges of searching and accessing information in Amharic due to the growing amount of digital documents. The document then describes the process of document clustering, which groups documents based on similarities to organize information. Key steps in the clustering process include document preprocessing, vector representation, and hierarchical clustering. Experimental results show that tuning the global support threshold is important for creating the desired hierarchy, and stemming affects cluster overlap. Future work could involve developing standard Amharic language resources and comparing different clustering and information retrieval methods.
Now the age of information technology, textual document is spontaneously increasing over the internet, e-mail, b pages, offline and online reports, journals, articles and they stored in the electronic database format. Millions of new text file created in a day, but for the proper classification, people miss vast information those are useful to several challenges in daily life. To maintain and access those documents are very difficult without adequate rating and when there has classification without any information provide call clustering. To overcome such difficulties K-means and others old clustering algorithms are unfit to impart as may be expected on Natural languages. Because of high-dimensional about texts, the presence of logical structure clues within the texts and novel segmentation techniques have taken advantage of advances in generative topic modeling algorithms, specifically designed to spot questions at intervals text to cipher word–topic distributions. By considering those challenges there, in the current thesis proposed a semantic document clustering framework and the framework be developed by using Python platform and tested each of steps. In this context there have preprocessing steps like tag elimination, removed stop words according to Oxford dictionary, applying lemmatization process after getting the help of WordNet semantic information available and synsets for each word individually from raw text. So considering the limitation of K-Means algorithm and other old algorithms, COBB conceptual clustering algorithm applied to the preprocessed data in this context. Clusters quality and accuracy is one of the most significant contributions to this research. For ensuring the accuracy of clusters, the f-measure accuracy measuring methods selected for evaluate the clusters and feedback the accuracy of clusters. F-Measure returns the accuracy of clusters and also ensuring the purity of clustering process. Framework tests on 20 samples of 20 different articles and minimum accuracy considered as the accuracy of the clusters and the developed system return 71.42% accurate. There are several challenges, such as synonym, high dimensionality, extracting core semantics from texts, and assigning appropriate description for the generated clusters need to experiment further. This research to work to find an accurate way to cluster text documents based on semantic meaning by the help of WordNet database.
Now the age of information technology, textual document is spontaneously increasing over the internet, e-mail, b pages, offline and online reports, journals, articles and they stored in the electronic database format. Millions of new text file created in a day, but for the proper classification, people miss vast information those are useful to several challenges in daily life. To maintain and access those documents are very difficult without adequate rating and when there has classification without any information provide call clustering. To overcome such difficulties K-means and others old clustering algorithms are unfit to impart as may be expected on Natural languages. Because of high-dimensional about texts, the presence of logical structure clues within the texts and novel segmentation techniques have taken advantage of advances in generative topic modeling algorithms, specifically designed to spot questions at intervals text to cipher word–topic distributions. By considering those challenges there, in the current thesis proposed a semantic document clustering framework and the framework be developed by using Python platform and tested each of steps. In this context there have preprocessing steps like tag elimination, removed stop words according to Oxford dictionary, applying lemmatization process after getting the help of WordNet semantic information available and synsets for each word individually from raw text. So considering the limitation of K-Means algorithm and other old algorithms, COBB conceptual clustering algorithm applied to the preprocessed data in this context. Clusters quality and accuracy is one of the most significant contributions to this research. For ensuring the accuracy of clusters, the f-measure accuracy measuring methods selected for evaluate the clusters and feedback the accuracy of clusters. F-Measure returns the accuracy of clusters and also ensuring the purity of clustering process. Framework tests on 20 samples of 20 different articles and minimum accuracy considered as the accuracy of the clusters and the developed system return 71.42% accurate. There are several challenges, such as synonym, high dimensionality, extracting core semantics from texts, and assigning appropriate description for the generated clusters need to experiment further. This research to work to find an accurate way to cluster text documents based on semantic meaning by the help of WordNet database.
Limitations in automated translation services showcase the need for active professionals to conduct your translation work and this is where Nordictrans comes into actions providing high quality translation services from and into just about any language.
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
Ron Daniel and Corey Harper of Elsevier Labs present at the Columbia University Data Science Institute: https://www.elsevier.com/connect/join-us-as-elsevier-data-scientists-present-at-columbia-university
Limitations in automated translation services showcase the need for active professionals to conduct your translation work and this is where Nordictrans comes into actions providing high quality translation services from and into just about any language.
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
Ron Daniel and Corey Harper of Elsevier Labs present at the Columbia University Data Science Institute: https://www.elsevier.com/connect/join-us-as-elsevier-data-scientists-present-at-columbia-university
This presentation talks about Natural Language Processing using Java. At Museaic, a music intelligence platform, we spent time figuring out how to extract central themes from song lyrics. In this talk, I will cover some of the tasks involved in natural language processing such as named entity recognition, word sense disambiguation and concept/theme extraction. I will also cover libraries available in java such as stanford-nlp, dbpedia-spotlight and graph approaches using WordNet and semantic databases. This talk would help people understand text processing beyond simple keyword approaches and provide them with some of the best techniques/libraries for it in the Java world.
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
Brief Introduction to Generative AI and LLM in particular.
Overview of the market, and usages of LLMs.
What's it like to train and build a model.
Retrieval Augmented Generation 101, explained for non savvies, and a perspective of what are the moving parts making it complex
It is about:
Introduction: What Is “Research Data”? and Data Lifecycle
Part 1:
Why Manage Your Data?
Formatting and organizing the data
Storage and Security of Data
Data documentation and meta data
Quality Control
Version controlling
Working with sensitive data
Controlled Vocabulary
Centralized Data Management
Part 2:
Data sharing
What are publishers & funders saying about data sharing?
Researchers’ Attitudes
Benefits of data sharing
Considerations before data sharing
Methods of Data Sharing
Shared Data Uses and Its’ Limitations
Data management plans
Brief summary
Acknowledgment , References
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree
based on Arabic documents collection. The paper also aims at giving the comparison points between all
these techniques and the performance of this techniques on each of the comparison points. Any information
retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there
are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a
document relevant to a specified query, while space represents the capacity of memory needed to create
the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However,
to measure the performance of each one, a retrieval system must be built to compare the results of using
these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer
Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the
there systems. We found out that the retrieval result for inverted files is better than the retrieval result for
other indices.
Introduction to Text Mining and Topic ModellingDavid Paule
A brief introduction to Text Mining and Topic Modelling given at the Urban Big Data Centre (University of Glasgow).
Want to know more? Visit my website davidpaule.es
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However, to measure the performance of each one, a retrieval system must be built to compare the results of using these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the there systems. We found out that the retrieval result for inverted files is better than the retrieval result for other indices.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Amharic document clustering
1. Document Clustering in Amharic
for information browsing and retrieval
Yalemisew Mintesinot Abgaz
Yabgaz@computing.dcu.ie
Dec 1, 2011
2. Introduction
• The rate of production of information is growing exponentially
• Documents produced in Amharic language are increasing
available in digital format
accessible online
• Growing number of Amharic web documents than before
• Growing number of Amharic language users
• Increasing number of applications available in Amharic
3. Introduction
• Challenges ahead
– Searching and accessing the information in Amharic is difficult
• From the language perspective
• From the knowledge perspective
• Availability of tools
– Identifying the relevant documents from the available ones is challenging
• Searching and Search results
– Browsing the documents in a concept map is not available
• The challenges call for a solution
4. Agenda Items
• Introduction
• Document clustering
• Document clustering process
• Experimental results
• Conclusion
• Future work
5. Document clustering
• Document clustering is a process of identifying groups or clusters of
documents with common features.
• Groups documents based on similarities of the contents of the documents
• Used for information organization and information retrieval
• To design a retrieval mechanism for searching through the clusters
• Can be
– Hierarchical
– None hierarchical
• Is different from document classification
6. Document clustering
• Hierarchical document clustering
– Is a widely used method
– Generates hierarchical classes with generalization at the top and
specialization at the bottom
• Clustering algorithms
– Divisive
– Agglomerative
• Single link, complete link, group average link, ward’s method and
• Frequent item based hierarchical clustering
7. Document clustering process
Document
collection Document Index words
text Indexing
collection
Stemming Stemmed
Stop index words
Word list
Vector Document
term vectors
Suffix list
representation
Cluster
Clustering
Representation
Query Query Query‐Cluster Output
processing Matching documents
8. Document clustering process
1. Document collection
- Amharic news documents collected from Walta Information Centre
- Similar documents were selected by previous researchers
- The documents cover various domains such as
- Governance
- Market
- Politics
- Sport
- Education etc.
9. Document clustering process
2. Document pre-processing
- Indexing the documents
- Word identification (Amharic word separators considered)
- Smoothing( characters of the same voice were mapped to a single character)
- ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ… ፀሐይ
- Stop word removal
- Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing
words]
- Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect.
- Stop words are validated against their frequency in the document
collection [a threshold of 100 is used]
10. Document clustering process
3. Stemming of indexed terms
- Amharic language is morphologically complex
- Nouns have inflection [prefix, and suffix]
- አስተማሩ
- አስተማረ
- አስተማረች አስተማረ
- አስተማርኩ
- Verbs have inflection[prefix, suffix and infix]
- ሰበረ
- ሰበረች
- ሰበርክ ሰበር ስብር
- ሰበርሽ
- አሰበረ
- stemming brings the word into its common form
11. Document clustering process
4. Representing documents using document vector
- Term weighting is used to weight the term frequency
- Weight(di,j) = Tfij* (logN- log n)+1
• Tf ij is frequency of term j in document i
• N is the number of document in the collection and
• n is the number of documents containing the term.
– Weighted term frequency for index terms
12. Document clustering process
5. Clustering the documents
- Constructing the initial clusters
- Following the FIHC algorithm, initial clusters are constructed by setting the
global support between 0 and 1
- The initial cluster groups similar documents together and creates a new cluster
whenever it gets a different document
- Used global support
13. Document clustering process
5. Clustering the documents
- Making the clusters disjoint
- The score function is used to measure how well a cluster fit the documents at
hand.
- Hierarchical tree construction
- The cluster tree is built using inter cluster similarity
- Centroid calculation
- Tree pruning
14. Experimental result
• Tuning the global support to get hierarchical documents
– More than 10% global support gives flat hierarchy
– Less than 1% global support gives a single vertical hierarchy
– 5% global support shows a better performance
Global Support Width Depth Remark
>=20% < =9 0 Flat hierarchy
10% 61 2 1 level hierarchical(only for 2 classes
5% 92 10 10 level hierarchy for two classes 5 level hierarchy for five classes
<=1% >=120 25 25 level hierarchy[took too much time to cluster]
17. Discussion of results
• Tuning the global support threshold plays a significant role in
creating the required clusters
• Stemming affects the clusters and creates overlapping clusters
• High precision can be achieved if frequent items(terms) are used
• High recall can be achieved when the whole index terms are used
but it greatly affect precision
18. Future directions
• Developing standard corpus collection
• Using ontologies as a concept map
• Standardization for Amharic language resources such as standard
stop word list
• Further research in stemming [cross domain research]
• Comparison with other document clustering algorithms
• Comparison with other information retrieval methods