The slides present a method for the automatic detection and correction of malapropism errors found in documents using the WordNet lexical database, a search engine (Google) and a paronyms dictionary
Comparative Study of Ant Colony Optimization And Gang SchedulingIJTET Journal
Abstract— Ant Colony Optimization (ACO) is a well known and rapidly evolving meta-heuristic technique. All optimization problems have already taken advantage of the ACO technique while countless others are on their way. Ant Colony Optimization (ACO) has been used as an effective algorithm in solving the scheduling problem in grid computing. Whereas gang scheduling is a scheduling algorithm that is used to schedule the parallel systems and schedules related threads or processes to run simultaneously on different processors. The threads that are scheduled are belonging to the same process, but they from different processes in some cases, for example when the processes have a producer-consumer relationship, when all processes come from the same MPI program.
The slides present a text recovery method based on a probabilistic post-recognition processing of the output of an Optical Character Recognition system. The proposed method is trying to fill in the gaps of missing text resulted from the recognition process of degraded documents. For this task, a corpus of up to 5-grams provided by Google is used. Several heuristics for using this corpus for the fulfilment of this task are described after presenting the general problem and alternative solutions. These heuristics have been validated using a set of experiments that are also discussed together with the results that have been obtained.
We are developing a web-based plagiarism detection system to detect plagiarism in written Arabic documents. This paper describes the proposed framework of our plagiarism detection system. The proposed plagiarism detection framework comprises of two main components, one global and the other local. The global component is heuristics-based, in which a potentially plagiarized given document is used to construct a set of representative queries by using different best performing heuristics. These queries are then submitted to Google via Google's search API to retrieve candidate source documents from the Web. The local component carries out detailed
similarity computations by combining different similarity computation techniques to check which parts of the given document are plagiarised and from which source documents retrieved from the Web. Since this is an ongoing research project, the quality of overall system is not evaluated yet.
Swarm intelligence is a biologically inspired field that studies how social behaviors emerge from the interactions between individuals in a decentralized system. It draws inspiration from natural systems like bird flocking and ant colonies. Particle swarm optimization and ant colony optimization are two popular swarm intelligence algorithms. PSO mimics bird flocking by having particles update their velocities based on their own experience and the swarm's experience. ACO mimics ant foraging behavior by having artificial ants deposit and follow pheromone trails to iteratively find optimal solutions. Both algorithms have been applied to problems like optimization and routing.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
The document compares the ant colony optimization (ACO) and bee colony optimization (BCO) algorithms for detecting spam hosts. It first provides background on ACO, which is inspired by how ants find food sources, and BCO, which is inspired by honeybee foraging behavior. The document then describes applying both algorithms to a spam host detection problem. Features are extracted from normal and spam hosts in a dataset to train classification models using ACO and BCO. The optimal solutions from ACO and BCO are then compared to determine which algorithm performs better at detecting spam hosts.
An Improved Ant Colony System Algorithm for Solving Shortest Path Network Pro...Lisa Riley
This document presents an improved ant colony system algorithm for solving shortest path network problems. The improvements include introducing dynamic programming into the heuristic information and applying a ratio approach to the local pheromone update process. The algorithm is tested on a hypothetical network of 10 nodes and 20 edges. The results show that the improved ant colony algorithm outperforms the existing one by requiring fewer iterations to converge to the optimal solution.
Optimized Robot Path Planning Using Parallel Genetic Algorithm Based on Visib...IJERA Editor
An analysis is made for optimized path planning for mobile robot by using parallel genetic algorithm. The
parallel genetic algorithm (PGA) is applied on the visible midpoint approach to find shortest path for mobile
robot. The hybrid ofthese two algorithms provides a better optimized solution for smooth and shortest path for
mobile robot. In this problem, the visible midpoint approach is used to make the effectiveness for avoiding
local minima. It gives the optimum paths which are always consisting on free trajectories. But the
proposedhybrid parallel genetic algorithm converges very fast to obtain the shortest route from source to
destination due to the sharing of population. The total population is partitioned into a number subgroups to
perform the parallel GA. The master thread is the center of information exchange and making selection with
fitness evaluation.The cell to cell crossover makes the algorithm significantly good. The problem converges
quickly with in a less number of iteration.
Comparative Study of Ant Colony Optimization And Gang SchedulingIJTET Journal
Abstract— Ant Colony Optimization (ACO) is a well known and rapidly evolving meta-heuristic technique. All optimization problems have already taken advantage of the ACO technique while countless others are on their way. Ant Colony Optimization (ACO) has been used as an effective algorithm in solving the scheduling problem in grid computing. Whereas gang scheduling is a scheduling algorithm that is used to schedule the parallel systems and schedules related threads or processes to run simultaneously on different processors. The threads that are scheduled are belonging to the same process, but they from different processes in some cases, for example when the processes have a producer-consumer relationship, when all processes come from the same MPI program.
The slides present a text recovery method based on a probabilistic post-recognition processing of the output of an Optical Character Recognition system. The proposed method is trying to fill in the gaps of missing text resulted from the recognition process of degraded documents. For this task, a corpus of up to 5-grams provided by Google is used. Several heuristics for using this corpus for the fulfilment of this task are described after presenting the general problem and alternative solutions. These heuristics have been validated using a set of experiments that are also discussed together with the results that have been obtained.
We are developing a web-based plagiarism detection system to detect plagiarism in written Arabic documents. This paper describes the proposed framework of our plagiarism detection system. The proposed plagiarism detection framework comprises of two main components, one global and the other local. The global component is heuristics-based, in which a potentially plagiarized given document is used to construct a set of representative queries by using different best performing heuristics. These queries are then submitted to Google via Google's search API to retrieve candidate source documents from the Web. The local component carries out detailed
similarity computations by combining different similarity computation techniques to check which parts of the given document are plagiarised and from which source documents retrieved from the Web. Since this is an ongoing research project, the quality of overall system is not evaluated yet.
Swarm intelligence is a biologically inspired field that studies how social behaviors emerge from the interactions between individuals in a decentralized system. It draws inspiration from natural systems like bird flocking and ant colonies. Particle swarm optimization and ant colony optimization are two popular swarm intelligence algorithms. PSO mimics bird flocking by having particles update their velocities based on their own experience and the swarm's experience. ACO mimics ant foraging behavior by having artificial ants deposit and follow pheromone trails to iteratively find optimal solutions. Both algorithms have been applied to problems like optimization and routing.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
The document compares the ant colony optimization (ACO) and bee colony optimization (BCO) algorithms for detecting spam hosts. It first provides background on ACO, which is inspired by how ants find food sources, and BCO, which is inspired by honeybee foraging behavior. The document then describes applying both algorithms to a spam host detection problem. Features are extracted from normal and spam hosts in a dataset to train classification models using ACO and BCO. The optimal solutions from ACO and BCO are then compared to determine which algorithm performs better at detecting spam hosts.
An Improved Ant Colony System Algorithm for Solving Shortest Path Network Pro...Lisa Riley
This document presents an improved ant colony system algorithm for solving shortest path network problems. The improvements include introducing dynamic programming into the heuristic information and applying a ratio approach to the local pheromone update process. The algorithm is tested on a hypothetical network of 10 nodes and 20 edges. The results show that the improved ant colony algorithm outperforms the existing one by requiring fewer iterations to converge to the optimal solution.
Optimized Robot Path Planning Using Parallel Genetic Algorithm Based on Visib...IJERA Editor
An analysis is made for optimized path planning for mobile robot by using parallel genetic algorithm. The
parallel genetic algorithm (PGA) is applied on the visible midpoint approach to find shortest path for mobile
robot. The hybrid ofthese two algorithms provides a better optimized solution for smooth and shortest path for
mobile robot. In this problem, the visible midpoint approach is used to make the effectiveness for avoiding
local minima. It gives the optimum paths which are always consisting on free trajectories. But the
proposedhybrid parallel genetic algorithm converges very fast to obtain the shortest route from source to
destination due to the sharing of population. The total population is partitioned into a number subgroups to
perform the parallel GA. The master thread is the center of information exchange and making selection with
fitness evaluation.The cell to cell crossover makes the algorithm significantly good. The problem converges
quickly with in a less number of iteration.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
The document summarizes Sentimatrix, a multilingual sentiment analysis service that can extract sentiments from text and associate them with named entities. It uses a combination of rule-based classification, statistics, and machine learning. The system has modules for preprocessing text, detecting the language, recognizing named entities, and identifying sentiments. It was evaluated on Romanian texts and achieved promising results, with an F-measure of 90.72% for named entity extraction and 66.73% for named entity classification. The system represents sentiments as weights and uses sentiment triggers, modifiers, and negation words to determine the overall sentiment expressed towards an entity.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
The proposed approach deals with the detection of jargon words in electronic data in different communication mediums like internet, mobile services etc. But in the real life, the jargon words are not used in complete word forms always. Most of the times, those words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed approach detects those abbreviated forms also using semi supervised learning methodology. This learning methodology derives the probability of a suspicious word to be a jargon word by the synset and concept analysis of the text.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
This paper proposes a semi-supervised learning approach to detect jargon words in text. It handles jargon words directly in the text as well as abbreviated forms like sounds-alike words. It uses a sliding window technique to detect suspicious words that partially match jargon words. A learning methodology assigns probabilities to suspicious words based on the concept derived from the text and stores them with a counter. Words are marked as jargon when the probability passes a threshold.
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
The authors present two novel progressive duplicate detection algorithms called Progressive Sorted Neighborhood Method (PSNM) and Progressive Blocking (PB) that improve the efficiency of duplicate detection over traditional approaches. PSNM works best on small, clean datasets by sorting records and comparing those within a sliding window, prioritizing nearby records. PB works best on large, dirty datasets by progressively combining blocks of records based on likelihood of matching. Experiments show these algorithms can double the efficiency of traditional methods and outperform related work by finding more duplicate pairs earlier within a given time frame.
The document proposes using text distortion and algorithmic clustering based on string compression to analyze the effects of progressively destroying text structure on the information contained in texts. Several experiments are carried out on text and artificially generated datasets. The results show that clustering results worsen as structure is destroyed in strongly structural datasets, and that using a compressor that enables context size choice helps determine a dataset's nature. These results are consistent with those from a method based on multidimensional projections.
This document provides an overview of bioinformatics and discusses key concepts like:
- Bioinformatics combines biology, computer science, and information technology to analyze large amounts of biological data.
- High-throughput DNA sequencing has generated vast genomic data that requires bioinformatics tools and databases accessible via the internet to analyze and share.
- Popular sequence alignment tools like BLAST, FASTA, and ClustalW are used to search databases and compare sequences, helping researchers analyze genes and genomes.
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Waqas Tariq
Abstract An attempt was made to carry out an experiment with three plagiarism detection tools (two free/open source tools, namely, Ferret and Sherlock, and one commercial web-based software called Turnitin) on Clough-Stevenson’s corpus including documents classified in three types of plagiarism and one type of non-plagiarism. The experiment was toward Extrinsic/External detecting plagiarism. The goal was to observe the performance of the tools on the corpus and then to analyze, compare, and discuss the outputs and, finally to see whether the tools’ identification of documents is the same as that identified by Clough and Stevenson. It appeared that Ferret and Sherlock, in most cases, produce the same results in plagiarism detection performance; however, Turnitin reported the results with great difference from the other two tools: It showed a higher percentage of similarities between the documents and the source. After investigating the reason (just checked with Ferret and Turnitin, cause Sherlock does not provide a view of the two documents with the overlapped and distinct parts), it was discovered that Turnitin performs quite acceptable and it is Ferret that does not show the expected percentage; it considers the longer text (for this corpus the longer is always the source) as the base and then looks how much of this text is overlapped by the shorter text and the result is shown as the percentage of similarity between the two documents, and this leads to wrong results. From this it can be also speculated that Sherlock does not manifest the results properly.
This document outlines the development of OpenDiscovery, an open source platform for automated docking of ligands to proteins and molecular simulation. It describes using freely available tools like AutoDock Vina, OBabel and PyMol to screen a library of compounds against a target protein, generate similar compounds, prepare ligand and receptor files, and visualize and summarize results. Issues around parameterizing novel ligands for molecular dynamics simulation are also discussed. The goal is to create a high-throughput virtual screening workflow to propose ligands for further testing, and contribute to OpenDiscovery which will integrate cheminformatics and additional analysis tools.
Structure based drug design- kiranmayiKiranmayiKnv
This presentation helps in detail learning about the structure based drug design. It includes types of structure based drug design and detailed study of docking, de novo drug design.
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...zeinabmovasaghinia
This paper presents a technique for detecting polymorphic worms by analyzing the structural properties of executable code in network flows. The technique generates fingerprints based on identifying common subgraphs in the control flow graphs (CFGs) of executable regions, rather than comparing byte strings. This makes the fingerprints more robust against code modifications aimed at evading signature-based detection. The technique satisfies properties of uniqueness, robustness to code insertions/deletions, and partial robustness to code modifications, allowing it to correlate variations of the same polymorphic worm in different network flows. A prototype system is implemented to evaluate the technique.
The slides present the steps that have been followed in order to
build a POS Tagger for Romanian language for a special kind of text: chats. We show the main differences that can be observed between chats and novel writings and present in a comparative way the results obtained using a chat trained model and a novel-trained one when tagging a corpus of chats.
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
This document compares a knowledge-poor probabilistic approach and a knowledge-rich rule-based approach for multilingual terminology extraction. It describes the methodology and resources used for each approach. The knowledge-poor approach uses distributional clustering to induce part-of-speech tags and conditional random fields to extract candidate terms from small annotated corpora. The knowledge-rich approach relies on hand-crafted patterns and rules based on part-of-speech tags to identify terms and variants. Both approaches are evaluated on terminology extraction for six languages from comparable corpora in the domains of wind energy and mobile technologies, using reference term lists.
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...Akira Taniguchi
○Akira Taniguchi, Yoshinobu Hagiwara, Tadahiro Taniguchi, and Tetsunari Inamura, "Online Spatial Concept and Lexical Acquisition with Simultaneous Localization and Mapping", IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS2017), 2017.
Video: https://youtu.be/hVKQCdbRQVM
This document is a thesis that proposes using word embeddings to improve information retrieval by addressing term mismatch issues. It discusses word2vec, a technique for learning word embeddings from large text corpora that capture semantic relationships between words. The thesis proposes two approaches: 1) incorporating word embedding similarities into a probabilistic language model for retrieval and 2) a vector space model. Due to time constraints, only the first approach is implemented, which integrates word embeddings into ALMasri and Chevallet's probabilistic language model. Experiments are conducted to evaluate the impact of using semantic features from word embeddings on retrieval effectiveness.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Because of the ubiquity of metaphors in language, metaphor processing is a very important task in the field of natural language processing. The first step towards metaphor processing, and probably the most difficult one, is metaphor detection. In the first part of this paper, we review the theoretical background for metaphors and the models and implementations that have been proposed for their detection. We then build corpora for detecting three types of metaphors: IS-A metaphors, metaphors formed with the preposition ‘of’ and metaphors formed with a verb. For the first two tasks, we train supervised classifiers using semantic features. For the third task, we use features commonly used in text categorization
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmEditor IJCATR
Sensor network consists of a large number of small nods, strongly interacting with the physical environment, takes
environmental data through sensors, and reacts after processing on information. Wireless network technologies are widely used in most
applications. As wireless sensor networks have many activities in the field of information transmission, network congestion cannot be
thus avoided. So it seems necessary that some new methods can control congestion and use existing resources for providing better traffic
demands. Congestion increases packet loss and retransmission of removed packets and also wastes of energy. In this paper, a novel
method is presented for congestion control in wireless sensor networks using genetic algorithm. The results of simulation show that the
proposed method, in comparison with the algorithm LEACH, can significantly improve congestion control at high speeds.
The current thesis’s scope is within the Natural Language Understanding sub-field of Natural Language Processing. From the multiple possible tasks from this domain, we stopped at Discourse Analysis. We analyzed the main approaches existent in this field and identified the flaws of each of the presented approaches. Starting from them, we proposed an adaptation of an existing framework (the Polyphonic framework) using ideas derived from the theory of a known linguist (Tannen) regarding the importance of repetitions in discourse. After presenting our adaptation, we showed how it would solve most of the indicated problems with the other approaches. In order to verify the effectiveness of the adapted framework, we presented a couple of developed applications that are meant to demonstrate its utility for discourse visualizations, for the identification and classification of the important moments of a discourse, for the assessment of chat conversations based on repetition and rhythmicity, for malapropism detection and correction, and for text recovery
- The document describes using time series analysis models like ARIMA to forecast daily sales quantities of products like paintings for an online retailer.
- The best model was found to be an ARIMA(7,0,2) model, which uses the previous 7 days' values to predict future values without differencing the data.
- This model provided more accurate predictions than the Facebook Prophet model based on error metrics, while converging during both training and testing.
More Related Content
Similar to Malapropisms detection and correction prezentarea
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
The document summarizes Sentimatrix, a multilingual sentiment analysis service that can extract sentiments from text and associate them with named entities. It uses a combination of rule-based classification, statistics, and machine learning. The system has modules for preprocessing text, detecting the language, recognizing named entities, and identifying sentiments. It was evaluated on Romanian texts and achieved promising results, with an F-measure of 90.72% for named entity extraction and 66.73% for named entity classification. The system represents sentiments as weights and uses sentiment triggers, modifiers, and negation words to determine the overall sentiment expressed towards an entity.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
The proposed approach deals with the detection of jargon words in electronic data in different communication mediums like internet, mobile services etc. But in the real life, the jargon words are not used in complete word forms always. Most of the times, those words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed approach detects those abbreviated forms also using semi supervised learning methodology. This learning methodology derives the probability of a suspicious word to be a jargon word by the synset and concept analysis of the text.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
This paper proposes a semi-supervised learning approach to detect jargon words in text. It handles jargon words directly in the text as well as abbreviated forms like sounds-alike words. It uses a sliding window technique to detect suspicious words that partially match jargon words. A learning methodology assigns probabilities to suspicious words based on the concept derived from the text and stores them with a counter. Words are marked as jargon when the probability passes a threshold.
A Survey of String Matching AlgorithmsIJERA Editor
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
The authors present two novel progressive duplicate detection algorithms called Progressive Sorted Neighborhood Method (PSNM) and Progressive Blocking (PB) that improve the efficiency of duplicate detection over traditional approaches. PSNM works best on small, clean datasets by sorting records and comparing those within a sliding window, prioritizing nearby records. PB works best on large, dirty datasets by progressively combining blocks of records based on likelihood of matching. Experiments show these algorithms can double the efficiency of traditional methods and outperform related work by finding more duplicate pairs earlier within a given time frame.
The document proposes using text distortion and algorithmic clustering based on string compression to analyze the effects of progressively destroying text structure on the information contained in texts. Several experiments are carried out on text and artificially generated datasets. The results show that clustering results worsen as structure is destroyed in strongly structural datasets, and that using a compressor that enables context size choice helps determine a dataset's nature. These results are consistent with those from a method based on multidimensional projections.
This document provides an overview of bioinformatics and discusses key concepts like:
- Bioinformatics combines biology, computer science, and information technology to analyze large amounts of biological data.
- High-throughput DNA sequencing has generated vast genomic data that requires bioinformatics tools and databases accessible via the internet to analyze and share.
- Popular sequence alignment tools like BLAST, FASTA, and ClustalW are used to search databases and compare sequences, helping researchers analyze genes and genomes.
Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)Waqas Tariq
Abstract An attempt was made to carry out an experiment with three plagiarism detection tools (two free/open source tools, namely, Ferret and Sherlock, and one commercial web-based software called Turnitin) on Clough-Stevenson’s corpus including documents classified in three types of plagiarism and one type of non-plagiarism. The experiment was toward Extrinsic/External detecting plagiarism. The goal was to observe the performance of the tools on the corpus and then to analyze, compare, and discuss the outputs and, finally to see whether the tools’ identification of documents is the same as that identified by Clough and Stevenson. It appeared that Ferret and Sherlock, in most cases, produce the same results in plagiarism detection performance; however, Turnitin reported the results with great difference from the other two tools: It showed a higher percentage of similarities between the documents and the source. After investigating the reason (just checked with Ferret and Turnitin, cause Sherlock does not provide a view of the two documents with the overlapped and distinct parts), it was discovered that Turnitin performs quite acceptable and it is Ferret that does not show the expected percentage; it considers the longer text (for this corpus the longer is always the source) as the base and then looks how much of this text is overlapped by the shorter text and the result is shown as the percentage of similarity between the two documents, and this leads to wrong results. From this it can be also speculated that Sherlock does not manifest the results properly.
This document outlines the development of OpenDiscovery, an open source platform for automated docking of ligands to proteins and molecular simulation. It describes using freely available tools like AutoDock Vina, OBabel and PyMol to screen a library of compounds against a target protein, generate similar compounds, prepare ligand and receptor files, and visualize and summarize results. Issues around parameterizing novel ligands for molecular dynamics simulation are also discussed. The goal is to create a high-throughput virtual screening workflow to propose ligands for further testing, and contribute to OpenDiscovery which will integrate cheminformatics and additional analysis tools.
Structure based drug design- kiranmayiKiranmayiKnv
This presentation helps in detail learning about the structure based drug design. It includes types of structure based drug design and detailed study of docking, de novo drug design.
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...zeinabmovasaghinia
This paper presents a technique for detecting polymorphic worms by analyzing the structural properties of executable code in network flows. The technique generates fingerprints based on identifying common subgraphs in the control flow graphs (CFGs) of executable regions, rather than comparing byte strings. This makes the fingerprints more robust against code modifications aimed at evading signature-based detection. The technique satisfies properties of uniqueness, robustness to code insertions/deletions, and partial robustness to code modifications, allowing it to correlate variations of the same polymorphic worm in different network flows. A prototype system is implemented to evaluate the technique.
The slides present the steps that have been followed in order to
build a POS Tagger for Romanian language for a special kind of text: chats. We show the main differences that can be observed between chats and novel writings and present in a comparative way the results obtained using a chat trained model and a novel-trained one when tagging a corpus of chats.
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
This document compares a knowledge-poor probabilistic approach and a knowledge-rich rule-based approach for multilingual terminology extraction. It describes the methodology and resources used for each approach. The knowledge-poor approach uses distributional clustering to induce part-of-speech tags and conditional random fields to extract candidate terms from small annotated corpora. The knowledge-rich approach relies on hand-crafted patterns and rules based on part-of-speech tags to identify terms and variants. Both approaches are evaluated on terminology extraction for six languages from comparable corpora in the domains of wind energy and mobile technologies, using reference term lists.
[IROS2017] Online Spatial Concept and Lexical Acquisition with Simultaneous L...Akira Taniguchi
○Akira Taniguchi, Yoshinobu Hagiwara, Tadahiro Taniguchi, and Tetsunari Inamura, "Online Spatial Concept and Lexical Acquisition with Simultaneous Localization and Mapping", IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS2017), 2017.
Video: https://youtu.be/hVKQCdbRQVM
This document is a thesis that proposes using word embeddings to improve information retrieval by addressing term mismatch issues. It discusses word2vec, a technique for learning word embeddings from large text corpora that capture semantic relationships between words. The thesis proposes two approaches: 1) incorporating word embedding similarities into a probabilistic language model for retrieval and 2) a vector space model. Due to time constraints, only the first approach is implemented, which integrates word embeddings into ALMasri and Chevallet's probabilistic language model. Experiments are conducted to evaluate the impact of using semantic features from word embeddings on retrieval effectiveness.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Because of the ubiquity of metaphors in language, metaphor processing is a very important task in the field of natural language processing. The first step towards metaphor processing, and probably the most difficult one, is metaphor detection. In the first part of this paper, we review the theoretical background for metaphors and the models and implementations that have been proposed for their detection. We then build corpora for detecting three types of metaphors: IS-A metaphors, metaphors formed with the preposition ‘of’ and metaphors formed with a verb. For the first two tasks, we train supervised classifiers using semantic features. For the third task, we use features commonly used in text categorization
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmEditor IJCATR
Sensor network consists of a large number of small nods, strongly interacting with the physical environment, takes
environmental data through sensors, and reacts after processing on information. Wireless network technologies are widely used in most
applications. As wireless sensor networks have many activities in the field of information transmission, network congestion cannot be
thus avoided. So it seems necessary that some new methods can control congestion and use existing resources for providing better traffic
demands. Congestion increases packet loss and retransmission of removed packets and also wastes of energy. In this paper, a novel
method is presented for congestion control in wireless sensor networks using genetic algorithm. The results of simulation show that the
proposed method, in comparison with the algorithm LEACH, can significantly improve congestion control at high speeds.
Similar to Malapropisms detection and correction prezentarea (20)
The current thesis’s scope is within the Natural Language Understanding sub-field of Natural Language Processing. From the multiple possible tasks from this domain, we stopped at Discourse Analysis. We analyzed the main approaches existent in this field and identified the flaws of each of the presented approaches. Starting from them, we proposed an adaptation of an existing framework (the Polyphonic framework) using ideas derived from the theory of a known linguist (Tannen) regarding the importance of repetitions in discourse. After presenting our adaptation, we showed how it would solve most of the indicated problems with the other approaches. In order to verify the effectiveness of the adapted framework, we presented a couple of developed applications that are meant to demonstrate its utility for discourse visualizations, for the identification and classification of the important moments of a discourse, for the assessment of chat conversations based on repetition and rhythmicity, for malapropism detection and correction, and for text recovery
- The document describes using time series analysis models like ARIMA to forecast daily sales quantities of products like paintings for an online retailer.
- The best model was found to be an ARIMA(7,0,2) model, which uses the previous 7 days' values to predict future values without differencing the data.
- This model provided more accurate predictions than the Facebook Prophet model based on error metrics, while converging during both training and testing.
The document describes improvements made to an existing application used to identify important moments in student collaborative chats. The improvements include: 1) Implementing a redirection system to analyze utterance timestamps to identify intense discussion periods, 2) Overlapping graphics to correlate concepts with disputed chat parts to identify more important concepts, 3) Increasing availability by creating a web application and avoiding user intervention for moment detection. The improved application can better identify important moments by considering both concept distribution and dialogue intensity over time.
The document summarizes a research paper on developing digital services to emphasize pollution phenomena using statistics and time series analysis. The paper presented at the 8th International Conference on Exploring Services Science discusses how it extracts concepts related to pollution from literature, analyzes frequency of concepts over time, and identifies peaks that correspond to pollution events. It finds that awareness of pollution threats increased in the late 1960s and presents limitations such as delays in reporting events and difficulty identifying all factors influencing time series. The methodology could be improved by better distinguishing yearly events and developing predictive models.
These slides present an application for identifying English words whose use is cyclic or regularly varies in time. The purpose of the developed application was to build a cross-platform system for indexing and analyzing the graphs of words usage over time. For words indexing, we used the data provided by the Google Books N-grams Corpus, which was afterwards filtered using the WordNet lexical database. For identifying the cyclic or regularly varying words, we used two different algorithms: autocorrelation and dynamic time warping. The results of the analysis can be visualized using a web interface. The application also offers the possibility to view the evolution of the use frequency of different words in time.
These slides present an application designed to analyze news articles from Romanian mass media and extract opinions about political entities relevant to the major political stage. The application was created with the desire to study media polarization around important political events, such as legislative or presidential elections. The application uses different crawlers to extract the data from online newspapers and save it in the database. Then, it uses several Machine Learning techniques for identifying and classifying opinions about given entities over a long span of time. Based on this classification, it generates reports and charts that could be use not only to study political polarization, but also to identify partisan media
Language is a living corpus, words tending to be created, or to disappear over time. Even the degree of certain words' usage tends to fluctuate due to historical events, cultural movements or scientific discoveries. The changes from the lan-guage are reflected in the written texts and thus, by tracking them one can deter-mine the moment when these texts were written. In this paper, we present an ap-plication that uses time series analysis built on top of the Google Books N-gram corpus to determine the time period during which a text was written. The applica-tion is based on words' fingerprinting to find the time interval when they were most probable used and on word' importance for the given text. Combining the fingerprints for all the text's words according to their importance allows the time stamping of that text.
These slides address the issue of predicting the reselling price of cars based on ads extracted from popular websites for reselling cars. To obtain the most accurate predictions, we have used two machine learning algorithms (multiple linear regression and random forest) to build multiple models to reflect the importance of different combinations of features in the final price of the cars. The predictions are generated based on the models trained on the ads extracted from such sites. The developed system provides the user with an interface that allows navigation through ads to assess the fairness of prices compared to the predicted ones
These slides address the problem of capturing, processing and analyzing images from the video stream of the Hearthstone game in order to obtain relevant information on the conduct of parties in this game. Since the information needs to be presented to the user in real-time, we needed to find the most suitable methods of extracting this information. Therefore, techniques such as background subtraction, histograms comparisons, key points matching, optical character recognition were investigated. Driven by the required processing speed, we ended up using optical character recognition on limited areas of interest from the captured image. After developing the application, we tested it in real-world context, while real games were played and presented the obtained results. In the end, we also provided two examples where the application would prove useful for better decision making during the game.
These slides present Movie Recommender, a system which provides movie recommendations based on the information known about the users. These recommendations are done using the analysis of the users' psychological profile, their watching history and the movies scores from other websites. They are based on aggregate similarity calculation. The system uses both collaborative filtering and content filtering (using an approach based on different features of the movies from the database). Although there are similar applications available, they tend to ignore the data specific to the user, which in our opinion is essential for his/her behavior
Language suffers an everlasting process of change, both at a semantic level, where existing words acquire new meanings, and at a lexical level, where new concepts appear and old ones disappear or are used less frequently. New words (terms/concepts) may be added as a result of scientific discoveries or socio-cultural influences, while other words are ”forgotten” or are assigned alternative meanings. These changes in a vocabulary usually characterize important shifts in the environment or
the domain they are used in. For experts there is an evident connection between a new concept and some of the existing ones, but for regular people these relations remain hidden and need to be identified. In particular, in the medical domain new terms appear as a result of new discoveries and it becomes an important challenge to establish the connections between different concepts. Moreover, it is important to detect if such a relation even exists. In this paper, we present a graph-based approach to identify the semantic path (which is a chain of semantically related words) between the concepts that appeared in the bio-medicine publications available in the PubMed corpus over a time period of 20 years
Public data can be considered large and important sources of data that can be used for different purposes. In this paper we present a method for collecting and analyzing data within urban settlements. For more focused analysis and gathering of large amount of data we considered a case study of Bucharest. The main purpose of this analysis is to pick up important information about different streets, points of interests, details about urban planning, etc., with the goal of facilitating a quick and correct evaluation of specific areas and identifying suitable location for adding new points of interest. The prediction of suitable location involves using heuristics and data mining technics such as clustering algorithms, association rules
These slides present an application for identifying archaisms and neologisms in texts. The application also provides the ability to view graphically the evolution trends of these words for a better interpretation of the results. The presented solution consists of two phases: the learning phase in which we identify the general evolution trends of three categories of words (archaisms, neologisms and common words) and the classification phase in which we label new words with their corresponding category. For both phases, the application requires Internet access because it is using the Google Books N-gram Viewer to generate the images that back up the decisions
These slides present an automatic system used for the evaluation of Bachelor and Master thesis of Computer Science students. In order to be able to fulfill this task, we have used text complexity measures along with other factors to evaluate the students' thesis. Text complexity has been mainly used to predict the grade level for which a specific reading passage or text should be assigned to. Also, it has been used in evaluating students' writings in language classes. We have decided to try to use text complexity measures for evaluating students' graduation thesis. The main challenges of this task are to select the best features that accurately reflect student's performance in a specific domain, and to identify the optimal classifier to predict the student's score. Firstly, we investigated four sets of text complexity measures (lexical, syntactic, semantic, and character measures), cohesion metrics and a couple of features related to the thesis organization and to the references and bibliography. Secondly, we computed the correlation between the proposed features and we excluded the highly inter-correlated ones. After that, we used several classifiers to predict the students' grade levels and to compare their performances. Finally, we tested our work on a corpus of Bachelor and Master thesis from the students of the Computer Science Department of the University Politehnica of Bucharest that were written in English (as for English there is a high availability of open-source tools for natural language processing). We evaluated the quality of the presented application using Pearson's Rank Correlation to compare our results with the students' grades assigned by the evaluation committee for their thesis
Every country has its own topics of interest and its hot topics at different moments in time. In this paper we present a system that helps to understand and compare different countries, starting from the topics that are debated between their members. In order to do that, we recorded and analyzed the content of the messages that are sent on Twitter by people living in several countries, hoping that this way we will be able to capture the topics of interest for each culture and predict their hot topics. We did our analysis on English written tweets only, based on the fact that English has become a global language, being spoken even by Internet users from non-English speaking countries when they want to share their thoughts and have a global audience for their messages. Our study is trying to capture the topic models both for the tweets and for the URLs shared in them. Then we compare the distribution of topics across different countries both for the tweets and for the URLs to check how consistent these models are. For the topic modelling task, we designed a specialized way of developing them that is adapted for tweets (which have a maximum of 140 characters, being too short to apply classical topic modelling methods). Our system has been tested on a corpus consisting on English tweets, collected using the Twitter streaming API, that have a location attached to them and that also contain an URL. In order to eliminate our bias, we extracted tweets without any restrictions (including tweets written in other languages, tweets without URLs, tweets without location attached) and then we checked the percentage of our targeted tweets for each country. As a consequence, we extended the period of collecting the tweets to decrease the risk of dealing with abnormal events occurring in a certain country
These slides present a text segmentation system based on the sentiments expressed in the text. The system takes as input plain text (product review for instance) and uses two different resources for tagging the sentiment words: a sentiment words dictionary and SentiWordNet. Once the sentiment words are identified, the initial text is annotated with segmentation markers when polarity shifts. The system also outputs the counts of positive and negative sentiment words found in text and optionally annotates them with their valence
In these slides we present a model that was intended to discriminate creative from non-creative news articles. In order to build the classifier, we have combined nine different measures using a stepwise logistic regression model. The obtained model was tested in two experiments: the first one tried to discriminate between news articles about the US 2012 Elections from different newspapers versus articles taken from The Onion (a website providing satiric news) on the same subject, while the second one evaluated the capacity of the model to generalize over different topics and text genres. The experiments showed that the system achieves 80% accuracy, but the lack of true positives from the second experiment raised the question of whether we really identified creativity or in fact we detected satire (as the assumption for the training corpus was that the satiric news from The Onion were also creative).
The document presents a methodology for automatically assessing participants in chat conversations used for computer-supported collaborative learning (CSCL). It uses natural language processing techniques and heuristics to evaluate conversations based on participants' involvement, knowledge, and innovation. The heuristics were tested on a corpus of 7 chat conversations involving 35 students discussing web collaboration technologies. Correlations between the heuristic evaluations and expert human evaluations were generally high, particularly for involvement and innovation. The knowledge heuristic was less reliable. The methodology can help identify effective participation criteria and rank learners and conversations.
In this poster paper we propose a new method for identifying creativity that is based on analyzing a corpus of chat conversations on the same topic and extracting the new ideas expressed by participants. The application is a first step in supporting creativity in online group discussions by highlighting the novel concepts present in conversations (new ideas) and also by identifying topics that could have become important, if not forgotten during the debates (lost ideas)
The main objective of this paper is to compare
the sentiments that prevailed before and after the presidential
elections, held in both US and France in the year 2012. To
achieve this objective we extracted the content information from a
social medium such as Twitter and used the tweets from electoral
candidates and the public users (voters), collected by means of
crawling during the course of election. In order to gain useful
insights about the US elections, we scored the sentiments for
each tweet using different metrics and performed a time series
analysis for candidates and different topics (identified by specific
keywords). In addition to this, we compared some of our insights
obtained from the US election with what we have observed for
the French election. This deep dive analysis was done in order
to understand the inherent nature of elections and to bring out
the influence of social media on elections.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Basics of crystallography, crystal systems, classes and different forms
Malapropisms detection and correction prezentarea
1. Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Malapropisms Detection and Correction
Using a Paronyms Dictionary, a Search
Engine and WordNet
Costin-Gabriel Chiru - costin.chiru@cs.pub.ro
Valentin Cojocaru
Traian Rebedea
Ştefan Trăuşan-Matu
2. Contents
• Introduction
• Used tools
• Application architecture
– Malapropisms detection
– Malapropisms correction
• Walkthrough example
• Experiments and results
• Conclusions and further developing
23.07.2010 1ICSOFT 2010
3. Introduction
• Purpose: detection and correction of malapropos
words (unintentional misuse of a word by confusion
with another one).
• Methodology: evaluate the local cohesion of a text in
order to identify the possible malapropisms and then
use the whole text coherence evaluated in terms of
lexical chains built using the linguistic ontology in
order to correct these.
23.07.2010 2ICSOFT 2010
4. Tools
• Google search engine in order to see the
probability of co-appearance of two words or
blocks of words used for the detection of
malapropos words;
• A paronym dictionary to extract the possible
replacements for the malapropos words;
• WordNet for detecting how closely related two
words are used for malapropisms correction;
23.07.2010 3ICSOFT 2010
6. Malapropisms Detection
• Responsible for detecting anomalies in the local text
cohesion – using Google.
• Two chunks of text are sent to Google:
– The number of hits for the 1st
chunk (no_pages1);
– The number of hits for the 2nd
chunk (no_pages2);
– The number of hits for the co-occurrence of the two
chunks – 2nd
chunk is right after the 1st
one (no_combined).
• Based on the mutual information inequality it
evaluates if their co-appearance is statistically correct.
23.07.2010 5ICSOFT 2010
Why
chunks?
7. Malapropisms Detection (2)
• Content words are rarely adjacent to
check if the local text cohesion is damaged,
we also need the functional words that
connects them Chuncker phrase
decomposed in chunks sequentially
evaluated using Google.
23.07.2010 6ICSOFT 2010
8. Malapropisms Detection -
Filters
• Cohesion evaluation is done based on six
progressive filters.
• Assumptions behind these six filters are:
– The fewer hits of the co-occurrences of the two
chunks, the greater probability of a malapropism;
– The more pages for the individual chunks – having
the same number of co-occurrences of the two
chunks – the greater probability of a malapropism.
23.07.2010 7ICSOFT 2010
9. Malapropisms Detection - Filters
(2)
• 1st
filter - no_combined has a very small value
(less than 20) – signal a possible malapropism
– used to eliminate noise.
• For the next five filters, a possible
malapropism is signaled if the following
formula is true:
23.07.2010 8ICSOFT 2010
10. Malapropisms Detection - Filters
(3)
20 500
23.07.2010 ICSOFT 2010 9
2nd
filter
beta = 1.05
Higher
permission
12000 14000 15000 16000
3rd
filter
beta = 1
Normal
permission
Most often
used!
4th
filter
beta = .95
Smaller
permission
5th
filter
beta = .9
Even smaller
permission
6th
filter
beta = .8
Much
smaller
permission
7th
filter
The formula is not used anymore and
no malapropisms is signaled!
16000 +
11. Malapropisms Detection
Final Remarks (1)
• Filters depend on:
– Thresholds (20, 500, 12k, 14k, 15k, 16k) and
– Beta – coefficient for the co-occurrence of the two
chunks (1.05, 1, .95, .9, .8).
• These values have been empirically determined
and they are
– Language dependent – number of hits are different
for each language;
– Time dependent – web is continuously growing;
– Text independent – no feature of the text has been
considered.
23.07.2010 10ICSOFT 2010
12. Malapropisms Detection
Final Remarks (2)
• The purpose of this module is to limit as much
as possible the number of misses in the
malapropisms detection.
• The module also signals a lot of fake
malapropisms, but they will be evaluated in
the next module and some of them will be
ignored.
23.07.2010 11ICSOFT 2010
13. Malapropisms Correction
• Purposes:
– Identify and eliminate the false alarms and
– Detect the most probable candidates for the
remaining malapropisms and correct them.
• Uses all the technologies.
• Works sequentially - analyze every pair of two
chunks of words and decide whether a
malapropism or a false alarm has been found.
23.07.2010 12ICSOFT 2010
14. Malapropisms Correction
Methodology
• Correction is done in three stages:
– The replacement candidates that ensure the local
cohesion are identified using the paronyms
dictionary;
– These words are filtered against the local context,
using the search engine in the same manner as for
detection;
– The replacement word is chosen from the remaining
words, based on the text logic (represented by lexical
chains) so that the whole text coherence to be
maintained.
23.07.2010 13ICSOFT 2010
17. Malapropisms Correction
Possible Situations (3)
• A malapropisms chain: multiple consecutive
chunks signaled as possible malapropisms.
• Try to correct only one of them the one that
corrects both malapropisms (2 chunks are
corrected together) – figure a;
• If this is impossible, each malapropism is treated
separately in order to correct both – figure b;
• If still impossible, we correct only 1 of them.
23.07.2010 16ICSOFT 2010
19. Walkthrough Example (1)
• I am travelling around the word [world].
• Chuncker: I; am travelling; around the word.
• Google: “I am travelling” – 1.6 million hits; “am
travelling around the word“ – 3 hits.
– The first combination is considered to be correct, while
the second will signal a possible malapropisms.
• Paronyms dictionary: word - cord, ford, lord, sword,
ward, wyrd, woad, wold, wood, wordy, work, worm,
worn, wort, world.
23.07.2010 18ICSOFT 2010
20. Walkthrough Example (2)
• Google again: “Word” is replaced by each of its paronyms
and the number of hits for every combination “am
travelling around the <paronym>” is detected.
• Filters: only one that passes filters is “am travelling around
the world” which has 4120 hits – passes the 3rd
filter (beta =
1).
• WordNet: it is verified that world is part of a lexical chain
that starts from travelling.
• A malapropism is signalled and the corrected form is given:
“I am travelling around the world.”
23.07.2010 19ICSOFT 2010
21. Experiments
• 3 types of corpora have been used for testing:
– 1st
corpus – build from individual phrases
containing malapropisms;
– 2nd
corpus – contained no malapropisms at all;
– 3rd
corpus – consisted of parts of text published on
the Internet (parts of some Fox News) and
modified to introduce malapropisms as suggested
by (Hirst and St-Onge, 1998) and (Hirst and
Budanitsky, 2005).
23.07.2010 20ICSOFT 2010
22. Results (1)
• 1st
corpus:
– 27 out of the 31 examples were correctly detected
(87.05%) and
– 25 of them were properly corrected (80.64%).
• 2nd
corpus (587 words):
– 1 false alarm was inserted (.17%)
• Due to the POS Tagger that wrongfully identified
“while” as being a noun and the application replaced it
with the more probable “white”.
23.07.2010 21ICSOFT 2010
23. Results (2)
• 3rd
corpus:
– Smaller text (199 words, 1 malapropism)
• corrected the malapropism but introduced a false alarm
(.5%) - it seems we underestimated the false alarms rate.
– Larger text (2083 words, 25 malapropisms)
• 21 malapropisms have been detected (84%);
• 17 malapropisms have been corrected (68%);
• Introduced 10 false alarms (.48%)
– 6 of these were in the vicinity of a proper noun (ex: Iran has been
replaced by Iraq, the two countries having similar contexts).
23.07.2010 22ICSOFT 2010
24. Conclusions
• Our approach:
– Combines three technologies (WordNet, Google,
Paronyms dictionary);
– The used thresholds do not depend on the analyzed
texts;
– Uses chunks of text in order to capture the local
cohesion of texts;
– It is fully automated.
23.07.2010 23ICSOFT 2010
25. Limitations
• Limitations:
– The application has problems with the proper
nouns, the numbers and the metaphors found in
the analyzed texts;
– WordNet structure and the accuracy of lexical
chains construction;
– Paronyms dictionary (at the moment only first-
level paronyms are used).
23.07.2010 24ICSOFT 2010
26. Possible Improvements
• Possible improvements:
– Construct the phrases’ syntactic tree in order to
consider the dependencies between the chunks of
text instead of evaluating them sequentially;
– Evaluate the possibility that the empirically
chosen thresholds to stand for any language by
verifying them on a different language;
– Multi-threading.
23.07.2010 25ICSOFT 2010
POSTagger – Qtag. The dictionary has 77,503 words, 22,020 of them (28.4%) having at least one first-level paronym.
pages parameter from the formula above represents the number of indexed pages written in the used language
Every paronym replaces the malapropos word and the local cohesion of the phrase is tested considering the next/previous chunk of text. If the new word fits better, then it is tested if it fits in one of the lexical chains of the text. If so, it becomes the replacement candidate and the malapropism is signaled as a real one.
Here, the local cohesion of the phrase is tested considering both the next and previous chunks of text. If the candidate fits with only 1 chunk, then it is marked as a possible replacement, but the malapropism is not yet market as being real, nor is ignored.
A small one – 199 words and a larger one – 2083 words