IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
The document discusses text normalization, which involves segmenting and standardizing text for natural language processing. It describes tokenizing text into words and sentences, lemmatizing words into their root forms, and standardizing formats. Tokenization involves separating punctuation, normalizing word formats, and segmenting sentences. Lemmatization determines that words have the same root despite surface differences. Sentence segmentation identifies sentence boundaries, which can be ambiguous without context. Overall, text normalization prepares raw text for further natural language analysis.
This document discusses methods for evaluating language models, including intrinsic and extrinsic evaluation. Intrinsic evaluation involves measuring a model's performance on a test set using metrics like perplexity, which is based on how well the model predicts the test set. Extrinsic evaluation embeds the model in an application and measures the application's performance. The document also covers techniques for dealing with unknown words like replacing low-frequency words with <UNK> and estimating its probability from training data.
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...Carrie Wang
This project developed a new quantitative methodology using feature-based context-free grammar to analyze discourse semantics from social media discussions in order to identify potential drug abuse. The methodology was able to parse YouTube comments about recreational cough syrup use and perform anaphora resolution. This computational representation of discourse contributes to understanding human language structure and has applications in public health monitoring and clinical research.
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Waqas Tariq
Words of a sentence will not follow same ordering in different languages. This paper proposes certain Parts-of-Speech (POS) based rules for reordering the given English sentence to get translation in Telugu. The added rules for adverbs, exceptional conjunctions in addition to improved handling of inflections enable the system to achieve more accurate translation. The proposed rules along with existing system gave a score of 0.6190 with BLEU evaluation metric while translating sentences from English to Telugu. This paper deals with simple form of sentences in a better way.
This document provides an overview of constraint satisfaction problems (CSPs). It defines a CSP as a problem where variables must be assigned values from their domains to satisfy constraints. Examples of CSPs include the n-queens puzzle, map coloring, Boolean satisfiability, and cryptarithmetic problems. A CSP is represented as a constraint graph with nodes as variables and edges as binary constraints. The goal is to assign values to each variable to satisfy all constraints.
The document discusses text normalization, which involves segmenting and standardizing text for natural language processing. It describes tokenizing text into words and sentences, lemmatizing words into their root forms, and standardizing formats. Tokenization involves separating punctuation, normalizing word formats, and segmenting sentences. Lemmatization determines that words have the same root despite surface differences. Sentence segmentation identifies sentence boundaries, which can be ambiguous without context. Overall, text normalization prepares raw text for further natural language analysis.
This document discusses methods for evaluating language models, including intrinsic and extrinsic evaluation. Intrinsic evaluation involves measuring a model's performance on a test set using metrics like perplexity, which is based on how well the model predicts the test set. Extrinsic evaluation embeds the model in an application and measures the application's performance. The document also covers techniques for dealing with unknown words like replacing low-frequency words with <UNK> and estimating its probability from training data.
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...Carrie Wang
This project developed a new quantitative methodology using feature-based context-free grammar to analyze discourse semantics from social media discussions in order to identify potential drug abuse. The methodology was able to parse YouTube comments about recreational cough syrup use and perform anaphora resolution. This computational representation of discourse contributes to understanding human language structure and has applications in public health monitoring and clinical research.
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Waqas Tariq
Words of a sentence will not follow same ordering in different languages. This paper proposes certain Parts-of-Speech (POS) based rules for reordering the given English sentence to get translation in Telugu. The added rules for adverbs, exceptional conjunctions in addition to improved handling of inflections enable the system to achieve more accurate translation. The proposed rules along with existing system gave a score of 0.6190 with BLEU evaluation metric while translating sentences from English to Telugu. This paper deals with simple form of sentences in a better way.
This document provides an overview of constraint satisfaction problems (CSPs). It defines a CSP as a problem where variables must be assigned values from their domains to satisfy constraints. Examples of CSPs include the n-queens puzzle, map coloring, Boolean satisfiability, and cryptarithmetic problems. A CSP is represented as a constraint graph with nodes as variables and edges as binary constraints. The goal is to assign values to each variable to satisfy all constraints.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Artificial intelligence and first order logicparsa rafiq
The document discusses knowledge representation and first order logic. It defines knowledge representation as how knowledge is encoded in artificial systems. It discusses representing objects, events, performance, meta-knowledge and facts. It also discusses types of knowledge like meta knowledge, heuristic knowledge, procedural knowledge and declarative knowledge. The document then discusses first order logic syntax including logical symbols, terms, formulas, quantifiers and predicates. It also discusses semantics and the uses and history of first order logic.
This document discusses natural language processing (NLP) from a developer's perspective. It provides an overview of common NLP tasks like spam detection, machine translation, question answering, and summarization. It then discusses some of the challenges in NLP like ambiguity and new forms of written language. The document goes on to explain probabilistic models and language models that are used to complete sentences and rearrange phrases based on probabilities. It also covers text processing techniques like tokenization, regular expressions, and more. Finally, it discusses spelling correction techniques using noisy channel models and confusion matrices.
The document discusses parts-of-speech (POS) tagging. It defines POS tagging as labeling each word in a sentence with its appropriate part of speech. It provides an example tagged sentence and discusses the challenges of POS tagging, including ambiguity and open/closed word classes. It also discusses common tag sets and stochastic POS tagging using hidden Markov models.
The MSR-NLP Chinese word segmentation system is part of a full sentence analyzer. It uses a dictionary and rules for basic segmentation, morphology, and named entity recognition to build a word lattice. The system proposes new words, prunes the lattice, and uses a parser to produce the final segmentation. It participated in four segmentation bakeoff tracks, ranking highly in each. An analysis found that parameter tuning, morphology/NER, and lattice pruning contributed most to performance, while the parser helped less. Problems included inconsistent annotations and differences in defining new words.
The document discusses two paradigms for natural language processing: knowledge engineering and machine learning. It provides examples of how each approach handles tasks like parsing, translation, and question formation. While knowledge engineering relies on hand-coded rules and representations, machine learning trains statistical models on large datasets. The document also notes Microsoft's interests in using NLP for applications like search and summarization.
Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
Shallow parsing is a technique that divides text such as sentences into constituent parts and describes the syntactic relationships between those parts, but does not fully analyze internal structure or function. It aims to infer as much structure as possible from morphological and word order information. Typical modules include part-of-speech tagging, chunking of phrases, and relation finding between chunks. Shallow parsers are useful for processing large texts and are more robust to noise than deep parsers.
Multisets are very powerful and yet simple control mechanisms in regulated rewriting systems. In this paper, we review back the main results on the generative power of multiset controlled grammars introduced in recent research. It was proven that multiset controlled grammars are at least as powerful as additive valence grammars and at most as powerful as matrix grammars. In this paper, we mainly investigate the closure properties of multiset controlled grammars. We show that the family of languages generated by multiset controlled grammars is closed under operations union, concatenation, kleene-star, homomorphism and mirror image.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
This document describes the design and evaluation of a rule-based stemming algorithm for the Kambaata language. The algorithm uses a single-pass, longest-matching approach and is context-sensitive. It was evaluated on a test set of 2,425 words, correctly stemming 2,349 words (96.87%), overstemming 63 words (2.60%), and understemming 13 words (0.54%). The stemmer achieved a 65.86% reduction in dictionary size. Errors are due to the language's complex morphology involving suffixation, infixation, compounding, blending and reduplication. This represents the first stemming algorithm developed for Kambaata words.
This document provides an overview of mathematical logic and set theory concepts. It discusses topics such as mathematical logic and its subfields, basic set theory concepts like membership and subsets, set operations like union and intersection, and logical concepts like negation, conjunction, and syllogisms. It also explains logical form and provides examples of open and closed sentences as well as categorical and compound sentences.
Dr. Lotfi Ali Asker Zadeh is considered the father of fuzzy logic. In the 1960s and 1970s, he developed the concept of fuzzy sets and fuzzy logic to deal with imprecise data and approximations. Fuzzy logic uses membership values between 0 and 1 rather than binary logic of true and false. It allows partial truth values to model uncertainty. Fuzzy logic has been applied in areas like control systems, decision making, and pattern recognition to handle imprecise concepts.
This document discusses several issues in part-of-speech (POS) tagging for Tamil language texts. It identifies challenges such as resolving lexical ambiguity, normalization of corpora, and analyzing compound words and multi-word expressions. It also discusses issues like determining the fineness versus coarseness of linguistic analysis, distinguishing syntactic function from lexical category, and whether to use new tags or tags from standard tagsets. Specific examples are provided to illustrate POS ambiguity depending on context, like the word "paṭi" which can be a noun, verb, adjective, adverb, or particle.
Chinese Grammar vs English Grammar in Universal DependencyJinho Choi
This document compares Chinese and English grammar within the framework of Universal Dependency (UD). It finds that while UD was originally developed using English grammar, many core dependency structures are similar between Chinese and English, such as subjects, objects, and coordination. However, some English-specific structures like expletives and prepositions/postpositions are handled differently in Chinese. The document aims to adapt UD to better fit Chinese by clarifying language-specific structures and building a Chinese UD treebank.
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGEkevig
A value-based approach to Natural Language Understanding, in particular, the disambiguation of
pronouns, is illustrated with a solution to a typical example from the Winograd Schema Challenge. The
worked example uses a language engine, Enguage, to support the articulation of the advocation and
fearing of violence. The example illustrates the indexical nature of pronouns, and how their values, their
referent objects, change because they are set by contextual data. It must be noted that Enguage is not a
suitable candidate for addressing the Winograd Schema Challenge as it is an interactive tool, whereas
the Challenge requires a preconfigured, unattended program.
Regular expressions provide a concise way to match patterns in text. They work by converting the regex into a state machine that can efficiently search a string to find matches. Important regex syntax includes quantifiers like *, +, ?, character classes like [a-z], and anchors like ^ and $. Regular expression engines turn the regex pattern into a program that can search strings. Thompson's NFA construction algorithm is commonly used to build the state machine from a regex for efficient matching.
Lecture: Regular Expressions and Regular LanguagesMarina Santini
This document provides an introduction to regular expressions and regular languages. It defines the key operations used in regular expressions: union, concatenation, and Kleene star. It explains how regular expressions can be converted into finite state automata and vice versa. Examples of regular expressions are provided. The document also defines regular languages as those languages that can be accepted by a deterministic finite automaton. It introduces the pumping lemma as a way to determine if a language is not regular. Finally, it includes some practical activities for readers to practice converting regular expressions to automata and writing regular expressions.
Available transfer capability computations in the indian southern e.h.v power...eSAT Publishing House
This document summarizes a research paper that compares three methods for computing available transfer capability (ATC) in the Indian southern extra high voltage power system: 1) the conventional continuation repeated power flow method, 2) a radial basis function neural network technique, and 3) an adaptive neuro fuzzy inference system technique. It provides background on ATC calculations and describes the radial basis function neural network and adaptive neuro fuzzy inference system techniques in detail. The paper aims to show that intelligent techniques can more accurately compute ATC online compared to conventional methods. It tests the three ATC computation methods on the 72-bus Indian southern power system.
Intelligent two axis dual-ccd image-servo shooting platform designeSAT Publishing House
This document describes the design of an intelligent two-axis dual-CCD image-servo shooting platform. It uses two cameras and image processing techniques to dynamically track targets in 3D space. The system calculates the precise 3D spatial coordinate of the target based on pixel differences between the dual camera images. Experimental results showed the system could hit dynamic targets with precision of 5mm.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Artificial intelligence and first order logicparsa rafiq
The document discusses knowledge representation and first order logic. It defines knowledge representation as how knowledge is encoded in artificial systems. It discusses representing objects, events, performance, meta-knowledge and facts. It also discusses types of knowledge like meta knowledge, heuristic knowledge, procedural knowledge and declarative knowledge. The document then discusses first order logic syntax including logical symbols, terms, formulas, quantifiers and predicates. It also discusses semantics and the uses and history of first order logic.
This document discusses natural language processing (NLP) from a developer's perspective. It provides an overview of common NLP tasks like spam detection, machine translation, question answering, and summarization. It then discusses some of the challenges in NLP like ambiguity and new forms of written language. The document goes on to explain probabilistic models and language models that are used to complete sentences and rearrange phrases based on probabilities. It also covers text processing techniques like tokenization, regular expressions, and more. Finally, it discusses spelling correction techniques using noisy channel models and confusion matrices.
The document discusses parts-of-speech (POS) tagging. It defines POS tagging as labeling each word in a sentence with its appropriate part of speech. It provides an example tagged sentence and discusses the challenges of POS tagging, including ambiguity and open/closed word classes. It also discusses common tag sets and stochastic POS tagging using hidden Markov models.
The MSR-NLP Chinese word segmentation system is part of a full sentence analyzer. It uses a dictionary and rules for basic segmentation, morphology, and named entity recognition to build a word lattice. The system proposes new words, prunes the lattice, and uses a parser to produce the final segmentation. It participated in four segmentation bakeoff tracks, ranking highly in each. An analysis found that parameter tuning, morphology/NER, and lattice pruning contributed most to performance, while the parser helped less. Problems included inconsistent annotations and differences in defining new words.
The document discusses two paradigms for natural language processing: knowledge engineering and machine learning. It provides examples of how each approach handles tasks like parsing, translation, and question formation. While knowledge engineering relies on hand-coded rules and representations, machine learning trains statistical models on large datasets. The document also notes Microsoft's interests in using NLP for applications like search and summarization.
Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
Shallow parsing is a technique that divides text such as sentences into constituent parts and describes the syntactic relationships between those parts, but does not fully analyze internal structure or function. It aims to infer as much structure as possible from morphological and word order information. Typical modules include part-of-speech tagging, chunking of phrases, and relation finding between chunks. Shallow parsers are useful for processing large texts and are more robust to noise than deep parsers.
Multisets are very powerful and yet simple control mechanisms in regulated rewriting systems. In this paper, we review back the main results on the generative power of multiset controlled grammars introduced in recent research. It was proven that multiset controlled grammars are at least as powerful as additive valence grammars and at most as powerful as matrix grammars. In this paper, we mainly investigate the closure properties of multiset controlled grammars. We show that the family of languages generated by multiset controlled grammars is closed under operations union, concatenation, kleene-star, homomorphism and mirror image.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
This document describes the design and evaluation of a rule-based stemming algorithm for the Kambaata language. The algorithm uses a single-pass, longest-matching approach and is context-sensitive. It was evaluated on a test set of 2,425 words, correctly stemming 2,349 words (96.87%), overstemming 63 words (2.60%), and understemming 13 words (0.54%). The stemmer achieved a 65.86% reduction in dictionary size. Errors are due to the language's complex morphology involving suffixation, infixation, compounding, blending and reduplication. This represents the first stemming algorithm developed for Kambaata words.
This document provides an overview of mathematical logic and set theory concepts. It discusses topics such as mathematical logic and its subfields, basic set theory concepts like membership and subsets, set operations like union and intersection, and logical concepts like negation, conjunction, and syllogisms. It also explains logical form and provides examples of open and closed sentences as well as categorical and compound sentences.
Dr. Lotfi Ali Asker Zadeh is considered the father of fuzzy logic. In the 1960s and 1970s, he developed the concept of fuzzy sets and fuzzy logic to deal with imprecise data and approximations. Fuzzy logic uses membership values between 0 and 1 rather than binary logic of true and false. It allows partial truth values to model uncertainty. Fuzzy logic has been applied in areas like control systems, decision making, and pattern recognition to handle imprecise concepts.
This document discusses several issues in part-of-speech (POS) tagging for Tamil language texts. It identifies challenges such as resolving lexical ambiguity, normalization of corpora, and analyzing compound words and multi-word expressions. It also discusses issues like determining the fineness versus coarseness of linguistic analysis, distinguishing syntactic function from lexical category, and whether to use new tags or tags from standard tagsets. Specific examples are provided to illustrate POS ambiguity depending on context, like the word "paṭi" which can be a noun, verb, adjective, adverb, or particle.
Chinese Grammar vs English Grammar in Universal DependencyJinho Choi
This document compares Chinese and English grammar within the framework of Universal Dependency (UD). It finds that while UD was originally developed using English grammar, many core dependency structures are similar between Chinese and English, such as subjects, objects, and coordination. However, some English-specific structures like expletives and prepositions/postpositions are handled differently in Chinese. The document aims to adapt UD to better fit Chinese by clarifying language-specific structures and building a Chinese UD treebank.
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGEkevig
A value-based approach to Natural Language Understanding, in particular, the disambiguation of
pronouns, is illustrated with a solution to a typical example from the Winograd Schema Challenge. The
worked example uses a language engine, Enguage, to support the articulation of the advocation and
fearing of violence. The example illustrates the indexical nature of pronouns, and how their values, their
referent objects, change because they are set by contextual data. It must be noted that Enguage is not a
suitable candidate for addressing the Winograd Schema Challenge as it is an interactive tool, whereas
the Challenge requires a preconfigured, unattended program.
Regular expressions provide a concise way to match patterns in text. They work by converting the regex into a state machine that can efficiently search a string to find matches. Important regex syntax includes quantifiers like *, +, ?, character classes like [a-z], and anchors like ^ and $. Regular expression engines turn the regex pattern into a program that can search strings. Thompson's NFA construction algorithm is commonly used to build the state machine from a regex for efficient matching.
Lecture: Regular Expressions and Regular LanguagesMarina Santini
This document provides an introduction to regular expressions and regular languages. It defines the key operations used in regular expressions: union, concatenation, and Kleene star. It explains how regular expressions can be converted into finite state automata and vice versa. Examples of regular expressions are provided. The document also defines regular languages as those languages that can be accepted by a deterministic finite automaton. It introduces the pumping lemma as a way to determine if a language is not regular. Finally, it includes some practical activities for readers to practice converting regular expressions to automata and writing regular expressions.
Available transfer capability computations in the indian southern e.h.v power...eSAT Publishing House
This document summarizes a research paper that compares three methods for computing available transfer capability (ATC) in the Indian southern extra high voltage power system: 1) the conventional continuation repeated power flow method, 2) a radial basis function neural network technique, and 3) an adaptive neuro fuzzy inference system technique. It provides background on ATC calculations and describes the radial basis function neural network and adaptive neuro fuzzy inference system techniques in detail. The paper aims to show that intelligent techniques can more accurately compute ATC online compared to conventional methods. It tests the three ATC computation methods on the 72-bus Indian southern power system.
Intelligent two axis dual-ccd image-servo shooting platform designeSAT Publishing House
This document describes the design of an intelligent two-axis dual-CCD image-servo shooting platform. It uses two cameras and image processing techniques to dynamically track targets in 3D space. The system calculates the precise 3D spatial coordinate of the target based on pixel differences between the dual camera images. Experimental results showed the system could hit dynamic targets with precision of 5mm.
Embedded based sensorless control of pmbldc motor with voltage controlled pfc...eSAT Publishing House
This document summarizes a research paper that proposes a sensorless control method for a permanent magnet brushless DC motor (PMBLDC) used in an air conditioner. The system includes a Cuk converter for power factor correction between the AC mains and the PMBLDC motor. Sensorless control is achieved by detecting the zero crossings of the back electromotive force (EMF) difference between motor phases rather than using position sensors. The system is designed, modeled and simulated in MATLAB. Results show the sensorless control method provides commutation signals comparable to hall sensors for controlling the inverter switches. The Cuk converter helps improve the power factor and total harmonic distortion drawn from the mains.
1. Pounding is a major cause of damage to adjacent buildings during earthquakes when they are constructed close together without sufficient separation.
2. The study analyzes seismic pounding forces between buildings of different heights and floor levels using software. It finds that pounding damage increases when buildings have different dynamic properties or are inadequately separated.
3. Pounding can cause non-structural to severe structural damage. The required minimum separation between buildings according to codes is 15-30 mm depending on building type, but may need to be larger.
Optimized mapping and navigation of remote area through an autonomous roboteSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Adaptive transmit diversity selection (atds) based on stbc and sfbc fir 2 x1 ...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
This document discusses using GIS to assess topographical aspects for locating infrastructure facilities in hilly regions. It notes that traditional 2D maps and sketches used by engineers do not fully consider topography. The study develops a GIS-based methodology to analyze topographical factors and locate proposed facilities at a college campus in India as a case study. The objectives are to model the existing topography and facilities in 3D to identify suitable locations for new infrastructure that consider adverse conditions posed by the hilly landscape.
Lake sediment thickness estimation using ground penetrating radareSAT Publishing House
This document summarizes a study that used ground penetrating radar (GPR) to estimate the thickness and volume of sediments in Punem Lake, India. GPR profiles identified two sediment layers in the lake. The thickness of the first layer ranged from 0.02m to 1.16m with an average of 0.5m, while the second layer ranged from 0.08m to 0.99m with an average of 0.37m. Sediment volume was estimated at 260,303 cubic meters for the first layer and 188,171 cubic meters for the second layer. Thicker sediments were found toward the north, northeast, and east of the lake for the first layer and toward the north, northeast, and west for
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Wear behaviour of si c reinforced al6061 alloy metal matrix composites by usi...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Bit error rate analysis of miso system in rayleigh fading channeleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Process design features of a 5 tonnesday multi – stage, intermittent drainage...eSAT Publishing House
This document summarizes the process design features of a 5 tonnes per day vegetable oil solvent extraction plant using n-hexane. It describes the 7-stage continuous full immersion extraction process. Material and energy balances were performed and the plant efficiency was determined to be 0.70. Process calculations found the energy requirement to be 23.17Kj per kg of extracted oil and the diffusivity of oils in the solvent averaged 4.0 x 10-9 m2/s. The mass transfer coefficient was calculated to be 3.2 x 10-5 Kmole/m2.s.
Optimization of energy use intensity in a design build frameworkeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Rate adaptive resource allocation in ofdma using bees algorithmeSAT Publishing House
This document discusses using the Bees Algorithm to allocate resources in an OFDMA system. The Bees Algorithm is a bio-inspired optimization technique modeled after the foraging behavior of honey bees. It aims to maximize the data rate of each user while keeping power consumption below a threshold. The algorithm represents an alternative to other nature-inspired approaches that have been applied to this resource allocation problem, such as genetic algorithms and ant colony optimization. It is suggested that the Bees Algorithm may provide fast convergence for practical resource allocation in OFDMA systems.
1) The document discusses the Roucairol and Carvalho optimization approach for the Ricart-Agrawala distributed mutual exclusion algorithm.
2) The optimization allows a site to enter the critical section multiple times without re-requesting permission, reducing messages to 0-2(N-1) per critical section.
3) However, this compromises fairness by allowing a site to monopolize the critical section, potentially causing starvation for other sites' requests.
This document discusses accelerating the seam carving algorithm for image resizing using CUDA (Compute Unified Device Architecture) on a GPU (graphics processing unit). Seam carving is a content-aware image resizing technique that identifies paths of least importance (seams) through an image that can be removed or inserted to change the image size. The document explains that seam carving involves large matrix calculations that can be significantly accelerated by implementing them in parallel on a CUDA-enabled GPU. It presents the seam carving algorithm, describes CUDA and GPU architecture, proposes implementing seam carving using CUDA to achieve speed-ups, and concludes that parallelizing seam carving calculations on a GPU exploits its massive parallelism for faster execution compared to
This document discusses various topics related to programming languages including:
- It defines different types of programming languages like procedural, functional, scripting, logic, and object-oriented.
- It explains why studying programming languages is important for expressing ideas, choosing appropriate languages, learning new languages, and better understanding implementation.
- It describes programming paradigms like imperative, object-oriented, functional, and logic programming.
- It covers concepts like syntax, semantics, pragmatics, grammars, and translation models in programming languages.
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
Part of speech tagging is one of the basic steps in natural language processing. Although it has been
investigated for many languages around the world, very little has been done for Setswana language.
Setswana language is written disjunctively and some words play multiple functions in a sentence. These
features make part of speech tagging more challenging. This paper presents a finite state method for
identifying one of the compound parts of speech, the relative. Results show an 82% identification rate
which is lower than for other languages. The results also show that the model can identify the start of a
relative 97% of the time but fail to identify where it stops 13% of the time. The model fails due to the
limitations of the morphological analyser and due to more complex sentences not accounted for in the
model.
The document describes a machine learning approach for language identification, named entity recognition, and transliteration on query words. It discusses:
1) Using supervised machine learning classifiers like random forest, decision trees, and SVMs along with contextual, character n-gram, and gazetteer features for language identification of Hindi-English and Bangla-English words.
2) Applying an IOB tagging scheme and features like character n-grams, context words, and typographic properties for named entity recognition and classification.
3) A statistical machine transliteration model that segments, aligns, and maps source and target language transliteration units based on context and probabilities learned from parallel training data.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
Paraphrasing refers to writing that either differs in its textual content or is dissimilar in rearrangement of words
but conveys the same meaning. Identifying a paraphrase is exceptionally important in various real life applications
such as Information Retrieval, Plagiarism Detection, Text Summarization and Question Answering. A large amount
of work in Paraphrase Detection has been done in English and many Indian Languages. However, there is no
existing system to identify paraphrases in Marathi. This is the first such endeavor in the Marathi Language.
A paraphrase has differently structured sentences, and since Marathi is a semantically strong language, this
system is designed for checking both statistical and semantic similarities of Marathi sentences. Statistical similarity
measure does not need any prior knowledge as it is only based on the factual data of sentences. The factual data
is calculated on the basis of the degree of closeness between the word-set, word-order, word-vector and worddistance. Universal Networking Language (UNL) speaks about the semantic significance in sentence without any
syntactic points of interest. Hence, the semantic similarity calculated on the basis of generated UNL graphs for two
Marathi sentences renders semantic equality of two Marathi sentences. The total paraphrase score was calculated
after joining statistical and semantic similarity scores, which gives a judgment on whether there is paraphrase or
non-paraphrase about the Marathi sentences in question.
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Dependency Analysis of Abstract Universal Structures in Korean and EnglishJinho Choi
This thesis gives two contributions in the form of lexical resourcesto (1) dependency parsing in Korean and (2) semantic parsing in English. First,we describe our methodology for building three dependency treebanks in Korean derived from existing treebanks and pseudo-annotated according to the latest guidelines from the Universal Dependencies (UD). The original Google Korean UD Treebank is re-tokenized to ensure morpheme-level annotation consistency with other corpora while maintaining linguistic validity of the revised tokens. Phrase structure trees in the Penn Korean Treebank and the Kaist Treebank are automatically converted into UD dependency trees by applying head-percolation rules and linguistically motivated heuristics. A total of 38K+ dependency trees are generated.To the best of our knowledge, this is the first time that the three Korean treebanks are converted into UD dependency treebanks following the latest annotation guidelines. Second, we introduce an on-going project for augmenting the OntoNotes phrase structure treebank with semantic features found in the Abstract Meaning Representation (AMR), as part of an effort to build an accurate AMR parser. We propose a novel technique for AMR parsing that first trains a dependency parser on the OntoNotes corpus augmented with numbered arguments in the Proposition Bank (PropBank), and then does a transfer learning of the trained dependency parser for the AMR parsing task. A preliminary step is to prepare dependency data by performing an automatic replacement of dependencies that define predicate argument structure with their corresponding PropBank argument labels during constituent-to-dependency conversion. To the best of our knowledge, this is the first time that the PropBank labels are directly inserted into dependency structure, producing a new dependency corpus with rich syntactic information as well as semantic role information provided by PropBank that fully describes the predicate-argument structure, making it an ideal resource for AMR parsing and, broadly, semantic parsing.
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
Extracting key phrases from documents is a common task in many applications. In general: The Noun
Phrase Extractor consists of three modules: tokenization; part-of-speech tagging; noun phrase
identification. These will be used as three main steps in building the new system ANPE, This paper aims at
picking Arabic Noun Phrases from a corpus of documents, Relevant criteria (Recall and Precision), will be
used as evaluation measure. On the one hand, when using NPs rather than using single terms, the system
yields more relevant documents from the retrieved ones, on the other hand, it gave low precision because
number of the retrieved documents will be decreased. At the researchers conclude and recommend
improvements for more effective and efficient research in the future.
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
The paper accepted on ICSE'17 and TSE'19. https://se-thesaurus.appspot.com/ https://pypi.org/project/DomainThesaurus/ Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonlyused morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
Regular Expressions, most often known as “RegEx” is one of the most popular and widely accepted technology used for parsing the specific data contents from large text.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
Parsing of Myanmar Sentences With Function Taggingkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences.
A POS Tagger for Tamil Language”, Proceedings of the IJCNLP-2009, Suntec,
Singapore.
Dhanalakshmi V, Anand Kumar M, Soman K P and Rajendran S (2011), “Dependency
Parsing for Tamil using Malt Parser”, Proceedings of the International Conference on
Asian Language Processing (IALP), Bali, Indonesia.
Gimenez J and Marquez L (2004), “SVMTool: A general POS tagger generator based on
Support Vector Machines”, Proceedings of the 4th International Conference on Language
Resources and Evaluation (LREC 2004), Lisbon, Portugal.
Joakim Nivre and Johan Hall (
Similar to Usage of regular expressions in nlp (20)
Hudhud cyclone caused extensive damage in Visakhapatnam, India in October 2014, especially to tree cover. This will likely impact the local environment in several ways: increased air pollution as trees absorb less; higher temperatures without tree canopy; increased erosion and landslides. It also created large amounts of waste from destroyed trees. Proper management of solid waste is needed to prevent disease spread. Suggested measures include restoring damaged plants, building fountains to reduce heat, mandating light-colored buildings, improving waste management, and educating public on health risks. Overall, changes are needed to water, land, and waste practices to rebuild the environment after the cyclone removed green cover.
Impact of flood disaster in a drought prone area – case study of alampur vill...eSAT Publishing House
1) In September-October 2009, unprecedented heavy rainfall and dam releases caused widespread flooding in Alampur village in Mahabub Nagar district, a historically drought-prone area.
2) The flood damaged or destroyed homes, buildings, infrastructure, crops, and documents. It displaced many residents and cut off the village.
3) The socioeconomic conditions and mud-based construction of homes in the village exacerbated the flood's impacts, making damage more severe and recovery more difficult.
The document summarizes the Hudhud cyclone that struck Visakhapatnam, India in October 2014. It describes the cyclone's formation, rapid intensification to winds of 175 km/h, and landfall near Visakhapatnam. The cyclone caused extensive damage estimated at over $1 billion and at least 109 deaths in India and Nepal. Infrastructure like buildings, bridges, and power lines were destroyed. Crops and fishing boats were also damaged. The document then discusses coping strategies and improvements needed to disaster management plans to better prepare for future cyclones.
Groundwater investigation using geophysical methods a case study of pydibhim...eSAT Publishing House
This document summarizes the results of a geophysical investigation using vertical electrical sounding (VES) methods at 13 locations around an industrial area in India. The VES data was interpreted to generate geo-electric sections and pseudo-sections showing subsurface resistivity variations. Three main layers were typically identified - a high resistivity topsoil, a weathered middle layer, and a basement rock. Pseudo-sections revealed relatively more weathered areas in the northwest and southwest. Resistivity sections helped identify zones of possible high groundwater potential based on low resistivity anomalies sandwiched between more resistive layers. The study concluded the electrical resistivity method was useful for understanding subsurface geology and identifying areas prospective for groundwater exploration.
Flood related disasters concerned to urban flooding in bangalore, indiaeSAT Publishing House
1. The document discusses urban flooding in Bangalore, India. It describes how factors like heavy rainfall, population growth, and improper land use have contributed to increased flooding in the city.
2. Flooding events in 2013 are analyzed in detail. A November rainfall caused runoff six times higher than the drainage capacity, inundating low-lying residential areas.
3. Impacts of urban flooding include disrupted daily life, damaged infrastructure, and decreased economic activity in affected areas. The document calls for improved flood management strategies to better mitigate urban flooding risks in Bangalore.
Enhancing post disaster recovery by optimal infrastructure capacity buildingeSAT Publishing House
This document discusses enhancing post-disaster recovery through optimal infrastructure capacity building. It presents a model to minimize the cost of meeting demand using auxiliary capacities when disaster damages infrastructure. The model uses genetic algorithms to select optimal capacity combinations. The document reviews how infrastructure provides vital services supporting recovery activities and discusses classifying infrastructure into six types. When disaster reduces infrastructure services, a gap forms between community demands and available support, hindering recovery. The proposed research aims to identify this gap and optimize capacity selection to fill it cost-effectively.
Effect of lintel and lintel band on the global performance of reinforced conc...eSAT Publishing House
This document analyzes the effect of lintels and lintel bands on the seismic performance of reinforced concrete masonry infilled frames through non-linear static pushover analysis. Four frame models are considered: a frame with a full masonry infill wall; a frame with a central opening but no lintel/band; a frame with a lintel above the opening; and a frame with a lintel band above the opening. The results show that the full infill wall model has 27% higher stiffness and 32% higher strength than the model with just an opening. Models with lintels or lintel bands have slightly higher strength and stiffness than the model with just an opening. The document concludes lintels and lintel
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...eSAT Publishing House
1) A cyclone with wind speeds of 175-200 kph caused massive damage to the green cover of Gitam University campus in Visakhapatnam, India. Thousands of trees were uprooted or damaged.
2) A study assessed different types of damage to trees from the cyclone, including defoliation, salt spray damage, damage to stems/branches, and uprooting. Certain tree species were more vulnerable than others.
3) The results of the study can help in selecting more wind-resistant tree species for future planting and reducing damage from future storms.
Wind damage to buildings, infrastrucuture and landscape elements along the be...eSAT Publishing House
1) A visual study was conducted to assess wind damage from Cyclone Hudhud along the 27km Visakha-Bheemli Beach road in Visakhapatnam, India.
2) Residential and commercial buildings suffered extensive roof damage, while glass facades on hotels and restaurants were shattered. Infrastructure like electricity poles and bus shelters were destroyed.
3) Landscape elements faced damage, including collapsed trees that damaged pavements, and debris in parks. The cyclone wiped out over half the city's green cover and caused beach erosion around protected areas.
1) The document reviews factors that influence the shear strength of reinforced concrete deep beams, including compressive strength of concrete, percentage of tension reinforcement, vertical and horizontal web reinforcement, aggregate interlock, shear span-to-depth ratio, loading distribution, side cover, and beam depth.
2) It finds that compressive strength of concrete, tension reinforcement percentage, and web reinforcement all increase shear strength, while shear strength decreases as shear span-to-depth ratio increases.
3) The distribution and amount of vertical and horizontal web reinforcement also affects shear strength, but closely spaced stirrups do not necessarily enhance capacity or performance.
Role of voluntary teams of professional engineers in dissater management – ex...eSAT Publishing House
1) A team of 17 professional engineers from various disciplines called the "Griha Seva" team volunteered after the 2001 Gujarat earthquake to provide technical assistance.
2) The team conducted site visits, assessments, testing and recommended retrofitting strategies for damaged structures in Bhuj and Ahmedabad. They were able to fully assess and retrofit 20 buildings in Ahmedabad.
3) Factors observed that exacerbated the earthquake's impacts included unplanned construction, non-engineered buildings, improper prior retrofitting, and defective materials and workmanship. The professional engineers' technical expertise was crucial for effective post-disaster management.
This document discusses risk analysis and environmental hazard management. It begins by defining risk, hazard, and toxicity. It then outlines the steps involved in hazard identification, including HAZID, HAZOP, and HAZAN. The document presents a case study of a hypothetical gas collecting station, identifying potential accidents and hazards. It discusses quantitative and qualitative approaches to risk analysis, including calculating a fire and explosion index. The document concludes by discussing hazard management strategies like preventative measures, control measures, fire protection, relief operations, and the importance of training personnel on safety.
Review study on performance of seismically tested repaired shear wallseSAT Publishing House
This document summarizes research on the performance of reinforced concrete shear walls that have been repaired after damage. It begins with an introduction to shear walls and their failure modes. The literature review then discusses the behavior of original shear walls as well as different repair techniques tested by other researchers, including conventional repair with new concrete, jacketing with steel plates or concrete, and use of fiber reinforced polymers. The document focuses on evaluating the strength retention of shear walls after being repaired with various methods.
Monitoring and assessment of air quality with reference to dust particles (pm...eSAT Publishing House
This document summarizes a study on monitoring and assessing air quality with respect to dust particles (PM10 and PM2.5) in the urban environment of Visakhapatnam, India. Sampling was conducted in residential, commercial, and industrial areas from October 2013 to August 2014. The average PM2.5 and PM10 concentrations were within limits in residential areas but moderate to high in commercial and industrial areas. Exceedance factor levels indicated moderate pollution for residential areas and moderate to high pollution for commercial and industrial areas. There is a need for management measures like improved public transport and green spaces to combat particulate air pollution in the study areas.
Low cost wireless sensor networks and smartphone applications for disaster ma...eSAT Publishing House
This document describes a low-cost wireless sensor network and smartphone application system for disaster management. The system uses an Arduino-based wireless sensor network comprising nodes with various sensors to monitor the environment. The sensor data is transmitted to a central gateway and then to the cloud for analysis. A smartphone app connected to the cloud can detect disasters from the sensor data and send real-time alerts to users to help with early evacuation. The system aims to provide low-cost localized disaster detection and warnings to improve safety.
Coastal zones – seismic vulnerability an analysis from east coast of indiaeSAT Publishing House
This document summarizes an analysis of seismic vulnerability along the east coast of India. It discusses the geotectonic setting of the region as a passive continental margin and reports some moderate seismic activity from offshore in recent decades. While seismic stability cannot be assumed given events like the 2004 tsunami, no major earthquakes have been recorded along this coast historically. The document calls for further study of active faults, neotectonics, and implementation of improved seismic building codes to mitigate vulnerability.
Can fracture mechanics predict damage due disaster of structureseSAT Publishing House
This document discusses how fracture mechanics can be used to better predict damage and failure of structures. It notes that current design codes are based on small-scale laboratory tests and do not account for size effects, which can lead to more brittle failures in larger structures. The document outlines how fracture mechanics considers factors like size effect, ductility, and minimum reinforcement that influence the strength and failure behavior of structures. It provides examples of how fracture mechanics has been applied to problems like evaluating shear strength in deep beams and investigating a failure of an oil platform structure. The document argues that fracture mechanics provides a more scientific basis for structural design compared to existing empirical code provisions.
This document discusses the assessment of seismic susceptibility of reinforced concrete (RC) buildings. It begins with an introduction to earthquakes and the importance of vulnerability assessment in mitigating earthquake risks and losses. It then describes modeling the nonlinear behavior of RC building elements and performing pushover analysis to evaluate building performance. The document outlines modeling RC frames and developing moment-curvature relationships. It also summarizes the results of pushover analyses on sample 2D and 3D RC frames with and without shear walls. The conclusions emphasize that pushover analysis effectively assesses building properties but has limitations, and that capacity spectrum method provides appropriate results for evaluating building response and retrofitting impact.
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...eSAT Publishing House
1) A 6.0 magnitude earthquake occurred off the coast of Paradip, Odisha in the Bay of Bengal on May 21, 2014 at a depth of around 40 km.
2) Analysis of magnetic and bathymetric data from the area revealed the presence of major lineaments in NW-SE and NE-SW directions that may be responsible for seismic activity through stress release.
3) Movements along growth faults at the margins of large Bengal channels, due to large sediment loads, could also contribute to seismic events by triggering movements along the faults.
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...eSAT Publishing House
This document discusses the effects of Cyclone Hudhud on the development of Visakhapatnam as a smart and green city through a case study and preliminary surveys. The surveys found that 31% of participants had experienced cyclones, 9% floods, and 59% landslides previously in Visakhapatnam. Awareness of disaster alarming systems increased from 14% before the 2004 tsunami to 85% during Cyclone Hudhud, while awareness of disaster management systems increased from 50% before the tsunami to 94% during Hudhud. The surveys indicate that initiatives after the tsunami improved awareness and preparedness. Developing Visakhapatnam as a smart, green city should consider governance
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 168
USAGE OF REGULAR EXPRESSIONS IN NLP
Gaganpreet Kaur
Computer Science Department, Guru Nanak Dev University, Amritsar, gagansehdev204@gmail.com
Abstract
A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic
representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer
science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text
analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular
expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of
regular expression in NLP.
Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment,
POS tagging
------------------------------------------------------------------------***----------------------------------------------------------------------
1. INTRODUCTION
Natural language processing is a large and
multidisciplinary field as it contains infinitely many
sentences. NLP began in the 1950s as the
intersection of artificial intelligence and linguistics.
Also there is much ambiguity in natural language.
There are many words which have several meanings,
such as can, bear, fly, orange, and sentences have
meanings different in different contexts. This makes
creation of programs that understands a natural
language, a challenging task [1] [5] [8].
The steps in NLP are [8]:
1. Morphology: Morphology concerns the way words
are built up from smaller meaning bearing units.
2. Syntax: Syntax concerns how words are put together
to form correct sentences and what structural role each
word has.
3. Semantics: Semantics concerns what words mean and
how these meanings combine in sentences to form
sentence meanings.
4. Pragmatics: Pragmatics concerns how sentences are
used in different situations and how use affects the
interpretation of the sentence.
5. Discourse: Discourse concerns how the immediately
preceding sentences affect the interpretation of the
next sentence.
Fig 1: Steps in Natural Language Processing [8]
Figure 1 illustrates the steps or stages that followed
in Natural Language processing in which surface text
that is input is converted into the tokens by using the
parsing or Tokenisation phase and then its syntax
and semantics should be checked.
2. REGULAR EXPRESSION
Regular expressions (regexps) are one of the most useful tools
in computer science. RE is a formal language for specifying the
string. Most commonly called the search expression. NLP, as
an area of computer science, has greatly benefitted from
regexps: they are used in phonology, morphology, text analysis,
information extraction, & speech recognition. Regular
expressions are placed inside the pair of matching. A regular
expression, or RE, describes strings of characters (words or
phrases or any arbitrary text). It's a pattern that matches certain
strings and doesn't match others. A regular expression is a set
of characters that specify a pattern. Regular expressions are
2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 169
case-sensitive. Regular Expression performs various functions
in SQL:
REGEXP_LIKE: Determine whether pattern matches
REGEXP_SUBSTR: Determine what string matches the
pattern
REGEXP_INSTR: Determine where the match occurred in the
string
REGEXP_REPLACE: Search and replace a pattern
How can RE be used in NLP?
1) Validate data fields (e.g., dates, email address, URLs,
abbreviations)
2) Filter text (e.g., spam, disallowed web sites)
3) Identify particular strings in a text (e.g., token
boundaries)
4) Convert the output of one processing component into
the format required for a second component [4].
3. RELATED WORK
(A.N. Arslan and Dan, 2006) have proposed the constrained
sequence alignment process. The regular expression
constrained sequence alignment has been introduced for this
purpose. An alignment satisfies the constraint if part of it
matches a given regular expression in each dimension (i.e. in
each sequence aligned). There is a method that rewards the
alignments that include a region matching the given regular
expression. This method does not always guarantee the
satisfaction of the constraint.
(M. Shahbaz, P. McMinn and M. Stevenson, 2012) presented a
novel approach of finding valid values by collating suitable
regular expressions dynamically that validate the format of the
string values, such as an email address. The regular expressions
are found using web searches that are driven by the identifiers
appearing in the program, for example a string parameter called
email Address. The identifier names are processed through
natural language processing techniques to tailor the web queries.
Once a regular expression has been found, a secondary web
search is performed for strings matching the regular expression.
(Xiaofei Wang and Yang Xu, 2013)discussed the
deep packet inspection that become a key
component in network intrusion detection systems,
where every packet in the incoming data stream
needs to be compared with patterns in an attack
database, byte-by-byte, using either string matching
or regular expression matching. Regular expression
matching, despite its flexibility and efficiency in
attack identification, brings significantly high
computation and storage complexities to NIDSes,
making line-rate packet processing a challenging
task. In this paper, authors present stride finite
automata (StriFA), a novel finite automata family, to
accelerate both string matching and regular
expression matching.
4. BASIC RE PATTERNS IN NLP [8]
Here is a discussion of regular expression syntax or
patterns and its meaning in various contexts.
Table 1 Regular expression meaning [8]
Character Regular-expression
meaning
. Any character, including
whitespace or numeric
? Zero or one of the preceding
character
* Zero or more of the
preceding character
+ One or more of the
preceding character
^ Negation or complement
In table 1, all the different characters have their own meaning.
Table 2 Regular expression String matching [8]
RE String matched
/woodchucks/ “interesting links to
woodchucks and lemurs”
/a/ “Sammy Ali stopped by
Neha’s”
/Ali says,/ “My gift please,” Ali says,”
/book/ “all our pretty books”
/!/ “Leave him behind!” said
Sammy to neha.
Table3 Regular expression matching [8]
RE Match
/[wW]oodchuck/
Woodchuck or woodchuck
/[abc]/ “a”, “b”, or “c”
/[0123456789]/ Any digit
In Table 2 and Table 3, there is a discussion on regular
expression string matching as woodchucks is matched with the
string interesting links to woodchucks and lemurs.
Table 4 Regular expression Description [8]
RE Description
/a*/ Zero or more a’s
/a+/ One or more a’s
3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 170
/a? / Zero or one a’s
/cat|dog/ ‘cat’ or ‘dog’
In Table 4, there is a description of various regular
expressions that are used in natural language
processing.
Table 5 Regular expression kleene operators [8]
Pattern Description
Colou?r Optional previous char
oo*h! 0 or more of previous char
o+h! 1 or more of previous char
In Table 5 there is a discussion on three kleene operators with
its meaning.
Regular expression: Kleene *, Kleene +, wildcard
Special charater + (aka Kleene +) specifies one or more
occurrences of the regular expression that comes right before it.
Special character * (aka Kleene *) specifies zero or more
occurrences of the regular expression that comes right before it.
Special character. (Wildcard) specifies any single character.
Regular expression: Anchors
Anchors are special characters that anchor a regular expression
to specific position in the text they are matched against. The
anchors are ^ and $ anchor regular expressions at the beginning
and end of the text, respectively [8].
Regular expression: Disjunction
Disjunction of characters inside a regular expression is done
with the matching square brackets [ ]. All characters inside [ ]
are part of the disjunction.
Regular expression: Negation [^] in disjunction
Carat means negation only when first in [].
5. GENERATION OF RE FROM SHEEP
LANGUAGE
In NLP, the concept of regular expression (RE) by using Sheep
Language is illustrated as [8]:
A sheep can talk to another sheep with this language
“baaaaa!” .In this there is a discussion on how a regular
expression can be generated from a SHEEP Language.
In the Sheep language:
“ba!”, “baa!”, “baaaaa!”
Finite state automata
Double circle indicates “accept state”
Regular Expression
(baa*) or (baa+!)
In this way a regular expression can be generated from a
Language by creating finite state automata first as from it RE is
generated.
6. DIFFERENT APPROACHES OF RE IN NLP [2] [3]
[10]
6.1 An Improved Algorithm for the Regular
Expression Constrained Multiple Sequence Alignment
Problems [2]
This is the first approach, in which there is an illustration in
which two synthetic sequences S1 = TGFPSVGKTKDDA, and
S2 =TFSVAKDDDGKSA are aligned in a way to maximize the
number of matches (this is the longest common subsequence
problem). An optimal alignment with 8 matches is shown in
part 3(a).
Fig 2(a): An optimal alignment with 8 matches [2]
For the regular expression constrained sequence alignment
problem in NLP with regular expression, R = (G + A)
££££GK(S + T), where £ denotes a fixed alphabet over which
sequences are defined, the alignments sought change.
The alignment in Part 2(a) does not satisfy the regular
expression constraint. Part2 (b) shows an alignment with which
the constraint is satisfied.
4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 171
Fig 2(b): An alignment with the constraint [2]
The alignment includes a region (shown with a rectangle
drawn in dashed lines in the figure 2b) where the substring
GFPSVGKT of S1 is aligned with substring AKDDDGKS of
S2, and both substrings match R. In this case, optimal number
of matches achievable with the constraint decreases to 4 [2].
6.2 Automated Discovery of Valid Test Strings from
the Web using Dynamic Regular Expressions
Collation and Natural Language Processing [3]
In this second approach, there is a combination of Natural
Language Processing (NLP) techniques and dynamic regular
expressions collation for finding valid String values on the
Internet. The rest of the section provides details for each part of
the approach [3].
Fig 3: Overview of the approach [3]
6.2.1 Extracting Identifiers
The key idea behind the approach is to extract important
information from program identifiers and use them to generate
web queries. An identifier is a name that identifies either a
unique object or a unique class of objects. An identifier may be
a word, number, letter, symbol, or any combination of those.
For example, an identifier name including the string “email” is
a strong indicator that its value is expected to be an email
address. A web query containing “email” can be used to
retrieve example email addresses from the Internet [3].
6.2.2 Processing Identifier Names
Once the identifiers are extracted, their names are processed
using the following NLP techniques.
1) Tokenisation: Identifier names are often formed from
concatenations of words and need to be split into separate
words (or tokens) before they can be used in web queries.
Identifiers are split into tokens by replacing underscores with
whitespace and adding a whitespace before each sequence of
upper case letters.
For example, “an_Email_Address_Str” becomes “an email
address str” and “parseEmailAddressStr” becomes “parse email
address str”. Finally, all characters are converted to lowercase
[3].
2) PoS Tagging: Identifier names often contain words such as
articles (“a”, “and”, “the”) and prepositions (“to”, “at” etc.) that
are not useful when included in web queries. In addition,
method names often contain verbs as a prefix to describe the
action they are intended to perform.
For example, “parseEmailAddressStr” is supposed to parse an
email address. The key information for the input value is
contained in the noun “email address”, rather than the verb
“parse”. The part-of-speech category in the Identifier names
can be identified using a NLP tool called Part-of-Speech (PoS)
tagger, and thereby removing any non-noun tokens. Thus, “an
email address str” and “parse email address str” both become
“email address str” [3].
3) Removal of Non-Words: Identifier names may include non-
words which can reduce the quality of search results. Therefore,
names are filtered so that the web query should entirely consist
of meaningful words. This is done by removing any word in the
processed identifier name that is not a dictionary word.
For example, “email address str” becomes “email address”,
since “str” is not a dictionary word [3].
6.2.3 Obtaining Regular Expressions [3]
The regular expressions for the identifiers are obtained
dynamically from two ways:
1) RegExLib Search, and
2) Web Search.
RegExLib: RegExLib [9] is an online regular expression
library that is currently indexing around 3300 expressions for
different types (e.g., email, URL, postcode) and scientific
notations. It provides an interface to search for a regular
expression.
Web Search: When RegExLib is unable to produce regular
expressions, the approach performs a simple web search. The
search query is formulated by prefixing the processed identifier
names with the string “regular expression”.
For example, the regular expressions for “email address” are
searched by applying the query “regular expression” email
address. The regular expressions are collected by identifying
any string that starts with ^ and ends with $ symbols.
6.2.4 Generating Web Queries
Once the regular expressions are obtained, valid values can be
generated automatically, e.g., using automaton. However, the
objective here is to generate not only valid values but also
realistic and meaningful values. Therefore, a secondary web
search is performed to identify values on the Internet matching
the regular expressions. This section explains the generation of
web queries for the secondary search to identify valid values.
5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 172
The web queries include different versions of pluralized and
quoting styles explained in the following.
6.2.4.1 Pluralization
The approach generates pluralized versions of the processed
identifier names by pluralizing the last word according to the
grammar rules.
For example, “email address” becomes “email addresses”.
6.2.4.2 Quoting
The approach generates queries with or without quotes. The
former style enforces the search engine to target web pages that
contain all words in the query as a complete phrase. The latter
style is a general search to target web pages that contain the
words in the query. In total, 4 queries are generated for each
identifier name that represents all combinations of pluralisation
and quoting styles. For a processed identifier name “email
address”, the generated web queries are: email address, email
addresses; “email address”, “email addresses”.
6.2.4.3 Identifying Values
Finally, the regular expressions and the downloaded web pages
are used to identify valid values.
6.3 StriDFA for Multistring Matching [10]
In this method, multistring matching is done and is one of the
better approaches for matching pattern among the above two
approaches but still contain limitations.
Deterministic finite automaton (DFA) and nondeterministic
finite automaton (NFA) are two typical finite automata used to
implement regex matching. DFA is fast and has deterministic
matching performance, but suffers from the memory explosion
problem. NFA, on the other hand, requires less memory, but
suffers from slow and nondeterministic matching performance.
Therefore, neither of them is suitable for implementing high
speed regex matching in environments where the fast memory
is limited.
Fig 4: Traditional DFA for patterns “reference” and
“replacement” (some transitions are partly ignored for
simplicity) [10]
Suppose here two patterns are to be matched, “reference” (P1)
and “replacement” (P2. The matching process is performed by
sending the input stream to the automaton byte by byte. If the
DFA reaches any of its accept states (the states with double
circles), say that a match is found.
Fig 5: Tag “e” and a sliding window used to convert an input
stream into an SL stream with tag “e.” [10]
Fig 6: Sample StriDFA of patterns “reference” and
“replacement” with character “e” set as the tag (some
transitions are partly ignored) [10]
The construction of the StriDFA in this example is very simple.
Here first need to convert the patterns to SL streams. As shown
in Figure 6, the SL of patterns P1 and P2 are Fe (P1) =2 2 3 and
Fe (P2) = 5 2, respectively. Then the classical DFA
construction algorithm is constructed for StriDFA.
It is easy to see that the number of states to be visited during
the processing is equal to the length of the input stream (in
units of bytes), and this number determines the time required
for finishing the matching process (each state visit requires a
memory access, which is a major bottleneck in today’s
computer systems). This scheme is designed to reduce the
number of states to be visited during the matching process. If
this objective is achieved, the number of memory accesses
required for detecting a match can be reduced, and
consequently, the pattern matching speed can be improved. One
way to achieve this objective is to reduce the number of
characters sent to the DFA. For example, if select “e” as the
tag and consider the input stream “referenceabcdreplacement,”
as shown in Figure 5 , then the corresponding stride length (SL)
stream is Fe(S) = 2 2 3 6 5 2, where Fe(S) denotes the SL
stream of the input stream S, “e” denotes the tag character in
6. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 173
use. The underscore is used to indicate an SL, to distinguish it
as not being a character. As shown in Figure 6, the SL of the
input stream is fed into the StriDFA and compared with the SL
streams extracted from the rule set.
7. SUMMARY OF APPROACHES
Table 6 Regular expression meaning [8]
APPROACH DESCRIPTION PROBLEM
An improved
algorithm for
the regular
expression
constrained
multiple
sequence
alignment
problems
In this approach, a
particular RE
constraint is
followed.
Follows
constraints for
string matching
and used in
limited areas and
the results
produced are
arbitrary.
Automated
Discovery of
Valid Test
Strings from the
Web using
Dynamic
Regular
Expressions
Collation and
Natural
Language
Processing
In this approach,
gives valid values
and another benefit
is that the generated
values are also
realistic rather than
arbitrary.
Multiple strings
can’t be
processed or
matched at the
same time.
StriDFA for
Multistring
Matching
In this approach,
multistring matching
is performed. The
main scheme is to
reduce the number
of states during the
matching process.
Didn’t processed
the larger
alphabet set.
In Table VI, the approaches are different and have its own
importance. As the first approach follows some constraints for
RE and used in case where a particular RE pattern is given and
second and the third approach is quite different from the first
one as there is no such concept of constraints. As second
approach is used to extract and match the patterns from the web
queries and these queries can be of any pattern. We suggest
second approach which is better than the first and third
approach as it generally gives valid values and another benefit
is that the generated values are also realistic rather than
arbitrary. This is because the values are obtained from the
Internet which is a rich source of human-recognizable data.
The third approach is used to achieve an ultrahigh matching
speed with a relatively low memory usage. But still third
approach is not working for the large alphabet sets.
8. APPLICATIONS OF RE IN NLP [8]
1. Web search
2. Word processing, find, substitute (MS-WORD)
3. Validate fields in a database (dates, email address,
URLs)
4. Information extraction (e.g. people & company names)
CONCLUSIONS & FUTURE WORK
A regular expression, or RE, describes strings of characters
(words or phrases or any arbitrary text). An attempt has been
made to present the theory of regular expression in natural
language processing in a unified way, combining the results of
several authors and showing the different approaches for
regular expression i.e. first is a expression constrained multiple
sequence alignment problem and second is combination of
Natural Language Processing (NLP) techniques and dynamic
regular expressions collation for finding valid String values on
the Internet and third is Multistring matching regex. All these
approaches have their own importance and at last practical
applications are discussed.
Future work is to generate advanced algorithms for obtaining
and filtering regular expressions that shall be investigated to
improve the precision of valid values.
REFERENCES
[1]. K.R. Chowdhary, “Introduction to Parsing - Natural
Language Processing” http://krchowdhary.com/me-nlp/nlp-
01.pdf, April, 2012.
[2]. A .N. Arslan and Dan He, “An improved algorithm for the
regular expression constrained multiple sequence alignment
problem”, Proc.: Sixth IEEE Symposium on BionInformatics
and Bio Engineering (BIBE'06), 2006
[3]. M. Shahbaz, P. McMinn and M. Stevenson, "Automated
Discovery of Valid Test Strings from the Web using Dynamic
Regular Expressions Collation and Natural Language
Processing". Proceedings of the International Conference on
Quality Software (QSIC 2012), IEEE, pp. 79–88.
[4]. Sai Qian, “Applications for NLP -Lecture 6: Regular
Expression”
http://talc.loria.fr/IMG/pdf/Lecture_6_Regular_Expression-
5.pdf , October, 2011
[5]. P. M Nadkarni, L.O. Machado, W. W. Chapman
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3168328/
[6]. H. A. Muhtaseb, “Lecture 3: REGULAR EXPRESSIONS
AND AUTOMATA”
[7]. A.McCallum, U. Amherst, “Introduction to Natural
Language Processing-Lecture 2”
CMPSCI 585, 2007
[8]. D. Jurafsky and J.H. Martin, “Speech and Language
Processing” Prentice Hall, 2 edition, 2008.
7. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 174
[9]. RegExLib: Regular Expression Library.
http://regexlib.com/.
[10]. Xiaofei Wang and Yang Xu, “StriFA: Stride Finite
Automata for High-Speed Regular Expression Matching in
Network Intrusion Detection Systems” IEEE Systems Journal,
Vol. 7, No. 3, September 2013.