This paper presents a study for three languages namely Assamese, Bengali and English. The main objective
of this study is to recognize the patterns based on language model with special reference to children stories in order to find the distinction among all these languages. We consider only the children stories because they are found to be similar all over the world with different flavors produced by different cultures, languages and time. Sixteen stories based on themes namely Fairy tales, Fables, Jack tales and Formula tales are considered for analysis. The significant differences among the types of stories written by different authors are verified . Autocorrelation test is performed by using Box- Ljung statistic to determine the randomness of the data by observing the language data as time series data. However for some of the stories, the data are found to be random and hence Kolmogorov goodness of fit
test and Smirnov test are conducted to test the significant differences only for those stories. It has been shown that there exist significant differences among the writing patterns of the children stories written by different authors in different languages.
PATTERN RECOGNITION OF CHILDREN STORIES BASED ON THEMES USING TIME SERIES DATAijnlc
This paper presents a study for three languages namely Assamese, Bengali and English. The main objective of this study is to recognize the patterns based on language model with special reference to children stories in order to find the distinction among all these languages. We consider only the children stories because they are found to be similar all over the world with different flavors produced by different, cultures, languages and time. Sixteen stories based on themes namely Fairy tales, Fables, Jack tales and Formula tales are considered for analysis. The significant differences among the types of stories written by different authors are verified . Autocorrelation test is performed by using Box- Ljung statistic to determine the randomness of the data by observing the language data as time series data. However for some of the stories, the data are found to be random and hence Kolmogorov goodness of fit, test and Smirnov test are conducted to test the significant differences only for those stories. It has been shown that there exist significant differences among the writing patterns of the children stories written by different authors in different languages.
A Neural Probabilistic Language Model.pptx
Bengio, Yoshua, et al. "A neural probabilistic language model." Journal of machine learning research 3.Feb (2003): 1137-1155.
A goal of statistical language modeling is to learn the joint probability function of sequences of
words in a language. This is intrinsically difficult because of the curse of dimensionality: a word
sequence on which the model will be tested is likely to be different from all the word sequences seen
during training. Traditional but very successful approaches based on n-grams obtain generalization
by concatenating very short overlapping sequences seen in the training set. We propose to fight the
curse of dimensionality by learning a distributed representation for words which allows each
training sentence to inform the model about an exponential number of semantically neighboring
sentences. The model learns simultaneously (1) a distributed representation for each word along
with (2) the probability function for word sequences, expressed in terms of these representations.
Generalization is obtained because a sequence of words that has never been seen before gets high
probability if it is made of words that are similar (in the sense of having a nearby representation) to
words forming an already seen sentence. Training such large models (with millions of parameters)
within a reasonable time is itself a significant challenge. We report on experiments using neural
networks for the probability function, showing on two text corpora that the proposed approach
significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to
take advantage of longer contexts.
PATTERN RECOGNITION OF CHILDREN STORIES BASED ON THEMES USING TIME SERIES DATAijnlc
This paper presents a study for three languages namely Assamese, Bengali and English. The main objective of this study is to recognize the patterns based on language model with special reference to children stories in order to find the distinction among all these languages. We consider only the children stories because they are found to be similar all over the world with different flavors produced by different, cultures, languages and time. Sixteen stories based on themes namely Fairy tales, Fables, Jack tales and Formula tales are considered for analysis. The significant differences among the types of stories written by different authors are verified . Autocorrelation test is performed by using Box- Ljung statistic to determine the randomness of the data by observing the language data as time series data. However for some of the stories, the data are found to be random and hence Kolmogorov goodness of fit, test and Smirnov test are conducted to test the significant differences only for those stories. It has been shown that there exist significant differences among the writing patterns of the children stories written by different authors in different languages.
A Neural Probabilistic Language Model.pptx
Bengio, Yoshua, et al. "A neural probabilistic language model." Journal of machine learning research 3.Feb (2003): 1137-1155.
A goal of statistical language modeling is to learn the joint probability function of sequences of
words in a language. This is intrinsically difficult because of the curse of dimensionality: a word
sequence on which the model will be tested is likely to be different from all the word sequences seen
during training. Traditional but very successful approaches based on n-grams obtain generalization
by concatenating very short overlapping sequences seen in the training set. We propose to fight the
curse of dimensionality by learning a distributed representation for words which allows each
training sentence to inform the model about an exponential number of semantically neighboring
sentences. The model learns simultaneously (1) a distributed representation for each word along
with (2) the probability function for word sequences, expressed in terms of these representations.
Generalization is obtained because a sequence of words that has never been seen before gets high
probability if it is made of words that are similar (in the sense of having a nearby representation) to
words forming an already seen sentence. Training such large models (with millions of parameters)
within a reasonable time is itself a significant challenge. We report on experiments using neural
networks for the probability function, showing on two text corpora that the proposed approach
significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to
take advantage of longer contexts.
TEXT DATA MINING OF ENGLISH BOOKS ON ENVIRONMENTOLOGYcscpconf
Recently, to confront environmental problems, a system of “environmentology” is trying to be constructed. In order to study environmentology, reading materials in English is considered to be indispensable. In this paper, we investigated several English books on environmentology, comparing with journalism in terms of metrical linguistics. In short, frequency characteristics of character- and word-appearance were investigated using a program written in C++. These characteristics were approximated by an exponential function. Furthermore, we calculated the percentage of Japanese junior high school required vocabulary and American basic vocabulary to obtain the difficulty-level as well as the K-characteristic of each material. As a result, it was clearly shown that English materials for environmentology have a similar tendency to literary writings in the characteristics of character appearance. Besides, the values of the K-characteristic for the materials on environmentology are high, and some books are more difficult than TIME magazine.
Document Author Classification Using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
Document Author Classification using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
A COMPUTATIONAL APPROACH FOR ANALYZING INTER-SENTENTIAL ANAPHORIC PRONOUNS IN...ijnlc
This paper presents a strategy and a computational model for solving inter-sentential anaphoric pronouns
in Vietnamese paragraphs composing simple sentences. The strategy is proposed based on grammatical
features of nouns and the focus phenomenon when using pronouns in Vietnamese. In this research, we
consider only nouns and pronouns which are human objects in the paragraph, and each anaphoric
pronoun will appear one time in one sentence and can appear in adjacent sentences. The computational
model is implemented in Prolog and based on applying and improving the models of Mark Johnson and
Ewan Klein, had been improved by Covington and Schmitz, with theoretical background of Discourse
Representation Theory.Analysis of test results shows that this approach which based on linguistic theories
helps for well solving inter-sentential anaphoric pronouns in Vietnamese paragraphs.
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Using Distributional Semantics” Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) -- Long Papers, Potsdam, Germany
Defining the generative probabilistic topic model for text summarization that aims at extracting a small subset of sentences from the corpus with respect to some given query.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRYkevig
In this paper we present a set of experiments carried out with BERT on a number of Italian sentences taken
from poetry domain. The experiments are organized on the hypothesis of a very high level of difficulty in
predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and
semantic level. To test this hypothesis we ran the Italian version of BERT with 80 sentences - for a total of
900 tokens – mostly extracted from Italian poetry of the first half of last century. Then we alternated
canonical and non-canonical versions of the same sentence before processing them with the same DL
model. We used then sentences from the newswire domain containing similar syntactic structures. The
results show that the DL model is highly sensitive to presence of non-canonical structures. However, DLs
are also very sensitive to word frequency and to local non-literal meaning compositional effect. This is also
apparent by the preference for predicting function vs content words, collocates vs infrequent word phrases.
In the paper, we focused our attention on the use of subword units done by BERT for out of vocabulary
words.
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRYkevig
In this paper we present a set of experiments carried out with BERT on a number of Italian sentences taken
from poetry domain. The experiments are organized on the hypothesis of a very high level of difficulty in
predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and
semantic level. To test this hypothesis we ran the Italian version of BERT with 80 sentences - for a total of
900 tokens – mostly extracted from Italian poetry of the first half of last century. Then we alternated
canonical and non-canonical versions of the same sentence before processing them with the same DL
model. We used then sentences from the newswire domain containing similar syntactic structures. The
results show that the DL model is highly sensitive to presence of non-canonical structures. However, DLs
are also very sensitive to word frequency and to local non-literal meaning compositional effect. This is also
apparent by the preference for predicting function vs content words, collocates vs infrequent word phrases.
In the paper, we focused our attention on the use of subword units done by BERT for out of vocabulary
words.
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
More Related Content
Similar to PATTERN RECOGNITION OF CHILDREN STORIES BASED ON THEMES USING TIME SERIES DATA
TEXT DATA MINING OF ENGLISH BOOKS ON ENVIRONMENTOLOGYcscpconf
Recently, to confront environmental problems, a system of “environmentology” is trying to be constructed. In order to study environmentology, reading materials in English is considered to be indispensable. In this paper, we investigated several English books on environmentology, comparing with journalism in terms of metrical linguistics. In short, frequency characteristics of character- and word-appearance were investigated using a program written in C++. These characteristics were approximated by an exponential function. Furthermore, we calculated the percentage of Japanese junior high school required vocabulary and American basic vocabulary to obtain the difficulty-level as well as the K-characteristic of each material. As a result, it was clearly shown that English materials for environmentology have a similar tendency to literary writings in the characteristics of character appearance. Besides, the values of the K-characteristic for the materials on environmentology are high, and some books are more difficult than TIME magazine.
Document Author Classification Using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
Document Author Classification using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
A COMPUTATIONAL APPROACH FOR ANALYZING INTER-SENTENTIAL ANAPHORIC PRONOUNS IN...ijnlc
This paper presents a strategy and a computational model for solving inter-sentential anaphoric pronouns
in Vietnamese paragraphs composing simple sentences. The strategy is proposed based on grammatical
features of nouns and the focus phenomenon when using pronouns in Vietnamese. In this research, we
consider only nouns and pronouns which are human objects in the paragraph, and each anaphoric
pronoun will appear one time in one sentence and can appear in adjacent sentences. The computational
model is implemented in Prolog and based on applying and improving the models of Mark Johnson and
Ewan Klein, had been improved by Covington and Schmitz, with theoretical background of Discourse
Representation Theory.Analysis of test results shows that this approach which based on linguistic theories
helps for well solving inter-sentential anaphoric pronouns in Vietnamese paragraphs.
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Using Distributional Semantics” Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) -- Long Papers, Potsdam, Germany
Defining the generative probabilistic topic model for text summarization that aims at extracting a small subset of sentences from the corpus with respect to some given query.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRYkevig
In this paper we present a set of experiments carried out with BERT on a number of Italian sentences taken
from poetry domain. The experiments are organized on the hypothesis of a very high level of difficulty in
predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and
semantic level. To test this hypothesis we ran the Italian version of BERT with 80 sentences - for a total of
900 tokens – mostly extracted from Italian poetry of the first half of last century. Then we alternated
canonical and non-canonical versions of the same sentence before processing them with the same DL
model. We used then sentences from the newswire domain containing similar syntactic structures. The
results show that the DL model is highly sensitive to presence of non-canonical structures. However, DLs
are also very sensitive to word frequency and to local non-literal meaning compositional effect. This is also
apparent by the preference for predicting function vs content words, collocates vs infrequent word phrases.
In the paper, we focused our attention on the use of subword units done by BERT for out of vocabulary
words.
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRYkevig
In this paper we present a set of experiments carried out with BERT on a number of Italian sentences taken
from poetry domain. The experiments are organized on the hypothesis of a very high level of difficulty in
predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and
semantic level. To test this hypothesis we ran the Italian version of BERT with 80 sentences - for a total of
900 tokens – mostly extracted from Italian poetry of the first half of last century. Then we alternated
canonical and non-canonical versions of the same sentence before processing them with the same DL
model. We used then sentences from the newswire domain containing similar syntactic structures. The
results show that the DL model is highly sensitive to presence of non-canonical structures. However, DLs
are also very sensitive to word frequency and to local non-literal meaning compositional effect. This is also
apparent by the preference for predicting function vs content words, collocates vs infrequent word phrases.
In the paper, we focused our attention on the use of subword units done by BERT for out of vocabulary
words.
Similar to PATTERN RECOGNITION OF CHILDREN STORIES BASED ON THEMES USING TIME SERIES DATA (20)
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Effect of Singular Value Decomposition Based Processing on Speech Perceptionkevig
Speech is an important biological signal for primary mode of communication among human being and also the most natural and efficient form of exchanging information among human in speech. Speech processing is the most important aspect in signal processing. In this paper the theory of linear algebra called singular value decomposition (SVD) is applied to the speech signal. SVD is a technique for deriving important parameters of a signal. The parameters derived using SVD may further be reduced by perceptual evaluation of the synthesized speech using only perceptually important parameters, where the speech signal can be compressed so that the information can be transformed into compressed form without losing its quality. This technique finds wide applications in speech compression, speech recognition, and speech synthesis. The objective of this paper is to investigate the effect of SVD based feature selection of the input speech on the perception of the processed speech signal. The speech signal which is in the form of vowels \a\, \e\, \u\ were recorded from each of the six speakers (3 males and 3 females). The vowels for the six speakers were analyzed using SVD based processing and the effect of the reduction in singular values was investigated on the perception of the resynthesized vowels using reduced singular values. Investigations have shown that the number of singular values can be drastically reduced without significantly affecting the perception of the vowels.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,
demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to
identify which specific terms in prompts positively or negatively impact relevance
evaluation with LLMs. We employed two types of prompts: those used in previous
research and generated automatically by LLMs. By comparing the performance of
these prompts in both few-shot and zero-shot settings, we analyze the influence of
specific terms in the prompts. We have observed two main findings from our study.
First, we discovered that prompts using the term ‘answer’ lead to more effective
relevance evaluations than those using ‘relevant.’ This indicates that a more direct
approach, focusing on answering the query, tends to enhance performance. Second,
we noted the importance of appropriately balancing the scope of ‘relevance.’ While
the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.
The inclusion of few-shot examples helps in more precisely defining this balance.
By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute
to refine relevance criteria. In conclusion, our study highlights the significance of
carefully selecting terms in prompts for relevance evaluation with LLMs.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term ‘answer’ lead to more effective relevance evaluations than those using ‘relevant.’ This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of ‘relevance.’ While the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Rule Based Transliteration Scheme for English to Punjabikevig
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Improving Dialogue Management Through Data Optimizationkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and computers through natural language stands as a critical advancement. Central to these systems is the dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue management has embraced a variety of methodologies, including rule-based systems, reinforcement learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset imperfections, especially mislabeling, on the challenges inherent in refining dialogue management processes.
Rag-Fusion: A New Take on Retrieval Augmented Generationkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. Through manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives. However, some answers strayed off topic when the generated queries' relevance to the original query is insufficient. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications and demonstrates transformations in a global and multi-industry context.
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their performance. However, this often leads to overlooking other crucial aspects that should be given careful consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches (SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance parameters (standard indices), but also alternative measures such as timing, energy consumption and costs, which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of complex models (LLM and generative models), consuming much less energy and requiring fewer resources. These findings suggest that companies should consider additional considerations when choosing machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Evaluation of Medium-Sized Language Models in German and English Languagekevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and
computers through natural language stands as a critical advancement. Central to these systems is the
dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user
goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue
management has embraced a variety of methodologies, including rule-based systems, reinforcement
learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This
research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue
managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as
Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the
introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset
imperfections, especially mislabeling, on the challenges inherent in refining dialogue management
processes.
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product
information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots,
but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines
RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal
scores and fusing the documents and scores. Through manually evaluating answers on accuracy,
relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and
comprehensive answers due to the generated queries contextualizing the original query from various
perspectives. However, some answers strayed off topic when the generated queries' relevance to the
original query is insufficient. This research marks significant progress in artificial intelligence (AI) and
natural language processing (NLP) applications and demonstrates transformations in a global and multiindustry context
Performance, energy consumption and costs: a comparative analysis of automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their
performance. However, this often leads to overlooking other crucial aspects that should be given careful
consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into
account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches
(SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance
parameters (standard indices), but also alternative measures such as timing, energy consumption and costs,
which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and
the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of
complex models (LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing machine
learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific
world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to
give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEkevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks
clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six
billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative
question answering, which requires models to provide elaborate answers without external document
retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results
show that combining the best answers from different MLMs yielded an overall correct answer rate of
82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B
parameters, which highlights the importance of using appropriate training data for fine-tuning rather than
solely relying on the number of parameters. More fine-grained feedback should be used to further improve
the quality of answers. The open source community is quickly closing the gap to the best commercial
models.
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithmkevig
Information Retrieval (IR) is a very important and vast area. While searching for context web returns all
the results related to the query. Identifying the relevant result is most tedious task for a user. Word Sense
Disambiguation (WSD) is the process of identifying the senses of word in textual context, when word has
multiple meanings. We have used the approaches of WSD. This paper presents a Proposed Dynamic Page
Rank algorithm that is improved version of Page Rank Algorithm. The Proposed Dynamic Page Rank
algorithm gives much better results than existing Google’s Page Rank algorithm. To prove this we have
calculated Reciprocal Rank for both the algorithms and presented comparative results.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Gen AI Study Jams _ For the GDSC Leads in India.pdf
PATTERN RECOGNITION OF CHILDREN STORIES BASED ON THEMES USING TIME SERIES DATA
1. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
DOI: 10.5121/ijnlc.2018.7504 37
PATTERN RECOGNITION OF CHILDREN STORIES
BASED ON THEMES USING TIME SERIES DATA
Ms. Menaka Sikdar1
and Ms. Pranita Sarma2
1
Ph.D Research scholar, Department of Statistics, Gauhati University, Guwahati, Assam,
India
2
Professor, Department of Statistics,Gauhati University, Guwahati, Assam, India.
ABSTRACT
This paper presents a study for three languages namely Assamese, Bengali and English. The main objective
of this study is to recognize the patterns based on language model with special reference to children
stories in order to find the distinction among all these languages. We consider only the children stories
because they are found to be similar all over the world with different flavors produced by different
cultures, languages and time. Sixteen stories based on themes namely Fairy tales, Fables, Jack tales and
Formula tales are considered for analysis. The significant differences among the types of stories
written by different authors are verified . Autocorrelation test is performed by using Box- Ljung
statistic to determine the randomness of the data by observing the language data as time series data.
However for some of the stories, the data are found to be random and hence Kolmogorov goodness of fit
test and Smirnov test are conducted to test the significant differences only for those stories. It has been
shown that there exist significant differences among the writing patterns of the children stories written by
different authors in different languages.
KEYWORDS
Autocorrelation, Box- Ljung statistic, Empirical distribution, Kolmogorov Goodness –of- Fit test and
Smirnov test.
1. INTRODUCTION
The maximum numbers of children stories are compiled from our classical folklores and
folktales. Therefore the structures of these stories may be changed but the skeleton of the stories
along with the moral lessons that are intended to inject the society remain all the more same over
time and places. Conventional Indo- European stories may be structurally different in many
subjects but their bases and themes are same. The variation in the writing patterns of stories
with respect to different languages, cultures and authors seems to quite natural. The
skeleton of the stories being same, it is difficult to comment whether there exists any significant
differences amongst them at their first look. The children stories are mainly classified into
four different categories namely Fairy tales1
, Fables2
, , Jack tales3
and formula tales4
.
Understanding of the complexity associated with the pattern of writing offered by different
authors in various languages is not an easy task. All kinds of mathematical tools were adopted to
gain understanding the complexity of human language. One of these tools is Time Series Analysis
(D.S.G Pollock, 1999). Time series analysis plays a key role not only in physical sciences but
also in statistical sciences. Time series analysis has been used to investigate written human
language texts by Kosmas Kosmidis et al(2006) and DENG W.B.et al(2011). Sikdar and Sarmah
(2017A,2017B,2017C and 2017D ) presented four articles on three languages namely Assamese,
Bengali and English for pattern recognition of language model with special reference to
children stories of which one is based on direct speech. In this article sixteen stories based
2. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
38
on themes namely fairy tales, Fables, Jack tales and Formula tales are considered for analysis.
Autocorrelation test is performed by using Box- Ljung statistic to determine the randomness of
the data by observing the language data as time series data. However for some of the stories, the
data are found to be random and hence Kolmogorov goodness of fit test and Smirnov test are
conducted to test the significant differences only for those stories. The significant differences
among the types of stories written by different authors are verified . Section 2 of this
article is about the objective of the study whereas the source of data is presented in
section 3 . The materials and methods used to perform the analysis are given in section 4.
Section 5 contains the results of analysis.
2. OBJECTIVE OF THE STUDY
The main objectives of our study are
a) To gain understanding of the process generating the sentence length time series (SLTS)
[mentioned in section 4.1]
b) To recognize the pattern of writings presented by different authors in three different
languages namely Assamese, Bengali and English with respect to the types of stories.
c) To study the significance differences among the stories which are theme wise similar.
3. SOURCES OF DATA
For analyzing the children stories based on themes, the data have been collected from the stories
by Sahityo –rothi Laxminath Bezbarua, Upendrakishore Roy Choudhury and the Grimm
brothers.The stories from each category(Fairy tales, Fables, Jack tales and Formula tales) which
are theme wise similar in ‘Burhi Aai’r Xaadhu’(Assamese) ,‘Tuntunir Boi’ (Bengali) and
Grimm’s Fairy Tales (English) are selected for the purpose of analysis. The number of words in
a sentence and number of sentences of the corpus under consideration are counted by using
Microsoft Office Excel 2007. The lists of theme wise similar stories are given in table1.
Table 1.List of selected stories
4. MATERIAL AND METHODS
To achieve the objectives of our study, we have performed Autocorrelation test using Box- Ljung
statistic to determine the randomness of the data. However for some of the stories, the data are
found to be random and hence Kolmogorov goodness of fit test, Smirnov test, Kruskal-Wallis test
3. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
39
and Squared rank test (Conover W.J. (2006) ) have been conducted to test the significant
differences among the stories.
4.1. Language Time Series
At a glance, it does not seem that language and time series have any relation between them.
Before applying the time series analysis method, we can observe the written document (story) as
time series data. Let us take a document (story) having ‘n’ sentences and Si (i=1,2,…,n) be the
length (total number of words) of the ith
sentence .The position of the sentence in a written
document depends on time. Let ti be time epoch at which the completion of ith
sentence
takes place ∀ = 1,2, … . Hence our data set contain (ti, Si) ∀ = 1,2, … .
4.2. Autocorrelation Test
Autocorrelation test has been performed to detect the randomness of the distribution of the
sentence length of different stories written in three different languages namely Assamese, Bengali
and English. The autocorrelation function [ACF] can be used to detect non-randomness in data
and also to identify an appropriate time series model if the data are not random. For our data set
(Si,ti), i=1,2,…,n, the lag k autocorrelation function may be defined as is
=
∑ − ̅ − ̅
∑ − ̅
The value of rk ranges from -1to 1, and the larger the absolute value of rk, the stronger the
correlation in the time series. The randomness is ascertained by computing autocorrelation for
data set at varying time lags. If random such autocorrelations should be near zero for any and all
time-lag separations. If non-random then one or more of the autocorrelations will be significantly
non zero. On the other hand, the Box- Ljung test (1978) is used to test whether or not
observations over time are random and independent. In particular, for a given k, it tests the
following hypotheses:
H0j: The autocorrelations of the length of sentences up to lag k under jth story are all zero for
j=1,2,…,m
H1j: The autocorrelations of the length of sentences of jth story at one or more lags differ from
0.
The test statistic is calculated as:
= + 2
−
which approximately follows chi-square distribution with k degrees of freedom.
4.3. The Empirical Distribution Function
The concept of Empirical distribution function is used for studying the probabilistic structure of
the distributions of sentence lengths of various stories which are randomly distributed. It is
interesting to note that except formula tales randomness is exhibited in other categories of
the stories. In case of the distribution of length (number of words) of sentences of a
4. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
40
particular story, our data consist of a random sample S1, S2,…,Sn of size n. The empirical
distribution function,
Fs(x) = (number of s≤ ⁄ .
4.4. Kolmogorov Goodness –Of- Fit Test (One Sample Kolmogorov And Smirnov
Test)
Our objective is to verify whether these random variables are normally distributed. The
Kolmogorov Goodness –of- Fit test is used for this purpose. The distribution of length
(number of words) of sentences of a particular story ,constitutes a random sample S1,S2,…,Sn
of size n associated with some unknown distribution function , denoted by Q(s).
Test Statistic
Let F(s) be the empirical distribution function based on the random sample S1,S2,…,Sn of size n.
Let Q*
(s) be a completely specified hypothesized distribution function which is considered here
as a normal probability distribution function.
(Two- Sided Test) Let the test statistic T1 be the greatest vertical distance between F(s) and Q*
(s).
Mathematically T1=
s
sup | ∗
$ − % $ |
Null Distribution: when Q(s) is continuous and the null hypothesis is true, the approximate
distribution function of T1 is & ' ≤ $ = [) $ ]
where
) $ = 1 − $ + ,
&
- .1 − $ −
n
p
/
0
1$ +
&
2
0
[ 3 ]
0 4
Where [n (1-s)] is the greatest integer less than or equal to n(1-s).
Hypotheses
The null hypothesis is to be tested
H0: Q(s)=Q*
(s) for all s from -∞ to +∞
H1: Q(s)≠Q*
(s) for at least one value of s
4.4. Smirnov Test (Also Known As Two Sample Kolmogorov And Smirnov Test)
Kolmogorov and Smirnov developed statistical procedure that uses the maximum vertical
distance between two empirical distribution functions as a measure of how well the functions
resemble each other. Our objective is to compare the distributions of two stories which are
theme wise similar but written by different authors in different languages . The lists of such
similar stories given in table 1 are being considered for our analysis. Here we are trying to
determine whether the two distribution functions of the length of sentences under two similar
stories are identical by applying Smirnov test.
Let , ,…, 5
be the length of n1 sentences under story 1 associated with some unknown
distribution function , denoted by Q1(s) and , ,…, 6
be the length of n2 sentences under
5. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
41
story 2 associated with some unknown distribution function , denoted by Q2(s).
Test Statistic
Let F1(s) and F2(s) be the empirical distribution function of the length of sentences under story1
and story2 respectively. The test statistic for two-sided test is defined as
T2 =
s
sup |% $ − % $ |
Null Distribution
The exact null distribution of T2 is obtained by considering all ordering of S1’s and S2’s to be
equally likely under the null hypothesis and computing T2, as appropriate, for each ordering
.Quantiles of the null distribution are given in table 19 for equal sample sizes and table 20 for
unequal sample sizes. [Table 19 and Table 20 are given in “Conover W.J. (2006) ]
Hypothesis: (Two sided test)
H0: Q1(s)= Q2(s) for all s from -∞ to +∞
H1: Q1(s)≠ Q2 (s) for at least one value of s
5. ANALYSIS OF DATA
Autocorrelations and results of Box- Ljung test of Sentence length time series of different stories
under different languages are obtained by using SPSS5
software and are given in table 2.
Table2. Results of autocorrelation and Box-Ljung test
Language category story lag
Autocorrelati
on
Box-Ljung statistic
value d.f p-value
English Fairy tales R
1 -0.044 0.112 1 0.74
2 -0.044 0.227 2 0.90
3 -0.208 2.826 3 0.42
4 -0.235 2.900 4 0.58
5 0.245 6.673 5 0.25
Assamese Fairy tales SJX
1 0.183 4.602 1 0.03
2 0.083 5.558 2 0.06
3 -0.030 5.685 3 0.13
4 0.018 5.731 4 0.22
5 -0.065 6.329 5 0.28
Assamese Fable BS
1 0.140 1.402 1 0.24
2 0.002 1.402 2 0.50
3 0.027 1.455 3 0.69
4 0.104 2.256 4 0.69
5 0.117 3.284 5 0.66
Bengali Fable MS
1 0.143 2.380 1 0.12
2 0.116 3.963 2 0.14
3 -0.023 4.025 3 0.26
4 -0.201 8.827 4 0.07
5 -0.022 8.886 5 0.15
Assamese Fable BAS
1 0.218 3.410 1 0.07
2 0.292 9.650 2 0.008*
3 0.077 10.091 3 0.018
4 0.088 10.676 4 0.030
5 0.134 12.058 5 0.034
7. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
43
It is interesting to note that the autocorrelation between the lengths of the sentences does not have
any particular pattern. At certain lag they are found to be very small whereas sometime their
absolute values are found to slightly bigger than 0.2. For this purpose we have to use Box- Ljung
statistic to verify the randomness of the data. Three categories namely Fairy tales, Fables and jack
tales are found to be random in nature whereas formula tales show non randomness in the data
under consideration. As mentioned in the note under table 2, the randomness of the length of the
sentences of the stories under different categories may be decided. The randomness of the
data are exhibited for stories of all categories except for formula tales.
The empirical distribution functions have been plotted for similar (theme wise) stories in the same
graph by using R6
Language and they are given below
[Assamese→ ,Bengali→ ,English→ ]
Figure1.(R & SJX) Figure2.(BS & MS) Figure3.(TDATS & CABK) Figure4.(Sj & DK)
Figure5.(BAS & CAMIP) Figure6.(PBK & CG)Figure7.(BBAS & KBK)
[ The empirical distribution functions of length of sentences of different stories under Assamese,
Bengali, English are represented in Figures 1,2,3,4,5,6 and 7 respectively]
Results of Kolmogorov Goodness- Of-Fit Test are obtained by using SPSS soft-ware and are
given in table3
0 10 20 30 40 50 60 70
0.00.20.40.60.81.0
ecdf(s)
x
Fn(x)
0 10 20 30 40 50 60
0.00.20.40.60.81.0
ecdf(s)
x
Fn(x)
0 10 20 30 40 50 60
0.00.20.40.60.81.0
ecdf(s)
x
Fn(x)
0 10 20 30 40 50 60 70
0.00.20.40.60.81.0
ecdf(s)
x
Fn(x)
0 10 20 30 40 50 60 70
0.00.20.40.60.81.0
ecdf(s)
x
Fn(x)
0 10 20 30 40 50 60
0.00.20.40.60.81.0
ecdf(s)
x
Fn(x)
0 10 20 30 40
0.00.20.40.60.81.0
ecdf(s)
x
Fn(x)
8. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
44
Table3. Results of One-Sample Kolmogorov-Smirnov Test for Normality
Language Story
Sample
size(n)
Mean
(7
S.D (8
Absolute
difference
(D)
K.S.Test
statistic
(Z)
p
value
English
R 55 25.51 13.60 0.119 0.880 0.421
CAMIP 52 18.75 13.42 0.167 1.204 0.110
TDATS 59 22.81 11.48 0.155 1.193 0.166
DK 35 22.23 13.28 0.190 1.123 0.160
CG 41 22.73 14.25 0.161 1.033 0.236
Assamese
SJX 134 14.05 8.35 0.130 1.509 0.021
BS 68 13.28 8.738 0.147 1.216 0.104
BAS 69 12.61 7.81 0.154 1.279 0.076
Sj 88 11.85 7.46 0.149 1.398 0.040
BBAS 47 14.02 9.95 0.179 1.228 0.098
Bengali
MS 113 9.97 5.34 0.083 0.879 0.422
CABK 50 8.8 4.45 0.131 0.929 0.354
KBK 74 9.70 5.18 0.141 1.212 0.106
PBK 55 8.33 4.45 0.136 1.010 0.259
From table 3, it has been noticed that the p-values of the test statistics for the distributions of
length of sentences of different stories(except for SJX and Sj) under different languages are
greater than 0.05.Therefore we may accept our null hypotheses at 5% level of significance . On
the other hand the p-values of the test statistics under the stories namely SJX and Sj( Assamese)
are greater than 0.01 and hence the null hypotheses of these stories may be accepted at 1% level
of significance. Hence we may conclude that the distributions of length of sentences of different
stories under different languages namely Assamese, Bengali and English are normally
distributed.
Results of Smirnov Test are obtained by using SPSS soft-ware and are given in table4
Table4. Results of Two-Sample Kolmogorov-Smirnov Test
Stories Language
Catego
ry
Sample
size(n)
Absolute
difference
(D)
K.S. Z
p
value
R English
Fairy
55
0.430 2. 686 0.000
SJX Assamese 134
BS Assamese
Fable
68
0.211 1.377 0.045
MS Bengali 113
TDATS English
Fable
59
0.707 3.680 0.000
CABK Bengali 50
BAS Assamese
Fable
69
0.307 1.672 0.007
CAMIP English 52
Sj Assamese
Jack
88
0.425 2.215 0.000
DK English 35
BBAS Assamese
Jack
47
0.252 1.349 0.053
KBK Bengali 74
CG English
Jack
41
0.561 2.721 0.000
PBK Bengali 55
9. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
45
From table 4, it has been noticed that the p-values of the test statistics for the distributions of
length of sentences of similar stories under English and Assamese as well as Bengali and English
are less than 0.05 and we may reject our null hypotheses at 5% level of significance. Therefore
we may conclude that the distributions of length of sentences of different stories under English
are significantly different from Assamese and Bengali. On the other hand, the p-value of the test
statistic for the stories namely BS (Assamese) and MS(Bengali) is greater than 0.01 and the null
hypothesis may accepted at 1% level of significance. Again the p-value of the test statistic for the
stories namely BBAS (Assamese) and KBK (Bengali) is greater than 0.05 and the null hypothesis
may accepted at 5% level of significance. Therefore we may conclude that the distributions of
length of sentences of Assamese and Bengali stories are not significantly different.
6. CONCLUSION
This study presents the variation of children stories with respect to language ,similar
themes and authors. The stories are written in the period of 19th
and 20th
centuries and we
observed similarity and also dissimilarity in many cases. Autocorrelation test shows that the
sentence structures only for formula tales are found to be non-random. However while analyzing
the children stories in all three languages; it is observed that there is a striking dissimilarity
between modern Indian languages and English. The sentence structures of English stories are
significantly different from Assamese and Bengali stories. It has been noticed that pattern
corresponding to Assamese and Bengali stories are similar to a great extent . One of the
possible reasons is that they are originating from same Sanskrit language and hence further
investigation corresponding to grammatical structure is necessary. Our future work will be
devoted to pattern recognition based on grammar.
REFERENCES
[1] Amit M. Shemerler Y. and Eisenberg E,(1994) “Language and codification dependence of long range
correlation in texts,” Fractals,vol. 2,pp7–13
[2] Ausloos M. (2008) “Equilibrium and dynamic methods when comparing an English text and its
Esperanto translation” Physica A, vol. 387,pp 6411–6420.
[3] Conover W.J. (2006) Practical Nonparametric Statistics (3rd ed.), John Willey and Sons Inc.
[4] Deng W B. Wang D J. Li W. et al. (2011) “English and Chinese language frequency time series
Analysis” Chinese Science Bulletin, vol. 56, No.34, pp 3717– 3722.
[5] Duda Richard O., Hart Peter E. Stork David G.( 2000). Pattern Classification (2nd ed.), John Willey
and Sons Inc.
[6] Gujarati D.N., Porter D.C. & Gunasekar S.,(2012) Basic Econometrics(5th ed), McGraw Hill
Education (India) Private Limited
[7] Kosmidis K, Kalampokis A, Argyrakis P. (2006 ) “Language time series analysis” Physica A, vol.
370 ,pp 808–816.
[8] Ljung G M and Box G E P.(1978) “On a measure of a lack of fit in time series models”Biometrika,
vol. 65, pp 297–303.
[9] Maurice G. Kendall and William R. Buckland, (1971) A Dictionary of Statistical Terms, Hafner
Publishing Company, New York.
[10] Mukhopadhyay P., (2005) Applied Statistics, (2nd ed.).Books and Allied (P) Ltd
10. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
46
[11] Schenkel A, Zhang J, Zhang Y C.( 1993) “Long range correlation in human writings” Fractals, vol. 1,
pp 47–57
[12] Sikdar M. & Sarmah P., (2017A) “Pattern Recognition In Language Model With Special Reference
To Children Stories” International Journal of Innovative Research and Advanced Studies (IJIRAS),
Vol. 4, Issue3,pp 397-406.
[13] Sikdar M. & Sarmah P., (2017B) “Statistical Pattern Classification Of Direct Speeches In Children
Stories” International Journal of Applied Mathematics and Statistical Sciences, ISSN(P): 2319-3972;
ISSN (E): 2319-3980, Vol.6, Issue 5, pp. 7-18.
[14] Sikdar M. & Sarmah P. (2017C).“Pattern Recognition Of Children Stories With Special Reference To
‘Jack Tales’” International Journal of Applied Mathematics and Statistical Sciences (IJAMSS), ISSN
(P): 2319-3972; ISSN (E): 2319-3980, Vol.6, Issue 5; 81-90.
1
Fairy tale-: A fairy tale is usually about a magic. There may be a witch but good always triumphs
over bad. Often the characters are children, who beat the evil person and ‘live happily ever after’.
2
Fable-: It uses animals, legendary creatures, plants, inanimate objects, or natures that are
anthropomorphized to explain moral values to the children.
3
Jack tale-: A Jack tale is a category of ‘folk tale’, originating in the Middle Ages, they usually
have a character called Jack who appears lazy or stupid but actually wins in the end because he is
‘tricky.
4
Formula tale: A formula tale comes under a category where a chain is created in the
story that goes on increasing in size towards the end of the story.
5
www.ibm.com/analytics/us/en/technology/spss
6
www.r-project.org
AUTHORS
Ms Menaka Sikdar She is a research scholar in the department of Statistics, Gauhati
University. She is pursuing her research work in “language pattern Classification”.
Ms Pranita Sarmah She is a retired professor in the department of Statistics, Gauhati
University. Her research interests are Reliability theory, Statistical Inference and
Stochastic modeling. She has several publications in the areas of health, finance and
various social issues. Recently she has published some papers in “Stochastic modeling
in Indian classical Music” .Presently, She is working in “language pattern
Classification”.