This document discusses novel cochlear filter based cepstral coefficients (CFCC) for classification of unvoiced fricatives. The authors propose using CFCC features derived from an auditory transform implemented as a bank of cochlear filters to model the human auditory system. Experimental results show CFCC performs better than MFCC for individual fricative classification, with an average 3.41% higher accuracy in clean conditions and lower error rates. CFCC also shows better noise robustness, with classification accuracy dropping less in noisy conditions compared to MFCC. The document provides background on previous work classifying fricatives, details of the proposed CFCC feature extraction method, and comparisons of auditory transforms to Fourier transforms
A Comparative Study: Gammachirp Wavelets and Auditory Filter Using Prosodic F...CSCJournals
Modern automatic speech recognition (ASR) systems typically use a bank of linear filters as the first step in performing frequency analysis of speech. On the other hand, the cochlea, which is responsible for frequency analysis in the human auditory system, is known to have a compressive non-linear frequency response which depends on input stimulus level. It will be shown in this paper that it presents a new method on the use of the gammachirp auditory filter based on a continuous wavelet analysis. The essential characteristic of this model is that it proposes an analysis by wavelet packet transformation on the frequency bands that come closer the critical bands of the ear that differs from the existing model based on an analysis by a short term Fourier transformation (STFT). The prosodic features such as pitch, formant frequency, jitter and shimmer are extracted from the fundamental frequency contour and added to baseline spectral features, specifically, Mel Frequency Cepstral Coefficients (MFCC) for human speech, Gammachirp Filterbank Cepstral Coefficient (GFCC) and Gammachirp Wavelet Frequency Cepstral Coefficient (GWFCC). The results show that the gammachirp wavelet gives results that are comparable to ones obtained by MFCC and GFCC. Experimental results show the best performance of this architecture. This paper implements the GW and examines its application to a specific example of speech. Implications for noise robust speech analysis are also discussed within AURORA databases.
A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...acijjournal
Cochlear implant devices are known to exist since a long time. The purpose of the present work is to develop a speech algorithm for obtaining robust speech. In this paper, the technique of cochlear implant is first introduced, followed by discussions of some of the existing techniques available for obtaining speech. The next section introduces a new technique for obtaining robust speech. The key feature of this technique lies in the use of the advantages of an integrated approach involving the use of an estimation technique such as a kalman filter with non linear filter bank strategy, using Dual Resonance Non Linear(DRNL) and Single Side Band(SSB) Encoding method. A comparative study of the proposed method with the existing method indicates that the proposed method performs well compared to the existing method.
A Comparative Study: Gammachirp Wavelets and Auditory Filter Using Prosodic F...CSCJournals
Modern automatic speech recognition (ASR) systems typically use a bank of linear filters as the first step in performing frequency analysis of speech. On the other hand, the cochlea, which is responsible for frequency analysis in the human auditory system, is known to have a compressive non-linear frequency response which depends on input stimulus level. It will be shown in this paper that it presents a new method on the use of the gammachirp auditory filter based on a continuous wavelet analysis. The essential characteristic of this model is that it proposes an analysis by wavelet packet transformation on the frequency bands that come closer the critical bands of the ear that differs from the existing model based on an analysis by a short term Fourier transformation (STFT). The prosodic features such as pitch, formant frequency, jitter and shimmer are extracted from the fundamental frequency contour and added to baseline spectral features, specifically, Mel Frequency Cepstral Coefficients (MFCC) for human speech, Gammachirp Filterbank Cepstral Coefficient (GFCC) and Gammachirp Wavelet Frequency Cepstral Coefficient (GWFCC). The results show that the gammachirp wavelet gives results that are comparable to ones obtained by MFCC and GFCC. Experimental results show the best performance of this architecture. This paper implements the GW and examines its application to a specific example of speech. Implications for noise robust speech analysis are also discussed within AURORA databases.
A NOVEL METHOD FOR OBTAINING A BETTER QUALITY SPEECH SIGNAL FOR COCHLEAR IMPL...acijjournal
Cochlear implant devices are known to exist since a long time. The purpose of the present work is to develop a speech algorithm for obtaining robust speech. In this paper, the technique of cochlear implant is first introduced, followed by discussions of some of the existing techniques available for obtaining speech. The next section introduces a new technique for obtaining robust speech. The key feature of this technique lies in the use of the advantages of an integrated approach involving the use of an estimation technique such as a kalman filter with non linear filter bank strategy, using Dual Resonance Non Linear(DRNL) and Single Side Band(SSB) Encoding method. A comparative study of the proposed method with the existing method indicates that the proposed method performs well compared to the existing method.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACHijnlc
Part of speech tagging is very important and the in
itial work towards machine translation and text
manipulation. Though much has been done in this reg
ard to the Indo- European and Asiatic languages,
development of part of speech tagging tools for Afr
ican languages is wanting. As a result, these lang
uages
are classified as under resourced languages.
This paper presents data driven part of speech tagg
ing tools for kikamba which is an under resourced
language spoken mostly in Machakos, Makueni and Kit
ui. The tool is made using the lazy learner called
Memory Based Tagger (MBT) with approximately thirty
thousand word corpuses. The corpus is collected,
cleaned and formatted with regard to MBT and experi
ment run.
Very encouraging performance is reported despite li
ttle amount of corpus, which clearly shows that us
ing
the state of art technology of data driven methods
tools can be developed for under resourced language
s.
We report a precision of 83%, recall of 72% and F-s
core of 75% and in terms of accuracy for the known
and unknown words, and accuracy of 94.65% and71.93%
respectively with overall accuracy of
90.68%..This predicts that with little source of co
rpus using data driven approach, we can generate to
ols
for the under resourced languages in Kenya.
C ONSTRUCTION O F R ESOURCES U SING J APANESE - S PANISH M EDICAL D ATAijnlc
In recent years, Many NLP researches have focused i
n constructing medical ontologies. This paper
introduces a technique for extracting medical infor
mation from the Wikipedia page. Using a dictionary
and then we evaluate on a Japanese-Spanish SMT syst
em. The study shows an increment in the BLEU score
Identification of prosodic features of punjabi for enhancing the pronunciatio...ijnlc
Voice browsing requires speech interface framework. Pronunciation Lexicon Specification (PLS) 1.0 is a recommendation of Voice Browser Working Group of W3C (World-Wide Web Consortium), a machine-readable specification of pronunciation information which can be used for speech technology development. This global PLS standard is applicable across European and Asian languages and this specification is extendable to all human languages. However, it currently does not cover morphological, syntactic and semantic information associated with pronunciations. In Indian languages, grammatical information is relatively encoded in its morphology, than syntax unlike English where the grammatical information is an integral part of syntax. In this paper, PLS 1.0 has been examined from the perspective of augmentation of prosodic features of Punjabi such as tone, germination etc.
In this paper firstly I have compared Single Label Text Categorization with Multi Label Text Categorization in detail then I have compared Document Pivoted Categorization with Category Pivoted Categorization in detail. For this purpose I have given the general definition of Text Categorization with its mathematical notation for the purpose of its frugality and cost effectiveness. Then with the help of mathematical notation and set theory ,I have converted the general definitions of Single Label Text Categorization and Multi Label Text Categorization into their respective mathematical representation .Then I discussed Binary Text Categorization as a special case of Single Label Text Categorization. After comparison of Single Label Text Categorization with Multi Label Text Categorization, I found that Single Label Text Categorization or Binary Text Categorization is more general than Multi Label Text Categorization. Thereafter I discussed an algorithm for transformation of Multi Label Classification into Binary Classification and explained the conditions of transformation of Multi Label Classification into Binary Classification. In the second step I compared Document Pivoted Categorization with Category Pivoted Categorization in detail. After comparison we found that Category Pivoted Categorization is more typical and complex than Document Pivoted Categorization. The Category Pivoted Categorization becomes more complicated when new category is added to predefined set of categories and the recurrent classification of documents takes place. Finally I compared Hard Categorization with Ranking Categorization. After comparing them I found that Hard Categorization incorporates ‘Hard Decisions’ about the relevance or belonging of a document to a category. This hard decision is either completely true or completely false. Whereas the Ranking Categorization creates a belonging of a document to a category
according to the estimated appropriateness to the document. The final Ranked List is developed in the Ranking Categorization which is used by the human expert for final decision of Text Categorization.
A syntactic analysis model for vietnamese questions in v dlg~tabl systemijnlc
This paper introduces a syntactic analysis model that we propose to parse and process the Vietnamese questions about tablets in V-DLG~TABL system, which is a Vietnamese Question – Answering system working based on automatic dialog mechanism. The V-DLG~TABL system is built to support clients using Vietnamese questions for searching tablets based on interaction between the clients and the system. We apply the “Phrase Structure Grammar” of Noam Chomsky to develop a syntactic analysis model that is specific and suitable for the V-DLG~TABL system. This syntactic analysis model is used to implement the “V-DLG~TABL Syntactic Parsing and Processing” component of the system.
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
S ENTIMENT A NALYSIS F OR M ODERN S TANDARD A RABIC A ND C OLLOQUIAlijnlc
The rise of social media such as blogs and social n
etworks has fueled interest in sentiment analysis.
With
the proliferation of reviews, ratings, recommendati
ons and other forms of online expression, online op
inion
has turned into a kind of virtual currency for busi
nesses looking to market their products, identify n
ew
opportunities and manage their reputations, therefo
re many are now looking to the field of sentiment
analysis. In this paper, we present a feature-based
sentence level approach for Arabic sentiment analy
sis.
Our approach is using Arabic idioms/saying phrases
lexicon as a key importance for improving the
detection of the sentiment polarity in Arabic sente
nces as well as a number of novels and rich set of
linguistically motivated features (contextual Inten
sifiers, contextual Shifter and negation handling),
syntactic features for conflicting phrases which en
hance the sentiment classification accuracy.
Furthermore, we introduce an automatic expandable w
ide coverage polarity lexicon of Arabic sentiment
words. The lexicon is built with gold-standard sent
iment words as a seed which is manually collected a
nd
annotated and it expands and detects the sentiment
orientation automatically of new sentiment words us
ing
synset aggregation technique and free online Arabic
lexicons and thesauruses. Our data focus on modern
standard Arabic (MSA) and Egyptian dialectal Arabic
tweets and microblogs (hotel reservation, product
reviews, etc.). The experimental results using our
resources and techniques with SVM classifier indica
te
high performance levels, with accuracies of over 95
%.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
ext segmentation task is an essential processing task for many of Natural Language Processing (NLP)
such as text summarization, text translation, dialogue language understanding, among others. Turns
segmentation consi
dered the key player in dialogue understanding task for building automatic Human
-
Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for
Egyptian spontaneous dialogues and Instance Messages (IM) using Machine
Learning (ML) approach as a
part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian
dialect
dialogue
corpus
the system evaluated by our
corpus
includes 3001 turns, which
are collected,
segmented, and annotat
ed manually from Egyptian call
-
centers. The system achieves F
1
scores
of 90.74%
and accuracy of 95.98%
M ACHINE T RANSLATION D EVELOPMENT F OR I NDIAN L ANGUAGE S A ND I TS A PPROA...ijnlc
This paper presents a survey of Machine translation system for Indian Regional languages. Machine
translation is one of the central areas of Natural language processing (NLP).
Machine translation
(henceforth
referred as MT)
is important for breaking the language barrier and facilitating inter
-
lingual
communication. For a multilingual country like INDIA which is largest democratic country in whole world,
there is a big requirement of automatic machine translation system.
With
the advent of Information
Technology many documents and web pages are coming
up in a local language so
there is
a large need of
good M
T
systems to address all these issue
s in order to establish a
proper
communication between states
and union governments to
exchange information amongst the people of different states.
This paper focuses
on different Machine translation projects done in India along with their features and domain
A Novel Approach for Recognizing Text in Arabic Ancient Manuscripts ijnlc
In this paper a system for recognizing Arabic ancient manuscripts is presented. The system has been
divided into four parts. The first part is the image pre-processing where the text in the Arabic ancient
manuscript will be recognized as a collection of Arabic characters through three phases of processing. The
second part is the Arabic text analysis which consists of lexical analyzer; syntax analyzer; and semantic
analyzer. The output of this subsystem is an XML file format that represents the ancient manuscript text.
The third part is the intermediate text generation, in this part an intermediate presentation of the Arabic
text is generated from the XML text file. The fourth part of the system is the Arabic text generation, which
converts the generated text to a modern standard Arabic (MSA) language (this part has four phases: text
organizer; pre-optimizer; semantics generator; and post-optimizer).
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Accents of English have been investigated for many years both from the perspective of native and non-native speakers of the language. Various research results imply that non-native speakers of English language produce certain speech characteristics which are uncommon in native speakers’ speech. This is because non-native speakers do not produce the same tongue movement as native speakers. This paper presents an isolated English word recognition system devised with the speech of local Bangladeshi people, who are also non-native speakers of English language. Here, we have also noticed a different speech characteristic which is not available within the speech of native English speakers. Two acoustic features, ‘pitch’ and ‘formants’ have been utilized to develop the system. The system is speaker-independent and stands on Template based approach. The recognition method applied here is very simple and the recognition accuracy is also very satisfactory.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACHijnlc
Part of speech tagging is very important and the in
itial work towards machine translation and text
manipulation. Though much has been done in this reg
ard to the Indo- European and Asiatic languages,
development of part of speech tagging tools for Afr
ican languages is wanting. As a result, these lang
uages
are classified as under resourced languages.
This paper presents data driven part of speech tagg
ing tools for kikamba which is an under resourced
language spoken mostly in Machakos, Makueni and Kit
ui. The tool is made using the lazy learner called
Memory Based Tagger (MBT) with approximately thirty
thousand word corpuses. The corpus is collected,
cleaned and formatted with regard to MBT and experi
ment run.
Very encouraging performance is reported despite li
ttle amount of corpus, which clearly shows that us
ing
the state of art technology of data driven methods
tools can be developed for under resourced language
s.
We report a precision of 83%, recall of 72% and F-s
core of 75% and in terms of accuracy for the known
and unknown words, and accuracy of 94.65% and71.93%
respectively with overall accuracy of
90.68%..This predicts that with little source of co
rpus using data driven approach, we can generate to
ols
for the under resourced languages in Kenya.
C ONSTRUCTION O F R ESOURCES U SING J APANESE - S PANISH M EDICAL D ATAijnlc
In recent years, Many NLP researches have focused i
n constructing medical ontologies. This paper
introduces a technique for extracting medical infor
mation from the Wikipedia page. Using a dictionary
and then we evaluate on a Japanese-Spanish SMT syst
em. The study shows an increment in the BLEU score
Identification of prosodic features of punjabi for enhancing the pronunciatio...ijnlc
Voice browsing requires speech interface framework. Pronunciation Lexicon Specification (PLS) 1.0 is a recommendation of Voice Browser Working Group of W3C (World-Wide Web Consortium), a machine-readable specification of pronunciation information which can be used for speech technology development. This global PLS standard is applicable across European and Asian languages and this specification is extendable to all human languages. However, it currently does not cover morphological, syntactic and semantic information associated with pronunciations. In Indian languages, grammatical information is relatively encoded in its morphology, than syntax unlike English where the grammatical information is an integral part of syntax. In this paper, PLS 1.0 has been examined from the perspective of augmentation of prosodic features of Punjabi such as tone, germination etc.
In this paper firstly I have compared Single Label Text Categorization with Multi Label Text Categorization in detail then I have compared Document Pivoted Categorization with Category Pivoted Categorization in detail. For this purpose I have given the general definition of Text Categorization with its mathematical notation for the purpose of its frugality and cost effectiveness. Then with the help of mathematical notation and set theory ,I have converted the general definitions of Single Label Text Categorization and Multi Label Text Categorization into their respective mathematical representation .Then I discussed Binary Text Categorization as a special case of Single Label Text Categorization. After comparison of Single Label Text Categorization with Multi Label Text Categorization, I found that Single Label Text Categorization or Binary Text Categorization is more general than Multi Label Text Categorization. Thereafter I discussed an algorithm for transformation of Multi Label Classification into Binary Classification and explained the conditions of transformation of Multi Label Classification into Binary Classification. In the second step I compared Document Pivoted Categorization with Category Pivoted Categorization in detail. After comparison we found that Category Pivoted Categorization is more typical and complex than Document Pivoted Categorization. The Category Pivoted Categorization becomes more complicated when new category is added to predefined set of categories and the recurrent classification of documents takes place. Finally I compared Hard Categorization with Ranking Categorization. After comparing them I found that Hard Categorization incorporates ‘Hard Decisions’ about the relevance or belonging of a document to a category. This hard decision is either completely true or completely false. Whereas the Ranking Categorization creates a belonging of a document to a category
according to the estimated appropriateness to the document. The final Ranked List is developed in the Ranking Categorization which is used by the human expert for final decision of Text Categorization.
A syntactic analysis model for vietnamese questions in v dlg~tabl systemijnlc
This paper introduces a syntactic analysis model that we propose to parse and process the Vietnamese questions about tablets in V-DLG~TABL system, which is a Vietnamese Question – Answering system working based on automatic dialog mechanism. The V-DLG~TABL system is built to support clients using Vietnamese questions for searching tablets based on interaction between the clients and the system. We apply the “Phrase Structure Grammar” of Noam Chomsky to develop a syntactic analysis model that is specific and suitable for the V-DLG~TABL system. This syntactic analysis model is used to implement the “V-DLG~TABL Syntactic Parsing and Processing” component of the system.
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
S ENTIMENT A NALYSIS F OR M ODERN S TANDARD A RABIC A ND C OLLOQUIAlijnlc
The rise of social media such as blogs and social n
etworks has fueled interest in sentiment analysis.
With
the proliferation of reviews, ratings, recommendati
ons and other forms of online expression, online op
inion
has turned into a kind of virtual currency for busi
nesses looking to market their products, identify n
ew
opportunities and manage their reputations, therefo
re many are now looking to the field of sentiment
analysis. In this paper, we present a feature-based
sentence level approach for Arabic sentiment analy
sis.
Our approach is using Arabic idioms/saying phrases
lexicon as a key importance for improving the
detection of the sentiment polarity in Arabic sente
nces as well as a number of novels and rich set of
linguistically motivated features (contextual Inten
sifiers, contextual Shifter and negation handling),
syntactic features for conflicting phrases which en
hance the sentiment classification accuracy.
Furthermore, we introduce an automatic expandable w
ide coverage polarity lexicon of Arabic sentiment
words. The lexicon is built with gold-standard sent
iment words as a seed which is manually collected a
nd
annotated and it expands and detects the sentiment
orientation automatically of new sentiment words us
ing
synset aggregation technique and free online Arabic
lexicons and thesauruses. Our data focus on modern
standard Arabic (MSA) and Egyptian dialectal Arabic
tweets and microblogs (hotel reservation, product
reviews, etc.). The experimental results using our
resources and techniques with SVM classifier indica
te
high performance levels, with accuracies of over 95
%.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
ext segmentation task is an essential processing task for many of Natural Language Processing (NLP)
such as text summarization, text translation, dialogue language understanding, among others. Turns
segmentation consi
dered the key player in dialogue understanding task for building automatic Human
-
Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for
Egyptian spontaneous dialogues and Instance Messages (IM) using Machine
Learning (ML) approach as a
part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian
dialect
dialogue
corpus
the system evaluated by our
corpus
includes 3001 turns, which
are collected,
segmented, and annotat
ed manually from Egyptian call
-
centers. The system achieves F
1
scores
of 90.74%
and accuracy of 95.98%
M ACHINE T RANSLATION D EVELOPMENT F OR I NDIAN L ANGUAGE S A ND I TS A PPROA...ijnlc
This paper presents a survey of Machine translation system for Indian Regional languages. Machine
translation is one of the central areas of Natural language processing (NLP).
Machine translation
(henceforth
referred as MT)
is important for breaking the language barrier and facilitating inter
-
lingual
communication. For a multilingual country like INDIA which is largest democratic country in whole world,
there is a big requirement of automatic machine translation system.
With
the advent of Information
Technology many documents and web pages are coming
up in a local language so
there is
a large need of
good M
T
systems to address all these issue
s in order to establish a
proper
communication between states
and union governments to
exchange information amongst the people of different states.
This paper focuses
on different Machine translation projects done in India along with their features and domain
A Novel Approach for Recognizing Text in Arabic Ancient Manuscripts ijnlc
In this paper a system for recognizing Arabic ancient manuscripts is presented. The system has been
divided into four parts. The first part is the image pre-processing where the text in the Arabic ancient
manuscript will be recognized as a collection of Arabic characters through three phases of processing. The
second part is the Arabic text analysis which consists of lexical analyzer; syntax analyzer; and semantic
analyzer. The output of this subsystem is an XML file format that represents the ancient manuscript text.
The third part is the intermediate text generation, in this part an intermediate presentation of the Arabic
text is generated from the XML text file. The fourth part of the system is the Arabic text generation, which
converts the generated text to a modern standard Arabic (MSA) language (this part has four phases: text
organizer; pre-optimizer; semantics generator; and post-optimizer).
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Accents of English have been investigated for many years both from the perspective of native and non-native speakers of the language. Various research results imply that non-native speakers of English language produce certain speech characteristics which are uncommon in native speakers’ speech. This is because non-native speakers do not produce the same tongue movement as native speakers. This paper presents an isolated English word recognition system devised with the speech of local Bangladeshi people, who are also non-native speakers of English language. Here, we have also noticed a different speech characteristic which is not available within the speech of native English speakers. Two acoustic features, ‘pitch’ and ‘formants’ have been utilized to develop the system. The system is speaker-independent and stands on Template based approach. The recognition method applied here is very simple and the recognition accuracy is also very satisfactory.
STUDY OF ACOUSTIC PROPERTIES OF NASAL AND NONNASAL VOWELS IN TEMPORAL DOMAINcscpconf
There has been considerable amount of work done in exploring the acoustic correlates of nasalized and non-nasalized vowels in the frequency domain. Nasalized vowels are characterized by the presence of extra pole-zero pairs near the first formant region and across thespectrum. Several other automatically extractable acoustic features have been proposed by researchers across the globe. This area has not been explored much in the temporal domain. In this study we have tried to find quantifiable differences/similarities between the nasal and non-nasal vowel /a/ in the temporal domain at the pitch synchronous level. The results show significant differences between nasalized and non-nasalized vowel /a/
Study of acoustic properties of nasal and nonnasal vowels in temporal domaincsandit
There has been considerable amount of work done in exploring the acoustic correlates of nasalized and
non-nasalized vowels in the frequency domain. Nasalized vowels are characterized by the presence of extra
pole-zero pairs near the first formant region and across the spectrum. Several other automatically
extractable acoustic features have been proposed by researchers across the globe. This area has not been
explored much in the temporal domain. In this study we have tried to find quantifiable
differences/similarities between the nasal and non-nasal vowel /a/ in the temporal domain at the pitch
synchronous level. The results show significant differences between nasalized and non-nasalized vowel /a/.
Characterization of Arabic sibilant consonants IJECEIAES
The aim of this study is to develop an automatic speech recognition system in order to classify sibilant Arabic consonants into two groups: alveolar consonants and post-alveolar consonants. The proposed method is based on the use of the energy distribution, in a consonant-vowel type syllable, as an acoustic cue. The application of this method on our own corpus reveals that the amount of energy included in a vocal signal is a very important parameter in the characterization of Arabic sibilant consonants. For consonants classifications, the accuracy achieved to identify consonants as alveolar or post-alveolar is 100%. For post-alveolar consonants, the rate is 96% and for alveolar consonants, the rate is over 94%. Our classification technique outperformed existing algorithms based on support vector machines and neural networks in terms of classification rate.
Investigation of the Effect of Obstacle Placed Near the Human Glottis on the ...kevig
Simulation of human vocal tract model is necessary to understand the effect of different conditions on speech generation. In natural calamities such as earthquakes person may swallow soil due to large amount of dust around. The effect of soil is investigated on the sound pressure in the human vocal tract. The generation of speech starts from the glottis, so the soil obstacle is positioned near it. The whole model is designed and analyzed using FEM. As the speech signal requires a bandwidth near about 4 kHz, the investigation carried out to obtained sound pressure in the human vocal tract from 100-5000 Hz sound signals.
Efficiency of the energy contained in modulators in the Arabic vowels recogni...IJECEIAES
The speech signal is described as many acoustic properties that may contribute differently to spoken word recognition. Vowel characterization is an important process of studying the acoustic characteristics or behaviors of speech within different contexts. This current study focuses on the modulators characteristics of three Arabic vowels, we proposed a new approach to characterize the three Arabic vowels /a/, /i/ and /u/. The proposed method is based on the energy contained in the speech modulators. The coherent subband demodulation method related to the spectral center of gravity (COG) was used to calculate the energy of the speech modulators. The obtained results showed that the modulators energy help characterize the Arabic vowels /a/, /i/ and /u/ with an interesting recognition rate ranging from 86% to 100%.
FORMANT ANALYSIS OF BANGLA VOWEL FOR AUTOMATIC SPEECH RECOGNITIONsipij
To provide new technological benefits to the mass people, nowadays, regional and local language
recognition draws attention to the researchers. Similarly to other languages, Bangla speech recognition
scheme is demandable. A formant is considered as the resonance frequency of vocal tract. Formant
frequencies play an important role for the purpose of automatic speech recognition, due to its noise robust
characteristics. In this paper, Bangla vowels are investigated to acquire formant frequencies and its
corresponding bandwidth from continuous Bangla sentences, which are considered as potential parameters
for wide voice applications. For the purpose of formant analysis, cepstrum based formant estimation and
Linear Predictive Coding (LPC) techniques are used. In order to acquire formant characteristics, enrich
continuous sentences and widely available Bangla language corpus namely “SHRUTI” is considered.
Intensive experimentation is carried out to determine formant characteristics (frequency and bandwidth) of
Bangla vowels for both male and female speakers. Finally, vowel recognition accuracy of Bangla language
is reported considering first three formants
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Novel cochlear filter based cepstral coefficients for classification of unvoiced fricatives
1. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
NOVEL COCHLEAR FILTER BASED
CEPSTRAL COEFFICIENTS FOR
CLASSIFICATION OF UNVOICED
FRICATIVES
Namrata Singh1 , Nikhil Bhendawade2 , Hemant A. Patil3
1Software Engineer, LG Soft India pvt. ltd., Embassy Tech Square, Bangalore, 560103,
India.
2 Design Engineer, Redpine signals pvt.ltd., Hitech City, Hyderabad - 500081, India.
3Dhirubhai Ambani Institute of Information Technology (DA-IICT), Gandhinagar-
382007,India.
ABSTRACT
In this paper, the use of new auditory-based features derived from cochlear filters, have been proposed for
classification of unvoiced fricatives. Classification attempts have been made to classify sibilant (i.e., /s/,
/sh/) vs. non-sibilants (i.e., /f/, /th/) as well as for fricatives within each sub-category (i.e., intra-sibilants
and intra-non-sibilants). Our experimental results indicate that proposed feature set, viz., Cochlear Filter-based
Cepstral Coefficients (CFCC) performs better for individual fricative classification (i.e., a jump of
3.41 % in average classification accuracy and a fall of 6.59 % in EER) in clean conditions than the state-of-
the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC). Furthermore, under signal
degradation conditions (i.e., by additive white noise) classification accuracy using proposed feature set
drops much slowly (i.e., from 86.73 % in clean conditions to 77.46 % at SNR of 5 dB) than by using MFCC
(i.e., from 82.18 % in clean conditions to 46.93 % at SNR of 5 dB).
KEYWORDS
Unvoiced fricative sound, auditory transform, cochlear filter cepstral coefficients, Mel cepstrum, sibilants,
non-sibilants.
1. INTRODUCTION
Classification of appropriate short regions of speech signal into different phoneme classes (e.g.,
fricatives vs. plosives) based on its acoustic characteristics is an interesting and challenging
research problem. In this paper, we present an effective feature set for classification of one
particular class of phonemes, viz., unvoiced fricatives. Fricative sounds are very unique class of
phonemes in the sense that for fricatives, the sound source occurs at the point of constriction in
the vocal tract rather than at the glottis. There are two types of fricatives, viz., voiced and
unvoiced (having different speech production mechanisms). For example, in case of voiced
fricatives, noisy characteristics caused by the constriction in the vocal tract are accompanied by
vibrations of vocal folds, thereby imparting some periodicity into the produced sound. However,
during the production of unvoiced fricatives, vocal folds are relaxed and not vibrating. This lack
DOI : 10.5121/ijnlc.2014.3402 21
2. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
of periodicity results in relatively more random waveform pattern. Furthermore, voiceless
fricatives being noise-like, having highly turbulent source, are dynamic, relatively short and weak
(i.e., having low energy) making classification even more difficult, especially, due to severe
masking of fricative sounds by noise (i.e., under signal degradation conditions.
Since production of unvoiced fricatives is governed by source (e.g., frication noise originating
from constriction in vocal tract) - filter (i.e., oral cavity) model theory [1, 2], they may be
distinguished depending on location of constriction in oral cavity. This constriction at different
locations accounts for distinct acoustical characteristics. To reliably predict the characteristics of
fricative sounds, two approaches could be considered, viz., modeling the production mechanism
of fricatives [3] or modeling response of human ear corresponding to the acoustical characteristics
of each fricative class [4,5]. Our study focuses on second approach. To that effect, we propose use
of cochlear filters to model response of human ear. Since among the various acoustic cues (e.g.,
amplitude, spectral, durational and transitional characteristics) used previously, spectral cues were
found to be the most efficient; we have used spectral information as a basis of classification of
four unvoiced fricatives, viz., /f/, /th/, /s/ and /sh/.
The rest of the paper is organized as follows: Section 2 gives brief discussion of relevant literature
that deals with the earlier attempts to classify fricative sounds using various acoustic features.
Section 3 discusses the details of proposed feature set and gives the comparison between Fourier
transform and auditory transform and its significance for unvoiced fricative classification. Section
4 describes the experimental setup which is followed by the comparison of classification results
using proposed and baseline features under various experimental evaluation factors (e.g., cross-validation,
dimension of feature vector, number of sub-band filters and signal degradation
conditions) in Section 5. Finally, Section 6 concludes the paper along with future research
directions.
22
2. LITERATURE REVIEW
The earlier studies in the area of fricative sound classification used Root Mean Square (RMS)
amplitude of fricative sound as an acoustic cue to distinguish between sibilants and non-sibilants
[7, 8].
Study reported in [9] used duration of fricative noise as a perceptual cue to distinguish between
sibilants and non-sibilants as they found that sibilants are on an average 33 ms longer than non-sibilants.
However, the approach had several issues such as durational features often vary with
speaking rate and contextual complexity. In an different experiment, it was also found that
listeners identify fricative sound using only the initial fraction of utterance contrary to the earlier
conclusion reported in [9] that absolute fricative noise duration can be used as a perceptual cue
[10]. Instead relative duration (i.e., duration of fricative relative to entire word duration) was
proposed in further studies [11]. This study found significant difference among all the places of
articulation for fricative using relative duration as a cue, however, with the exception of unvoiced
non-sibilants.
Various spectral features have been investigated and used for a long time since the hypothesis
presented in [12], that spectrum of fricatives is governed by size and shape of resonance chamber
in front of constriction point. Work presented in [13] supported this finding when the spectral
characteristics of front (near-flat spectrum), middle (spectral peak around 3.5 kHz) and back
(spectral peak around 1.5 kHz) unvoiced fricatives were examined. Though the locations of
spectral peaks are influenced by speaker differences [14] and age differences among speakers
[15], it was consistently observed in many studies that the spectral peaks of sibilants always lie
between 1-6 kHz range while non-sibilants show almost flat spectrum extending beyond 8 kHz.
3. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
Previous studies depict that various acoustic cues have been found effective for distinguishing
between sibilant and non-sibilant class as a whole and between fricatives within sibilant class.
However, analyzing the characteristics of fricatives within non-sibilant class has proved less
conclusive resulting in poor classification accuracy. In this paper, we propose an auditory-based
approach, for relatively better analysis and distinction of non-sibilant sounds in both clean and
noisy environments by using cochlear filters (which resemble impulse response of human cochlea
to any sound event). As human ear could distinguish between fricative sounds better than any
other classification system (both in clean and noisy conditions), spectral cues derived from
application of cochlear filters have been used for distinction between all four unvoiced fricatives
(i.e., /f/, /th/, /s/ and /sh/). Results have also been reported for classification of sibilant vs. non-sibilant
Transform CFCC
23
sounds and for fricatives within each subcategory (i.e., /f/ vs. /th/ and /s/ vs. /sh/).
3. COCHLEAR FILTER-BASED CEPSTRAL COEFFICIENTS
(CFCC)
CFCC features (derived from auditory transform) have been proposed first time in [4] for speaker
recognition application. Auditory transform is basically a wavelet transform, however, the mother
wavelet (i.e.,y (t ) ) is chosen in such a manner that the cochlear filters (whose impulse response
corresponds to dilated version of mother wavelet) emulate the cochlear filters present in cochlea
of human ear. Cochlear filters are responsible for perception of sound by human auditory system
and would thus be expected to include properties of robustness under noisy or signal degradation
conditions (i.e., may be better than most of the other artificial speech recognition or classification
systems in noisy environments). The auditory transform is implemented as a bank of sub-band
filters where each sub-band filter corresponds to the cochlear filter present along the basilar
membrane (BM) in cochlea of human ear. These cochlear filters have been found to have a
bandwidth that varies with their central frequencies. In particular, the bandwidth of these filters
increases with increasing central frequency (i.e., c f ) and has almost constant quality factor
(i.e.,Q). These filters thus provide a range of analysis window durations and bandwidth for
analyzing speech signal so that rapidly varying signal components are analyzed with shorter
window duration than slowly varying components preserving the time-frequency resolution in
both cases. Fig. 1 shows block diagram for implementation of CFCC [4, 5].
Input
Speech
Signal
Auditory
Transform
Hair Cell/
Window
Non-
Linearity
Discrete Cosine
Fig. 1. Auditory-based feature extraction technique, viz., Cochlear Filter based Cepstral Coefficients
(CFCC) [4,5].
We have chosen logarithmic nonlinearity instead of cubic root nonlinearity used in earlier
studies [4,5] as it resulted in better classification, i.e.,
y(i, j) = ln(S(i, j)). (1)
where S(i,j ) is the nerve spike density, obtained from hair cell output for each sub-band with
duration for nerve density count taken as 12 ms (i.e., =12 ms), calculated with window shift
duration of 5ms.
4. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
24
3.1. Details of cochlear filters
Fig. 2 shows the frequency response of cochlear filters used in proposed feature set. Filters have
been placed according to Mel scale and central frequencies of filters are calculated according to,
Fig. 2. Frequency response of 13 cochlear filters placed along Mel-scale with =2 and =0.45 .
lin
1127ln(1 ),
700
mel
f
f = + (2)
where mel f is central frequency along Mel scale and lin f is corresponding central frequency along
linear scale (i.e., in Hz). Filters are placed uniformly along Mel scale so the distribution appears
exponential along linear scale. Parameters a and b which decide filter shape have been
optimized as 2 and 0.45, respectively, (for database used in present work).
Though 13 cochlear filters have been used in our work (for the reasons described in Section 5.3),
we experimented with number of sub-band filters to find the minimum number of cochlear filters
required to capture the distinctive spectral characteristics of the unvoiced fricatives. Six filters
have been found to be significant in our analysis (giving classification accuracy of 84.07 %). Fig.
3 shows the frequency responses of these significant filters. Corresponding impulse responses
have been shown in Fig. 4. It is noted that as central frequency of filters increases, bandwidth also
increases maintaining a near-constant Q factor of 2.15 (as shown in Table 1). Furthermore, higher
frequency components are analyzed with larger time resolution (shorter analysis window
durations) while higher frequency resolution is used for analyzing lower frequency components.
As shown in Fig. 4, frequency components near 13.1 kHz are analyzed with window of
approximately 0.561 ms duration1 while window of approximately 11.4 ms is used for analyzing
frequency components near 451 Hz. This is known as constant Q-filtering and this is what
happens in Basilar membrane of human cochlea during speech perception
1 Only half part of the analysis window has been displaced in Fig.4, since the window is symmetric.
5. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
25
Fig. 3. Frequency response of six cochlear filters found significant for unvoiced fricative classification.
Table 1 : Central frequencies of cochlear filters found significant for unvoiced
Cochlear
filter
Index
fricative classification
Center
frequency(Hz)
-3dB
Bandwidth( B )
(Hz)
Quality factor
( c f B )
1 451 210 2.1476
2 1191 550 2.1654
3 2408 1120 2.1500
4 4408 2050 2.1502
5 7696 3580 2.1497
6 13100 6450 2.04
Fig.4. Impulse response of six cochlear filters found significant for unvoiced fricative classification with
central frequencies (a) 451 Hz, (b) 1191 Hz, (c) 2408 Hz, (d) 4408 Hz, (e) 7696 Hz, and (f) 13.1 kHz.
6. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
26
3.2. Short-time Fourier transform vs. Auditory transform:
Short-time Fourier transform (STFT) is the most widely used technique for analyzing the
frequency-domain characteristics of localized regions of speech signal. Though efficient, it uses
fixed length window for signal analysis resulting in constant time-frequency resolution and hence
improving resolution in time-domain will result in degradation of resolution frequency-domain
(i.e., Heisenberg’s uncertainity principle in signal processing framework [16]). In addition,
several optimized algorithms used in evaluating STFT via Fast Fourier Transform (FFT), add to
the computational noise, by increasing computational speed at the expense of slight compromise
in accuracy. This might seriously affect spectral cues in case of non-sibilants as they have weak
resonances (i.e., formants) in their spectrum. Fig. 5-Fig.8 gives the comparison between the
spectrum derived from auditory transform and traditional Fourier transform. Each spectrum is
averaged from initial, middle and end regions of fricative sounds for each fricative class such that
it represents the overall average spectral characteristics for that class. Hamming window with
window duration of 12 ms with frame rate of 5 ms has been used for FFT-based computation of
Fourier transform while auditory transform is computed using 13 cochlear filters of variable
length by the procedure described in [17]. Fourier transform spectrum is affected by regular
spikes because of the fixed window duration for all frequency bands (as seen in the Fig. 5 in the
form of periodic spikes in spectrum, as shown in Fig. 5-Fig.8). On the other hand, spectrogram
generated from auditory transform provides flexible time-scale resolution by employing variable
length filters and hence it is free from these spikes and also preserves information about formant
frequencies [4, 5]. From Fig. 7 and FIg. 8, it is also clear that sibilants show spectral peaks around
5 kHz while such energy concentration at particular frequency is absent in non-sibilants and they
tend to have near-flat spectrum (which is shown Fig. 5 and Fig.6). The reason for this could be
explained from speech production mechanism. In particular, during production of sibilant sounds,
point of constriction lies near alveolar ridge resulting in considerable length of front cavity,
(created between point of constriction and lips) which in turn is responsible for spectral filtering
of the turbulant sound produced from the constriction introducing resonances into the spectrum
while such spectral filtering is almost absent in case of labiodental (/f/) and interdental (/th/) non-sibilants
as point of constriction itself lies at lips in the former case while between upper and
lower teeth in later ([18], [19]).
7. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
27
3.3. Noise
3.3. Noise suppression capability of CFCC
Fig. 8 (a) Waveform for fricative sound /sh/ and corresponding
spectrum using (b) Fourier transform and (c) auditory transform..
Fig.. 5 (a) Waveform for fricative sound /f/ (i.e., non-sibilant)
and corresponding spectrum using (b) Fourier
transform (c) auditory transform.
Fig.. 6 (a) Waveform for fricative sound /th/ / (i.e., non-sibilant)
and corresponding spectrum using (b) Fourier
transform (c) auditory transform..
Fig.. 7 (a) Waveform for fricative sound /s/ (i.e.,
sibilant) and corresponding spectrum using (b)
Fourier transform and (c) auditory transform..
(b)
(c) .
8. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
The Mel scale filterbank has triangular shaped sub-band filters which are not smooth at the vertex
of each triangle [20]. On the other hand, from Fig. 3, it is evident that cochlear filters have bell-shaped
frequency response and hence are relatively much more smoother than the Mel filters.
= = (4)
and the
28
This smoothness of the cochlear filters may help in suppressing the noise.
Robustness of CFCC features could also be explained from similarity of auditory transform with
signal processing abstraction of cochlea in human ear. In noisy acoustic environment, human
listeners perform robustly. In particular, human hearing system is robust to the noise because of
amplification mechanism in auditory transform to take care of mechanical vibrations of eardrum
at the threshold of hearing (i.e., 2 10 5N /m2 − × ) [21]. To support this observation, study reported
in [22] claims that two or more rows of outer hair cells (OHC) in the cochlea are pumping fluid
which accelerates the process of detecting sub-band energies in speech sound. In addition, those
OHC might be setting up their own vortex to act as the amplifier [21]. The sub-band-based
processing and energy detection comes from the original studies reported in [23]. Study in [23] is
based on belief that human ear is a frequency analyzer, except for detection of transient sounds. In
this context, CFCC employs continuous-time wavelet transform (CWT) which has mother
wavelet y (t ) to aid for noise suppression and to detect the transitional sounds such as fricatives.
This is analyzed below.
we have eq. (3) from [4],
+¥
=
y (t)dt 0;
−¥
(3)
+¥ +¥
( ) ( ) 0 0 .
y t dt t y t dt
=−¥ =−¥
t t
This means that y (t ) has one vanishing moment and it will suppress polynomial of degree zero
[16]. Let f (t ) be the clean speech signal, w(t ) be the additive white noise signal, then the
noisy speech signal, x(t ) , is given by
x(t ) = f (t ) + w(t ). (5)
Taking wavelet transform on both sides and using linearity property of CWT, we get,
Wx(a,b) =Wf (a,b) +Ww(a,b), (6)
where ( ) y * ( ) y ( ) y
*
, ,
1
( , ) , a b a b
t t
t b
Wf a b f t t dt f f t dt
a a
+¥ +¥
=−¥ =−¥
−
= = =
symbol , denotes the inner product operation and Wf (a,b) means CWT of signal f (t ) .
Hence, eq. (6) becomes,
, , , , , , , a b a b a b xy = f y + wy (7)
, , , , , , , a b a b a b xy = f y + wy
, , , , , , . a b a b a b xy £ f y + wy
(8)
9. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
It is well known that the Taylor formula relates the differentiability of a signal f (t ) to local
polynomial approximation. Let us assume that signal w(t ) is m times differentiable in
[v − h,v + h]. If ( ) v P t is Taylor polynomial in the neighborhood of point v , then
29
( ) ( ) ( ), v v w t = P t +e t
(9)
where the approximation error ( ) v e t is refined by non-integer exponent a (called as Lipchitz
exponent or Holder exponent in mathematical literature) . In particular, there exists K 0such
that
a
, ( ) ( ) ( ) . v v t t w t P t K t v
Îı e = − = −
(10)
Let mother wavelet y (t ) has n vanishing moments and signal ( ) ( ) 2 w t ÎL ı
(i.e., Hilbert
space of finite energy signals) has non-integer Lipchitz exponent a . In this case, we have
following two theorems [chapter 6, pp. 169-171, 24], [25].
Theorem 1: If the signal ( ) ( ) 2 w t ÎL ı is uniformly Lipchitz a £ n over the closed interval
[ ] 1 2 b ,b then there exists K 0 such that
( ) [ ]
1
2
a
Îı + × y
£
+
1 2 , , , , , . a b a b b b w Ka
(11)
Theorem 2 (JAFFARD) : If the signal ( ) ( ) 2 w t ÎL ı is Lipchitz a £ n at a point v , then
there exists K 0 such that
( )
1
2
b v
, , , , 1 , a b
a b w Ka
a
a
a
y
+
+
−
Î × £ +
ı ı
(12)
where a and b are scale and translation paramters in the definition of CWT. It should be noted
that converse is also true for both the above theorems. Since in present case, from eq. (4), we
have n =1 which implies that a £1.Above two theorems gives a guarrantee that the wavelet
transform of noise signal will decay faster as the scale parameter goes to zero (i.e., the at the fine
scales). On the other hand, for larger values of scale parameter, it does not introduce any
constraint. In particular, due to Cauchy-Schwartz inequality, we have
, , , . a b a b wy £ w y = w y
(13)
Since due to normalization of mother wavelet, , 1 a b y = y = [chapter 4, 24], we have,
, , . a b wy £ w
(14)
Hence, the wavelet transform of noise signal is bounded by w , at larger scale parameter. From
eq.(8) and eq. (11), we have,
y
£ +
Similarly, eq.(8) and eq. (11), we have
1 1
2 2
a a
+ +
1 2
, 1 2 , . a b x K a K a
10. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
30
a a
1 1
2 2
1 2
b v b v
a a
1 2
, 1 2 , 1 1 , a b
x K a K a
a a
y
+ − + −
£ + + +
(15)
where 1 2 K ,K 0 and 1 a and 2 a are the Lipchitz exponents of clean speech signal and additive
white noise, respectively. Since, wavelet transform of noise signal will decay, it is evident from
eq. (14) and eq. (15) that additive noise is suppressed in wavelet-domain. Since, CFCC inherently
employs CWT representation to mimic cochlear filters in human hear, it is expected that CFCC
will have noise suppression capability. This is also demonstrated with experimental results for
unvoiced fricative classification under noisy conditions in Section 5.4.
4. EXPERIMENTAL SETUP
4.1.Database used in this study
Preparation of sufficient training and testing data for each fricative involves extracting fricatives
sounds from continuous speech in different contexts (of speech recordings) from different
speakers. All the fricatives have been manually extracted (using Audacity software [26]) from
CHAINS database [27] of continuous speech in solo reading style (recorded using a Neumann
U87 condenser microphone). The database is publicly available having 4 extracts (viz., rainbow
text, members of the body text, north wind text and Cinderella text), a set of 24 sentences having
text material corresponding to TIMIT database and a set of other 9 CSLU's Speaker Identification
Corpus sentences.
Table 2 summarizes the details (such as number of speakers and contexts of fricative sounds) of
the dataset for each fricative sound used in this work. Words for segmenting fricative samples are
collected such that samples consist of variety of contexts. Column 5 in Table 2 gives this
contextual information (i.e., underlined region in a word indicates the location of fricative sound).
Table 2. Training and testing data extraction for each unvoiced fricative class
# of
samples
# of Speakers
(Male+Female)
Context associated with training and testing
samples
/f/ 208 5 (1 M + 4 F) for, of, affirmative, find, enough, fire, fish
frantically, fortune, frightened, fairy, forgot, fifth,
if, fires, off, beautiful, form, refused, few, fell,
from, food, roof, centrifuge and Jeff
/th/ 143 21 (11 M + 10 F) Thought, teeth, North, tooth, think, throughout,
everything, path, something and mouth
/s/ 305 6 (2 M + 4 F) sun, sunlight, say, looks, support, must, necessary,
receive, dance, sing, small, surface, cost, sloppy,
appearance, same, atmosphere, escape, sermon,
subdued, task, rescue, ask, suit, saw, system, loss
and centrifuge
/sh/ 254 14 (4 M + 10 F) wash ,under-wash ,discussion ,condition, shine,
share, shape, shelter, shotguns, action, Trish and
she
Total 910 23 (10 M + 13 F)
Sibilants Non-Sibilants
M-Male, F-Female
11. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
31
4.1. Front end analysis
To evaluate the relative performance of the proposed feature set, state-of-the-art feature set, viz.,
MFCC is used as the baseline feature set. Front end analysis involves computation of both CFCC
and MFCC features from corresponding spectra. Spectral analysis is done using Discrete Fourier
Transform (DFT) up to 22.05 kHz (corresponding to sampling frequency of 44.1 kHz) as it was
observed previously that spectral information of non-sibilants extend above 10 kHz [28]. Frame
size of 12 ms along with Hamming window and frame rate of 5 ms is used for computation of
MFCC features while CFCC features are computed as described in Section 3 . Though such small
window size of 12 ms reduces the resolution in frequency-domain in case of MFCC, we observed
that temporal development of fricative sounds can be better modeled using larger number of
feature vectors per fricative sound (i.e., small window size) thereby increasing time resolution,
especially for non-sibilant /th/ which has average duration as small as 71.86 ms (computed over
143 samples used in this study). Cepstral Mean Subtraction (CMS) is performed after MFCC and
CFCC computation to take care of variations in recording devices and transmission channels.
Furthermore, use of CMS also resulted in considerable increase in % classification accuracy.
4.3 Hidden Markov Model (HMM)
In this work, HMM is used as a pattern classifier since it preserves the temporal development of
the fricative utterance which is often important in perception of fricative sounds. On the other
hand, temporal variation is irrelevant in other widely used techniques such as discriminatively-trained
pattern classifier, viz., support vector machines (SVMs) in which classification is done
independently for each frame in an utterance [31]. HMM evaluates the probability of an utterance
being particular fricative sound based on observation and transition probabilities of observed
sequence . A 3-state continuous density HMM has been employed for modeling of each fricative
class.
4.4 Performance measures
To facilitate the performance comparison between proposed and baseline feature sets, three
performance measures, viz., classification accuracy, % Equal Error Rate (EER) and minimum
Detection Cost Function (DCF) have been employed. % classification accuracy is defined as, %
# of test samples correctly identified (N )
Classification Accuracy = c
t
100.
Total # of test samples (N )
× (16)
Error is a measure of misclassification probability. Classification error could be due to failure of
a classifier to detect a true test sample or due to acceptance of false test sample. We have used
Detection Error Trade-off (DET) curve for analyzing the error rates which gives the trade-offs
between missed detection rate (i.e., miss probability) and false acceptance rate (i.e., false alarm
probability) [32]. Two performance measures, viz., % Equal Error Rate (EER) and minimum
Detection Cost Function (DCF) have been employed for quantifying the error associated with
classification task. % EER corresponds to an optimal classification threshold at which both the
errors (i.e., false acceptance and missed detection) are equal while DCF calculates the minimum
cost associated with the errors by penalizing each error according to its relative significance. DCF
is given by,
DCF = miss C * * * * , miss P Ptrue +Cfa Pfa Pfalse (17)
12. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
where miss P and fa P are missed detection and false alarm probabilities while miss C and fa C are
costs associated with them. true P and false P denote prior probabilities of true and false samples,
respectively, which in turn depends upon number of genuine and imposter trials performed. We
have employed equal penalties to both the errors (i.e., miss C = fa C = 1) for evaluating DCF. We
have also reported 95 % confidence intervals of classification accuracy to quote statistical
significance of our experimental results. Confidence intervals have been estimated by parametric
techniques [33].
32
5. EXPERIMENTAL RESULTS
In this section, experiments are performed to evaluate the proposed feature set for various
experimental evaluation factors such as cross-validation, effect of feature dimension, number of
sub-band filters and robustness against signal degradations. The details of these experiments and
analysis of results are presented in next sub-sections.
5.1. Fricative Classification using CFCC and MFCC
Using 13-dimensional feature vector (for both CFCC and MFCC feature sets),following three
classification tasks are performed on 2-fold cross-validated data.
1. Modeling sibilants and non-sibilants as different classes,
2. Modeling fricatives within sibilants and non-sibilants as different classes (e.g.,/s/ vs. /sh/
and /f/ vs. /th/),
3. Modeling each kind of fricative sound as a different class.
Table 3 shows the overall classification results for above classification tasks followed by
individual class analysis depicted via confusion matrices (shown in Table 4 -Table11).
Corresponding DET curves have been shown in Fig. 9, Fig. 10 and Fig. 11, respectively.
Following observations could be made from the results.
a. CFCC features perform consistently superior to baseline feature set (i.e., MFCC) in all
three classification tasks as mentioned above(Table 3 to Table 11).
b. CFCC improves the overall % classification accuracy of sibilant vs. non-sibilant
classification (i.e.,92.01 %, as shown in Table 3) by improving the rate of identifying
genuine non-sibilant samples (i.e.,90.15 %, as shown in Table 5) while genuine sibilant
samples have been identified equally well using both MFCC and CFCC feature sets (Table
4 and Table 5). DET curve (shown in Fig. 9) indicates that CFCC performs better than
MFCC at all the operating points of the curve (i.e., by varying classification threshold)
reducing % EER by6.37%.
c. Classification within sibilant class is much more accurate than within non-sibilant class in
case of both feature sets (i.e., MFCC and CFCC). Furthermore, classification accuracy
within sibilant class is almost same for both features, while % EER has been significantly
reduced in case of CFCC (by 5.37 %) suggesting that overlapping score distribution of
genuine and imposter test samples in case of MFCC has been considerably reduced by
using proposed CFCC (Table 3, Table 8 and Table 9, Fig. 10(b)).
d. Though classification accuracy within non-sibilant class has been improved in case of
CFCC (because of better identification of genuine /th/ test samples), the % EER is much
higher in case of both features (Table 3, Table 6 and Table 7, Fig. 10(a)).
e. Individual classification analysis of all fricatives also shows the effectiveness of proposed
feature set to better identify genuine /th/ test samples than MFCC resulting in overall
13. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
superior performance (Table 3, Table 10 and Table11). DET curve (shown in Fig. 11) also
depicts the superiority of CFCC which performs better than MFCC for most of the
operating points of the DET curve (of varying classification threshold) reducing % EER by
6.59 %.
33
Table 3: Comparison of classification results using CFCC and MFCC
Average % classification
accuracy
% EER Minimum DCF
Feature set
Task
MFCC CFCC
MFCC CFCC MFCC CFCC
Sibilants vs.
Non-sibilants
89.44
[86.62, 92.26]
92.01
[89.52, 94.5]
27.91
21.54
0.2780
0.2121
/f/ vs. /th/
(i.e., within
non-sibilant class)
76.42
[70.15, 82.69]
83.18
[77.66, 88.7]
31.77
25.52
0.3151
0.2782
/s/ vs. /sh/
(i.e.,within sibilant
class)
96.45
[94.28, 98.62]
97.55
[95.74,99.36
]
21.14
15.77
0.13
0.10
All four
fricatives
(i.e.,/f/,/s/,/sh/,/t
h/)
85.73
[82.52,88.94]
89.14
[86.29 92]
26.37
19.78
0.2148
0.1549
To summarize, sibilants are classified accurately by using both feature sets, MFCC and CFCC.
Interestingly, within non-sibilants, /f/ is classified equally well in both feature sets, however,
classification accuracy of /th/ is much higher in case of CFCC as compared to the MFCC. The
reason for this could be large spectral variation in /th/ sound. /f/ sound is found to occupy weak
spectral resonances around 1.5 kHz and 8.5 kHz. However, such energy concentration is not
observed consistently with all the /th/ test samples. On the other hand, spectral distribution of /th/
sound is highly variable (especially above 8 kHz) across different speakers and contexts. As
CFCC incorporates cochlear filters and several processes involved in auditory perception of
sound (eg., neural firings, nerve spike density, etc.), the spectral variability in /th/ sound may
bebetter modelled (as it happens in human auditory system) by CFCC resulting in considerable
increase in classification accuracy of /th/ as compared to MFCC.
Identified
Actual
Non-sibilants
Sibilants
Non-sibilants 83.68 16.32
Sibilants 6.96 93.05
Identified
Actual
Non-
Sibilants
Sibilants
Non-sibilants 90.15 9.85
Sibilants 6.82 93.18
Table 4: Confusion matrix showing %
classification accuracy for sibilant vs. non-sibilant
classification using MFCC features
Table 5: Confusion matrix showing %
classification accuracy for sibilant vs. non-sibilant
classification using CFCC features
14. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
34
Fig.9. DET curves for sibilant vs. non-sibilant classification using baseline and proposed feature sets.
Table 6: Confusion matrix showing %
classification accuracy of classification within
non-sibilants using MFCC
/f/
/th/
/f/ 85.34 14.66
/th/ 36.46 63.54
Table 7: Confusion matrix showing %
classification accuracy of classification
within non-sibilants using CFCC
Identified
Actual
/f/
/th/
/f/ 86.11 13.89
/th/ 21.04 78.96
(a) (b)
Identified
Actual
Fig.10. DET curves for classification using baseline (MFCC) and proposed (CFCC) feature sets (a)
within non-sibilant class (i.e., /f/ vs. /th/) (b) within sibilant class (i.e., /s/ vs. /sh/).
15. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
Table 11: Confusion matrix showing %
classification accuracy of unvoiced fricative
classification using CFCC
Table 8: Confusion matrix showing %
classification accuracy of classification
within sibilants using MFCC
Fig.11 DET curves for unvoiced fricative classification (for four classes, viz., /f/, /th/, /s/ and /sh/) using
35
MFCC and CFCC.
Identified
Actual
/s/
/sh/
/s/ 95.66 4.34
/sh/ 2.60 97.40
Identified
Actual
/s/
/sh/
/s/ 96.64 3.36
/sh/ 1.34 98.66
Identified
Actual
/f/
/th/
/s/
/sh/
/f/ 84.95 7.64 3.8 2.64
/th/ 24.49 56.49 10.56 7.04
/s/ 2.39 2.98 92.69 1.94
/sh/ 2.28 1.41 1.69 94.60
Identified
Actual
/f/
/th/
/s/
/sh/
/f/ 82.83 15.19 1.9 1.15
/th/ 22.63 72.50 2.48 0.98
/s/ 0.625 1.8 95.77 1.81
/sh/ 1.62 0.82 1.73 95.83
Table 10: Confusion matrix showing %
classification accuracy of unvoiced fricative
classification using MFCC
Table 9: Confusion matrix showing %
classification accuracy of classification
within sibilants using CFCC
16. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
36
5.2. Analysis of data independency via 4-fold cross-validation
Classification results should not be data-dependent(i.e., specific to particular set of training and
testing samples)rather should be consistent for any dataset as long as datasets are valid (i.e.,
represent samples from respective classes).In this paper, this is ensured by evaluating
classification results using 4-fold cross-validation analysis. Data for each fricative class is
randomly divided into 4 sets (as shown in Table 12) and each dataset is used for testing at a time
while remaining datasets are used for training. Four such trials have been performed and
corresponding experimental results for individual fricative classification are shown in Table 13
and Fig. 12. Table 13 shows the overall classification results for each fold while results for each
fricative (averaged over all these 4 folds datasets) have been shown in Fig. 12.
CFCC proves to be a better front-end feature set for classification as training and testing datasets
are varied in each of 4 folds (as shown in Table 13). It is also clear that both % EER and
minimum DCF have been reduced in 4-fold cross-validation analysis with slight reduction in
accuracy as well compared to 2-fold cross-validation analysis performed in Section 5.1 (as shown
in Table 3). One of the possible reasons for this difference in results could be the trade-off
involved between number of training and testing samples. Only half of the total samples have
been used for training in 2-fold cross-validation analysis whereas 75 % of total samples are used
for training in case of 4-fold cross-validation leading to better estimation of HMM parameters.
Fold number
Fold-1
Fold-2
Fold-3
Fold-4
Average
Feature
set
Results
MFCC
CFC
C
MFCC
CFCC
MFCC
CFCC
MFCC
CFCC
MFCC
CFCC
%
classification
accuracy
84.11
87
83.67
89.13
85.93
90.28
87.58
85.65
85.32
88.01
% EER
24.64
18.0
7
27.75
18.94
24.86
16.75
26.60
19.60
25.96
18.34
Minimum
DCF
0.1937
0.14
8
0.209
0.141
0.2050
0.141
0.196
0.151
0.2010
0.145
Table 12. Division of database into four sets via 4-fold cross-validation
# of test
samples in
Dataset 1
# of test
samples
in
Dataset 2
# of test
samples in
Dataset 3
# of test
samples in
Dataset 4
Total # of
samples
/f/ 52 52 52 52 208
/th/ 36 35 36 36 143
/s/ 77 76 76 76 305
/sh/ 63 64 63 64 254
Table 13. % classification accuracy for different training and testing sets (using 4-fold
cross-validation) using CFCC and MFCC for classification of /f/,/th/, /s/ and /sh/
17. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
Fig. 12. 4-fold averaged classification accuracy for individual fricative class for classification of /f/, /th/, /s/
37
and /sh/ using CFCC.
However, this is accompanied by less conclusive classification analysis as testing samples have
been reduced. Averaged individual fricative class accuracy (as shown in Fig. 12) shows
significant difference in accuracy of non-sibilant, /th/ using CFCC and MFCC features (i.e.,74.53
% for CFCC, 54.53 % for MFCC) confirming dataset independence of experimental results
reported in Section 5.1.
5.3. Effect of number of sub-band filters and feature dimensions
Both proposed (CFCC) and baseline (MFCC) features are evaluated by applying different number
of sub-band filters on corresponding spectral information to estimate optimum number of Mel and
cochlear filters required to capture distinct acoustic characteristics of each class. In wavelet
analysis, there is always a trade-off between number of sub-band filters used and associated
computational complexity. As more number of filters tend to provide more resolution (both in
time and frequency-domain), it is intuitive that this number should be chosen based on a
particular application (i.e., minimum number providing sufficient temporal and spectral details).
Initially, we varied the number of sub-band filters used to estimate feature vector along with
dimensions of feature set. In particular, if number of sub-band filters used is N then dimension of
feature vector is also kept as N. Fig. 13 (a) shows the plot of % classification accuracy vs.
number of sub-band filters (with fixed feature dimension) whereas Fig. 13(b) shows the plot of
% classification accuracy vs. dimension of feature vector (with fixed number of sub-band filters)
.
(a) (b)
Fig. 13 (a) % classification accuracy with variation in number of filters employed and with fixed dimension
of feature vector for classification of /f/, /th, /s/ and /sh/, (b) % classification accuracy by varying feature
dimension and with fixed number of filters for classification of /f/, /th/, /s/ and /sh/.
18. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
Feature dimension of 13 (with 13 cochlear sub-band filters) is found to be optimum for both
CFCC and MFCC features as both features show near -maximum classification accuracy in (i.e.,
89.14 % for CFCC, 85.73 % for MFCC when number of filters are varied by keeping fixed
dimension of feature vector). Hence, all the other experiments reported in this work have been
performed using 13-dimensional feature vectors for both CFCC and MFCC. In the next
experiment, we fixed the number of sub-band filters and reduced the number of cepstral
coefficients (i.e., feature dimension) from 13 in order to examine how many cepstral coefficients
are vital. Fig. 13 (b) shows the results obtained as feature dimensions are varied alone (i.e., with
fixed number of sub-band filters). It is observed that employing only 6 cepstral coefficients of
CFCC results in considerable classification accuracy in both the experiments (i.e., 86.48 % when
6 filters are employed, and 86.77 % when number of filters are fixed to 13) followed by rapid fall
in accuracy on reducing the feature dimension further. Therefore, it can be concluded that these 6
cochlear filters provide enough spectral resolution for capturing the distinctive spectral
characteristics of given unvoiced fricatives. Impulse and frequency responses of these 6 cochlear
filters have been discussed in Section 3 (as shown in Fig. 3 and Fig. 4, respectively).
38
5.4. Robustness under signal degradation conditions
To study the robustness of the proposed feature set under noisy conditions, testing samples of
fricative sounds were added with white noise at various SNR levels, while training is performed
with clean fricative samples. White noise samples are obtained from NOISEX-92 database [29]
(having sampling frequency of 19.98 kHz). These noise samples have been up-sampled to 44.1
kHz such that up-sampled white noise contains all the frequencies up to 22.05 kHz. Analysis is
performed on these test samples using both MFCC and CFCC features starting from clean
conditions and at varying SNR levels from 15 dB to -5 dB in steps of 5 dB. Fig. 14 shows the
performance of both features under various SNR levels. Though overall classification accuracy
decreases in case of both features, the decrease is much steeper with MFCC features, as accuracy
falls to 46.93 % at SNR of 5 dB while CFCC accuracy still remains at 77.46 %. Similar behavior
has been observed in % EER as well since % EER has been considerably increased with SNR
degradation in case of MFCC (i.e., 26.37 % EER in clean conditions to 40.49 % EER at SNR of 5
dB) while this increase is less steeper in CFCC (i.e., 19.78 % EER in clean conditions to 26.85 %
EER at SNR of 5 dB).
(a) (b)
Fig. 14. (a) Degradation of average classification accuracies in presence of additive white noise using
baseline (MFCC) and proposed (CFCC) feature sets, (b) increase in classification error in presence of
white noise using baseline (MFCC) and proposed (CFCC) feature sets.
As discussed in Section 3.3, the robustness of CFCC is due to the fact that
1. CFCC employs smooth bell-shaped cochear filters as opposed to triangular-shaped Mel
filters,
19. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
2. CFCC is designed to mimic human auditory processing which has inherent noise
suppression mechanism to take care of mechanical vibration of eardrum at the threshold
of hearing,
3. CFCC employs CWT which has mother wavelet to aid the noise suppression in wavelet-domain.
Decreasing SNR levels beyond 5 dB SNR results in rapid fall of accuracies in case of both feature
domains as fricative sounds are almost masked by added white noise and front end features no
longer reflect distinct acoustic characteristics in presence of such high noise.
39
6 SUMMARY AND CONCLUSIONS
Application of recently developed auditory-based cochlear filters for identifying spectral cues in
unvoiced class of fricatives has been proposed. Study was motivated by need to develop effective
acoustic cues using auditory transform pertaining to the similarity of auditory transform with
human cochlear response thereby distinguishing effectively between fricative sounds. Our
experimental results indicate that proposed CFCC features outperform MFCC features both in
clean and noisy conditions. One of the possible limitations of this study could be classification is
solely dependent on spectral characteristics of manually segmented fricative sounds. Including
contextual information may result in better classification since proposed feature set, viz., CFCC
itself depends on human auditory system and contextual information greatly helps in perceiving
fricative utterances in case of human listeners[30]. Global optimization of HMM parameters is
another issue as Baum-Welch re-estimation algorithm guarantees only local optimization.
Auditory transform-based CFCC features present an alternative to state-of-the-art front end
features (viz., MFCC) used for robust phoneme classification. Our future research will be directed
towards extending our present study to application of proposed robust feature (i.e., CFCC) in
phoneme identification task.
REFERENCES:
[1] Fant, G., Acoustic Theory of Speech Production, Mouton, The Hague, 1960.
[2] Stevens, K.N., Acoustic Phonetics (Current Studies in Linguistics), M.I.T. Press, 1999.
[3] J. D. Markel and A. H. Gray Jr., Linear Prediction of Speech, Springer-Verlag, 1976.
[4] Q. Li, An auditory-based transform for audio signal processing, Proc. IEEE Workshop App. Signal
Process. Audio Acoust., New Paltz, NY, pp. 181–184, Oct. 2009.
[5] Qi Li, An auditory-based feature extraction algorithm for robust speaker identification under
mismatched conditions, IEEE Trans. on Audio, Speech and Lang. Process., vol. 19, no. 6, pp.1791-
1801, Aug. 2011.
[6] McCasland, G. P., Noise intensity and spectrtuirt cues for spoken fricatives, J. Acoust. Soc. Am.
Suppl. vol. 165, pp.S78–S79, 1979.
[7] Behrens, S. and S. E. Blumstein, Acoustic characteristics of English voiceless fricatives: a descriptive
analysis, J. Phonetics, vol. 16, no.3, pp. 295–298, 1988.
[8] Stevens, K. N., Evidence for the role of acoustic boundaries in the perception of speech sounds, J.
Acoust. Soc. Am., vol. 69, no. S1, pp. S116-S116, 1981.
[9] Behrens, S. and S. E. Blumstein, On the role of the amplitude of the fricative noise in the perception
of place of articulation in voiceless fricative consonants, J. Acoust. Soc. Am., vol. 84, no. 3, pp. 861–
867, 1988.
[10] Jongman, A., Duration of fricative noise required for identification of English fricatives, J. Acoust.
Soc. Am., vol. 85, no. 4, pp. 1718–1725, 1989.
20. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014
[11] Jongman, A., R. Wayland, and S. Wong, Acoustic characteristics of English fricatives, J. Acoust. Soc.
40
Am., vol. 108, no.3, pp. 1252–1263, 2000.
[12] Hughes, G. W. and M. Halle, Spectral properties of fricative consonants, J. Acoust. Soc. Am., vol. 28,
no.2, pp. 303–310, 1956.
[13] Strevens, P., Spectra of fricative noise in human speech, Lang. Speech, vol. 3, no.1, pp. 32–49,
1960.
[14] Pentz, A., H. R. Gilbert, and P. Zawadzki, Spectral properties of fricative consonants in children, J.
Acoust. Soc. Am., vol. 66, no. 6, pp. 1891–1893, 1979.
[15] Nissen, S., An accoustic analysis of voiceless obstruents produced by adults and typically developing
children, Ph. D. Thesis, Ohio State University, Columbus, OH, 2003.
[16] S. Mallat, A Wavelet Tour of Signal Processing, 3rd Ed., New York: Academic, 2007.
[17] Qi Li, An auditory-based transform for audio signal processing, IEEE workshop on applications of
signal processing to audio and acoustics – WASPAA, pp. 181-184, 2009.
[18] J.L. Flanagan, Speech Analysis, Synthesis and Perception, 2nd Ed., Springer-Verlag, New York,
1972.
[19] R.K. Potter, G.A. Kopp, and H.C. Green, Visible Speech, D.Van Nostrand Co., New York, 1947.
Republished by Dover Publications, Inc., 1966.
[20] S. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word
recognition in continuously spoken sentences, IEEE Trans. on Acoustics, Speech and Signal Process.,
vol. 28, no 4, pp. 357-366, Aug. 1980.
[21] H. M. Teager and S. M. Teager, Evidence for nonlinear production mechanisms in the vocal tract, in
Speech Production and Speech Modeling, Norwell, MA: Kluwer, vol. 55, pp. 241–261, 1989.
[22] Brownell, W. E., Bader, C. R., Bertrand, D. and Ribaupierre, Y. d. , Evoked Mechanical Responses of
Isolated Cochlear Outer Hair Cells, Science, vol. 227, pp. 194-196, 1985.
[23] Helmholtz, H. L. F. v. On the Sensations of Tone, Dover Publications, Inc., New York, NY, 1954.
[24] S. Mallat, A Wavelet Tour of Signal Processing, 3rd Ed., New York: Academic, 2007.
[25] Jaffard, Pointwise smoothness, two-microlocalization and wavelet coefficients. Publications
Mathematiques, vol. 35, pp. 155168, 1991.
[26] Audacity software: Available Online: http://audacity.sourceforge.net/ {Last accessed : July 22,
2013}.
[27] CHAINS Corpus: Available online: http://chains.ucd.ie/ftpaccess.php .{Last accessed : July 22,2013}.
[28] Marija Tabain and Catherine Watson, A study on classification of fricatives,6th Australian
International conference on Speech science and technology, Adelaide, pp. 623-628, Dec.1996
[29] White Noise Source: NOISEX-92 database , Available online :
http://spib.rice.edu/spib/data/signals/noise/white.html {Last Accessed : July 22, 2013}.
[30] Brian C. J. Moore, An Introduction to the Psychology of Hearing, Academic Press, 4th Ed., 1997.
[31] Frid,A., Lavner,Y., Acoustic-phonetic analysis of fricatives for classification using SVM based
algorithm, 26th IEEE Convention of Electrical and Electronics Engineers in Israel (IEEEI'10),
pp.751-755, 2010.
[32] A.F. Martin, G. Doddington, T. Kamm, M. Ordowski and M. Przybocki, The DET curve in
assessment of detection error performance, Proc. EUROSPEECH’97, Rhodes Greece, vol.4, pp.1899-
1903, Sept. 1997.
[33] Bolle R.M., Pankanti S., Ratha N.K., Evaluation techniques for biometrics-based authentication
systems (FRR), Proc. 15th International Conference on Pattern Recognition , vol.2, pp.831-837,
2000.