T Silva, D D Karunaratna, G N Wikramanayake, K P Hewagamage, G K A Dias (2004) "Speaker Search and Indexing for Multimedia Databases" In:6th International Information Technology Conference, Edited by:V.K. Samaranayake et al. pp. 157-162. Infotel Lanka Society, Colombo, Sri Lanka: IITC Nov 29-Dec 1, ISBN: 955-8974-01-3
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...IDES Editor
This document summarizes a research paper that proposes a multimodal biometric system using palmprint and speech signals. It extracts features from each modality using different methods. For speech, it uses Subband Cepstral Coefficients extracted via a wavelet packet transform. For palmprint, it uses a Modified Canonical Form method. The features are fused at the score level using a weighted sum rule. The system is tested on a database of over 300 subjects, and results show improved recognition rates compared to single modalities.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
A Novel Optimum Technique for JPEG 2000 Post Compression Rate Distortion Algo...IDES Editor
The new technique we proposed in this paper based
on Hidden Markov Model in the field of post compression rate
distortion algorithms certainly meet the requirements of high
quality still images. The existing technology has been
extensively applied in modern image processing. Development
of image compression algorithms is becoming increasingly
important for obtaining a more informative image from several
source images captured by different modes of imaging systems
or multiple sensors. The JPEG 2000 image compression
standard is very sensitive to errors. The JPEG2000 system
provides scalability with respect to quality, resolution and
color component in the transfer of images. But some of the
applications need certainly qualitative images at the output
ends. In our architecture the Proto-object also introduced as
the input ant bit rate allocation and rate distortion has been
discussed for the output image with high resolution. In this
paper, we have also discussed our novel response dependent
condensation image compression which has given scope to go
for this post compression Rate Distortion Algorithm (PCRD)
of JPEG 2000 standard. This proposed technique outperforms
the existing methods in terms of increasing efficiency,
optimum PSNR values at different bpp levels. The proposed
technique involves Hidden Markov Model to meet the
requirements for higher scalability and also to increase the
memory storage capacity.
Speech Emotion Recognition is a recent research topic in the Human Computer Interaction (HCI) field. The need has risen for a more natural communication interface between humans and computer, as computers have become an integral part of our lives. A lot of work currently going on to improve the interaction between humans and computers. To achieve this goal, a computer would have to be able to distinguish its present situation and respond differently depending on that observation. Part of this process involves understanding a user‟s emotional state. To make the human computer interaction more natural, the objective is that computer should be able to recognize emotional states in the same as human does. The efficiency of emotion recognition system depends on type of features extracted and classifier used for detection of emotions. The proposed system aims at identification of basic emotional states such as anger, joy, neutral and sadness from human speech. While classifying different emotions, features like MFCC (Mel Frequency Cepstral Coefficient) and Energy is used. In this paper, Standard Emotional Database i.e. English Database is used which gives the satisfactory detection of emotions than recorded samples of emotions. This methodology describes and compares the performances of Learning Vector Quantization Neural Network (LVQ NN), Multiclass Support Vector Machine (SVM) and their combination for emotion recognition.
The document summarizes an automatic text extraction system for complex images. The system uses discrete wavelet transform for text localization. Morphological operations like erosion and dilation are used to enhance text identification and segmentation. Text regions are segmented using connected component analysis and properties like area and bounding box shape. The extracted text is recognized and shown in a text file. The system allows modifying the recognized text and shows better performance than existing techniques.
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
Automatic voice recognition system aims to limit fraudulent access to sensitive areas as labs. Our primary objective of this paper is to increasethe accuracy of the voice recognition in noisy environment of the Microsoft Research (MSR) identity toolbox. The proposed system enabled the user tospeak into the microphone then it will match unknown voice with other human voices existing in the database using a statistical model, in order togrant or deny access to the system. The voice recognition was done in twosteps: training and testing. During the training a Universal BackgroundModel as well as a Gaussian Mixtures Model: GMM-UBM models arecalculated based on different sentences pronounced by the human voice (s) used to record the training data. Then the testing of voice signal in noisyenvironment calculated the Log-Likelihood Ratio of the GMM-UBM models in order to classify user's voice. However, before testing noise and de-noisemethods were applied, we investigated different MFCC features of the voiceto determine the best feature possible as well as noise filter algorithmthat subsequently improved the performance of the automatic voicerecognition system.
Significant Data Hiding through Discrete Wavelet Transformation ApproachEswar Publications
The methods of communication of invisible information have become need in the today’s digital era. The network connectivity and high speed devices made easy passing massive data instantly. As boom of the huge data transmission has taken place due to easy use of the technology, the protection of the data has become prime issue. Steganography hides messages inside some other digital media. Cryptography, on the other hand obscures the content of the message. We proposed a novel integration of an incorporating text and image steganography to find a solution for improve security and protect data. The proposed methods shows a high level of efficiency and robustness by combining text and image which involves the scheme of discrete wavelet transformation combining text and image by secretly embeds encrypted secrete message text data (cipher text) or text image in the content of a digital image. A comparative study of the different techniques has been illustrated by computing Mean square error (MSE) and Peak Signal to Noise Ratio (PSNR).
The document discusses the relational model of data for large shared data banks. It introduces the concept of n-ary relations and the universal data sublanguage as an alternative to tree-structured files or network models. The relational model provides independence between data representation and programs/queries by removing ordering, indexing, and access path dependencies from the user's data model. This allows changes to the internal data representation without affecting user activities.
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...IDES Editor
This document summarizes a research paper that proposes a multimodal biometric system using palmprint and speech signals. It extracts features from each modality using different methods. For speech, it uses Subband Cepstral Coefficients extracted via a wavelet packet transform. For palmprint, it uses a Modified Canonical Form method. The features are fused at the score level using a weighted sum rule. The system is tested on a database of over 300 subjects, and results show improved recognition rates compared to single modalities.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
A Novel Optimum Technique for JPEG 2000 Post Compression Rate Distortion Algo...IDES Editor
The new technique we proposed in this paper based
on Hidden Markov Model in the field of post compression rate
distortion algorithms certainly meet the requirements of high
quality still images. The existing technology has been
extensively applied in modern image processing. Development
of image compression algorithms is becoming increasingly
important for obtaining a more informative image from several
source images captured by different modes of imaging systems
or multiple sensors. The JPEG 2000 image compression
standard is very sensitive to errors. The JPEG2000 system
provides scalability with respect to quality, resolution and
color component in the transfer of images. But some of the
applications need certainly qualitative images at the output
ends. In our architecture the Proto-object also introduced as
the input ant bit rate allocation and rate distortion has been
discussed for the output image with high resolution. In this
paper, we have also discussed our novel response dependent
condensation image compression which has given scope to go
for this post compression Rate Distortion Algorithm (PCRD)
of JPEG 2000 standard. This proposed technique outperforms
the existing methods in terms of increasing efficiency,
optimum PSNR values at different bpp levels. The proposed
technique involves Hidden Markov Model to meet the
requirements for higher scalability and also to increase the
memory storage capacity.
Speech Emotion Recognition is a recent research topic in the Human Computer Interaction (HCI) field. The need has risen for a more natural communication interface between humans and computer, as computers have become an integral part of our lives. A lot of work currently going on to improve the interaction between humans and computers. To achieve this goal, a computer would have to be able to distinguish its present situation and respond differently depending on that observation. Part of this process involves understanding a user‟s emotional state. To make the human computer interaction more natural, the objective is that computer should be able to recognize emotional states in the same as human does. The efficiency of emotion recognition system depends on type of features extracted and classifier used for detection of emotions. The proposed system aims at identification of basic emotional states such as anger, joy, neutral and sadness from human speech. While classifying different emotions, features like MFCC (Mel Frequency Cepstral Coefficient) and Energy is used. In this paper, Standard Emotional Database i.e. English Database is used which gives the satisfactory detection of emotions than recorded samples of emotions. This methodology describes and compares the performances of Learning Vector Quantization Neural Network (LVQ NN), Multiclass Support Vector Machine (SVM) and their combination for emotion recognition.
The document summarizes an automatic text extraction system for complex images. The system uses discrete wavelet transform for text localization. Morphological operations like erosion and dilation are used to enhance text identification and segmentation. Text regions are segmented using connected component analysis and properties like area and bounding box shape. The extracted text is recognized and shown in a text file. The system allows modifying the recognized text and shows better performance than existing techniques.
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
Automatic voice recognition system aims to limit fraudulent access to sensitive areas as labs. Our primary objective of this paper is to increasethe accuracy of the voice recognition in noisy environment of the Microsoft Research (MSR) identity toolbox. The proposed system enabled the user tospeak into the microphone then it will match unknown voice with other human voices existing in the database using a statistical model, in order togrant or deny access to the system. The voice recognition was done in twosteps: training and testing. During the training a Universal BackgroundModel as well as a Gaussian Mixtures Model: GMM-UBM models arecalculated based on different sentences pronounced by the human voice (s) used to record the training data. Then the testing of voice signal in noisyenvironment calculated the Log-Likelihood Ratio of the GMM-UBM models in order to classify user's voice. However, before testing noise and de-noisemethods were applied, we investigated different MFCC features of the voiceto determine the best feature possible as well as noise filter algorithmthat subsequently improved the performance of the automatic voicerecognition system.
Significant Data Hiding through Discrete Wavelet Transformation ApproachEswar Publications
The methods of communication of invisible information have become need in the today’s digital era. The network connectivity and high speed devices made easy passing massive data instantly. As boom of the huge data transmission has taken place due to easy use of the technology, the protection of the data has become prime issue. Steganography hides messages inside some other digital media. Cryptography, on the other hand obscures the content of the message. We proposed a novel integration of an incorporating text and image steganography to find a solution for improve security and protect data. The proposed methods shows a high level of efficiency and robustness by combining text and image which involves the scheme of discrete wavelet transformation combining text and image by secretly embeds encrypted secrete message text data (cipher text) or text image in the content of a digital image. A comparative study of the different techniques has been illustrated by computing Mean square error (MSE) and Peak Signal to Noise Ratio (PSNR).
The document discusses the relational model of data for large shared data banks. It introduces the concept of n-ary relations and the universal data sublanguage as an alternative to tree-structured files or network models. The relational model provides independence between data representation and programs/queries by removing ordering, indexing, and access path dependencies from the user's data model. This allows changes to the internal data representation without affecting user activities.
This document provides course details for the Microwave Engineering course offered at Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal. The course code is EC-604 and it is worth 6 credits. It is a departmental core course offered at the third year level. The course contains 5 units covering topics such as microwave transmission systems, microwave networks and components, microwave solid state devices and applications, microwave vacuum tube devices, and microwave measurements. The document lists the course contents in detail for each unit along with 6 experiments related to microwave engineering concepts. It also provides 6 references for further reading.
Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...CSCJournals
1) The document discusses joint source-channel coding (JSCC) schemes for transmitting JPEG images over AWGN channels.
2) It proposes using JSCC with unequal error protection (UEP), applying stronger channel coding to important DC coefficients and weaker coding to less important AC coefficients.
3) The goal is to improve received image quality compared to equal error protection (EEP) schemes by providing varying levels of protection according to data importance.
This document contains slides related to Chapter 3 of the textbook "Distributed Systems: Concepts and Design" covering networking and internetworking concepts. The slides include diagrams of network performance over distance, conceptual layering of protocol software, encapsulation in layered protocols, the OSI reference model layers, examples of protocols for each OSI layer, internetwork layers, routing in wide area networks, routing tables, a routing algorithm pseudocode, examples of network topologies, IP addressing structures, IP packet layout, network address translation, IPv6 improvements, mobile IP routing, and firewall configurations.
A Relational Model of Data for Large Shared Data Banksrenguzi
1. The paper introduces a relational model of data as a superior alternative to existing network and tree models for large shared databases. It allows data to be described in its natural form without imposing additional structure.
2. The relational model provides a basis for treating issues like redundancy, derivability, and consistency of data relations. It also allows clearer evaluation of existing data systems and representations.
3. Existing systems still impose dependencies between data representation and user interaction models related to ordering, indexing, and access paths that limit changes without impacting applications. The relational model aims to reduce these dependencies.
1. The document introduces a relational model of data as an alternative to tree-structured and network models for large shared databases.
2. Key advantages of the relational model include independence between data representation and user applications, and clearer treatment of issues like redundancy and consistency.
3. Existing systems still impose dependencies between data representation characteristics like ordering, indexing, and access paths, and user applications and queries. The relational model aims to reduce these dependencies.
Design and implementation of a java based virtual laboratory for data communi...IJECEIAES
Students in this modern age find engineering courses taught in the university very abstract and difficult, and cannot relate theoretical calculations to real life scenarios. They consequently lose interest in their coursework and perform poorly in their grades. Simulation of classroom concepts with simulation software like MATLAB, were developed to facilitate learning experience. This paper involves the development of a virtual laboratory simulation package for teaching data communication concepts such as coding schemes, modulation and filtering. Unlike other simulation packages, no prior knowledge of computer programming is required for students to grasp these concepts.
The document describes a method for enriching image labels through automatic annotation. The method involves extending existing image labels using synonyms from a thesaurus. Candidate extended labels are evaluated and filtered using three metrics: user log data, semantic vectors, and context vectors. Experimental results showed that the proposed label extension method improved search effectiveness and efficiency by bridging the gap between short image labels and textual queries. The method provides a simple but effective and efficient way to auto-annotate images without extensive manual labeling.
The growing trend of online image sharing and downloads today mandate the need for better encoding and
decoding scheme. This paper looks into this issue of image coding. Multiple Description Coding is an
encoding and decoding scheme that is specially designed in providing more error resilience for data
transmission. The main issue of Multiple Description Coding is the lossy transmission channels. This work
attempts to address the issue of re-constructing high quality image with the use of just one descriptor
rather than the conventional descriptor. This work compare the use of Type I quantizer and Type II
quantizer. We propose and compare 4 coders by examining the quality of re-constructed images. The 4
coders are namely JPEG HH (Horizontal Pixel Interleaving with Huffman Coding) model, JPEG HA
(Horizontal Pixel Interleaving with Arithmetic Encoding) model, JPEG VH (Vertical Pixel Interleaving
with Huffman Encoding) model, and JPEG VA (Vertical Pixel Interleaving with Arithmetic Encoding)
model. The findings suggest that the use of horizontal and vertical pixel interleavings do not affect the
results much. Whereas the choice of quantizer greatly affect its performance.
This document proposes a method for detecting, localizing, and extracting text from videos with complex backgrounds. It uses two techniques independently - corner metric detection and Laplacian filtering - to detect text regions. The results of the two techniques are then multiplied to reduce noise. Localized text is then binarized by determining seed pixels. The method aims to improve on existing approaches by combining two detection methods to more accurately extract text from videos while reducing noise.
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia TrafficIOSR Journals
Abstract : The performance of the multimedia traffic in Ad-Hoc networks is highly impacted with the Signal to Noise Ratio. The Average PSNR (Peak Signal to Noise Ratio) is an important parameter for the evaluation of multimedia traffic in Ad-Hoc Networks. With the increase of bandwidth of the channels, it becomes necessary to take care of other network parameters like PSNR and ASNR( Average Signal to Noise Ratio) .Enhanced bandwidth with higher channel error rates demand a careful analysis of signal to noise ratio for optimum performance. In this paper, we have evaluated the effect of channel error rate on Average PSNR for the MPEG-4 traffic in Ad-hoc Networks. Keywords: MANETs, Evalvid, MPEG-4, Fragmentation, PSNR
The document contains 14 figures summarizing key concepts of peer-to-peer systems and distributed hash tables. Figure 10.1 compares IP and overlay routing, noting differences in scale, load balancing, dynamics, fault tolerance, identification, and security. Figures 10.2 through 10.5 illustrate examples of peer-to-peer architectures like Napster, distributed hash tables, and object location services.
This document provides an overview of the applications of error-control coding over the past 50 years. It begins by discussing early applications in deep space communications, where coding provided significant power savings. The first coding scheme used for deep space, known as the Mariner code, achieved a power gain of 3.2 dB over uncoded BPSK but required higher bandwidth. More advanced codes were later developed that achieved gains close to the theoretical limits. The document then discusses how coding has been widely used in other areas such as satellite communications, data transmission, data storage, mobile communications and more to improve performance.
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...Rosdiadee Nordin
This document presents a novel hybrid algorithm called HGS for detecting transmitted data in 4G and beyond mobile communication systems. The HGS algorithm uses metaheuristic approaches to achieve optimal detection performance with low complexity. It aims to develop adaptive detection methods that have faster convergence speeds and require fewer parameters over time-varying MIMO channels. Simulation results show the HGS algorithm achieves better bit error rate than other detection methods like MMSE-MUD and GACE, with lower computational complexity. Future work could involve applying the metaheuristic approaches to systems using interleaved division multiple access or independent component analysis.
ROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORMsipij
In this paper a robust watermarking method operating in the wavelet domain for grayscale digital imagesis developed. The method first acomputes the differences between the watermark and the HH1 sub-band ofthe cover image values and then embed these differences in one of the frequency sub-bands. The resultsshow that embedding the watermark in the LH1 sub-band gave the best results. The results were evaluatedusing the RMSE and the PSNR of both the original and the watermarked image. Although the watermarkwas recovered perfectly in the ideal case, the addition of Gaussian noise, or compression of the imageusing JPEG with quality less than 100 destroys the embedded watermark. Different experiments werecarried out to test the performance of the proposed method and good results were obtained.
The document discusses clustering images based on their properties. Images are converted into intensity, contrast, Weibull and fractal images. Eight properties are calculated for each image type, including brightness, standard deviation, entropy, skewness, kurtosis, separability, spatial frequency and visibility. The properties are normalized and clustered using k-means clustering. Tables show normalized property values for different image types. The clustering groups similar images based on their discriminative properties.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
This document summarizes a study that compares different acoustic feature extraction methods (LPC, MFCC, PLP) for a Bangla speech recognition system using LSTM neural networks. It finds that PLP outperforms MFCC and LPC based on statistical distance measurements of phoneme coefficients. PLP shows better distinction between phonemes compared to MFCC and LPC. While RNN/LSTM are inherently slow, combining PLP with faster networks like Transformers may improve performance for large datasets.
A robust audio watermarking in cepstrum domain composed of sample's relation ...ijma
This document proposes a robust audio watermarking technique that embeds watermarks in the cepstrum domain based on the relationship between mean values of consecutive sample groups. The host audio is divided into frames, which are then divided into four equal-sized sub-frames. The sub-frames are transformed to the cepstrum domain. Watermarks are embedded by either interchanging or updating the differences between the mean values of the first two sub-frames and last two sub-frames, selectively distorting sample values in the sub-frames. This allows imperceptible embedding while making extraction computationally simple without needing to know distortion amounts. Simulation results show the technique is robust and imperceptible against attacks.
This paper proposes an approach for indexing multimedia clips by speaker using audio segmentation, speaker recognition, and metadata indexing. Segmentation is done using Bayesian Information Criterion (BIC) and silence detection. Gaussian Mixture Models (GMMs) are trained for each speaker using Mel-Frequency Cepstral Coefficients (MFCC) as features. An ensemble of 3 GMMs is used to reduce errors from stochastic GMM training. Indexing utilizes sampled MFCC features from segments as metadata linked to speaker models. The system achieves a 20% true miss rate and 10% false alarm rate on segments 15-25 seconds, with performance decreasing for shorter segments.
Management of Evolving Constraints in a Computerised Engineering Design Envir...Gihan Wikramanayake
Presentation by Gihan Wikramanayake on 8 July 2004 at 23rd National Information Technology Conference of the Computer Society of Sri Lanka, Colombo, Sri Lanka.
Reference:
J S Goonethillake, G N Wikramanayake (2004) Management of Evolving Constraints in a Computerised Engineering Design Environment In: 23rd National Information Technology Conference 43-54 Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
Article: http://www.slideshare.net/wikramanayake/management-of-evolving-constraints-in-a-computerised-engineering-design-environment
Design and Development of a Resource Allocation Mechanism for the School Educ...Gihan Wikramanayake
Presentation by Gihan Wikramanayake on 1st Dec 2000 based on output of GEP2 project
Reference:
G N Wikramanayake (2000) Design and Development of a Resource Allocation Mechanism for the School Education Sector In: Annual Sessions, Faculty of Science, University of Colombo 19 UoC Dec, vol. 1
Article: http://www.slideshare.net/wikramanayake/design-and-development-of-a-resource-allocation-mechanism-for-the-school-education-sector
This document provides course details for the Microwave Engineering course offered at Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal. The course code is EC-604 and it is worth 6 credits. It is a departmental core course offered at the third year level. The course contains 5 units covering topics such as microwave transmission systems, microwave networks and components, microwave solid state devices and applications, microwave vacuum tube devices, and microwave measurements. The document lists the course contents in detail for each unit along with 6 experiments related to microwave engineering concepts. It also provides 6 references for further reading.
Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...CSCJournals
1) The document discusses joint source-channel coding (JSCC) schemes for transmitting JPEG images over AWGN channels.
2) It proposes using JSCC with unequal error protection (UEP), applying stronger channel coding to important DC coefficients and weaker coding to less important AC coefficients.
3) The goal is to improve received image quality compared to equal error protection (EEP) schemes by providing varying levels of protection according to data importance.
This document contains slides related to Chapter 3 of the textbook "Distributed Systems: Concepts and Design" covering networking and internetworking concepts. The slides include diagrams of network performance over distance, conceptual layering of protocol software, encapsulation in layered protocols, the OSI reference model layers, examples of protocols for each OSI layer, internetwork layers, routing in wide area networks, routing tables, a routing algorithm pseudocode, examples of network topologies, IP addressing structures, IP packet layout, network address translation, IPv6 improvements, mobile IP routing, and firewall configurations.
A Relational Model of Data for Large Shared Data Banksrenguzi
1. The paper introduces a relational model of data as a superior alternative to existing network and tree models for large shared databases. It allows data to be described in its natural form without imposing additional structure.
2. The relational model provides a basis for treating issues like redundancy, derivability, and consistency of data relations. It also allows clearer evaluation of existing data systems and representations.
3. Existing systems still impose dependencies between data representation and user interaction models related to ordering, indexing, and access paths that limit changes without impacting applications. The relational model aims to reduce these dependencies.
1. The document introduces a relational model of data as an alternative to tree-structured and network models for large shared databases.
2. Key advantages of the relational model include independence between data representation and user applications, and clearer treatment of issues like redundancy and consistency.
3. Existing systems still impose dependencies between data representation characteristics like ordering, indexing, and access paths, and user applications and queries. The relational model aims to reduce these dependencies.
Design and implementation of a java based virtual laboratory for data communi...IJECEIAES
Students in this modern age find engineering courses taught in the university very abstract and difficult, and cannot relate theoretical calculations to real life scenarios. They consequently lose interest in their coursework and perform poorly in their grades. Simulation of classroom concepts with simulation software like MATLAB, were developed to facilitate learning experience. This paper involves the development of a virtual laboratory simulation package for teaching data communication concepts such as coding schemes, modulation and filtering. Unlike other simulation packages, no prior knowledge of computer programming is required for students to grasp these concepts.
The document describes a method for enriching image labels through automatic annotation. The method involves extending existing image labels using synonyms from a thesaurus. Candidate extended labels are evaluated and filtered using three metrics: user log data, semantic vectors, and context vectors. Experimental results showed that the proposed label extension method improved search effectiveness and efficiency by bridging the gap between short image labels and textual queries. The method provides a simple but effective and efficient way to auto-annotate images without extensive manual labeling.
The growing trend of online image sharing and downloads today mandate the need for better encoding and
decoding scheme. This paper looks into this issue of image coding. Multiple Description Coding is an
encoding and decoding scheme that is specially designed in providing more error resilience for data
transmission. The main issue of Multiple Description Coding is the lossy transmission channels. This work
attempts to address the issue of re-constructing high quality image with the use of just one descriptor
rather than the conventional descriptor. This work compare the use of Type I quantizer and Type II
quantizer. We propose and compare 4 coders by examining the quality of re-constructed images. The 4
coders are namely JPEG HH (Horizontal Pixel Interleaving with Huffman Coding) model, JPEG HA
(Horizontal Pixel Interleaving with Arithmetic Encoding) model, JPEG VH (Vertical Pixel Interleaving
with Huffman Encoding) model, and JPEG VA (Vertical Pixel Interleaving with Arithmetic Encoding)
model. The findings suggest that the use of horizontal and vertical pixel interleavings do not affect the
results much. Whereas the choice of quantizer greatly affect its performance.
This document proposes a method for detecting, localizing, and extracting text from videos with complex backgrounds. It uses two techniques independently - corner metric detection and Laplacian filtering - to detect text regions. The results of the two techniques are then multiplied to reduce noise. Localized text is then binarized by determining seed pixels. The method aims to improve on existing approaches by combining two detection methods to more accurately extract text from videos while reducing noise.
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia TrafficIOSR Journals
Abstract : The performance of the multimedia traffic in Ad-Hoc networks is highly impacted with the Signal to Noise Ratio. The Average PSNR (Peak Signal to Noise Ratio) is an important parameter for the evaluation of multimedia traffic in Ad-Hoc Networks. With the increase of bandwidth of the channels, it becomes necessary to take care of other network parameters like PSNR and ASNR( Average Signal to Noise Ratio) .Enhanced bandwidth with higher channel error rates demand a careful analysis of signal to noise ratio for optimum performance. In this paper, we have evaluated the effect of channel error rate on Average PSNR for the MPEG-4 traffic in Ad-hoc Networks. Keywords: MANETs, Evalvid, MPEG-4, Fragmentation, PSNR
The document contains 14 figures summarizing key concepts of peer-to-peer systems and distributed hash tables. Figure 10.1 compares IP and overlay routing, noting differences in scale, load balancing, dynamics, fault tolerance, identification, and security. Figures 10.2 through 10.5 illustrate examples of peer-to-peer architectures like Napster, distributed hash tables, and object location services.
This document provides an overview of the applications of error-control coding over the past 50 years. It begins by discussing early applications in deep space communications, where coding provided significant power savings. The first coding scheme used for deep space, known as the Mariner code, achieved a power gain of 3.2 dB over uncoded BPSK but required higher bandwidth. More advanced codes were later developed that achieved gains close to the theoretical limits. The document then discusses how coding has been widely used in other areas such as satellite communications, data transmission, data storage, mobile communications and more to improve performance.
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...Rosdiadee Nordin
This document presents a novel hybrid algorithm called HGS for detecting transmitted data in 4G and beyond mobile communication systems. The HGS algorithm uses metaheuristic approaches to achieve optimal detection performance with low complexity. It aims to develop adaptive detection methods that have faster convergence speeds and require fewer parameters over time-varying MIMO channels. Simulation results show the HGS algorithm achieves better bit error rate than other detection methods like MMSE-MUD and GACE, with lower computational complexity. Future work could involve applying the metaheuristic approaches to systems using interleaved division multiple access or independent component analysis.
ROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORMsipij
In this paper a robust watermarking method operating in the wavelet domain for grayscale digital imagesis developed. The method first acomputes the differences between the watermark and the HH1 sub-band ofthe cover image values and then embed these differences in one of the frequency sub-bands. The resultsshow that embedding the watermark in the LH1 sub-band gave the best results. The results were evaluatedusing the RMSE and the PSNR of both the original and the watermarked image. Although the watermarkwas recovered perfectly in the ideal case, the addition of Gaussian noise, or compression of the imageusing JPEG with quality less than 100 destroys the embedded watermark. Different experiments werecarried out to test the performance of the proposed method and good results were obtained.
The document discusses clustering images based on their properties. Images are converted into intensity, contrast, Weibull and fractal images. Eight properties are calculated for each image type, including brightness, standard deviation, entropy, skewness, kurtosis, separability, spatial frequency and visibility. The properties are normalized and clustered using k-means clustering. Tables show normalized property values for different image types. The clustering groups similar images based on their discriminative properties.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
This document summarizes a study that compares different acoustic feature extraction methods (LPC, MFCC, PLP) for a Bangla speech recognition system using LSTM neural networks. It finds that PLP outperforms MFCC and LPC based on statistical distance measurements of phoneme coefficients. PLP shows better distinction between phonemes compared to MFCC and LPC. While RNN/LSTM are inherently slow, combining PLP with faster networks like Transformers may improve performance for large datasets.
A robust audio watermarking in cepstrum domain composed of sample's relation ...ijma
This document proposes a robust audio watermarking technique that embeds watermarks in the cepstrum domain based on the relationship between mean values of consecutive sample groups. The host audio is divided into frames, which are then divided into four equal-sized sub-frames. The sub-frames are transformed to the cepstrum domain. Watermarks are embedded by either interchanging or updating the differences between the mean values of the first two sub-frames and last two sub-frames, selectively distorting sample values in the sub-frames. This allows imperceptible embedding while making extraction computationally simple without needing to know distortion amounts. Simulation results show the technique is robust and imperceptible against attacks.
This paper proposes an approach for indexing multimedia clips by speaker using audio segmentation, speaker recognition, and metadata indexing. Segmentation is done using Bayesian Information Criterion (BIC) and silence detection. Gaussian Mixture Models (GMMs) are trained for each speaker using Mel-Frequency Cepstral Coefficients (MFCC) as features. An ensemble of 3 GMMs is used to reduce errors from stochastic GMM training. Indexing utilizes sampled MFCC features from segments as metadata linked to speaker models. The system achieves a 20% true miss rate and 10% false alarm rate on segments 15-25 seconds, with performance decreasing for shorter segments.
Management of Evolving Constraints in a Computerised Engineering Design Envir...Gihan Wikramanayake
Presentation by Gihan Wikramanayake on 8 July 2004 at 23rd National Information Technology Conference of the Computer Society of Sri Lanka, Colombo, Sri Lanka.
Reference:
J S Goonethillake, G N Wikramanayake (2004) Management of Evolving Constraints in a Computerised Engineering Design Environment In: 23rd National Information Technology Conference 43-54 Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
Article: http://www.slideshare.net/wikramanayake/management-of-evolving-constraints-in-a-computerised-engineering-design-environment
Design and Development of a Resource Allocation Mechanism for the School Educ...Gihan Wikramanayake
Presentation by Gihan Wikramanayake on 1st Dec 2000 based on output of GEP2 project
Reference:
G N Wikramanayake (2000) Design and Development of a Resource Allocation Mechanism for the School Education Sector In: Annual Sessions, Faculty of Science, University of Colombo 19 UoC Dec, vol. 1
Article: http://www.slideshare.net/wikramanayake/design-and-development-of-a-resource-allocation-mechanism-for-the-school-education-sector
Asia eBIT @ UCSC: Implementing the paradigm shift from Teaching to Learning t...Gihan Wikramanayake
G N Wikramanayake, K P Hewagamage, G I Gamage, A R Weerasinghe (2007) "Asia eBIT @ UCSC: Implementing the paradigm shift from Teaching to Learning through e-learning framework" In:25th National Information Technology Edited by:Chrisantha Silva et al. pp. 68-81. Computer Society of Sri Lanka, Colombo, Sri Lanka: CSSL Sep 19-20 ISBN: 978-955-9155-15-7
Improving B2B Transactions by Exploiting Business RegistriesGihan Wikramanayake
Manjith Gunatilaka, G N Wikramanayake, D D Karunaratna (2004) "Improving B2B Transactions by Exploiting Business Registries" In:23rd National Information Technology Conference, pp. 103-109. Computer Society of Sri Lanka, Colombo: CSSL Jul 8-9, ISBN: 955-9155-12-1
Improving Your Web Services Thorough Semantic Web TechniquesGihan Wikramanayake
J P Liyanage, G N Wikramanayake (2006) "Improving Your Web Services Thorough Semantic Web Techniques" In: 8th International Information Technology Conference on Innovations for a Knowledge Economy, pp. 14-23 Infotel Lanka Society, Colombo, Sri Lanka: IITC Oct 12-13, ISBN: 955-8974-04-8
Evaluation of English and IT skills of new entrants to Sri Lankan universitiesGihan Wikramanayake
Gihan N. Wikramanayake, Damitha D. Karunartna, Dilkushi S. Wettewe, "Evaluation of English and IT skills of new entrants to Sri Lankan universities", International Conference on Information and Educational Technology (ICIET), Mumbai, 15 Jan 2012.
This study presents our experiences in designing, implementing and deploying an on-line evaluation scheme to measure the English and information technology skills of new entrants to Sri Lankan universities at point of entry in 2011. Over 15,000 students from 25 districts of the country were subjected to the on-line evaluation. The test was
conducted by using a learning management system, in 24 consecutive days in twenty six centres scattered across the country. This paper sums up the experiences we gathered in conducting the evaluation of a larger group of students spread across a wide geographical area and the lessons learned.
Allocation of Educational Funds to Provinces: Options 1999Gihan Wikramanayake
Presentation made by Gihan Wikramanayake for the Development of a Norm-Based Cost Allocation Mechanism, Joint Workshop, Rural Bank & Staff Training College, Rajagiriya, 16th February 1999.
Management of Evolving Constraints in a Computerised Engineering Design Envir...Gihan Wikramanayake
1) The document discusses managing evolving constraints in engineering design. Constraints often change during the iterative design process due to changes in requirements, technology, costs, or performance goals.
2) It proposes a framework using Constraint Version Objects (CVOs) to independently capture changing constraints over time without modifying class definitions. Each CVO contains a set of constraints that versions of a design must satisfy.
3) The latest CVO created becomes the default CVO, and new versions are automatically validated against it. This allows different versions to adhere to different constraint sets over the evolution of the design process.
B P Manage, G N Wikramanayake (1998) "Integrated Sri Lankan University Information System" In: Conference, Exhibition and Business Directory of 1st International Information Technology Conference, pp. 43. Infotel Lanka Society, Colombo, Sri Lanka: IITC Oct 7-8
Re-Engineering Databases using Meta-Programming TechnologyGihan Wikramanayake
G N Wikramanayake (1997) "Re-engineering Databases using Meta-Programming Technology" In:16th National Information Technology Conference on Information Technology for Better Quality of Life Edited by:R. Ganepola et al. pp. 1-14. Computer Society of Sri Lanka, Colombo: CSSL Jul 11-13, ISBN 955-9155-05-9
The document discusses the degree progress pathways at the University of Colombo School of Computing for students pursuing a BIT (Bachelor of Information Technology) degree. It outlines a multi-year pathway that begins with an aptitude test and one-year diploma program, followed by a two-year higher diploma or certificate program, culminating in a three-year BIT degree program. It also provides statistics on enrollment numbers, pass rates, and employment outcomes at each stage from 2000 to 2006, showing increasing interest and participation in the BIT program over time.
G N Wikramanayake (2005) Impact of Digital Technology on Education In: 24th National Information Technology Conference 82-91 Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Aug 15-16, ISBN: 955-9155-13-X
Article: http://www.slideshare.net/wikramanayake/impact-of-digital-technology-on-education
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise boosts blood flow, releases endorphins, and promotes changes in the brain which help regulate emotions and stress levels.
Analysis and Development of Curriculum to Build the Foundation for eLearning ...Gihan Wikramanayake
The document discusses curriculum development for e-learning courses for a BIT degree program. It outlines the current issues with low registration, pass rates, and high dropout rates. Stage 1 of an e-learning pilot project is evaluated, finding weaknesses in learning content, assessment alignment, and lack of a virtual learning environment. The objectives of further curriculum development are then presented, including developing a program roadmap, implementing constructive alignment of syllabus, learning resources and assessment, and introducing a collaborative learning model based on constructivism and learning activities.
Sabesan Manivasakan, Risch Tore, G N Wikramanayake (2006) "Querying Mediated Web Services" In:8th International Information Technology Conference on Innovations for a Knowledge Economy, pp. 39-44. Infotel Lanka Society, Colombo, Sri Lanka: IITC Oct 12-13, ISBN 955-8974-04-8
Profile based Video segmentation system to support E-learningGihan Wikramanayake
S C Premaratne, D D Karunaratna, G N Wikramanayake, K P Hewagamage, G K A Dias (2004) "Profile Based Video Segmentation System to Support e-Learning" In:6th International Information Technology Conference, Edited by:V.K. Samaranayake et al. pp. 74-81. Infotel Lanka Society, Colombo, Sri Lanka: IITC Nov 29-Dec 1, ISBN: 955-8974-01-3
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
The speech is most basic and essential method of communication used by person.On the basis of individual information included in speech signals the speaker is recognized. Speaker recognition (SR) is useful to identify the person who is speaking. In recent years speaker recognition is used for security system. In this paper we have discussed the feature extraction techniques like Mel frequency cepstral coefficient (MFCC), Linear predictive coding (LPC), Dynamic time wrapping (DTW), and for classification Gaussian Mixture Models (GMM), Artificial neural network (ANN)& Support vector machine (SVM).
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
In this paper we have focused on an efficient feature selection method in classification of audio files.
The main objective is feature selection and extraction. We have selected a set of features for further
analysis, which represents the elements in feature vector. By extraction method we can compute a
numerical representation that can be used to characterize the audio using the existing toolbox. In this
study Gain Ratio (GR) is used as a feature selection measure. GR is used to select splitting attribute
which will separate the tuples into different classes. The pulse clarity is considered as a subjective
measure and it is used to calculate the gain of features of audio files. The splitting criterion is
employed in the application to identify the class or the music genre of a specific audio file from
testing database. Experimental results indicate that by using GR the application can produce a
satisfactory result for music genre classification. After dimensionality reduction best three features
have been selected out of various features of audio file and in this technique we will get more than
90% successful classification result.
The advents in this technological era have resulted into enormous pool of information. This information is
stored at multiple places globally, in multiple formats. This article highlights a methodology for extracting
the video lectures delivered by experts in the domain of Computer Science by using Generalized Gamma
Mixture Model. The feature extraction is based on the DCT transformations. In order to propose the model,
the data set is pooled from the YouTube video lectures in the domain of Computer Science. The outputs
generated are evaluated using Precision and Recall.
System analysis and design for multimedia retrieval systemsijma
Due to the extensive use of information technology and the recent developments in multimedia systems, the
amount of multimedia data available to users has increased exponentially. Video is an example of
multimedia data as it contains several kinds of data such as text, image, meta-data, visual and audio.
Content based video retrieval is an approach for facilitating the searching and browsing of large
multimedia collections over WWW. In order to create an effective video retrieval system, visual perception
must be taken into account. We conjectured that a technique which employs multiple features for indexing
and retrieval would be more effective in the discrimination and search tasks of videos. In order to validate
this, content based indexing and retrieval systems were implemented using color histogram, Texture feature
(GLCM), edge density and motion..
Tim Malthus_Towards standards for the exchange of field spectral datasetsTERN Australia
This document discusses the development of standards for the exchange of field spectral datasets. It notes the importance of metadata for determining the quality and representativeness of spectral data obtained in the field. A workshop was held in 2012 to discuss best practices for data collection and exchange and key conclusions included the need for standards to facilitate accurate comparison across studies and the role of thorough metadata. Work is ongoing to enhance the SPECCHIO system for hosting spectral libraries and metadata and establishing it as the international tool for storage and exchange of spectral datasets.
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...IJDKP
Content based retrieval has an advantage of higher prediction accuracy as compared to tagging based approach. However, the complexity in its representation and classification approach, results in lower processing accuracy and computation overhead. The correlative nature of the feature data are un-explored in the conventional modeling, where all the data features are taken as a set of feature values to give a decision. The recurrent feature class attribute is observed for the feature regrouping in action model prediction. In this paper a co-relative information, bounding grouping approach is suggested for action model prediction in CBMR application. The co-relative recurrent feature mapping results in faster retrieval process as compared to the conventional retrieval system.
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...IJDKP
Content based retrieval has an advantage of higher prediction accuracy as compared to tagging based approach. However, the complexity in its representation and classification approach, results in lower processing accuracy and computation overhead. The correlative nature of the feature data are un-explored in the conventional modeling, where all the data features are taken as a set of feature values to give a decision. The recurrent feature class attribute is observed for the feature regrouping in action model prediction. In this paper a co-relative information, bounding grouping approach is suggested for action model prediction in CBMR application. The co-relative recurrent feature mapping results in faster retrieval process as compared to the conventional retrieval system.
Investigating Multi-Feature Selection and Ensembling for Audio Classificationijaia
Deep Learning (DL) algorithms have shown impressive performance in diverse domains. Among
them, audio has attracted many researchers over the last couple of decades due to some interesting
patterns–particularly in classification of audio data. For better performance of audio classification,
feature selection and combination play a key role as they have the potential to make or break the
performance of any DL model. To investigate this role, we conduct an extensive evaluation of the
performance of several cutting-edge DL models (i.e., Convolutional Neural Network, EfficientNet,
MobileNet, Supper Vector Machine and Multi-Perceptron) with various state-of-the-art audio features (i.e., Mel Spectrogram, Mel Frequency Cepstral Coefficients, and Zero Crossing Rate) either
independently or as a combination (i.e., through ensembling) on three different datasets (i.e., Free
Spoken Digits Dataset, Audio Urdu Digits Dataset, and Audio Gujarati Digits Dataset). Overall, results suggest feature selection depends on both the dataset and the model. However, feature
combinations should be restricted to the only features that already achieve good performances when
used individually (i.e., mostly Mel Spectrogram, Mel Frequency Cepstral Coefficients). Such feature
combination/ensembling enabled us to outperform the previous state-of-the-art results irrespective
of our choice of DL model
IRJET- Segmentation in Digital Signal ProcessingIRJET Journal
1) The document discusses audio signal segmentation techniques in digital signal processing. It analyzes algorithms for segmenting audio signals into homogeneous segments for applications like audio event recognition and speaker identification.
2) A hybrid classification approach is proposed that first classifies audio frames as natural sound, noise, or silence using bagged SVM. Rule-based classification is then used to further segment silence frames.
3) The approach involves pre-classifying each windowed portion of the audio clip, extracting normalized feature vectors, and then applying the hybrid classifier. This requires less training data while achieving high accuracy for segmenting into four main audio types.
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...CSCJournals
In today's era of digitization and fast internet, many video are uploaded on websites, a mechanism is required to access this video accurately and efficiently. Semantic concept detection achieve this task accurately and is used in many application like multimedia annotation, video summarization, annotation, indexing and retrieval. Video retrieval based on semantic concept is efficient and challenging research area. Semantic concept detection bridges the semantic gap between low level extraction of features from key-frame or shot of video and high level interpretation of the same as semantics. Semantic Concept detection automatically assigns labels to video from predefined vocabulary. This task is considered as supervised machine learning problem. Support vector machine (SVM) emerged as default classifier choice for this task. But recently Deep Convolutional Neural Network (CNN) has shown exceptional performance in this area. CNN requires large dataset for training. In this paper, we present framework for semantic concept detection using hybrid model of SVM and CNN. Global features like color moment, HSV histogram, wavelet transform, grey level co-occurrence matrix and edge orientation histogram are selected as low level features extracted from annotated groundtruth video dataset of TRECVID. In second pipeline, deep features are extracted using pretrained CNN. Dataset is partitioned in three segments to deal with data imbalance issue. Two classifiers are separately trained on all segments and fusion of scores is performed to detect the concepts in test dataset. The system performance is evaluated using Mean Average Precision for multi-label dataset. The performance of the proposed framework using hybrid model of SVM and CNN is comparable to existing approaches.
Towards an objective comparison of feature extraction techniques for automati...journalBEEI
A common limitation of the previous comparative studies on speaker-features extraction techniques lies in the fact that the comparison is done independently of the used speaker modeling technique and its parameters. The aim of the present paper is twofold. Firstly, it aims to review the most significant advancements in feature extraction techniques used for automatic speaker recognition. Secondly, it seeks to evaluate and compare the currently dominant ones using an objective comparison methodology that overcomes the various limitations and drawbacks of the previous comparative studies. The results of the carried out experiments underlines the importance of the proposed comparison methodology.
Multimedia Semantics:Metadata, Analysis and InteractionRaphael Troncy
Multimedia Semantics:Metadata, Analysis and Interaction. Keynote Talk at the Latin-American Conference on Networked Electronic Media (LACNEM), August 2009, Bogota, Colombia
A computationally efficient learning model to classify audio signal attributesIJECEIAES
The era of machine learning has opened up groundbreaking realities and opportunities in the field of medical diagnosis. However, it is also observed that faster and proper diagnosis of any diseases/medical conditions require proper analysis and classification of digital signal data. It indicates the proper identification of tumors in the brain. Brain magnetic resonance imaging (MRI) data has to be appropriately classified, and similarly, pulse signal analysis is required to evaluate the human heart operating condition. Several studies have used machine learning (ML) modeling to classify speech signals, but very few studies have explored the classification of audio signal attributes in the context of intelligent healthcare monitoring. The study thereby aims to introduce novel mathematical modeling to analyze and classify synthetic pulse audio signal attributes with cost-effective computation. The numerical modeling is composed of several functional blocks where deep neural network-based learning (DNNL) plays a crucial role during the training phase, and also it is further combined with a recurrent structure of long-short term memory (R-LSTM) feedback connections (FCs). The design approaches further experiment in a numerical computing environment in terms of accuracy and computational aspects. The classification outcome of the proposed approach shows that it attains approximately 85% accuracy, which is comparable to the baseline approaches and execution time.
Content Based Video Retrieval Using Integrated Feature Extraction and Persona...IJERD Editor
This document describes a content-based video retrieval system that extracts features from videos and uses those features to retrieve matching videos from a database. The system first segments videos into frames, applies optical character recognition (OCR) to extract text and automatic speech recognition (ASR) to extract keywords. It then extracts additional low-level visual features like color, texture and edges. All the extracted keywords and features are stored in a database. When a query video is input, the same features are extracted and used to search the database for similar videos. The results are then re-ranked based on the user's past viewing history to personalize the results. The system is evaluated on a database of 15 videos and is able to retrieve matching videos
Mr Bhusan Chettri is a researcher in AI and Machine learning applied to sound and speech technology. He earned his PhD from the prestigious Queen Mary University in London and a master's degree in speech and language processing from The University of Sheffield.
Computing for Human Experience and WellnessAmit Sheth
Talk at Venture Panel in Nov. 2005. Since this very early start, the ideas have substantially matured: a more recent version is at: http://www.slideshare.net/knoesis/computing-for-human-experience-v3
Content based video retrieval using discrete cosine transformnooriasukmaningtyas
A content based video retrieval (CBVR) framework is built in this paper.
One of the essential features of video retrieval process and CBVR is a color
value. The discrete cosine transform (DCT) is used to extract a query video
features to compare with the video features stored in our database. Average
result of 0.6475 was obtained by using the DCT after implementing it to the
database we created and collected, and on all categories. This technique was
applied on our database of video, check 100 database videos, 5 videos in
Keywords: each category.
Text Mining with Automatic Annotation from Unstructured ContentIIRindia
This document summarizes text mining techniques for automatic annotation of unstructured content such as images, video and audio. It discusses using multimedia mining to extract structured information from multimedia data through techniques like automatic image annotation and video annotation. Automatic image annotation assigns keywords or tags to images using computer vision and machine learning. Video annotation creates text transcripts from spoken narration in videos and segments videos into clips linked to corresponding text. The document explores models for content-based multimedia retrieval based on visual features and annotations.
Similar to Speaker Search and Indexing for Multimedia Databases (20)
G N Wikramanayake (2010) Learning beyond the classroom In: Humanitarian Technology Challenges of the 21st Century, Trivandrum, Kerala, 20-21 Feb. IEEE Kerala Section
The document discusses different types of broadcasting technologies and media, including electromagnetic waves, radio, TV, video tapes, satellites, and digital technologies. It also categorizes media as one-way or two-way, with examples such as print, audio, images, video, email, and blogs. Finally, it outlines asynchronous tools like discussion boards and blogs that enable delayed communication, as well as synchronous tools like audio/video conferencing and chat that allow for live interaction.
Seminar on Sports and Information Technology held at UCSC on 10th July 2010 under the distinguish patronage of Hon. C.B. Rathnayake Minister of Sports, Member of Parliament Thilanga Sumithipala and Professor Kshanika Hirimburegama Vice-Chancellor, University of Colombo
Improving student learning through assessment for learning using social media...Gihan Wikramanayake
This document summarizes a study on improving student learning through assessment using social media and e-Learning 2.0 on a distance education degree program in Sri Lanka. Specifically:
- The study examines the Bachelor of Information Technology (BIT) program at the University of Colombo School of Computing, which has high failure and dropout rates.
- Currently exams focus on factual recall through multiple choice questions, encouraging rote learning. Language barriers also negatively impact some students' performance.
- The program aims to improve learning and reduce failure/dropout rates by designing new assessment methods using social media and e-Learning 2.0 to promote higher-order thinking. Data on student experiences will inform the redesign.
M C Siriwardena, G N Wikramanayake (2005) Exploiting Tourism through Data Warehousing IS Engineer, The Bulletin of the British Computer Society Sri Lanka Section, Oct, pp. 23-25.
Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...Gihan Wikramanayake
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help alleviate symptoms of mental illness and boost overall mental well-being.
Analysis of Multiple Choice Question Papers with Special Reference to those s...Gihan Wikramanayake
V K Samaranayake, G N Wikramanayake, A P S R Somasiri, M G N A S Fernando (1985) Analysis of Multiple Choice Question Papers with Special Reference to those set at the G.C.E. (Advanced Level) Examination The Journal of the Mathematical and Astronomical Society 12: 17-25
This document discusses how modern teaching methods focus on producing employable graduates through active learning. It provides the example of the University of Western Sydney, which identifies learning objectives before content to emphasize application skills. Their assessments evaluate continuous learning through practical work rather than only testing theoretical knowledge via exams. This process allows graduates to directly apply their skills in employment without additional training.
P G Punchihewa, G N Wikramanayake, D D Karunaratna (2003) Balanced Scorecard and its relationship to UMM IS Engineer, The Bulletin of the British Computer Society Sri Lanka Section 7-8 Oct
H A Caldera, Y Deshpande, G N Wikramanayake (2005) Web Usage Mining Based on Heuristics: Drawbacks. IS Engineer, The Bulletin of the British Computer Society Sri Lanka Section, Apr, pp. 27-28.
P N P Fernando, G N Wikramanayake (1998) "Development of a Web site with Dynamic Data" In: 54th Annual Sessions of Sri Lanka Association for the Advancement of Science, pp. 246-247. Colombo: SLAAS Dec 14-19, Part 1 – abstracts
O N N Fernando, G N Wikramanayake (1998) "Web Based Agriculture Information System" In: Conference, Exhibition and Business Directory of 1st International Information Technology Conference, p. 36. Infotel Lanka Society, Colombo, Sri Lanka: IITC Oct 7-8
Design and Development of a Resource Allocation Mechanism for the School Educ...Gihan Wikramanayake
G N Wikramanayake (2000) "Design and Development of a Resource Allocation Mechanism for the School Education Sector" In: Annual Sessions, Faculty of Science, University of Colombo, p. 19. UoC Dec, vol. 1
Presentation Slides: http://www.slideshare.net/wikramanayake/design-and-development-of-a-resource-allocation-mechanism-for-the-school-education-sector-2000-presentation-883152
S S Sooriarachchi, G N Wikramanayake, G K A Dias (2003) "A Tool for the Management of ebXML Resources" In:5th International Information Technology Conference, pp. 142-151. Infotel Lanka Society Ltd., Colombo, Sri Lanka: IITC Dec 1-7, ISBN: 955-8974-00-5
J P Sathiadas, G N Wikramanayake (2003) "Document Management Techniques and Technologies" In:5th International Information Technology Conference, pp. 40-48. Infotel Lanka Society Ltd., Colombo, Sri Lanka: IITC Dec 1-7, ISBN: 955-8974-00-5
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
🔥🔥🔥🔥🔥🔥🔥🔥🔥
إضغ بين إيديكم من أقوى الملازم التي صممتها
ملزمة تشريح الجهاز الهيكلي (نظري 3)
💀💀💀💀💀💀💀💀💀💀
تتميز هذهِ الملزمة بعِدة مُميزات :
1- مُترجمة ترجمة تُناسب جميع المستويات
2- تحتوي على 78 رسم توضيحي لكل كلمة موجودة بالملزمة (لكل كلمة !!!!)
#فهم_ماكو_درخ
3- دقة الكتابة والصور عالية جداً جداً جداً
4- هُنالك بعض المعلومات تم توضيحها بشكل تفصيلي جداً (تُعتبر لدى الطالب أو الطالبة بإنها معلومات مُبهمة ومع ذلك تم توضيح هذهِ المعلومات المُبهمة بشكل تفصيلي جداً
5- الملزمة تشرح نفسها ب نفسها بس تكلك تعال اقراني
6- تحتوي الملزمة في اول سلايد على خارطة تتضمن جميع تفرُعات معلومات الجهاز الهيكلي المذكورة في هذهِ الملزمة
واخيراً هذهِ الملزمة حلالٌ عليكم وإتمنى منكم إن تدعولي بالخير والصحة والعافية فقط
كل التوفيق زملائي وزميلاتي ، زميلكم محمد الذهبي 💊💊
🔥🔥🔥🔥🔥🔥🔥🔥🔥
Speaker Search and Indexing for Multimedia Databases
1. Speaker Search and Indexing for Multimedia Databases
T. Silva, D. D. Karunaratna, G. N. Wikramanayake
K. P. Hewagamage and G. K. A. Dias.
University of Colombo School of Computing,
35, Reid Avenue, Colombo 7, Sri Lanka.
E-mail: tyro@sltnet.lk, {ddk, gnw, kph, gkd }@ucsc.cmb.ac.lk
Abstract What is required in this context is a ‘true’ multimedia
database with the ability to query the actual content for
This paper proposes an approach for indexing a the purpose of locating a given person, object or topic
collection of multimedia clips by a speaker in an audio area.
track. A Bayesian Information Criterion (BIC) procedure
is used for segmentation and Mel-Frequency Cepstral Currently there are no commercial database products
Coefficients (MFCC) are extracted and sampled as supporting content based querying, although there are
metadata for each segment. Silence detection is also several ongoing research projects such as the Combined
carried out during segmentation. Gaussian Mixture Image and Word Spotting (CIMWOS) programme
Models (GMM) are trained for each speaker, and an focused on this area. Many avenues for research exist
ensemble technique is proposed to reduce errors caused such as face recognition, object recognition and keyword
by the probabilistic nature of GMM training. The spotting. This paper describes the work undertaken in one
indexing system utilizes sampled MFCC features as such avenue, that of speaker recognition, as part of a
segment metadata and maintains the metadata of the research programme at the University of Colombo School
speakers separately, allowing modification or additions to of Computing, Sri Lanka.
be done independently. The system achieves a True Miss
Rate (TMR) of around 20% and a False Alarm Rate 2.0 Background and Architecture
(FAR) of around 10% for segments between 15 and 25
seconds in length with performance decreasing with In order to provide the capability to carry out a speaker
reduction in segment size. based search of multimedia clips three main tasks have to
be achieved. These tasks are respectively Segmentation,
1.0 Introduction Recognition and Indexing. The architecture of the system
incorporating these three tasks is shown in figure 1.
The use of multimedia has become more widespread due
to advances in technology and the growing popularity of Segmentation is carried out after initial feature extraction
“rich’ presentation mediums such as the Internet. As a on the new multimedia clip. A labeling procedure is used
result of this growth we are now faced with the problem after segmentation to classify segments as voice or silence
of managing ever expanding collections of audio and and the latter are marked and excluded from the later
video content. This problem is particularly acute in the steps involved in querying.
area of e-learning where it is not uncommon to find
thousands of archived video and voice clips. Users Training speakers within this system is done in a
accessing such multimedia archives need to be able to sift supervised manner by specifying audio files containing
through this wealth of visual and aural data, both spatially speech purely from the given speaker. The mean and
and temporally, in order to find the material they really variance of the log-likelihood for the specified test data is
need [19]. obtained in order to establish a decision criterion for the
later search.
The management of such multimedia content has not
traditionally been a strong point in the field of databases Both activities of training and segmentation result in
where numerical and textual data has been the focus for metadata which is stored in a relational database for the
many years. For this purpose such systems utilize purpose of Indexing. This metadata is combined at query
descriptive structured metadata, which may not always be time to give the result of the search.
available for the multimedia [18].
157
2. Initial Labelling
New Multimedia Feature
Segmentation (Silence /
Object Extraction
Voiced)
Multimedia Segment
Object Server MetaData
Comparison
Query Query Result
Engine
Speaker
Model Data
Testing /
New Speaker
Feature Confidence
Training Data GMM Training
Extraction Interval
Estimation
Figure 1 : Architecture of the Speaker Search and Indexing System
The components of this system are described in more 2.2 Segmentation
detail in the following sections.
For the task of segmentation three broad methods have
2.1 Feature Extraction been identified [3]. These are namely Energy Based
Segmentation, Model Based Segmentation and Metric
The success of all three tasks identified previously is Based Segmentation.. The first approach is more
dependent on the extraction of a suitable feature set from traditional and based on identifying changes in amplitude
the audio. Features such as intensity, pitch, formants, and power which occur at speaker change points. The
harmonic features and spectral correlation have been used second method requires prior speaker models to be
in the past [1],[17] but in recent speech-related research trained and segmentation carried out as a classification
such as [10],[12],[5] the two dominant features used are procedure on each section of audio as it is being
Linear Predictive Coding Cepstral Coefficients (LPCC) processed.
and Mel Frequency Cepstral Coefficients (MFCC).
The final method, metric based segmentation uses
Both methods utilize the concept of the “cepstrum” based estimated “differences” in adjacent audio frames in order
on successive fourier transforms and a log transform to discover speaker change points. Typical metrics used
applied to the original signal. LPCC uses linear predictive include the Euclidian, Kullback-Leiblar [4], Mahalanobis
analysis whereas MFCC utilizes a triangular filter bank to and Gish distance measures [5],[6] as well the Bayesian
derive the actual coefficients. Of these LPCC is the older Information Criterion (BIC). The BIC is a form of
method and has been cited as being superior for certain Minimum Descriptor Length criterion that has been
types of recognition topologies such as simplified Hidden popular in recent times. Various extensions such as
Markov Models [2]. MFCC features are however more automatically adjusting window sizes [7] and multiple
noise resistant and also incorporate the mel-frequency passes at decreasing granularity levels [8] have also been
scale, which simulates the non-uniform sensitivity the proposed.
human ear has towards various sound frequencies [16].
BIC is employed to test the hypothesis that there is a
In this system the audio is downsampled to 11025 Hz and significant different in speakers in two adjacent sections
cepstral coefficients are drawn from each 128 byte frames. of audio. The equation used is shown below:
The zeroth (energy) coefficient is dropped from this
feature vector. Non-overlapping hamming windows are
used and a feature vector size of 15 and 23 is used for BIC = N*log| | - ( i*log| 1| + (N-i)*log| 2| )
recognition and segmentation respectively. At the initial - ½ (d+½d(d+1))logN
stages of this research project various combinations of
MFCC and LPCC coefficients were experimented with where N(µ, ) denotes a normal/Gaussian distribution
and it was discovered empirically that a pure MFCC with covariance and mean µ, N is the total number of
feature vector gave the best results. frames in both segments, i is the number of frames in the
first segment and d is the dimensionality of each sample
158
3. (feature vector) drawn from the audio. is a penalty In this system GMM’s with full covariance matrices are
factor which is set to unity in statistical theory. used with an initial 20 mixtures per GMM. Low-weight
mixtures are dropped from the model after training.
A result less than zero for BIC indicates that the two Rather than utilizing a single GMM per speaker it was
segments are similar. decided to incorporate three such models within an
ensemble classifier, as the stochastic nature of the EM
The actual technique used in this system is based on the algorithm resulted in diverse models at each run. By
variable window scheme by Tritschler [8]. A speaker averaging three models it was hoped that any bias that
change point is sought within an initial window, and if occurred during training due to the random initialization
such a point could not be found the window is allowed to of the EM algorithm could be reduced.
grow in size. When a speaker change point is detected
then that location is selected as the starting point for the Following the training phase the model is tested with data
next window. To avoid the calculations becoming too for the same speaker. The mean and standard deviation of
costly it was decided to introduce an upper limit to the the log-likelihoods obtained for the test data is calculated
size of the window. Beyond this limit the window starts and used for the recognition criteria as described in the
“sliding” rather than growing. next section.
An intentional bias towards over-segmenting was 2.4 Indexing
introduced into the BIC equation to ensure than all
possible speaker change points were accounted for. This One method used for the Indexing task involves the use of
was done by setting to a value less than 1. Since this pre-trained anchor or reference models. Distance or
caused a lot of spurious boundaries to be detected a Characterization vectors are calculated for each utterance
second pass was carried out in which adjacent segments based on its location relative to the anchor models.
were compared once again using the BIC and merged if Similar vectors are used to specify the relative location
they were found to be similar. for speakers within the feature space. Searching for a
specific speaker is simply a matter of finding the
After segments are obtained the Zero Crossing Frequency utterances that have a characterization vector close (in
(ZCF) is used to detect silence segments. Voice segments terms of Euclidean or any other distance measure) to that
have a very much higher variance for the ZCF and a of the target speaker [15]. While anchor models are
simple threshold was used to differentiate them from the efficient in terms of search time, they have a much higher
segments containing silence. reported error rate than direct utilization of GMM
recognition.
2.3 Speaker Model Training
In the design of the indexing scheme for this system, the
Several methods have been utilized in the literature for following assumptions and decisions were made
the purpose of speaker recognition. Text-dependent
recognition, such as that required for voice password 1. A query was assumed to always be for a known
systems, has been carried out using Hidden Markov speaker. Hence a trained GMM classifier would
Models (HMM) and Dynamic Time Warping (DTW). In have to exist beforehand for any speaker for
the text-independent approach, which is more relevant to whom a query was formulated. Queries using
the research question addressed in this project, the Vector example audio tracks could also be incorporated
Quantization method has been used in the past [9], and into this system, as it would simply require that a
more recently combined with neural networks in the form speaker model be trained using the given sample
of Learning Vector Quantization (LVQ) [10]. audio track before the query process was carried
out.
However Gaussian Mixture Models (GMM) are at present
the most popular method used. This method models the 2. No prior labeling on the basis of speaker would
features for each speaker by using multiple weighted be carried out on the audio. The reason for this
Gaussian distributions [11]. The speaker dependent was that such a labeling would be inflexible as it
parameters for each Gaussian (mean, variance and the would not allow new speakers to be added to the
relative weight) can be estimated through the Expectation system without extensive recalculation and
Maximization (EM) algorithm. Various forms of this annotation of the existing audio. Therefore it was
model have been tried, including variations in the type of decided to defer the work involved in labeling to
covariance matrix used [12],[13] and a hierarchical form query time itself.
[14].
159
4. Speaker Metadata
Segment Metadata
Is log-likelihood value within Speaker ID
Clip ID decision boundary ? Component 1
Left Boundary
m + sd * k {
Right Boundary
Mean Vector
Covariance Matrix
{
Weight Vector
MFCC 1
Yes }
MFCC 2
Component 2 { … }
………..
Component 3 { … }
}
Segment contains Log-Likelihood Mean
Speaker Log-Likelihood SD
Figure 2 : The Metadata used for Indexing
Following the segmentation phase, information on
segment boundaries discovered through the BIC 2.5 Architecture
procedure is held as metadata, along with a label
specifying if the segment is silence. The MFCC features While recent research into speaker recognition has often
of that segment are also extracted and saved. This utilized MFCC feature vectors and the BIC metric, in the
constitutes the metadata for the audio clips. majority of these cases these systems depend on having
pre-trained models of all possible speakers in the domain.
In order to avoid the excessive storage requirements Hence the segmentation phase is followed by an
inherent in storing all the MFCC features of a segment, a annotation of the segments with the id of known speakers,
sampling procedure is carried out for segments larger than and then carrying out text based searches of this field.
a certain size. Feature vectors are obtained from the
segments at linear intervals. Experiments show that the In this respect our system provides the same capability as
effect on log-likelihood caused by using the sampled the anchor modeling done by Sturim [15], although there
feature vectors is relatively low. is no indication whether his system utilized automated
segmentation. In addition, the use of direct evaluation of
Metadata for each speaker is derived from the speaker speaker GMM against sampled MFCC features of the
training procedure explained in section 2.3 , This consists segments has not been encountered as an indexing
of the parameters derived for the GMM’s as well as the technique in the literature.
mean and the standard deviation of the log-likelihood test
data for that speaker Similarly no studies have reported utilizing an ensemble
GMM classifier to reduce the possible bias introduced
Carrying out a search is simply a case of retrieving the during the stochastic training process.
GMM data for the queried speaker and subsequently
calculating the probability of the MFCC vectors for each
segment against this GMM. The overall structure of the 3.0 Evaluation
indexing scheme is shown in figure 2.
The algorithms for Speaker Recognition and
The constant K shown in the decision criteria in figure 2 Segmentation were implemented in Matlab, translated
is actually a user specified value which represents how into C code and were compiled as COM objects. The
much “relaxation” the system is allowed when evaluating indexing and the user interface were created using
the query. Higher values for K result in a higher C#.NET and integrated with the COM objects.
probability that existing segments for that speaker are
found, along with an increased probability that false Experiments were performed using a small sized speaker
speakers are also discovered. database consisting of 60 second audio clips and 9
speakers. Each speaker had 3 clips of lone speech. In
One feature of this system is that metadata for speakers addition 8 clips of mixed speech were obtained, ensuring
and audio are maintained independently of each other. that each speaker was represented in at least 3 mixed clips.
This allows either one to be changed (i.e. when adding or The recording environment contained background
removing audio clips and speakers) without requiring conversations and noise due to activities of students in the
changes to be done to the other.
160
5. laboratory and air conditioning hum. Both segmentation 4.0 Conclusions
and recognition were tested on this data.
This paper has addressed the problem of speaker
The criteria used for evaluation were the False Alarm recognition and indexing using a segmentation procedure
Rates (FAR) and True Miss Rates (TMR). They are based on BIC, a GMM recognition system and an
defined as shown below: indexing scheme utilizing metadata based on sampled
MFCC features. The techniques outlined here have given
FAR% = Erroneously Detected Segments x 100 good segmentation results and a satisfactory recognition
Total Detected Segments rate when used with medium to large sized segments
recorded under the same conditions.
TMR% = Missed True Segments x 100 Due to the independence of the segmentation and speaker
True Segments training components the actual work of speaker
recognition may be delayed till a query is carried out.
This means that knowledge of speakers is not required
The results obtained for segmentation were quite good
when a new audio clip is segmented. We only require that
with an FAR of 3.2% and a TMR of 6.7%.
a speaker model exist at the time of the actual search. This
creates the possibility of extending the system so that a
Recognition tests were carried out on these same search can be carried out on the criteria that the speaker in
segments and the results are shown in figures 3 and 4. the returned segment matches that of an “example” audio
The audio segments were divided into three classes, clip provided by the user at search time.
namely large (60 seconds), medium (15-25 seconds) and
small (under 15 seconds). These segments were those However some further evaluation has shown that the
obtained as the result of the previous segmentation performance of this system gets worse when applied to
algorithm applied to the speech data. Each segment segments extracted from audio recorded under different
therefore corresponded to the continuous speech of one conditions. This shows that this method is sensitive to
speaker. The results for large and medium segments are channel and noise effects in the audio. There remains
promising with the former showing error rates close to much scope to explore various forms of filtering as well
0% and the latter having a TMR of around 20-30% along as channel normalization in order to improve the
with an FAR of under 10% for values of the relaxation recognition rates of this system.
constant around 1.5. Small segments however showed a
much higher error rate with the TMR being as high as In addition the indexing scheme outlined is not as fast as
50% and FAR between 20-30%. the use of anchor models. A method by which the speed
advantage of anchor models and the accuracy of GMM
based indexing could be to utilize the anchor model based
indexing scheme as the first stage of the search procedure
in order to cut down on the number of segments
considered.
Recognition Performance (FAR) Recognition Performance (TMR)
1
1
0.9 0.9
0.8 0.8
Large Large
0.7 0.7 Segments
Segments
0.6
TMR %
0.6
FAR %
Medium Medium
0.5 Segments 0.5 Segments
0.4 Small 0.4 Small
0.3 Segments 0.3 Segments
0.2 0.2
0.1 0.1
0 0
0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5
K (Relaxation Constant) K (Relaxation Constant)
Figure 3 : The FAR achieved for recognition Figure 4 : The TMR achieved for recognition
according to segment size and the value of the according to segment size and the value of the
relaxation constant relaxation constant
161
6. References [13] Bonastre, J., Delacourt, P., Fredouille, C., Merlin, T. &
Wellekens, C. (2000), “A Speaker Tracking System Based on
Speaker Turn Detection for NIST Evaluation”, IEEE
[1] Pool, J. (2002) “Investigation on the Impact of High
International Conference on Acoustics, Speech, and Signal
Frequency Transmitted Speech on Speaker Recognition”, Msc.
Processing, June 2000, Istanbul, Turkey, pp. 1177-1180
Thesis, University of Stellenbosch
[14] Liu, M., Chang, E. & Dai, B. (2002), “Hierarchical
[2] Bimbot, F., Hutter, H. & Jaboulet, C. (1998) “An Overview
Gaussian Mixture Model for Speaker Verification”, Microsoft
of the Cave Project Research Activities in Speaker
Research Asia publication
Verification”, Speech Communication, Vol 31, Number 2-3, pp.
155-180
[15] Sturim, D.E., Reynolds, D.A., Singer, E. & Campbell, J.P.
(2001), “Speaker Indexing in Large Audio Databases Using
[3] Kemp, T., Schmidt, M., Westphal, M. & Waibel, A, (2000),
Anchor Models”, Proceedings of the International Conference
“Strategies for Automatic Segmentation of Audio Data”,
on Acoustics, Speech, and Signal Processing, pp. 429-432
Proceedings of the IEEE International Conference on Acoustics,
Speech, and Signal Processing ICASSP, Volume 3, pp. 1423-
[16] Skowronsk, M.D. & Harris, J.G. (2002), “Human Factor
1426
Cepstral Coefficients”, Proceedings of the First Pan
American/Iberian Meeting on Acoustics
[4] Lu, L. & Zhang, H. (2002), “Real-Time Unsupervised
Speaker Change Detection”, Proceedings of the 16th
[17] Matsui, T. & Furai, S. (1990) “Text Independent Speaker
International Conference on Pattern Recognition (ICPR), Vol.
Recognition Using Vocal Tract and Pitch Information”,
II, pp. 358-361
Proceedings of the International Conference on Spoken
Language Processing, Vol. 4, pp. 137-140
[5] Pietquin, O, Couvreur, L & Covreur, P (2002),
“Applied Clustering for Automatic Speaker-Based Segmentation
[18] Wen, J., Li, Q., Ma, W. & Zhang, H. (2002) “A Multi-
of Audio Material”, Belgian Journal of Operations Research,
Paradigm Querying Approach for a Generic Multimedia
Statistics and Computer Science (JORBEL) - Special Issue on
Database System”, SIGMOD Record, Vol. 32, No. 1
OR and Statistics in the Universities of Mons, vol. 41, no. 1-2,
pp. 69-81
[19] Klas, W. & Aberer, K. (1995), “Multimedia Applications
and Their Implications on Database Architectures”,
[6] Siegler, M.A., Jain, U., Raj, B. & Stern, R.M. (1997),
“Automatic Segmenting, Classification and Clustering of
Broadcast News Audio”, Proceedings of Darpa Speech
Recognition Workshop, February 1997, Chantilly, Virginia
[7] Tritschler, A. & Gopinath, R (1999), “Improved Speaker
Segmentation and Segment Clustering Using the Bayesian
Information Criterion”, Proceedings of Eurospeech 1999,
September 1999, Budapest
[8] Delacourt, P, Kryze, D & Wellekens, C.J (1999), “Speaker
Based Segmentation for Audio Data Indexing”, ESCA
Workshop on Accessing Information in Audio Data, 1999,
Cambridge, UK
[9] Orman, O.D. (1996), “Frequency Analysis of Speaker
Identification Performance”, MSc Thesis, Bogazici University
[10] Cordella, L.P., Foggia, P., Sanson, C. & Vento, M. (2003)
“A Real-Time Text-Independent Speaker Identification
System”, Proceedings of the 12th International Conference on
Image Analysis and Processing, IEEE Computer Society Press,
Mantova, pp. 632-637
[11] Reynolds, D.A. (1995), “Robust Text-Independent Speaker
Identification Using Gaussian Mixture Models”, IEEE
Transactions on Speech & Audio Processing, Volume 3, pp.72-
83
[12] Sarma, S.V. & Sridevi V. (1997), “A Segment Based
Speaker Verification System Using SUMMIT”, Proceedings of
Eurospeech 1997, September 1997, Rhodes, Greece
162