SlideShare a Scribd company logo
1 of 6
Download to read offline
Speaker Search and Indexing for Multimedia Databases

                                T. Silva, D. D. Karunaratna, G. N. Wikramanayake
                                       K. P. Hewagamage and G. K. A. Dias.
                                   University of Colombo School of Computing,
                                     35, Reid Avenue, Colombo 7, Sri Lanka.

                         E-mail: tyro@sltnet.lk, {ddk, gnw, kph, gkd }@ucsc.cmb.ac.lk

                        Abstract                                     What is required in this context is a ‘true’ multimedia
                                                                     database with the ability to query the actual content for
This paper proposes an approach for indexing a                       the purpose of locating a given person, object or topic
collection of multimedia clips by a speaker in an audio              area.
track. A Bayesian Information Criterion (BIC) procedure
is used for segmentation and Mel-Frequency Cepstral                  Currently there are no commercial database products
Coefficients (MFCC) are extracted and sampled as                     supporting content based querying, although there are
metadata for each segment. Silence detection is also                 several ongoing research projects such as the Combined
carried out during segmentation. Gaussian Mixture                    Image and Word Spotting (CIMWOS) programme
Models (GMM) are trained for each speaker, and an                    focused on this area. Many avenues for research exist
ensemble technique is proposed to reduce errors caused               such as face recognition, object recognition and keyword
by the probabilistic nature of GMM training. The                     spotting. This paper describes the work undertaken in one
indexing system utilizes sampled MFCC features as                    such avenue, that of speaker recognition, as part of a
segment metadata and maintains the metadata of the                   research programme at the University of Colombo School
speakers separately, allowing modification or additions to           of Computing, Sri Lanka.
be done independently. The system achieves a True Miss
Rate (TMR) of around 20% and a False Alarm Rate                      2.0 Background and Architecture
(FAR) of around 10% for segments between 15 and 25
seconds in length with performance decreasing with                   In order to provide the capability to carry out a speaker
reduction in segment size.                                           based search of multimedia clips three main tasks have to
                                                                     be achieved. These tasks are respectively Segmentation,
1.0 Introduction                                                     Recognition and Indexing. The architecture of the system
                                                                     incorporating these three tasks is shown in figure 1.
The use of multimedia has become more widespread due
to advances in technology and the growing popularity of              Segmentation is carried out after initial feature extraction
“rich’ presentation mediums such as the Internet. As a               on the new multimedia clip. A labeling procedure is used
result of this growth we are now faced with the problem              after segmentation to classify segments as voice or silence
of managing ever expanding collections of audio and                  and the latter are marked and excluded from the later
video content. This problem is particularly acute in the             steps involved in querying.
area of e-learning where it is not uncommon to find
thousands of archived video and voice clips. Users                   Training speakers within this system is done in a
accessing such multimedia archives need to be able to sift           supervised manner by specifying audio files containing
through this wealth of visual and aural data, both spatially         speech purely from the given speaker. The mean and
and temporally, in order to find the material they really            variance of the log-likelihood for the specified test data is
need [19].                                                           obtained in order to establish a decision criterion for the
                                                                     later search.
The management of such multimedia content has not
traditionally been a strong point in the field of databases          Both activities of training and segmentation result in
where numerical and textual data has been the focus for              metadata which is stored in a relational database for the
many years. For this purpose such systems utilize                    purpose of Indexing. This metadata is combined at query
descriptive structured metadata, which may not always be             time to give the result of the search.
available for the multimedia [18].

                                                               157
Initial Labelling
        New Multimedia                  Feature
                                                              Segmentation                (Silence /
           Object                      Extraction
                                                                                           Voiced)



                              Multimedia                                                                        Segment
                             Object Server                                                                      MetaData




                                                       Comparison
                         Query                                                      Query Result
                                                         Engine



                                                                                                                Speaker
                                                                                                               Model Data

                                                                                       Testing /
        New Speaker
                                    Feature                                           Confidence
        Training Data                                 GMM Training
                                   Extraction                                          Interval
                                                                                      Estimation

                                 Figure 1 : Architecture of the Speaker Search and Indexing System

The components of this system are described in more                    2.2 Segmentation
detail in the following sections.
                                                                       For the task of segmentation three broad methods have
2.1 Feature Extraction                                                 been identified [3]. These are namely Energy Based
                                                                       Segmentation, Model Based Segmentation and Metric
The success of all three tasks identified previously is                Based Segmentation.. The first approach is more
dependent on the extraction of a suitable feature set from             traditional and based on identifying changes in amplitude
the audio. Features such as intensity, pitch, formants,                and power which occur at speaker change points. The
harmonic features and spectral correlation have been used              second method requires prior speaker models to be
in the past [1],[17] but in recent speech-related research             trained and segmentation carried out as a classification
such as [10],[12],[5] the two dominant features used are               procedure on each section of audio as it is being
Linear Predictive Coding Cepstral Coefficients (LPCC)                  processed.
and Mel Frequency Cepstral Coefficients (MFCC).
                                                                       The final method, metric based segmentation uses
Both methods utilize the concept of the “cepstrum” based               estimated “differences” in adjacent audio frames in order
on successive fourier transforms and a log transform                   to discover speaker change points. Typical metrics used
applied to the original signal. LPCC uses linear predictive            include the Euclidian, Kullback-Leiblar [4], Mahalanobis
analysis whereas MFCC utilizes a triangular filter bank to             and Gish distance measures [5],[6] as well the Bayesian
derive the actual coefficients. Of these LPCC is the older             Information Criterion (BIC). The BIC is a form of
method and has been cited as being superior for certain                Minimum Descriptor Length criterion that has been
types of recognition topologies such as simplified Hidden              popular in recent times. Various extensions such as
Markov Models [2]. MFCC features are however more                      automatically adjusting window sizes [7] and multiple
noise resistant and also incorporate the mel-frequency                 passes at decreasing granularity levels [8] have also been
scale, which simulates the non-uniform sensitivity the                 proposed.
human ear has towards various sound frequencies [16].
                                                                       BIC is employed to test the hypothesis that there is a
In this system the audio is downsampled to 11025 Hz and                significant different in speakers in two adjacent sections
cepstral coefficients are drawn from each 128 byte frames.             of audio. The equation used is shown below:
The zeroth (energy) coefficient is dropped from this
feature vector. Non-overlapping hamming windows are
used and a feature vector size of 15 and 23 is used for                  BIC = N*log| | - ( i*log|        1|   + (N-i)*log|   2|   )
recognition and segmentation respectively. At the initial                                  - ½ (d+½d(d+1))logN
stages of this research project various combinations of
MFCC and LPCC coefficients were experimented with                      where N(µ, ) denotes a normal/Gaussian distribution
and it was discovered empirically that a pure MFCC                     with covariance and mean µ, N is the total number of
feature vector gave the best results.                                  frames in both segments, i is the number of frames in the
                                                                       first segment and d is the dimensionality of each sample
                                                                158
(feature vector) drawn from the audio.            is a penalty         In this system GMM’s with full covariance matrices are
factor which is set to unity in statistical theory.                    used with an initial 20 mixtures per GMM. Low-weight
                                                                       mixtures are dropped from the model after training.
A result less than zero for BIC indicates that the two                 Rather than utilizing a single GMM per speaker it was
segments are similar.                                                  decided to incorporate three such models within an
                                                                       ensemble classifier, as the stochastic nature of the EM
The actual technique used in this system is based on the               algorithm resulted in diverse models at each run. By
variable window scheme by Tritschler [8]. A speaker                    averaging three models it was hoped that any bias that
change point is sought within an initial window, and if                occurred during training due to the random initialization
such a point could not be found the window is allowed to               of the EM algorithm could be reduced.
grow in size. When a speaker change point is detected
then that location is selected as the starting point for the           Following the training phase the model is tested with data
next window. To avoid the calculations becoming too                    for the same speaker. The mean and standard deviation of
costly it was decided to introduce an upper limit to the               the log-likelihoods obtained for the test data is calculated
size of the window. Beyond this limit the window starts                and used for the recognition criteria as described in the
“sliding” rather than growing.                                         next section.

An intentional bias towards over-segmenting was                        2.4 Indexing
introduced into the BIC equation to ensure than all
possible speaker change points were accounted for. This                One method used for the Indexing task involves the use of
was done by setting to a value less than 1. Since this                 pre-trained anchor or reference models. Distance or
caused a lot of spurious boundaries to be detected a                   Characterization vectors are calculated for each utterance
second pass was carried out in which adjacent segments                 based on its location relative to the anchor models.
were compared once again using the BIC and merged if                   Similar vectors are used to specify the relative location
they were found to be similar.                                         for speakers within the feature space. Searching for a
                                                                       specific speaker is simply a matter of finding the
After segments are obtained the Zero Crossing Frequency                utterances that have a characterization vector close (in
(ZCF) is used to detect silence segments. Voice segments               terms of Euclidean or any other distance measure) to that
have a very much higher variance for the ZCF and a                     of the target speaker [15]. While anchor models are
simple threshold was used to differentiate them from the               efficient in terms of search time, they have a much higher
segments containing silence.                                           reported error rate than direct utilization of GMM
                                                                       recognition.
2.3 Speaker Model Training
                                                                       In the design of the indexing scheme for this system, the
Several methods have been utilized in the literature for               following assumptions and decisions were made
the purpose of speaker recognition. Text-dependent
recognition, such as that required for voice password                      1.   A query was assumed to always be for a known
systems, has been carried out using Hidden Markov                               speaker. Hence a trained GMM classifier would
Models (HMM) and Dynamic Time Warping (DTW). In                                 have to exist beforehand for any speaker for
the text-independent approach, which is more relevant to                        whom a query was formulated. Queries using
the research question addressed in this project, the Vector                     example audio tracks could also be incorporated
Quantization method has been used in the past [9], and                          into this system, as it would simply require that a
more recently combined with neural networks in the form                         speaker model be trained using the given sample
of Learning Vector Quantization (LVQ) [10].                                     audio track before the query process was carried
                                                                                out.
However Gaussian Mixture Models (GMM) are at present
the most popular method used. This method models the                       2.   No prior labeling on the basis of speaker would
features for each speaker by using multiple weighted                            be carried out on the audio. The reason for this
Gaussian distributions [11]. The speaker dependent                              was that such a labeling would be inflexible as it
parameters for each Gaussian (mean, variance and the                            would not allow new speakers to be added to the
relative weight) can be estimated through the Expectation                       system without extensive recalculation and
Maximization (EM) algorithm. Various forms of this                              annotation of the existing audio. Therefore it was
model have been tried, including variations in the type of                      decided to defer the work involved in labeling to
covariance matrix used [12],[13] and a hierarchical form                        query time itself.
[14].

                                                                 159
Speaker Metadata
                  Segment Metadata
                                                    Is log-likelihood value within                 Speaker ID
                  Clip ID                                decision boundary ?                       Component 1
                  Left Boundary
                                                           m + sd * k                              {
                  Right Boundary
                                                                                                      Mean Vector
                                                                                                      Covariance Matrix
                  {
                                                                                                      Weight Vector
                      MFCC 1
                                                                 Yes                               }
                      MFCC 2
                                                                                                   Component 2 { … }
                      ………..
                                                                                                   Component 3 { … }
                  }
                                                         Segment contains                          Log-Likelihood Mean
                                                            Speaker                                Log-Likelihood SD




                                            Figure 2 : The Metadata used for Indexing


Following the segmentation phase, information on
segment boundaries discovered through the BIC                          2.5 Architecture
procedure is held as metadata, along with a label
specifying if the segment is silence. The MFCC features                While recent research into speaker recognition has often
of that segment are also extracted and saved. This                     utilized MFCC feature vectors and the BIC metric, in the
constitutes the metadata for the audio clips.                          majority of these cases these systems depend on having
                                                                       pre-trained models of all possible speakers in the domain.
In order to avoid the excessive storage requirements                   Hence the segmentation phase is followed by an
inherent in storing all the MFCC features of a segment, a              annotation of the segments with the id of known speakers,
sampling procedure is carried out for segments larger than             and then carrying out text based searches of this field.
a certain size. Feature vectors are obtained from the
segments at linear intervals. Experiments show that the                In this respect our system provides the same capability as
effect on log-likelihood caused by using the sampled                   the anchor modeling done by Sturim [15], although there
feature vectors is relatively low.                                     is no indication whether his system utilized automated
                                                                       segmentation. In addition, the use of direct evaluation of
Metadata for each speaker is derived from the speaker                  speaker GMM against sampled MFCC features of the
training procedure explained in section 2.3 , This consists            segments has not been encountered as an indexing
of the parameters derived for the GMM’s as well as the                 technique in the literature.
mean and the standard deviation of the log-likelihood test
data for that speaker                                                  Similarly no studies have reported utilizing an ensemble
                                                                       GMM classifier to reduce the possible bias introduced
Carrying out a search is simply a case of retrieving the               during the stochastic training process.
GMM data for the queried speaker and subsequently
calculating the probability of the MFCC vectors for each
segment against this GMM. The overall structure of the                 3.0 Evaluation
indexing scheme is shown in figure 2.
                                                                       The algorithms for Speaker Recognition and
The constant K shown in the decision criteria in figure 2              Segmentation were implemented in Matlab, translated
is actually a user specified value which represents how                into C code and were compiled as COM objects. The
much “relaxation” the system is allowed when evaluating                indexing and the user interface were created using
the query. Higher values for K result in a higher                      C#.NET and integrated with the COM objects.
probability that existing segments for that speaker are
found, along with an increased probability that false                  Experiments were performed using a small sized speaker
speakers are also discovered.                                          database consisting of 60 second audio clips and 9
                                                                       speakers. Each speaker had 3 clips of lone speech. In
One feature of this system is that metadata for speakers               addition 8 clips of mixed speech were obtained, ensuring
and audio are maintained independently of each other.                  that each speaker was represented in at least 3 mixed clips.
This allows either one to be changed (i.e. when adding or              The recording environment contained background
removing audio clips and speakers) without requiring                   conversations and noise due to activities of students in the
changes to be done to the other.
                                                              160
laboratory and air conditioning hum. Both segmentation                 4.0 Conclusions
  and recognition were tested on this data.
                                                                         This paper has addressed the problem of speaker
  The criteria used for evaluation were the False Alarm                  recognition and indexing using a segmentation procedure
  Rates (FAR) and True Miss Rates (TMR). They are                        based on BIC, a GMM recognition system and an
  defined as shown below:                                                indexing scheme utilizing metadata based on sampled
                                                                         MFCC features. The techniques outlined here have given
         FAR% = Erroneously Detected Segments x 100                      good segmentation results and a satisfactory recognition
                       Total Detected Segments                           rate when used with medium to large sized segments
                                                                         recorded under the same conditions.

             TMR% = Missed True Segments x 100                           Due to the independence of the segmentation and speaker
                          True Segments                                  training components the actual work of speaker
                                                                         recognition may be delayed till a query is carried out.
                                                                         This means that knowledge of speakers is not required
  The results obtained for segmentation were quite good
                                                                         when a new audio clip is segmented. We only require that
  with an FAR of 3.2% and a TMR of 6.7%.
                                                                         a speaker model exist at the time of the actual search. This
                                                                         creates the possibility of extending the system so that a
  Recognition tests were carried out on these same                       search can be carried out on the criteria that the speaker in
  segments and the results are shown in figures 3 and 4.                 the returned segment matches that of an “example” audio
  The audio segments were divided into three classes,                    clip provided by the user at search time.
  namely large (60 seconds), medium (15-25 seconds) and
  small (under 15 seconds). These segments were those                    However some further evaluation has shown that the
  obtained as the result of the previous segmentation                    performance of this system gets worse when applied to
  algorithm applied to the speech data. Each segment                     segments extracted from audio recorded under different
  therefore corresponded to the continuous speech of one                 conditions. This shows that this method is sensitive to
  speaker. The results for large and medium segments are                 channel and noise effects in the audio. There remains
  promising with the former showing error rates close to                 much scope to explore various forms of filtering as well
  0% and the latter having a TMR of around 20-30% along                  as channel normalization in order to improve the
  with an FAR of under 10% for values of the relaxation                  recognition rates of this system.
  constant around 1.5. Small segments however showed a
  much higher error rate with the TMR being as high as                   In addition the indexing scheme outlined is not as fast as
  50% and FAR between 20-30%.                                            the use of anchor models. A method by which the speed
                                                                         advantage of anchor models and the accuracy of GMM
                                                                         based indexing could be to utilize the anchor model based
                                                                         indexing scheme as the first stage of the search procedure
                                                                         in order to cut down on the number of segments
                                                                         considered.

                 Recognition Performance (FAR)                                             Recognition Performance (TMR)
     1
                                                                              1
   0.9                                                                      0.9
   0.8                                                                      0.8
                                                       Large                                                                     Large
   0.7                                                                      0.7                                                  Segments
                                                       Segments
                                                                            0.6
                                                                        TMR %




   0.6
FAR %




                                                       Medium                                                                    Medium
   0.5                                                 Segments             0.5                                                  Segments
   0.4                                                 Small                0.4                                                  Small
   0.3                                                 Segments             0.3                                                  Segments
   0.2                                                                      0.2
   0.1                                                                      0.1
     0                                                                        0
           0.5     1      1.5      2      2.5                                        0.5     1      1.5      2       2.5
                                        K (Relaxation Constant)                                                   K (Relaxation Constant)

   Figure 3 : The FAR achieved for recognition                                   Figure 4 : The TMR achieved for recognition
  according to segment size and the value of the                                according to segment size and the value of the
                relaxation constant                                                           relaxation constant
                                                                  161
References                                                               [13] Bonastre, J., Delacourt, P., Fredouille, C., Merlin, T. &
                                                                         Wellekens, C. (2000), “A Speaker Tracking System Based on
                                                                         Speaker Turn Detection for NIST Evaluation”, IEEE
[1] Pool, J. (2002) “Investigation on the Impact of High
                                                                         International Conference on Acoustics, Speech, and Signal
Frequency Transmitted Speech on Speaker Recognition”, Msc.
                                                                         Processing, June 2000, Istanbul, Turkey, pp. 1177-1180
Thesis, University of Stellenbosch
                                                                         [14] Liu, M., Chang, E. & Dai, B. (2002), “Hierarchical
[2] Bimbot, F., Hutter, H. & Jaboulet, C. (1998) “An Overview
                                                                         Gaussian Mixture Model for Speaker Verification”, Microsoft
of the Cave Project Research Activities in Speaker
                                                                         Research Asia publication
Verification”, Speech Communication, Vol 31, Number 2-3, pp.
155-180
                                                                         [15] Sturim, D.E., Reynolds, D.A., Singer, E. & Campbell, J.P.
                                                                         (2001), “Speaker Indexing in Large Audio Databases Using
[3] Kemp, T., Schmidt, M., Westphal, M. & Waibel, A, (2000),
                                                                         Anchor Models”, Proceedings of the International Conference
“Strategies for Automatic Segmentation of Audio Data”,
                                                                         on Acoustics, Speech, and Signal Processing, pp. 429-432
Proceedings of the IEEE International Conference on Acoustics,
Speech, and Signal Processing ICASSP, Volume 3, pp. 1423-
                                                                         [16] Skowronsk, M.D. & Harris, J.G. (2002), “Human Factor
1426
                                                                         Cepstral Coefficients”, Proceedings of the First Pan
                                                                         American/Iberian Meeting on Acoustics
[4] Lu, L. & Zhang, H. (2002), “Real-Time Unsupervised
Speaker Change Detection”, Proceedings of the 16th
                                                                         [17] Matsui, T. & Furai, S. (1990) “Text Independent Speaker
International Conference on Pattern Recognition (ICPR), Vol.
                                                                         Recognition Using Vocal Tract and Pitch Information”,
II, pp. 358-361
                                                                         Proceedings of the International Conference on Spoken
                                                                         Language Processing, Vol. 4, pp. 137-140
[5] Pietquin, O, Couvreur, L & Covreur,         P        (2002),
“Applied Clustering for Automatic Speaker-Based Segmentation
                                                                         [18] Wen, J., Li, Q., Ma, W. & Zhang, H. (2002) “A Multi-
of Audio Material”, Belgian Journal of Operations Research,
                                                                         Paradigm Querying Approach for a Generic Multimedia
Statistics and Computer Science (JORBEL) - Special Issue on
                                                                         Database System”, SIGMOD Record, Vol. 32, No. 1
OR and Statistics in the Universities of Mons, vol. 41, no. 1-2,
pp. 69-81
                                                                         [19] Klas, W. & Aberer, K. (1995), “Multimedia Applications
                                                                         and Their Implications on Database Architectures”,
[6] Siegler, M.A., Jain, U., Raj, B. & Stern, R.M. (1997),
“Automatic Segmenting, Classification and Clustering of
Broadcast News Audio”, Proceedings of Darpa Speech
Recognition Workshop, February 1997, Chantilly, Virginia

[7] Tritschler, A. & Gopinath, R (1999), “Improved Speaker
Segmentation and Segment Clustering Using the Bayesian
Information Criterion”, Proceedings of Eurospeech 1999,
September 1999, Budapest

[8] Delacourt, P, Kryze, D & Wellekens, C.J (1999), “Speaker
Based Segmentation for Audio Data Indexing”, ESCA
Workshop on Accessing Information in Audio Data, 1999,
Cambridge, UK

[9] Orman, O.D. (1996), “Frequency Analysis of Speaker
Identification Performance”, MSc Thesis, Bogazici University

[10] Cordella, L.P., Foggia, P., Sanson, C. & Vento, M. (2003)
“A Real-Time Text-Independent Speaker Identification
System”, Proceedings of the 12th International Conference on
Image Analysis and Processing, IEEE Computer Society Press,
Mantova, pp. 632-637

[11] Reynolds, D.A. (1995), “Robust Text-Independent Speaker
Identification Using Gaussian Mixture Models”, IEEE
Transactions on Speech & Audio Processing, Volume 3, pp.72-
83

[12] Sarma, S.V. & Sridevi V. (1997), “A Segment Based
Speaker Verification System Using SUMMIT”, Proceedings of
Eurospeech 1997, September 1997, Rhodes, Greece
                                                                   162

More Related Content

What's hot

Vi ec (sy)080411014912
Vi ec (sy)080411014912Vi ec (sy)080411014912
Vi ec (sy)080411014912sbbirla
 
Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...
Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...
Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...CSCJournals
 
Chapter 3 slides
Chapter 3 slidesChapter 3 slides
Chapter 3 slideslara_ays
 
A Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data BanksA Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data Banksrenguzi
 
Birthof Relation Database
Birthof Relation DatabaseBirthof Relation Database
Birthof Relation DatabaseRaj Bhat
 
Design and implementation of a java based virtual laboratory for data communi...
Design and implementation of a java based virtual laboratory for data communi...Design and implementation of a java based virtual laboratory for data communi...
Design and implementation of a java based virtual laboratory for data communi...IJECEIAES
 
ANALYSING JPEG CODING WITH MASKING
ANALYSING JPEG CODING WITH MASKINGANALYSING JPEG CODING WITH MASKING
ANALYSING JPEG CODING WITH MASKINGijma
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Editor IJARCET
 
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia Traffic
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia TrafficAnalysis of Impact of Channel Error Rate on Average PSNR in Multimedia Traffic
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia TrafficIOSR Journals
 
Chapter 10.slides
Chapter 10.slidesChapter 10.slides
Chapter 10.slideslara_ays
 
Applications of Error Control Coding
Applications of Error Control CodingApplications of Error Control Coding
Applications of Error Control Codinghamidiye
 
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...Rosdiadee Nordin
 
ROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORM
ROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORMROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORM
ROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORMsipij
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
 
A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...ijma
 

What's hot (18)

Vi ec (sy)080411014912
Vi ec (sy)080411014912Vi ec (sy)080411014912
Vi ec (sy)080411014912
 
Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...
Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...
Rate Distortion Performance for Joint Source Channel Coding of JPEG image Ove...
 
Chapter 3 slides
Chapter 3 slidesChapter 3 slides
Chapter 3 slides
 
A Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data BanksA Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data Banks
 
Birthof Relation Database
Birthof Relation DatabaseBirthof Relation Database
Birthof Relation Database
 
Design and implementation of a java based virtual laboratory for data communi...
Design and implementation of a java based virtual laboratory for data communi...Design and implementation of a java based virtual laboratory for data communi...
Design and implementation of a java based virtual laboratory for data communi...
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
ANALYSING JPEG CODING WITH MASKING
ANALYSING JPEG CODING WITH MASKINGANALYSING JPEG CODING WITH MASKING
ANALYSING JPEG CODING WITH MASKING
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124
 
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia Traffic
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia TrafficAnalysis of Impact of Channel Error Rate on Average PSNR in Multimedia Traffic
Analysis of Impact of Channel Error Rate on Average PSNR in Multimedia Traffic
 
Chapter 10.slides
Chapter 10.slidesChapter 10.slides
Chapter 10.slides
 
Applications of Error Control Coding
Applications of Error Control CodingApplications of Error Control Coding
Applications of Error Control Coding
 
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...
 
ROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORM
ROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORMROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORM
ROBUST IMAGE WATERMARKING METHOD USING WAVELET TRANSFORM
 
C1803011419
C1803011419C1803011419
C1803011419
 
Et25897899
Et25897899Et25897899
Et25897899
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...
 

Viewers also liked

Purchase of Computers & Software for Procurement and Delivery of Quality In...
Purchase of Computers & Software for Procurement and Delivery of Quality In...Purchase of Computers & Software for Procurement and Delivery of Quality In...
Purchase of Computers & Software for Procurement and Delivery of Quality In...Gihan Wikramanayake
 
Exploiting Tourism through Data Warehousing
Exploiting Tourism through Data WarehousingExploiting Tourism through Data Warehousing
Exploiting Tourism through Data WarehousingGihan Wikramanayake
 
Using ICT to Promote Learning in a Medical Faculty
Using ICT to Promote Learning in a Medical FacultyUsing ICT to Promote Learning in a Medical Faculty
Using ICT to Promote Learning in a Medical FacultyGihan Wikramanayake
 
Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...
Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...
Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...Gihan Wikramanayake
 
Evaluation of English and IT skills of new entrants to Sri Lankan universities
Evaluation of English and IT skills of new entrants to Sri Lankan universitiesEvaluation of English and IT skills of new entrants to Sri Lankan universities
Evaluation of English and IT skills of new entrants to Sri Lankan universitiesGihan Wikramanayake
 
Future Trends in IT Training 2007
Future Trends in IT Training 2007Future Trends in IT Training 2007
Future Trends in IT Training 2007Gihan Wikramanayake
 
Improving student learning through assessment for learning using social media...
Improving student learning through assessment for learning using social media...Improving student learning through assessment for learning using social media...
Improving student learning through assessment for learning using social media...Gihan Wikramanayake
 

Viewers also liked (8)

Purchase of Computers & Software for Procurement and Delivery of Quality In...
Purchase of Computers & Software for Procurement and Delivery of Quality In...Purchase of Computers & Software for Procurement and Delivery of Quality In...
Purchase of Computers & Software for Procurement and Delivery of Quality In...
 
Exploiting Tourism through Data Warehousing
Exploiting Tourism through Data WarehousingExploiting Tourism through Data Warehousing
Exploiting Tourism through Data Warehousing
 
Using ICT to Promote Learning in a Medical Faculty
Using ICT to Promote Learning in a Medical FacultyUsing ICT to Promote Learning in a Medical Faculty
Using ICT to Promote Learning in a Medical Faculty
 
Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...
Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...
Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...
 
Learning beyond the classroom
Learning beyond the classroomLearning beyond the classroom
Learning beyond the classroom
 
Evaluation of English and IT skills of new entrants to Sri Lankan universities
Evaluation of English and IT skills of new entrants to Sri Lankan universitiesEvaluation of English and IT skills of new entrants to Sri Lankan universities
Evaluation of English and IT skills of new entrants to Sri Lankan universities
 
Future Trends in IT Training 2007
Future Trends in IT Training 2007Future Trends in IT Training 2007
Future Trends in IT Training 2007
 
Improving student learning through assessment for learning using social media...
Improving student learning through assessment for learning using social media...Improving student learning through assessment for learning using social media...
Improving student learning through assessment for learning using social media...
 

Similar to Speaker Search and Indexing for Multimedia Databases

Profile based Video segmentation system to support E-learning
Profile based Video segmentation system to support E-learningProfile based Video segmentation system to support E-learning
Profile based Video segmentation system to support E-learningGihan Wikramanayake
 
Comparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewComparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
 
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
 
An efficient feature selection in
An efficient feature selection inAn efficient feature selection in
An efficient feature selection incsandit
 
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...IJCSIS Research Publications
 
System analysis and design for multimedia retrieval systems
System analysis and design for multimedia retrieval systemsSystem analysis and design for multimedia retrieval systems
System analysis and design for multimedia retrieval systemsijma
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTERN Australia
 
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...IJDKP
 
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...IJDKP
 
Investigating Multi-Feature Selection and Ensembling for Audio Classification
Investigating Multi-Feature Selection and Ensembling for Audio ClassificationInvestigating Multi-Feature Selection and Ensembling for Audio Classification
Investigating Multi-Feature Selection and Ensembling for Audio Classificationijaia
 
IRJET- Segmentation in Digital Signal Processing
IRJET-  	  Segmentation in Digital Signal ProcessingIRJET-  	  Segmentation in Digital Signal Processing
IRJET- Segmentation in Digital Signal ProcessingIRJET Journal
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...CSCJournals
 
Towards an objective comparison of feature extraction techniques for automati...
Towards an objective comparison of feature extraction techniques for automati...Towards an objective comparison of feature extraction techniques for automati...
Towards an objective comparison of feature extraction techniques for automati...journalBEEI
 
Multimedia Semantics: Metadata, Analysis and Interaction
Multimedia Semantics:Metadata, Analysis and InteractionMultimedia Semantics:Metadata, Analysis and Interaction
Multimedia Semantics: Metadata, Analysis and InteractionRaphael Troncy
 
A computationally efficient learning model to classify audio signal attributes
A computationally efficient learning model to classify audio  signal attributesA computationally efficient learning model to classify audio  signal attributes
A computationally efficient learning model to classify audio signal attributesIJECEIAES
 
Content Based Video Retrieval Using Integrated Feature Extraction and Persona...
Content Based Video Retrieval Using Integrated Feature Extraction and Persona...Content Based Video Retrieval Using Integrated Feature Extraction and Persona...
Content Based Video Retrieval Using Integrated Feature Extraction and Persona...IJERD Editor
 
Automatic Speaker Recognition and AI.pdf
Automatic Speaker Recognition and AI.pdfAutomatic Speaker Recognition and AI.pdf
Automatic Speaker Recognition and AI.pdfBhusan Chettri
 
Computing for Human Experience and Wellness
Computing for Human Experience and WellnessComputing for Human Experience and Wellness
Computing for Human Experience and WellnessAmit Sheth
 
Content based video retrieval using discrete cosine transform
Content based video retrieval using discrete cosine transformContent based video retrieval using discrete cosine transform
Content based video retrieval using discrete cosine transformnooriasukmaningtyas
 
Text Mining with Automatic Annotation from Unstructured Content
Text Mining with Automatic Annotation from Unstructured ContentText Mining with Automatic Annotation from Unstructured Content
Text Mining with Automatic Annotation from Unstructured ContentIIRindia
 

Similar to Speaker Search and Indexing for Multimedia Databases (20)

Profile based Video segmentation system to support E-learning
Profile based Video segmentation system to support E-learningProfile based Video segmentation system to support E-learning
Profile based Video segmentation system to support E-learning
 
Comparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: ReviewComparative Study of Different Techniques in Speaker Recognition: Review
Comparative Study of Different Techniques in Speaker Recognition: Review
 
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
 
An efficient feature selection in
An efficient feature selection inAn efficient feature selection in
An efficient feature selection in
 
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
 
System analysis and design for multimedia retrieval systems
System analysis and design for multimedia retrieval systemsSystem analysis and design for multimedia retrieval systems
System analysis and design for multimedia retrieval systems
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
 
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
 
Investigating Multi-Feature Selection and Ensembling for Audio Classification
Investigating Multi-Feature Selection and Ensembling for Audio ClassificationInvestigating Multi-Feature Selection and Ensembling for Audio Classification
Investigating Multi-Feature Selection and Ensembling for Audio Classification
 
IRJET- Segmentation in Digital Signal Processing
IRJET-  	  Segmentation in Digital Signal ProcessingIRJET-  	  Segmentation in Digital Signal Processing
IRJET- Segmentation in Digital Signal Processing
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
 
Towards an objective comparison of feature extraction techniques for automati...
Towards an objective comparison of feature extraction techniques for automati...Towards an objective comparison of feature extraction techniques for automati...
Towards an objective comparison of feature extraction techniques for automati...
 
Multimedia Semantics: Metadata, Analysis and Interaction
Multimedia Semantics:Metadata, Analysis and InteractionMultimedia Semantics:Metadata, Analysis and Interaction
Multimedia Semantics: Metadata, Analysis and Interaction
 
A computationally efficient learning model to classify audio signal attributes
A computationally efficient learning model to classify audio  signal attributesA computationally efficient learning model to classify audio  signal attributes
A computationally efficient learning model to classify audio signal attributes
 
Content Based Video Retrieval Using Integrated Feature Extraction and Persona...
Content Based Video Retrieval Using Integrated Feature Extraction and Persona...Content Based Video Retrieval Using Integrated Feature Extraction and Persona...
Content Based Video Retrieval Using Integrated Feature Extraction and Persona...
 
Automatic Speaker Recognition and AI.pdf
Automatic Speaker Recognition and AI.pdfAutomatic Speaker Recognition and AI.pdf
Automatic Speaker Recognition and AI.pdf
 
Computing for Human Experience and Wellness
Computing for Human Experience and WellnessComputing for Human Experience and Wellness
Computing for Human Experience and Wellness
 
Content based video retrieval using discrete cosine transform
Content based video retrieval using discrete cosine transformContent based video retrieval using discrete cosine transform
Content based video retrieval using discrete cosine transform
 
Text Mining with Automatic Annotation from Unstructured Content
Text Mining with Automatic Annotation from Unstructured ContentText Mining with Automatic Annotation from Unstructured Content
Text Mining with Automatic Annotation from Unstructured Content
 

More from Gihan Wikramanayake

Broadcasting Technology: Overview
Broadcasting  Technology: OverviewBroadcasting  Technology: Overview
Broadcasting Technology: OverviewGihan Wikramanayake
 
Importance of Information Technology for Sports
Importance of Information Technology for SportsImportance of Information Technology for Sports
Importance of Information Technology for SportsGihan Wikramanayake
 
Analysis of Multiple Choice Question Papers with Special Reference to those s...
Analysis of Multiple Choice Question Papers with Special Reference to those s...Analysis of Multiple Choice Question Papers with Special Reference to those s...
Analysis of Multiple Choice Question Papers with Special Reference to those s...Gihan Wikramanayake
 
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy DatabasesAssisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy DatabasesGihan Wikramanayake
 
ICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව දිනමිණ, පරිගණක දැනුම
ICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව   දිනමිණ, පරිගණක දැනුමICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව   දිනමිණ, පරිගණක දැනුම
ICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව දිනමිණ, පරිගණක දැනුමGihan Wikramanayake
 
වෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය දිනමිණ, පරිගණක දැනුම
වෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය   දිනමිණ, පරිගණක දැනුමවෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය   දිනමිණ, පරිගණක දැනුම
වෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය දිනමිණ, පරිගණක දැනුමGihan Wikramanayake
 
පරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා දිනමිණ, පරිගණක දැනුම
පරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා   දිනමිණ, පරිගණක දැනුමපරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා   දිනමිණ, පරිගණක දැනුම
පරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා දිනමිණ, පරිගණක දැනුමGihan Wikramanayake
 
Balanced Scorecard and its relationship to UMM
Balanced Scorecard and its relationship to UMMBalanced Scorecard and its relationship to UMM
Balanced Scorecard and its relationship to UMMGihan Wikramanayake
 
Web Usage Mining based on Heuristics: Drawbacks
Web Usage Mining based on Heuristics: DrawbacksWeb Usage Mining based on Heuristics: Drawbacks
Web Usage Mining based on Heuristics: DrawbacksGihan Wikramanayake
 
Evolving and Migrating Relational Legacy Databases
Evolving and Migrating Relational Legacy DatabasesEvolving and Migrating Relational Legacy Databases
Evolving and Migrating Relational Legacy DatabasesGihan Wikramanayake
 
Re-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyRe-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyGihan Wikramanayake
 
Development of a Web site with Dynamic Data
Development of a Web site with Dynamic DataDevelopment of a Web site with Dynamic Data
Development of a Web site with Dynamic DataGihan Wikramanayake
 
Web Based Agriculture Information System
Web Based Agriculture Information SystemWeb Based Agriculture Information System
Web Based Agriculture Information SystemGihan Wikramanayake
 
Integrated Sri Lankan University Information System
Integrated Sri Lankan University Information SystemIntegrated Sri Lankan University Information System
Integrated Sri Lankan University Information SystemGihan Wikramanayake
 
Design and Development of a Resource Allocation Mechanism for the School Educ...
Design and Development of a Resource Allocation Mechanism for the School Educ...Design and Development of a Resource Allocation Mechanism for the School Educ...
Design and Development of a Resource Allocation Mechanism for the School Educ...Gihan Wikramanayake
 
A Tool for the Management of ebXML Resources
A Tool for the Management of ebXML ResourcesA Tool for the Management of ebXML Resources
A Tool for the Management of ebXML ResourcesGihan Wikramanayake
 
Document Management Techniques & Technologies
Document Management Techniques & TechnologiesDocument Management Techniques & Technologies
Document Management Techniques & TechnologiesGihan Wikramanayake
 
Impact of Digital Technology on Education
Impact of Digital Technology on EducationImpact of Digital Technology on Education
Impact of Digital Technology on EducationGihan Wikramanayake
 

More from Gihan Wikramanayake (20)

Broadcasting Technology: Overview
Broadcasting  Technology: OverviewBroadcasting  Technology: Overview
Broadcasting Technology: Overview
 
Importance of Information Technology for Sports
Importance of Information Technology for SportsImportance of Information Technology for Sports
Importance of Information Technology for Sports
 
Analysis of Multiple Choice Question Papers with Special Reference to those s...
Analysis of Multiple Choice Question Papers with Special Reference to those s...Analysis of Multiple Choice Question Papers with Special Reference to those s...
Analysis of Multiple Choice Question Papers with Special Reference to those s...
 
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy DatabasesAssisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
 
ICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව දිනමිණ, පරිගණක දැනුම
ICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව   දිනමිණ, පරිගණක දැනුමICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව   දිනමිණ, පරිගණක දැනුම
ICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව දිනමිණ, පරිගණක දැනුම
 
වෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය දිනමිණ, පරිගණක දැනුම
වෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය   දිනමිණ, පරිගණක දැනුමවෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය   දිනමිණ, පරිගණක දැනුම
වෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය දිනමිණ, පරිගණක දැනුම
 
පරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා දිනමිණ, පරිගණක දැනුම
පරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා   දිනමිණ, පරිගණක දැනුමපරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා   දිනමිණ, පරිගණක දැනුම
පරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා දිනමිණ, පරිගණක දැනුම
 
Producing Employable Graduates
Producing Employable GraduatesProducing Employable Graduates
Producing Employable Graduates
 
Balanced Scorecard and its relationship to UMM
Balanced Scorecard and its relationship to UMMBalanced Scorecard and its relationship to UMM
Balanced Scorecard and its relationship to UMM
 
An SMS-Email Reader
An SMS-Email ReaderAn SMS-Email Reader
An SMS-Email Reader
 
Web Usage Mining based on Heuristics: Drawbacks
Web Usage Mining based on Heuristics: DrawbacksWeb Usage Mining based on Heuristics: Drawbacks
Web Usage Mining based on Heuristics: Drawbacks
 
Evolving and Migrating Relational Legacy Databases
Evolving and Migrating Relational Legacy DatabasesEvolving and Migrating Relational Legacy Databases
Evolving and Migrating Relational Legacy Databases
 
Re-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyRe-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming Technology
 
Development of a Web site with Dynamic Data
Development of a Web site with Dynamic DataDevelopment of a Web site with Dynamic Data
Development of a Web site with Dynamic Data
 
Web Based Agriculture Information System
Web Based Agriculture Information SystemWeb Based Agriculture Information System
Web Based Agriculture Information System
 
Integrated Sri Lankan University Information System
Integrated Sri Lankan University Information SystemIntegrated Sri Lankan University Information System
Integrated Sri Lankan University Information System
 
Design and Development of a Resource Allocation Mechanism for the School Educ...
Design and Development of a Resource Allocation Mechanism for the School Educ...Design and Development of a Resource Allocation Mechanism for the School Educ...
Design and Development of a Resource Allocation Mechanism for the School Educ...
 
A Tool for the Management of ebXML Resources
A Tool for the Management of ebXML ResourcesA Tool for the Management of ebXML Resources
A Tool for the Management of ebXML Resources
 
Document Management Techniques & Technologies
Document Management Techniques & TechnologiesDocument Management Techniques & Technologies
Document Management Techniques & Technologies
 
Impact of Digital Technology on Education
Impact of Digital Technology on EducationImpact of Digital Technology on Education
Impact of Digital Technology on Education
 

Recently uploaded

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 

Recently uploaded (20)

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

Speaker Search and Indexing for Multimedia Databases

  • 1. Speaker Search and Indexing for Multimedia Databases T. Silva, D. D. Karunaratna, G. N. Wikramanayake K. P. Hewagamage and G. K. A. Dias. University of Colombo School of Computing, 35, Reid Avenue, Colombo 7, Sri Lanka. E-mail: tyro@sltnet.lk, {ddk, gnw, kph, gkd }@ucsc.cmb.ac.lk Abstract What is required in this context is a ‘true’ multimedia database with the ability to query the actual content for This paper proposes an approach for indexing a the purpose of locating a given person, object or topic collection of multimedia clips by a speaker in an audio area. track. A Bayesian Information Criterion (BIC) procedure is used for segmentation and Mel-Frequency Cepstral Currently there are no commercial database products Coefficients (MFCC) are extracted and sampled as supporting content based querying, although there are metadata for each segment. Silence detection is also several ongoing research projects such as the Combined carried out during segmentation. Gaussian Mixture Image and Word Spotting (CIMWOS) programme Models (GMM) are trained for each speaker, and an focused on this area. Many avenues for research exist ensemble technique is proposed to reduce errors caused such as face recognition, object recognition and keyword by the probabilistic nature of GMM training. The spotting. This paper describes the work undertaken in one indexing system utilizes sampled MFCC features as such avenue, that of speaker recognition, as part of a segment metadata and maintains the metadata of the research programme at the University of Colombo School speakers separately, allowing modification or additions to of Computing, Sri Lanka. be done independently. The system achieves a True Miss Rate (TMR) of around 20% and a False Alarm Rate 2.0 Background and Architecture (FAR) of around 10% for segments between 15 and 25 seconds in length with performance decreasing with In order to provide the capability to carry out a speaker reduction in segment size. based search of multimedia clips three main tasks have to be achieved. These tasks are respectively Segmentation, 1.0 Introduction Recognition and Indexing. The architecture of the system incorporating these three tasks is shown in figure 1. The use of multimedia has become more widespread due to advances in technology and the growing popularity of Segmentation is carried out after initial feature extraction “rich’ presentation mediums such as the Internet. As a on the new multimedia clip. A labeling procedure is used result of this growth we are now faced with the problem after segmentation to classify segments as voice or silence of managing ever expanding collections of audio and and the latter are marked and excluded from the later video content. This problem is particularly acute in the steps involved in querying. area of e-learning where it is not uncommon to find thousands of archived video and voice clips. Users Training speakers within this system is done in a accessing such multimedia archives need to be able to sift supervised manner by specifying audio files containing through this wealth of visual and aural data, both spatially speech purely from the given speaker. The mean and and temporally, in order to find the material they really variance of the log-likelihood for the specified test data is need [19]. obtained in order to establish a decision criterion for the later search. The management of such multimedia content has not traditionally been a strong point in the field of databases Both activities of training and segmentation result in where numerical and textual data has been the focus for metadata which is stored in a relational database for the many years. For this purpose such systems utilize purpose of Indexing. This metadata is combined at query descriptive structured metadata, which may not always be time to give the result of the search. available for the multimedia [18]. 157
  • 2. Initial Labelling New Multimedia Feature Segmentation (Silence / Object Extraction Voiced) Multimedia Segment Object Server MetaData Comparison Query Query Result Engine Speaker Model Data Testing / New Speaker Feature Confidence Training Data GMM Training Extraction Interval Estimation Figure 1 : Architecture of the Speaker Search and Indexing System The components of this system are described in more 2.2 Segmentation detail in the following sections. For the task of segmentation three broad methods have 2.1 Feature Extraction been identified [3]. These are namely Energy Based Segmentation, Model Based Segmentation and Metric The success of all three tasks identified previously is Based Segmentation.. The first approach is more dependent on the extraction of a suitable feature set from traditional and based on identifying changes in amplitude the audio. Features such as intensity, pitch, formants, and power which occur at speaker change points. The harmonic features and spectral correlation have been used second method requires prior speaker models to be in the past [1],[17] but in recent speech-related research trained and segmentation carried out as a classification such as [10],[12],[5] the two dominant features used are procedure on each section of audio as it is being Linear Predictive Coding Cepstral Coefficients (LPCC) processed. and Mel Frequency Cepstral Coefficients (MFCC). The final method, metric based segmentation uses Both methods utilize the concept of the “cepstrum” based estimated “differences” in adjacent audio frames in order on successive fourier transforms and a log transform to discover speaker change points. Typical metrics used applied to the original signal. LPCC uses linear predictive include the Euclidian, Kullback-Leiblar [4], Mahalanobis analysis whereas MFCC utilizes a triangular filter bank to and Gish distance measures [5],[6] as well the Bayesian derive the actual coefficients. Of these LPCC is the older Information Criterion (BIC). The BIC is a form of method and has been cited as being superior for certain Minimum Descriptor Length criterion that has been types of recognition topologies such as simplified Hidden popular in recent times. Various extensions such as Markov Models [2]. MFCC features are however more automatically adjusting window sizes [7] and multiple noise resistant and also incorporate the mel-frequency passes at decreasing granularity levels [8] have also been scale, which simulates the non-uniform sensitivity the proposed. human ear has towards various sound frequencies [16]. BIC is employed to test the hypothesis that there is a In this system the audio is downsampled to 11025 Hz and significant different in speakers in two adjacent sections cepstral coefficients are drawn from each 128 byte frames. of audio. The equation used is shown below: The zeroth (energy) coefficient is dropped from this feature vector. Non-overlapping hamming windows are used and a feature vector size of 15 and 23 is used for BIC = N*log| | - ( i*log| 1| + (N-i)*log| 2| ) recognition and segmentation respectively. At the initial - ½ (d+½d(d+1))logN stages of this research project various combinations of MFCC and LPCC coefficients were experimented with where N(µ, ) denotes a normal/Gaussian distribution and it was discovered empirically that a pure MFCC with covariance and mean µ, N is the total number of feature vector gave the best results. frames in both segments, i is the number of frames in the first segment and d is the dimensionality of each sample 158
  • 3. (feature vector) drawn from the audio. is a penalty In this system GMM’s with full covariance matrices are factor which is set to unity in statistical theory. used with an initial 20 mixtures per GMM. Low-weight mixtures are dropped from the model after training. A result less than zero for BIC indicates that the two Rather than utilizing a single GMM per speaker it was segments are similar. decided to incorporate three such models within an ensemble classifier, as the stochastic nature of the EM The actual technique used in this system is based on the algorithm resulted in diverse models at each run. By variable window scheme by Tritschler [8]. A speaker averaging three models it was hoped that any bias that change point is sought within an initial window, and if occurred during training due to the random initialization such a point could not be found the window is allowed to of the EM algorithm could be reduced. grow in size. When a speaker change point is detected then that location is selected as the starting point for the Following the training phase the model is tested with data next window. To avoid the calculations becoming too for the same speaker. The mean and standard deviation of costly it was decided to introduce an upper limit to the the log-likelihoods obtained for the test data is calculated size of the window. Beyond this limit the window starts and used for the recognition criteria as described in the “sliding” rather than growing. next section. An intentional bias towards over-segmenting was 2.4 Indexing introduced into the BIC equation to ensure than all possible speaker change points were accounted for. This One method used for the Indexing task involves the use of was done by setting to a value less than 1. Since this pre-trained anchor or reference models. Distance or caused a lot of spurious boundaries to be detected a Characterization vectors are calculated for each utterance second pass was carried out in which adjacent segments based on its location relative to the anchor models. were compared once again using the BIC and merged if Similar vectors are used to specify the relative location they were found to be similar. for speakers within the feature space. Searching for a specific speaker is simply a matter of finding the After segments are obtained the Zero Crossing Frequency utterances that have a characterization vector close (in (ZCF) is used to detect silence segments. Voice segments terms of Euclidean or any other distance measure) to that have a very much higher variance for the ZCF and a of the target speaker [15]. While anchor models are simple threshold was used to differentiate them from the efficient in terms of search time, they have a much higher segments containing silence. reported error rate than direct utilization of GMM recognition. 2.3 Speaker Model Training In the design of the indexing scheme for this system, the Several methods have been utilized in the literature for following assumptions and decisions were made the purpose of speaker recognition. Text-dependent recognition, such as that required for voice password 1. A query was assumed to always be for a known systems, has been carried out using Hidden Markov speaker. Hence a trained GMM classifier would Models (HMM) and Dynamic Time Warping (DTW). In have to exist beforehand for any speaker for the text-independent approach, which is more relevant to whom a query was formulated. Queries using the research question addressed in this project, the Vector example audio tracks could also be incorporated Quantization method has been used in the past [9], and into this system, as it would simply require that a more recently combined with neural networks in the form speaker model be trained using the given sample of Learning Vector Quantization (LVQ) [10]. audio track before the query process was carried out. However Gaussian Mixture Models (GMM) are at present the most popular method used. This method models the 2. No prior labeling on the basis of speaker would features for each speaker by using multiple weighted be carried out on the audio. The reason for this Gaussian distributions [11]. The speaker dependent was that such a labeling would be inflexible as it parameters for each Gaussian (mean, variance and the would not allow new speakers to be added to the relative weight) can be estimated through the Expectation system without extensive recalculation and Maximization (EM) algorithm. Various forms of this annotation of the existing audio. Therefore it was model have been tried, including variations in the type of decided to defer the work involved in labeling to covariance matrix used [12],[13] and a hierarchical form query time itself. [14]. 159
  • 4. Speaker Metadata Segment Metadata Is log-likelihood value within Speaker ID Clip ID decision boundary ? Component 1 Left Boundary m + sd * k { Right Boundary Mean Vector Covariance Matrix { Weight Vector MFCC 1 Yes } MFCC 2 Component 2 { … } ……….. Component 3 { … } } Segment contains Log-Likelihood Mean Speaker Log-Likelihood SD Figure 2 : The Metadata used for Indexing Following the segmentation phase, information on segment boundaries discovered through the BIC 2.5 Architecture procedure is held as metadata, along with a label specifying if the segment is silence. The MFCC features While recent research into speaker recognition has often of that segment are also extracted and saved. This utilized MFCC feature vectors and the BIC metric, in the constitutes the metadata for the audio clips. majority of these cases these systems depend on having pre-trained models of all possible speakers in the domain. In order to avoid the excessive storage requirements Hence the segmentation phase is followed by an inherent in storing all the MFCC features of a segment, a annotation of the segments with the id of known speakers, sampling procedure is carried out for segments larger than and then carrying out text based searches of this field. a certain size. Feature vectors are obtained from the segments at linear intervals. Experiments show that the In this respect our system provides the same capability as effect on log-likelihood caused by using the sampled the anchor modeling done by Sturim [15], although there feature vectors is relatively low. is no indication whether his system utilized automated segmentation. In addition, the use of direct evaluation of Metadata for each speaker is derived from the speaker speaker GMM against sampled MFCC features of the training procedure explained in section 2.3 , This consists segments has not been encountered as an indexing of the parameters derived for the GMM’s as well as the technique in the literature. mean and the standard deviation of the log-likelihood test data for that speaker Similarly no studies have reported utilizing an ensemble GMM classifier to reduce the possible bias introduced Carrying out a search is simply a case of retrieving the during the stochastic training process. GMM data for the queried speaker and subsequently calculating the probability of the MFCC vectors for each segment against this GMM. The overall structure of the 3.0 Evaluation indexing scheme is shown in figure 2. The algorithms for Speaker Recognition and The constant K shown in the decision criteria in figure 2 Segmentation were implemented in Matlab, translated is actually a user specified value which represents how into C code and were compiled as COM objects. The much “relaxation” the system is allowed when evaluating indexing and the user interface were created using the query. Higher values for K result in a higher C#.NET and integrated with the COM objects. probability that existing segments for that speaker are found, along with an increased probability that false Experiments were performed using a small sized speaker speakers are also discovered. database consisting of 60 second audio clips and 9 speakers. Each speaker had 3 clips of lone speech. In One feature of this system is that metadata for speakers addition 8 clips of mixed speech were obtained, ensuring and audio are maintained independently of each other. that each speaker was represented in at least 3 mixed clips. This allows either one to be changed (i.e. when adding or The recording environment contained background removing audio clips and speakers) without requiring conversations and noise due to activities of students in the changes to be done to the other. 160
  • 5. laboratory and air conditioning hum. Both segmentation 4.0 Conclusions and recognition were tested on this data. This paper has addressed the problem of speaker The criteria used for evaluation were the False Alarm recognition and indexing using a segmentation procedure Rates (FAR) and True Miss Rates (TMR). They are based on BIC, a GMM recognition system and an defined as shown below: indexing scheme utilizing metadata based on sampled MFCC features. The techniques outlined here have given FAR% = Erroneously Detected Segments x 100 good segmentation results and a satisfactory recognition Total Detected Segments rate when used with medium to large sized segments recorded under the same conditions. TMR% = Missed True Segments x 100 Due to the independence of the segmentation and speaker True Segments training components the actual work of speaker recognition may be delayed till a query is carried out. This means that knowledge of speakers is not required The results obtained for segmentation were quite good when a new audio clip is segmented. We only require that with an FAR of 3.2% and a TMR of 6.7%. a speaker model exist at the time of the actual search. This creates the possibility of extending the system so that a Recognition tests were carried out on these same search can be carried out on the criteria that the speaker in segments and the results are shown in figures 3 and 4. the returned segment matches that of an “example” audio The audio segments were divided into three classes, clip provided by the user at search time. namely large (60 seconds), medium (15-25 seconds) and small (under 15 seconds). These segments were those However some further evaluation has shown that the obtained as the result of the previous segmentation performance of this system gets worse when applied to algorithm applied to the speech data. Each segment segments extracted from audio recorded under different therefore corresponded to the continuous speech of one conditions. This shows that this method is sensitive to speaker. The results for large and medium segments are channel and noise effects in the audio. There remains promising with the former showing error rates close to much scope to explore various forms of filtering as well 0% and the latter having a TMR of around 20-30% along as channel normalization in order to improve the with an FAR of under 10% for values of the relaxation recognition rates of this system. constant around 1.5. Small segments however showed a much higher error rate with the TMR being as high as In addition the indexing scheme outlined is not as fast as 50% and FAR between 20-30%. the use of anchor models. A method by which the speed advantage of anchor models and the accuracy of GMM based indexing could be to utilize the anchor model based indexing scheme as the first stage of the search procedure in order to cut down on the number of segments considered. Recognition Performance (FAR) Recognition Performance (TMR) 1 1 0.9 0.9 0.8 0.8 Large Large 0.7 0.7 Segments Segments 0.6 TMR % 0.6 FAR % Medium Medium 0.5 Segments 0.5 Segments 0.4 Small 0.4 Small 0.3 Segments 0.3 Segments 0.2 0.2 0.1 0.1 0 0 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 K (Relaxation Constant) K (Relaxation Constant) Figure 3 : The FAR achieved for recognition Figure 4 : The TMR achieved for recognition according to segment size and the value of the according to segment size and the value of the relaxation constant relaxation constant 161
  • 6. References [13] Bonastre, J., Delacourt, P., Fredouille, C., Merlin, T. & Wellekens, C. (2000), “A Speaker Tracking System Based on Speaker Turn Detection for NIST Evaluation”, IEEE [1] Pool, J. (2002) “Investigation on the Impact of High International Conference on Acoustics, Speech, and Signal Frequency Transmitted Speech on Speaker Recognition”, Msc. Processing, June 2000, Istanbul, Turkey, pp. 1177-1180 Thesis, University of Stellenbosch [14] Liu, M., Chang, E. & Dai, B. (2002), “Hierarchical [2] Bimbot, F., Hutter, H. & Jaboulet, C. (1998) “An Overview Gaussian Mixture Model for Speaker Verification”, Microsoft of the Cave Project Research Activities in Speaker Research Asia publication Verification”, Speech Communication, Vol 31, Number 2-3, pp. 155-180 [15] Sturim, D.E., Reynolds, D.A., Singer, E. & Campbell, J.P. (2001), “Speaker Indexing in Large Audio Databases Using [3] Kemp, T., Schmidt, M., Westphal, M. & Waibel, A, (2000), Anchor Models”, Proceedings of the International Conference “Strategies for Automatic Segmentation of Audio Data”, on Acoustics, Speech, and Signal Processing, pp. 429-432 Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP, Volume 3, pp. 1423- [16] Skowronsk, M.D. & Harris, J.G. (2002), “Human Factor 1426 Cepstral Coefficients”, Proceedings of the First Pan American/Iberian Meeting on Acoustics [4] Lu, L. & Zhang, H. (2002), “Real-Time Unsupervised Speaker Change Detection”, Proceedings of the 16th [17] Matsui, T. & Furai, S. (1990) “Text Independent Speaker International Conference on Pattern Recognition (ICPR), Vol. Recognition Using Vocal Tract and Pitch Information”, II, pp. 358-361 Proceedings of the International Conference on Spoken Language Processing, Vol. 4, pp. 137-140 [5] Pietquin, O, Couvreur, L & Covreur, P (2002), “Applied Clustering for Automatic Speaker-Based Segmentation [18] Wen, J., Li, Q., Ma, W. & Zhang, H. (2002) “A Multi- of Audio Material”, Belgian Journal of Operations Research, Paradigm Querying Approach for a Generic Multimedia Statistics and Computer Science (JORBEL) - Special Issue on Database System”, SIGMOD Record, Vol. 32, No. 1 OR and Statistics in the Universities of Mons, vol. 41, no. 1-2, pp. 69-81 [19] Klas, W. & Aberer, K. (1995), “Multimedia Applications and Their Implications on Database Architectures”, [6] Siegler, M.A., Jain, U., Raj, B. & Stern, R.M. (1997), “Automatic Segmenting, Classification and Clustering of Broadcast News Audio”, Proceedings of Darpa Speech Recognition Workshop, February 1997, Chantilly, Virginia [7] Tritschler, A. & Gopinath, R (1999), “Improved Speaker Segmentation and Segment Clustering Using the Bayesian Information Criterion”, Proceedings of Eurospeech 1999, September 1999, Budapest [8] Delacourt, P, Kryze, D & Wellekens, C.J (1999), “Speaker Based Segmentation for Audio Data Indexing”, ESCA Workshop on Accessing Information in Audio Data, 1999, Cambridge, UK [9] Orman, O.D. (1996), “Frequency Analysis of Speaker Identification Performance”, MSc Thesis, Bogazici University [10] Cordella, L.P., Foggia, P., Sanson, C. & Vento, M. (2003) “A Real-Time Text-Independent Speaker Identification System”, Proceedings of the 12th International Conference on Image Analysis and Processing, IEEE Computer Society Press, Mantova, pp. 632-637 [11] Reynolds, D.A. (1995), “Robust Text-Independent Speaker Identification Using Gaussian Mixture Models”, IEEE Transactions on Speech & Audio Processing, Volume 3, pp.72- 83 [12] Sarma, S.V. & Sridevi V. (1997), “A Segment Based Speaker Verification System Using SUMMIT”, Proceedings of Eurospeech 1997, September 1997, Rhodes, Greece 162