Konversa.docx - konversa.googlecode.com


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Konversa.docx - konversa.googlecode.com

  1. 1. Konversa : A Personal Audiolog<br />Guru Gopalakrishnan Krishnakanth Chimalamarri Midhun Achuthan<br />University of Southern California<br />{ggopalak,chimalam,achuthan}@usc.edu<br />Abstract<br />In recent years, the computing capabilility of mobile phone devices have increased drastically. Also, with the advent of 3G, these devices remain connected to the internet almost throughout the day over high bandwidth connections. We present konversa (Derived from conversa, the Portuguese translation for conversation) a system that logs a mobile phone user's non-phone conversations over the day. These conversations are time stamped and location stamped, so that the user can keep track of when and where he made the conversation.<br />1. Introduction<br />Mobile phones are unobtrusive devices that are carried by the user throughout the day. These devices have powerful microphones which allow average quality sound to be captured. The quality of sound may not always be the same in different environments. Outdoor conversations are often subject to high noise interference. Even indoor conversations may be interefered by music or crowd noise. There are also situations where the user may not be participating in a conversation, but is in close proximity of another conversation that his phone's microphone is able to capture the audio signals.<br />Konversa discards such conversations and by the end of the day presents the user with a time-and-location-stamped log of all the <br />conversations he has made during the day. This is done by opportunistically sending the recorded audio samples to a remote server on the internet. We run classification algorithms on the server to determine which of the samples contain relevant conversations. The server sends back the decision to the phone so that the phone can discard irrelevant samples. Konversa runs as a service on the Android G1 Phone and provides an interactive application to playback the audio clips based on time and location on the map. The backend server is a Linux machine that listens for requests from the phone. Konversa is written in Java and MATLAB.<br />2. Design and Implementation<br />2.1 Design Issues<br />We were faced with a number of challenges while designing konversa. Although the Android G1 phones are computationaly superior to most other phones in the market, they are not powerful enough to run classification algorithms. Also, in order to extract features, the input audio sample had to be filtered and processed. Doing this on the phone without interfering with the phones basic operations like calls and text messaging is not computationaly feasible given the phones limited memory allocation to user applications. To overcome this limitation, we had to setup a backend server to which the phone can opportunistically connect and upload samples of recordings. The server will then process these samples and send back the results to the phone.<br />2.2 System Specification<br />Konversa was deployed on the developer version of the Android G1 phone. The android platform supports applications written in Java that can be compiled and run on the Dalvik VM. The phone comes with various built-in sensors. For this experiment, we use the microphone to record audio and the GPS service to track the location of the phone. The backend server runs Ubuntu Linux 8 with JRE. Classification algorithms are written in MATLAB which runs on the server.<br />2.3 Architecture<br />2.3.1 Network Architecture<br />Konversa can communicate with the backend server via 802.11g Wi-Fi or 3G . It does so opportunistically. When there is no connectivity it caches all the recorded samples on the local phone memory. When there is a connection, a separate thread attempts to upload as many samples as it can, via SFTP. After processing, the server sends back a decision based on which the phone may save or discard the sample.<br /> <br /> <br />Fig 1. Konversa Network<br />2.3.2 System Architecture<br />Konversa runs on the phone as a service. It uses the phones mic to capture audio. The recorder thread wakes up at specified intervals, enables the mic and starts capturing. It saves the audio file in the tive 3GP format on the phone in a temperory cache memory along with its time-stamp and location-stamp. This contains unclassified and unprocessed audio files. The communications thread periodically picks an unprocessed file and sends it to the backend server via Secure FTP (SFTP). We chose SFTP because personal audio samples may contain private information and they need to be sent over a secure channel.<br /> <br /> <br /> Fig 2. System Architecture<br />On receiving a file, the server needs to do some pre-processing before the features can be extracted and classified. The following processing takes place at the server.<br />a. Conversion to raw format<br />Android (and most other phones) record audio in 3GP format specified by the Third Generation Partnership Project. This format stores video in MPEG-4 Part 2 or H.263 or MPEG-4 Part 10 (AVC/H.264), and audio streams as AMR-NB, AMR-WB, AMR-WB+, AAC-LC or HE-AAC. [3]. In order to apply signal processing techniques, the audio has to be decoded and uncompressed to its raw format. The server decodes the audio channel in AMR format to the raw WAV format which is used by various audio signal processing tools for further processing. The ffmpeg tool, an open source initiative, is used to decode the samples to its raw form.<br />b. Noise removal<br />Audio captured from the phone is often subject to noise from the environment. Konversa eliminates noise using a band pass filter. A low pass filter filters out noise about 3400Hz. A high pass filter fitlers out noise below 300Hz. The resulting sample has a frequency range between 300 and 3400 hz. Two different tools were tested - JSyn and Sound Exchange (SoX). Sox gave a better performance in terms of speed.<br />c. Silence removal<br />Most conversation samples contain silence for irregular periods of time. The presence of silence affects the features extracted from the audio sample which in turn affects the classification accuracy. After experimenting on various thresholds, we got the best results by trimming anything below 0.3% of the peak amplitude value of the sample. Removing silence also shortens the clip length by varying amounts. Sound exchange was used to remove the silence from the clips.<br />Fig 3. Comparison of waveforms after each step of filtering<br />Fig 4. Comparison of waveforms (a) Before removing silence and (b) After removing silence<br />d.Splitting up samples for conversation detection.<br />The recorded clips originally had a duration of 60s. In order to detect a conversation, these samples have to be split up further. We discovered that splitting them into 5s parts was reasonable to identify the speakers presence from the classifier. Since removing silence shortens the clip, we sometimes discard the last part if its less than 5s.<br />e. Extracting features<br />In order to classify audio, we need to extract features that are unique to a speaker. The most widely used features are MFCC( Mel-Frequency Cepstral Coefficients). These are coefficients that collectively make up an MFC which is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. They are derived from a type of cepstral representation of the audio clip (a nonlinear " spectrum-of-a-spectrum" )[4]. The features for each part are extracted using MATLAB. We use<br />f. Classification<br />The resulting feature vector is then classified using the VQ codebook that was initially trained over the user's voice. The algorithm is explained in detail in the next section.<br />The result is sent back to the phone. If it is " yes" , the communication thread will move the file into the classified files list, which can be accessed by the user through the GUI via the native SQLite database. Otherwise the file is removed from the cache.<br />3. Classification<br />Konversa has a fairly simple multi-threaded architecture, where separate threads handle the capturing and communication as explained in section 2.3. Most of the development work was devoted to identifying and implementing the right kind of classifier. After experimenting on a few different models, we had to chose one which gives the maximum accuracy with minimum implementation constraints that will seamlessly integrate with the underlying client-server architecture. In this section we explain a few models that we experimented on and why we choose the vector quantization model as the winner.<br />a. Artificial Neural Networks<br />Artificial Neural Networks implement a discrimination-based learning procedure. We used a 3-layer multiperceptron model. Training was based on backpropagation. 20 MFCC features were extracted for 20 windows of the sample, each of size 512. These features were input to 400 perceptrons on the input-layer. The hidden layer consists of 300 perceptrons and the output layer had a single perceptron which outputs the result of the classification. A similar model has been experimented previosly for text-dependent speech recognition in which case the number of output perceptrons correspond to the categories of phenomes to identify [2]. Although skeptical that this model would work well for text-independent classification, we decided to experiment and see what the results would like. The network was trained with samples from 4 different speakers. We will refer to the target speaker as Starget. The remaining participants will be referred to as S1,S2 and S3.<br /> <br />Fig 5. Structure of the neural network<br />For each of the 4 speakers, we collected 6 samples each of length 10s for creating the training set. The ideal output set for Starget was {1.0} and {0.0} for the remaining speakers. The network was trainined for 500 epochs each or until the error rate dropped to 0.001. After training for each of the sample, the network tried to classify the samples from the training set and a validation set which contained 4 samples each that did not belong to the training set. The results were not very impressive. We achieved a classification accuracy of around 60 - 70% even on the training set and 50% - 60% on the validation set, clearly indicating that the neural network model was not suitable in this scenario. Also from [8], their optimal structure has to be selected by trial-and-error procedures. The need to split the available train data in training and cross-validation sets, and the fact that the temporal structure of speech signals re-mains difficult to handle, makes it disadvantageous.<br />b. Gaussian Mixture Models<br />Mixture model is a probabilistic model for density estimation using a mixture distribution. A mixture model can be regarded as a type of unsupervised learning or clustering. Voice is considered to be a mixture of Gaussian components and hence the GMMs are known to perform well with speaker recognition. A mixture model consists of several Gaussian Components, each of these components has mean, variance and weight. These have to be initialized to certain value and then trained using EM. The initialization algorithm we used was k-means clustering, with 19 clusters, for MFCC vector of 20 dimensions. We used a library called COMIRVA as a starting point for our design. This library was optimized for music instruments and we had to tweak it, particularly add a floor value to the co-variances (of 0.01) [6] and tweak the MFCC frequency to voice frequency between 300 and 3000Hz. We had initially recorded a training sample on the phone for length of 90s[6] and trained our GMM using EM ( Expectation Maximization) over initialized values. Then we try to classify out input samples and calculate the log-likelihood and probability of that model and representing the given classification sample.<br />On each EM iteration, the following reestimation formulas are used which gaurantee a monotonic increase in the model's likelihood value [5]<br />From our experiments we decided not to go with GMM due its poor poerformance in presence of noise and also our limited training set. From [6,7] we find out that GMMs are usually trained over 16 speakers or more. We were testing with 3 speakers and this could one of the reason why GMM did not work well. Also due to Singualirities in the Matrix while training, we had to use a floor. This technique although standard in speech processing, might have removed the subtle differences in speakers in our limited training set. <br />c. Vector Quantization (LBG)<br />Vector quantization (VQ) is a lossy data compression method based on the principle of block coding. It is a fixed-to-fixed length algorithm. In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a training sequence, before this the VQ was considered tough due to multiple dimensional integrals. Using training sequences eliminates the need for these integrals. This is LBG VQ, which we used in our implementation.<br />A VQ is an approximation algorithm. Similar to round-off, the VQ design problem is:<br />" Given a vector source with its statistical properties known, given a distortion measure, and given the number of code-vectors, find a codebook and a partition which result in the smallest average distortion" .[10] two conditions have to be satisfied for a this algorithm.<br /> * Nearest Neighbor Condition: This condition says that the encoding region should consists of all vectors that are closer to that code-vector, and not any other code-vectors.<br /> * Centroid Condition: This condition says that a code-vector should be average of all those training vectors that are in encoding region. we should ensure at least one training vector per region to avoid divide-by-zero problems.<br />LBG solves the above problem with these two conditions iteratively applied. Initially an average of all code vectors in calculated and then the Splitting stage is applied, where number of codebooks becomes twice. Then an Iteratively these steps are done till it reaches termination. we have chosen a size of 16. Following is an example for VQ - clustering.[9]<br /> Fig 6 : LBG-VQ clustering from [9].<br />4. Training<br /> <br />Once the classifier was determined, we performed a systematic training process. The samples collected for training, go through the entire pre-processing steps as outlined in the architecture. The training was performed on 3 different speakers.<br />In the first round of training, 4 samples of length 10s were collected for each speaker. Clips 1-4 were mapped to the target speaker. Clips 5-12 correspond to the other 2 speakers. The results were good for classifying samples that were recorded close to the mouth. There was some confusion when classifying samples recorded at a distance.<br />We then changed the training set. Sample size was increased to 45 seconds to model more realistic conversations. After removing silence, these lengths were reduced by varying amounts. Each speaker would talk for 45 seconds in two different positions. In the first position, the phone was held close to the mouth. In the second position, the phone was kept on a desk about 1.5m away from the speaker. This training set gave better classification accuracy in both positions.<br />We still had a major question unanswered. How will the model scale to participants who were not part of the training set? In order to do so, it was necessary to collect as many audio samples as possible. We managed to increase the training set to 6 speakers. The results are provided in section 6.<br />5. Modelling conversations.<br />In order to detect a conversation, we primarily look for the presence of the selected speaker Starget in the samples. We assume that in an active conversation, Starget speaks at least for 7 to 10s at a stretch. We split the initial 60s sample (<60s after removing silence) into sub-samples of 5s duration. If speaker Starget speaks during most of the time in a sub-sample (Fig 7.a), there is higher accuracy in classification. However, if Starget 's speech is separated between two sub-samples as in (Fig 7.b) , there is a higher probabilty for error in classification. After splitting the original sample into smaller sub-samples, the features are extracted and classified. The output for each original sample is a list of 'yes' or 'no' based on the classification results of the sub-samples. We assume the conversation probability Pconversation to be the percentage of positives (sub-samples that were classified as 'yes') over the total number of sub-samples in the given original sample.<br />We set a minimum and maximum threshold for the conversation probability. In our experiments the minimum threshold was set to 20% and maximum to 80%. The minimum threshold overcomes some of the false-negatives (The classifier wrongly classified the sub sample as 'no'). This could be due to the fact stated in the beginning of this section or the inherent inaccuracies within the classifier. The maximum threshold is set to distinguish a conversation from a one-way speech, say a lecture, as it is highly unlikely that a person talks for most of the duration in a 60s conversation sample.<br />Fig 7. Different scenarios which can affect classification accuracy of sub-samples<br />6. Results<br />The following are the results of the classification using VQ code book on training set of sample size 45s with 2 samples per speaker. Starget is the speaker we wish to identify in the conversation.<br />S1 and S2 are other speakers who participated in the training set. Each 45s sample was split into 5s sub-samples. The following table depict the false negative rate on a 5s sub-sample as the distance from the microphone increases.<br />a) Starget speaks alone (False Negative)<br />Distance (in m) False Negative (%)0.1 00.2 00.3 00.4 100.5 150.6 350.7 400.8 450.9 551 60<br />Table 1<br />Fig 8 - Increase in false negative % with distance<br />b) Starget speaks with S1 (False positive rate)<br />Distance (in m) False Positive (%)0.1 00.2 00.3 00.4 200.5 300.6 350.7 400.8 450.9 501 55<br />Table 2<br /> <br />Fig 9 - Increase in false positive % with distance<br />c) Starget with S2 (False positive rate)<br />Distance (in m) False Positive (%)0.1 00.2 00.3 50.4 150.5 250.6 350.7 450.8 550.9 601 65<br />Table 3<br />Fig 10 - Increase in false positive % with distance<br />There is a general increase in the false positive rate in detecting conversations where the distance of the mic increases.<br />7. Conclusion and Future Work<br />We developed a useful phone application that provide a convenient way for users to keep track of important information that they may come across during their daily conversations. Textual conversations like short messages and emails can be easily logged. Providing a framework to log audio clips based on contextual information along with their spatio-temporal characteristics requires robust classification algorithms and network connectivity. While we were able to provide a delay tolerant communication framework, there are still improvements to be made in the classification algorithms. During the course of this work we evaluated several classification methods. Artificial neural networks performed poorly because of the improper selection of training data. GMM although has been claimed to work well with noiseless data, the amount of data we were testing it with ( 3 speakers) was very limited and could have led to bad performance. VQ is an algorithm which performs exceptionally well for very few speakers and also can handle noise better. Hence we chose VQ.<br />Some of the future work, apart from improving the classification accuracy, includes tuning the microphones sampling frequency. Since not many conversations are made during the day, keeping the microphone on can severely drain the devices battery power. Choosing the right frequency is a challenge and involves a trade-off between resource constraints and missing whole or parts of important conversations. Other ideas include creating a multi-user data from multiple users can be analyzed to determine relationships between them based on the spatio-temporal characteristics of their conversations. However, in such systems user privacy is of major concern. <br />8. Acknowledgement<br />First and foremost, we would like to thank Prof. Gaurav Sukhatme for this opportunity and his constant encouragement which helped us move forward and complete the project. We are also extremely grateful to Dr. Sameera Poduri and Karthik Dantu, for their time and support throughout the course of this work. We also thank Prof. Fei Sha for enlightening us on speech recognition models.<br />9. References<br />1. On the effectiveness of MFCCs and their statistical distribution properties in speaker identification<br />2. Speech Recognition Using Neural Networks at the Center for Spoken Language Understanding [http://speech.bme.ogi.edu/tutordemos/nnet_recog/recog.html]<br />3. 3rd Generation Partnership Project [http://www.3gpp.org]<br />4. Mel Frequency Cepstrum Coefficient [http://en.wikipedia.org/wiki/Mel_frequency_cepstral_coefficient]<br />5. Robust Text-Independent Speaker Identification Using gaussian Mixture Models - Douglas. A. Reynolds, and Richard C Rose. ( 1995)<br />6. Speaker Diarization using bottom-up clustering based on a Parameter-derived Distance between adapted GMMs Mathieu Ben, Micha¨el Betser, Fr´ed´eric Bimbot, Guillaume Gravier (2004)<br />7. SPEAKER RECOGNITION - Joseph P. Campbell, Jr. Book Chapter.<br />8. A Tutorial on Text-Independent Speaker Verification Fr ´ed´ eric Bimbot,1 Jean-Franc¸ois Bonastre,2 Corinne Fredouille,2 Guillaume Gravier,1 Ivan Magrin-Chagnolleau,3 SylvainMeignier,2 Teva Merlin.<br />9. http://www.data-compression.com/vq.shtml<br />10 Speech Data Compression using Vector Quantization H. B. Kekre, Tanuja K. Sarode<br />