Speech processing and recognition

HOME ASSIGNMENT
SUBMITTED BY:
MANVI PRIYA BE/10007/14
CHIRAG JAIN BE/10038/14
ANKITA SINGH BE/10136/14
MEC2163 - SPEECH PROCESSING AND RECOGNITION

EFFECT OF
WATERMARKING IN
SPEECH SIGNAL

INTRODUCTION
• Watermarking is the technique and art of hiding additional
data (such as watermarked bits, logo and text message) in the
host signal which includes image, video, audio, speech, text,
without any perceptibility of the existence of additional
information. The additional information which is embedded in
the host signal should be extractable and must resist various
intentional and unintentional attacks. Digital speech
watermarking process is depicted in Fig.

TYPES OF DIGITAL SPEECH
WATERMARKING
• There are two main types of digital speech watermarking in terms of
robustness:
1. Robust digital speech watermarking in which embedded and
additional information must resist channel attacks.
2. Fragile digital speech watermarking in which additional
information must be destroyed if any attack or transformation
takes place like for paper watermarks in bank notes.
• In terms of source and extraction module for digital speech
watermarking are found three main categories:
1. Blind speech watermarking which does not need any extra
information such as original signal, logo or watermarked bits.
2. Semi-blind speech watermarking which may need extra
information for the extraction phase like access to the published
watermarked signal that is the original signal after just adding the
watermark.
3. Non-blind speech watermarking which needs the original signal
and the watermarked signal.

APPLICATIONS OF DIGITAL
SPEECH WATERMARKING
• Different applications of digital speech watermarking are known:
1. Copy control: Cryptography algorithms are very slow and the cracker may use software,
e.g. DeCSS or reverse engineering techniques to decrypt a valid key. However,
watermarking can be combined with certain content for the recording device to refuse
to copy so that the watermarked bits are detectable easily.
2. Device control is in the border category and copy control is one of its applications. For
example, Digimarc’s MediaBridge interacts with a TV program by using action toys.
Skipping advertisements can be done automatically by turning functions on and off.
3. Owner identification: According to American laws, when the owner’s right is misused,
the system can restrict the owner’s material. Even the copyright is not considered. This
application considering helps to protect the holder’s right without considering the
copyright in the distributed copies.
4. Proof of ownership: Creating a central repository for every copyright is too costly when
textual copyright is needed. Watermarking can be used as alternative to proof of
ownership.
In case of authentication, fragile watermark is used by embedding the watermark in the
original data. If the impostor manipulates the content, then the watermark will be altered. As
a consequence, the media will not be taken as genuine.
Another watermarking application is using fingerprints to enable the holder to detect and
investigate the source of the authorized version by restricting the unauthorized users. Other
applications of watermarking are broadcast monitoring, copy prevention, access control and
transaction tracking.

WATERMARK DESIGN
• Each audio signal is watermarked with a unique codeword.
• Our watermarking scheme is based on a repeated application
of a basic watermarking operation on processed versions of
the audio signal.
• The basic method uses three steps to watermark an audio
segment as shown in Fig.
• The complete watermarking scheme is shown in Fig Below we
provide a detailed explanation of the basic watermarking step
and the complete watermarking technique.

METHODS OF DIGITAL
WATERMARKING
Different methods are used for digital speech watermarking. Figure presents
an overview of these methods.

AUDITORY MASKING
• Auditory masking in general is defined by the American
standards agency as ‘the process by which the threshold of
audibility for one sound is raised by the presence of another
sound’ and ‘the amount by which the threshold of audibility of
sound is raised by the presence of another sound’.

TEMPORAL MASKING
• Temporal masking consists of pre-masking and post-masking.
With a stronger masker, the weaker maskee region becomes
inaudible from 50 to 200 ms after the masker. In pre-masking,
the weaker signal becomes inaudible before the stronger
masker from about 5 to 20 ms before the masker. The pre
masking effect is much harder to detect compared to the post-
masking effect. The temporal masking can be detected by
using time domain.

WATERMARK EMBEDDING:
• By applying the masking effect in the frequency and temporal
domain, the watermark which is a noise-like sequence, is shaped.
1. First, the speech signal is segmented into a block with a
predefined size.
2. Second, the power spectrum of this block is calculated by FFT or
DWT. Third, the frequency masking of this block is computed.
3. Fourth, the masking weights for shaping the watermark bits
(noise-like sequence) are applied.
4. Fifth, the inverse power spectrum which is Inverse FFT or Inverse
DWT is computed.
5. Sixth, the temporal masking for shaping the noise-like sequence is
calculated. Seventh, the temporal and frequency domain for
embedding the watermark into the speech signal are used.
The process is shown in Fig.

Fig.
Using the temporal masking model guarantees that the watermark
cannot be heard. Applying the frequency domain itself may not be
enough, for example, when a fixed window of Fourier transform is not
provided with a suitable time resolution. In some cases, when FFT is
applied on the watermark, it can spear over whole blocks. When the
block’s energy is not enough and shorter than the block size under
analysis, then the watermark is masked inside the subinterval. This
situation causes distortion.

WATERMARK EXTRACTION
• The watermark bits must be detectable even if the speech signal has been under
various signal processing attacks. Although the watermark is trading as noise for
speech, the attacker still attempts to destroy it blindly. For example, N is the
number of recovered speech sample and the extraction algorithm has the proper
location of the received speech signal, the samples may still contain watermark
bits or may not. It can be assumed that r(i)=s(i)+d(i) where d(i) is a contaminant
which is noise alone or watermark and noise. The watermark bits are extracted
by hypothesis testing as in the following Eq. (1)
where n(i) is noise and w′(i) is modified watermark.

• In another paper (Swanson et al. 1998) a similar measurement is used to
evaluate the robustness of the algorithm by calculating the original
watermark w(i) and extracted watermark w′(i) as in the following Eq.
• The watermark is compared to the threshold to evaluate the system’s
robustness. However sometimes cases, the extraction system may not
find the exact location of the watermark bits as in r(i)=s(i+τ)+d(i),
0≤i≤N−1, where all the parameters are like before, just τ is the delay
corresponding to time shifting through the samples.
• In this case, for the evaluation of robustness, a generalized likelihood
ratio test (Swanson et al. 1998) is done to determine whether the
received speech may or may not have the watermark as in following
Eq. (The watermark is again compared to a threshold. The higher ratio
means that the watermark is present. The generalized likelihood ratio
test is performed, if the speech signal is also suspected under time
scaling attack.

Perceptual distance between
watermarked and original
speech
• There are many methods available for calculating the
perceptual distance, the more common is Lp-norm. By
increasing the p, high energy regions are given more weight
for better measurement. Applying L1 norm is shown in the
following Eq.
where c2 is the additional calibration constant to improve the sensitivity of
this model. Equation (5) can lead to analytical expression for masking
threshold as seen in Eq. (6). The majority of quantization noise speech
watermarking assumes that X and ε are uncorrelated E(Xε)=0. Equation (6)
is shown as follows:

Another assumption is related to a masking situation when just negligible
errors may corrupt the clean speech signal.

QUESTIONS:
1. What is the use of dynamic time warping?
2. What are the merits and demerits of silence part of the speech signal?
3. Consider an HMM representation of a coin tossing experiment. Assume a
three state model corresponding to three different coins with
probabilities
State 1 State 2 State 3
P(heads) 0.50 0.25 0.25
P(tails) 0.50 0.75 0.75
And with all state transition probabilities equal to 1/3 (assume initial state
probabilities of 1/3)
Sequence O = {HHHHTHTTTT}
What state sequence is most likely? What is the probability of the
observation sequence and this most likely state sequence?

QUESTIONS(contd.)
4. Consider the observation sequence O’= {HTTHTHHTTH}.
How would your answer to the question change?
5. Differentiate between LPC and LPCC. How LPCC is superior
for speech recognition?

Speech processing and recognition

Speech processing and recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Speech processing and recognition

Similar to Speech processing and recognition (20)

Recently uploaded

Recently uploaded (20)

Speech processing and recognition