To understand the terminologies of multimedia
To study protocols used in multimedia applications.
To know different hardware and software
components required to run multimedia.
To evaluate multimedia services that satisfy user
The way we utilize audio and video has evolved as a result of
recent technological advancements.
In the past, we would listen to an audio broadcast on the radio
and watch a video show on the television.
People nowadays desire to utilize the Internet for audio and
video services in addition to text and image communications.
This chapter focuses on programs that provide audio and video
services via the Internet.
Audio and video services may be divided into three
Streaming Stored Audio/Video
The files are compressed and saved on a server using this
The files are downloaded by a client through the Internet.
It is named On-demand audio/video.
Stored audio files: songs, symphonies, books on tape, and
Stored video files: movies, TV shows, and music video clips.
Streaming Live Audio/Video
Streaming live audio/video refers to the broadcasting
of radio and TV programs through the Internet.
A user listens to broadcast audio and video through the
E.g. Internet Radio. Some radio stations solely transmit
their programming via the Internet, while others
broadcast them both over the Internet and over the air.
Digitizing Audio and Video
Before audio or video signals can be transmitted over
the Internet, they must first be digitized.
When sound is supplied into a microphone, an
electrical analog signal is produced that represents
the amplitude of the sound as a function of time.
These signals are named analog audio signals.
An analog signal, such as audio, can be digitized to
produce a digital signal.
According to the Nyquist theorem, if the highest
frequency of the signal is f, we need to sample the
signal 21 times per second.
A video is made up of a series of frames. We receive
the sensation of motion if the frames are presented on
the screen quickly enough.
The reason for this is that our eyes cannot differentiate
between the quickly flashing frames and individual
There is no standard for the number of frames per second;
nevertheless, 25 frames per second is popular in North
A frame must be refreshed to avoid a situation known as
flickering(change in brightness).
Each frame is repainted twice in the television industry.
This implies 50 frames must be delivered, or 25 frames if
memory is available at the sender site, with each frame
repainted from memory.
Each frame is subdivided into picture elements, or pixels, which are
Each 8-bit pixel on black-and-white television represents one of
256 distinct grey levels. Each pixel on a color TV is 24 bits, with 8
bits for each basic color (red, green, and blue).
We can calculate the number of bits in 1s for a specific resolution.
A color frame with the lowest resolution is 1024 × 768 pixels. This
2 x 25 x 1024 x768x 24=944 Mbps.
Audio and Video Compression
Compression is required when sending audio or
video over the Internet.
1. Audio Compression
Speech and music may both benefit from audio
compression. We need to compress a 64-kHz digitized
signal for speech, and a 1.41 I-MHz signal for music.
There are two kinds of techniques for audio
Instead of storing all of the sampled values, predictive
encoding encodes the changes between the samples.
Speech compression is the most common use for this sort
GSM (13 kbps), G.729 (8 kbps), and G.723.3 are
some of the standards that have been established (6.4
or 5.3 kbps).
Perceptual Encoding: MP3
The perceptual encoding approach is the most popular
compression technique used to generate CD-quality
This kind of audio requires at least 1.411 Mbps, which
cannot be sent without compression via the Internet.
This method is used by MP3 (MPEG audio layer 3), which
is part of the MPEG standard.
Perceptual audio coding is a type of audio signal
compression method that is based on human ear
Perceptual encoding is based on the science of
psychology concerned with the perception of sound and
its physiological effects.
The concept is based on defects in our auditory system,
which allows some sounds to hide other sounds. Masking
can occur in both frequency and time.
Frequency masking: A strong sound in one
frequency band can partially or completely hide a
lower sound in another frequency range, which is
known as frequency masking.
E.g. We cannot hear the words of a person who is
sitting beside us in a room where an Arkestra in
loud sound is playing.
Temporal masking: A loud sound can affect our
hearing for a short period after it has ended in
MP3 compresses audio signals by using frequency
and temporal masking. MP3 has three different
data rates: 96 kbps, 128 kbps, and 160 kbps.
The rate is determined by the frequency range of
the original analog audio.
Video is comprised of multiple frames, and each of the frames is an
Video can be compressed by compressing the images.
The market is dominated by two standards:
Joint Photographic Experts Group (JPEG) and
Moving Picture Experts Group (MPEG).
Images are compressed using the Joint Photographic Experts Group
Video is compressed using the Moving Picture Experts Group (MPEG).
Image Compression: JPEG
In the grayscale picture, each pixel can be
represented by an 8-bit integer (256 levels).
The picture is in color, each pixel can be represented
by 24 bits (3 x 8 bits), with every 8 bits representing
red, blue, or green (RBG).
A grayscale image is split into 8 × 8-pixel blocks in
The goal of splitting the image into blocks is to reduce
the number of computations since the number of
mathematical operations for each picture is equal to
the square of the number of units.
JPEG's entire concept is to convert the image into a
linear (vector) set of numbers that shows the
Using one of the text compression methods, the
redundancies (lack of changes) may then be
Discrete Cosine Transform (DCT)
During this phase, each block of 64 pixels is
transformed using the discrete cosine transform (DCT).
The transformation modifies the 64 values, preserving
the relative connections between pixels while revealing
We present the transformation outcomes for three
We present the transformation outcomes for
three different situations.
Case 1:Uniform Gray Scale
Case 2:Two Sections
Case 3:Gradient Gray Scale
Case 1:Uniform Gray Scale
In this case, we have a grayscale block with a value of 20
for each pixel.
We receive a nonzero value for the first element (upper
left corner) when we perform the transformations and the
remaining of pixels have a 0 value.
The value of T(0,0) is the average (multiplied by a
constant) of the P(x,y) values and is called the dc value
(direct current, borrowed from electrical engineering).
The remaining values are called ac values, in which T(m,n)
represents changes in the pixel values. As shown in Figure the
rest of the values are 0s.
Case 1: Uniform Gray Scale
Case 2: Two Sections
In the second example, we have a block that has two
distinct uniform greyscale sections.
The pixel values have changed significantly (from 20 to
We receive a dc value as well as nonzero ac values when
we perform the transformations.
However, the dc value is surrounded by just a few nonzero
values. As per Figure 2.5, the majority of the values are
Case 3:Gradient Gray Scale
In the third case, we have a block that slowly
That is, there is no significant difference in the
values of nearby pixels.
When we do the transformations, we obtain a dc
value along with several nonzero ac values as
shown in Figure
From the all above cases we can conclude that:
The transformation creates table T from table P.
The dc value is the average value (multiplied by a
constant) of the pixels.
The ac values are the changes.
Lack of changes in neighboring pixels creates 0s.
Quantization is the process of reducing the number of
bits needed to store an integer value by reducing the
precision of the integer.
Previously, when we quantized each number, we
removed the fraction and preserved the integer part.
The number is divided by a constant, and the fraction is
This further reduces the number of bits required.
A quantizing table (8 x 8) is used in most
implementations to specify how to quantize each
The divisor is determined by the value's position in
the T table.
This is done to optimize the number of bits and 0s
for each specific application.
The quantizing step is the only part of the process
that cannot be reversed.
We've lost some information that can't be
Due to this reason, only JPEG is called lossy
compression because of this quantization phase.
The values are read from the table after quantization,
and redundant 0s are eliminated.
The table is read diagonally in a zigzag way rather
than row by row or column by column to cluster the 0s
The reason behind this is if the picture changes
smoothly, the bottom right corner of the T table is all 0s.
Figure depicts the process of reading the table.
Video Compression: MPEG
A motion picture is a fast sequence of frames, each of
which represents an image.
To put it another way, a frame is a spatial combination
of pixels, whereas a video is a temporal combination of
frames transmitted one after the other.
Compressing video means spatially compressing each
frame and temporally compressing a set of frames.
JPEG is used to compress each frame's spatial data.
Each frame is an image that may be compressed
Duplicate frames are eliminated during temporal compression.
We get 50 frames per second when we watch television.
However, the majority of the frames in a sequence are nearly
E.g. When someone is speaking, the majority of the frame
remains the same from one frame to the next, with the exception
of the segment of the frame around the lips, which varies from
one frame to the next.
For temporal data compression, the MPEG method
divides frames into three types:
I-Frames (Intracoded Frame)
It is a frame that exists independently of any other frame
(not to the frame sent before or to the frame sent after).
They are not constructed by other frames.
They arrive at regular intervals (e.g., every ninth frame is
An I-frame must appear on a regular basis to manage a
rapid change in the frame that the preceding and
subsequent frames are unable to display
A viewer may tune in at any moment when a video
If there is only one I-frame at the start of the show,
late viewers will not get a complete picture.
P-Frames (Predicted Frame)
It is related to the previous I-frame or P-frame.
Each P-frame only contains the differences from the
E.g. if an object is moving quickly, the new changes may not
be recorded in a P-frame. P-frames can only be built from
previous I- or P-frames.
P-frames carry significantly less information than other
frame types and even fewer bits after compression.
B-Frames (Bidirectional Frame)
It is related to the I-frame or P-frame that comes
before and after it. Each B-frame is relative to the
past and future. It should be noted that a B-frame is
never related to another B-frame.
Figure depicts a sample frame sequence.
First Approach: Using a Web Server
You can save a compressed audio/video file as a text file.
To download the file, the client (browser) can use HTTP services and
send a GET message.
The compressed file can be sent to the browser by the Web server.
The browser can then play the file using an application, referred to
as a media player.
This method is very simple and clear and does not require any
This method is depicted in Figure 2.10.
This method has several drawbacks.
Even after compression, an audio/video file is usually quite
A video file and audio file require lots of megabits to store.
The file must be completely downloaded before it can be
With today's data rates, the user will have to wait a few
seconds or even tens of seconds before the file can be
Second Approach: Using a Web Server
This approach involves connecting the media player
directly to the Web server and downloading the
The audio/video file and a metafile containing
information about the audio/video file are both
stored on the Web server.
The steps in this approach are depicted in Figure
1. The HTTP client accesses the Web server by using the
2. The information about the metafile comes in the
3. The metafile is passed to the media player.
4. The media player uses the URL in the metafile to access
the audio/video file.
5. The Web server responds.
Third Approach: Using a Media Server
The issue with the second approach is that both the
browser and the media player rely on HTTP services.
HTTP is intended to operate over TCP.
This is appropriate for retrieving the metafile but not
the audio/video file.
The reason for this is that TCP retransmits a lost or
damaged segment, which goes against the streaming
TCP and its error control must be dropped in favor
HTTP connects to the Web server, and the Web
server itself is designed to work with TCP;
Here, we need a separate server, a media server
for the processing of the audio and video files.
1. The HTTP client accesses the Web server by using a
2. The information about the metafile comes in the
3. The metafile is forwarded to the media player.
4. The media player uses the URL in the metafile to access
the media server to download the file.
5. The media server sends reply.
Fourth Approach: Using a Media Server
The Real-Time Streaming Protocol (RTSP) is a control
protocol that was created to enhance the functionality
of the streaming process.
We can control the playback of audio/video using
RTSP is an out-of-band control protocol similar to FTP's
A media server and RTSP are depicted in Figure
1. The HTTP client accesses the Web server by using a GET
2. The information about the metafile comes in the response.
3. The metafile is passed to the media player.
4. The media player sends a SETUP message to create a
connection with the media server.
5. The media server responds.
6. The media player sends a PLAY message to start
7. The audio/video file is downloaded by using
another protocol that runs over UDP.
8. The connection is broken by using the TEARDOWN
9. The media server responds.
Streaming Live Audio/Video
Streaming live audio/video follows the same
strategy to broadcast audio and video on radio
and television stations.
Only the difference is that the station uses the
Internet for broadcasting instead of the air.
Streaming stored audio/video and streaming live
audio/video are both affected by delays, and neither
can accept retransmission.
There is a distinction.
The communication in the first application is unicast and
The communication is multicast and live in the second.
Live streaming is better suited to IP multicast
services and protocols like UDP and RTP.
However, live streaming is still using TCP and
multiple unicasting rather than multicasting.
we discuss several characteristics of real-time audio/video
1. Time Relationship
3. Playback Buffer
The preservation of the time relationship between
packets of a session is required for real-time data on a
For Example: let us assume that a real time video server
creates live video images and sends them online.
The video is digitized and packetized.
There are only three packets and each packet holds
10s of video information.
But what if the packets arrive at different times?
first packet arrives at 00:00:01 (1-s delay),
the second at 00:00:15 (5-s delay),
and the third at 00:00:27. (7-s delay).
If the receiver begins to play the first packet at 00:00:01,
it will end at 00:00:11.
The next packet, however, has not yet arrived; it will arrive
4 seconds later.
As the video is viewed at the remote site, there is a
gap between the first and second packets, and
between the second and third.
This is referred to as jitter.
The delay between packets causes jitter in real-time
The situation is depicted in Figure
Assume, for example, that a real-time video server
generates and distributes live video images over the
Video has been digitized and packetized.
There are only three packets, and each packet contains
10s of video data.
The first packet begins at 00:00:00, the second packet
at 00:10, and the third packet at 00:20.
Assume that each packet takes 1 second to reach its
destination (equal delay).
The first packet can be played back at 00:00:01, the second
packet at 00:00:11, and the third packet at 00:00:21.
Despite the fact that there is a 1s time difference between
what the server sends and what the client sees on the computer
screen, the action is taking place in real-time.
The packets' time relationship is maintained. The 1s lag is
To prevent Jitter, we can time-stamp the packets and
separate the arrival time from the playback time.
The use of a timestamp is one solution to Jitter. If each
packet contains a timestamp indicating the time it was
created in relation to the first (or previous) packet, the
receiver can add this time to the time it begins
In other words, the receiver knows when to play each
Consider the previous example, where the first packet has a
timestamp of 0, the second has a timestamp of 10, and the
third has a timestamp of 20.
If the receiver begins playing the first packet at 00:00:08,
the second at 00:00:18, and the third at 00:00:28.
There are no gaps between packets. The situation is
depicted in Figure
We need a buffer to store the data until it is played back so that
we can separate the arrival time from the playback time.
The buffer is known as a playback buffer.
When a session starts (the first bit of the first packet arrives), the
receiver defers playing the data until a certain threshold is reached.
The first bit of the first packet arrives at 00:00:01 in the preceding
example; the threshold is 7 s, and the playback time is 00:00:08.
The threshold is measured in data time units.
The replay does not begin until the data time units reach the
The data is stored in the buffer at a variable rate, but
it is extracted and played back at a constant rate.
The amount of data in the buffer shrinks or expands,
but there is no jitter as long as the delay is less than the
time it takes to playback the threshold amount of data.
For our example, Figure depicts the buffer at various
One more feature is required in addition to time relationship
information and timestamps for real-time traffic.
Each packet requires a sequence number.
If a packet is lost, the timestamp alone will not alert the
Let's pretend the timestamps are 0, 10, and 20.
The receiver receives only two packets with timestamps 0
and 20 if the second packet is lost.
The receiver assumes the packet with the timestamp
20 is the second packet, which was sent 20 seconds
after the first.
The receiver has no way of knowing whether or not
the second packet was lost.
To deal with this situation, you'll need a sequence
number to order the packets.
Audio and video conferencing rely heavily on
The data is distributed using multicasting methods
because the traffic can be heavy.
Two-way communication between receivers and senders
is required for conferencing.
A translator is a computer that can change the format
of a high-bandwidth video signal to a lower-quality
This is required, for example, when a source generates
a high-quality video signal at 5 Mbps and sends it to a
recipient with a bandwidth of less than 1 Mbps.
A translator is required to decode the signal and
encode it again at a lower quality that requires less
bandwidth in order to receive it.
When multiple sources can send data at the same
time (as in a video or audio conference), the traffic
is divided into multiple streams.
Data from various sources can be mixed to
converge traffic to a single stream.
A mixer mathematically combines signals from
various sources to produce a single signal.
Support from Transport Layer Protocol
Some of the procedures in real-time applications
are preferable to implement in the transport layer
Let's take a look at which of the existing transport
layers is appropriate for this type of traffic.
Mainly TCP and UDP are two transport layer protocols. TCP is not
appropriate for interactive traffic.
It does not support time-stamping and multicasting.
The error control mechanism supported by TCP is not suitable for
interactive traffic as retransmission of the lost or corrupted packet is not
The concept of time-stamping and playback is thrown off by retransmission.
Today's audio and video signals have so much redundancy (even with
compression) that we can simply ignore a lost packet.
The listener or viewer at the remote location may miss it.
For interactive multimedia traffic, UDP is better.
Multicasting is supported by UDP, but there is no retransmission
UDP, on the other hand, does not support time-stamping, sequencing,
These features are provided by the Real-time Transport Protocol
(RTP), a new transport protocol.
For interactive traffic, UDP is preferable to TCP.
However, we require the services of RTP, a different transport layer
protocol, to compensate for UDP's shortcomings.
RTP (Real-time Transport Protocol)
The Real-time Transport Protocol (RTP) is a protocol
designed to handle real-time Internet traffic.
RTP lacks a delivery mechanism (multicasting, port numbers,
and so on).
It must be used in conjunction with UDP. RTP acts as a bridge
between UDP and the application program.
RTP's primary contributions are time-stamping, sequencing,
and mixing capabilities.
RTP's position in the protocol suite is sketched in Figure
The format is simple and broad enough to cover a
wide range of real-time applications.
If an application requires additional data, it adds
it to the beginning of its payload.
The RTP packet header is shown in Figure
Ver (2-bits) :It defines the version number. The current
version is 2.
P (1-bit):If this field is set to 1, it indicates the appearance
of padding at the end of the packet. The value of the last
byte in the padding defines the length of the padding.
There is no padding if the value of the P field is 0.
X (1-bit):If this field is set to 1, it indicates an extra
extension header between the basic header and the data. If
this field is set to 0 then, no extra extension header.
Contributor Count (4-bits):It gives the count of
Contributors. We can have a maximum of 15
contributors (between 0 and 15).
M (1-bit):It is used by the application as a marker. It
indicates, for example, the end of its data.
Payload Type (7-bits):It gives the type of payload.
Several Payload Types are defined but Table 2.1
describes some of the payload types and the
Sequence Number (16-bits)
This field is used to give the number to the RTP packets. The first
packet's sequence number is chosen at random, and it is increased
by one for each subsequent packet. The receiver uses the sequence
number to detect lost or out-of-order packets.
This field indicates the time relationship between the packets. The
first packet's timestamp is a random number. The value for each
subsequent packet is the sum of the preceding timestamp plus the
time the first byte is produced.
Synchronization Source Identifier (32-bits)
In the case of only one source, this field defines the source. If there
are multiple sources, the mixer serves as the synchronization source,
while the other sources serve as contributors. The source identifier's
value is a random number chosen by the source.
Contributor Identifier (32-bits)
Each of these 32-bit identifiers (up to 15 in total) defines a source.
When there are multiple sources in a session, the mixer serves as the
synchronization source, while the remaining sources serve as
Despite the fact that RTP is a transport layer protocol, the RTP
packet is not directly encapsulated in an IP datagram. Instead, RTP
is encapsulated in a UDP user datagram and treated as an
RTP does not have a well-known port assigned to it.
The port can be chosen at any time, with the exception that the port
number must be an even number.
RTP's companion, Real-time Transport Control Protocol (RTCP), uses
the next number (an odd number).
RTP uses a temporary even-numbered UDP port.
RTCP(Real-time Transport Control
Real-time Transport Control Protocol (RTCP) is a
protocol implemented to facilitate messages which
regulate the flow and quality of data while also
allowing the recipient to provide feedback to the
source or sources.
Figure depicts the five types of messages supported by
RTCP. The number next to each box denotes the
The active senders in a conference send the sender report
on a regular basis to report transmission and reception
statistics for all RTP packets sent during the interval.
The sender report includes an absolute timestamp, which is
the number of seconds since 12:00 a.m. on January 1, 1970.
The absolute timestamp enables the receiver to synchronize
multiple RTP messages at the same time.
It is especially critical when both audio and video are
The receiver report is intended for passive
participants who do not send RTP packets.
The report informs the sender and other recipients
about the service's quality.
Source Description Message
A source description message is sent by the source
on a regular basis to provide additional information
The name, e-mail address, phone number, and
address of the source's owner or controller can be
included in this information.
To close a stream, a source sends a bye message. It
enables the source to announce its departure from
the conference. Other sources can detect a lack of
a source, but this message is a direct announcement.
A packet for an application that wants to use new
applications is called an application-specific
message. It enables the creation of new message
RTPC uses a temporary port. RTCP uses an odd-
numbered UDP port number that follows the port
number selected for RTP.
Voice Over IP
Voice over IP or Internet telephony is a real-time interactive
The concept here is to use the Internet as a telephone
network with some added features.
This application allows two parties to communicate over a
SIP and H.323 are two protocols designed specifically for
this type of communication.
They are discussed briefly here.
SIP (Session Initiation Protocol)
Session Initiation Protocol (SIP) is an application layer
protocol and is created by IETE.
It establishes, manages, and terminates a multimedia
It allows you to create two-party, multi-party, or
SIP is designed to run on UDP, TCP, or SCTP, regardless
of the underlying transport layer.
A header and a body are included in each SIP message. The header is made up of several
lines that describe the message's structure, caller capability, media type, and other details.
SIP messages are described as follows.
INVITE: The caller initializes a session with the INVITE message.
ACK: After the callee answers the call, the caller sends an ACK message for confirmation.
BYE: The BYE message terminates a session.
OPTIONS: The OPTIONS message queries a machine about its capabilities.
CANCEL: The CANCEL message cancels an already started initialization process.
REGISTER: The REGISTER message makes a connection when the callee is not available.
SIP is a very adaptable protocol. To identify the
sender and receiver in SIP, an e-mail address, an IP
address, a phone number, and other types of
addresses can be used.
However, the address must be in SIP format. Some
common formats are shown in Figure
A basic SIP session
comprises three modules:
Figure depicts a simple
Establishing a Session
In order to establish a session in SIP, a three-way
handshake is required. To initiate communication, the
caller sends an INVITE message via UDP, TCP, or SCTP. If
the callee agrees to begin the session, she sends a
reply message. The caller sends an ACK message to
confirm that a reply code has been received.
After the session is established, the caller and callee
can communicate via two temporary ports.
Terminating the Session
The session can be ended by either party sending a
Tracking the Callee
SIP has a mechanism (similar to DNS) for determining the IP
address of the terminal where the callee is seated.
SIP employs the concept of registration to carry out this
Some servers are designated as registrars by SIP.
At any given time, a user is registered with at least one
registrar server, which is aware of the callee's IP address.
When a caller needs to communicate with the callee, the caller can use the
e-mail address in the INVITE message instead of the IP address.
The message is routed through a proxy server.
The proxy server sends a lookup message to the registrar server that has
the callee's information.
When the proxy server receives a reply message from the registrar server,
it inserts the newly discovered IP address of the callee into the caller's
This message is then delivered to the callee.
The procedure is depicted in Figure
H.323 is a standard developed by ITV that allows
telephones on the public telephone network to
communicate with computers connected to the
Internet (referred to as terminals in H.323).
The general architecture of H.323 is depicted in
A gateway is a device that connects the Internet to the
A gateway is a five-layer device that can convert a
message from one protocol stack to another.
The gateway in this case does the same thing.
It converts a message from a telephone network to an
As we discussed in the SIP, the gatekeeper server on the
local area network serves as the registrar server.
To establish and
maintain voice (or
H.323 employs several
These protocols are
depicted in Figure
H.323 compresses using G.71 or G.723.1.
It employs the H.245 protocol, which allows the parties
to negotiate the compression method.
Q.931 protocol is used to establish and terminate
For registration with the gatekeeper, another protocol
called H.225, or RAS (Registration, Administration,
Status), is used.
Let us use a simple example to demonstrate the operation of
telephone communication using H.323.
Figure 2.27 depicts the steps that a terminal takes to communicate
with a telephone.
1. The gatekeeper receives a broadcast message from the terminal.
The gatekeeper responds by providing its IP address.
2. The terminal and gatekeeper communicate via H.225, which is used
to negotiate bandwidth.
3. Q.931 is used to establish a connection between the terminal,
gatekeeper, gateway, and telephone.
4. To negotiate the compression method, the terminal,
gatekeeper, gateway, and telephone use H.245 to
5. RTP is used by the terminal, gateway, and telephone to
exchange audio under the management of RTCP.
6. To terminate the communication, the terminal,
gatekeeper, gateway, and telephone use Q.931.
1. Data communications and networking by Behrouz Forouzan 4th/5th edition,
McGraw Hill Pvt Ltd.
2. Computer Networks by Andrew S Tanenbaum, 4th/5th edition, Pearson
3. Cryptography and Network Security: Principles and Practice, William Stallings,
7th edition, Pearson Education
4. Network Security Essentials: Applications and Standards (For VTU), William
Stallings, 3rd edition, Pearson Education