Migrating Visual Communications from H.323 to SIP
by Stefan Karapetkov
April 2, 2008
The H.323 protocol was developed by the International Telecommunication Union (ITU) - an
international standardization body based in Geneva, Switzerland - with video conferencing in
mind, and most traditional video conferencing systems are based on H.323. However, the
convergence of the voice, video, and data into what is often referred to as Unified
Communications (UC) has a dramatic impact on how people use video, and presents a new set of
requirements to solutions for the emerging visual communications market.
In order to meet these new requirements, Polycom is working on a seamless migration from
H.323 to the Session Initiation Protocol (SIP). This process will take long time, and H.323 and
SIP will coexist in customer networks for years to come.
The content of this paper is based on Polycom’s presentation ‘A New Paradigm for SIP-based
Video Communications’ at the International SIP Conference in Paris (January 29 – February 1,
2008). The paper provides an overview of H.323 and SIP, and compares the two protocols. The
paper also makes references to specific technologies that Polycom is deploying to guarantee
smooth migration of the installed customer base from H.323 to SIP.
Visual Communications Market
Polycom envisions a dramatically different marketplace for video in the years ahead. Social,
economic, and technological trends are aligning to create a unique opportunity for new and
innovative forms of visual communication. This combination of factors will bring video into the
mainstream and make visual communication essential in both our personal and professional lives.
Polycom calls this transformation VC2.
Visual communications today include applications such as telepresence which provides an
immersive experience for users, group video conferencing which is now available with High-
Definition audio and video and provides a new level of user experience, as well as personal video
which brings visual communication to the individual user’s desktop or project space. Figure 1 is
an overview of these applications.
POLYCOM, Inc. 1
Figure 1: Enterprise Video Communications Today
While dedicated personal video systems integrate monitor, camera, microphones, speakers, and
codec into one and are optimized for video communication, soft clients rely on the PC video and
voice processing capabilities.
Today, visual communications solutions are widely deployed in education, medical, and
government organizations. Deployments in general enterprises were recently revitalized as a
result of travel restrictions and green policies.
Two major market trends are driving the visual communications market. The first trend is the
shift from reserved to on-demand conferencing. Both audio and video conferencing started as
scheduled events with reserved resources, e.g. ports on the Multipoint Conferencing Unit (MCU)
and bandwidth, e.g. B channels in the ISDN network. Audio conferencing made the transition to
reservation-less, operator-less systems and is now 96% on-demand. Figure 2 summarizes the
trend to on-demand conferencing.
Figure 2: Trend from Reserved to On-demand Conferencing
Video has stayed scheduled for longer, and even today, about 80% of video conferencing is
scheduled. However, there is a clear trend to on-demand video, and strong indicators that future
conferencing will be even richer and more flexible - with presence integration and increased
number of choices how to access the services, e.g. from desktop computers and mobile phones.
Looking at this trend on a higher level, reserved operator-attended services are becoming
presence-enabled customer-initiated services. Note that audio conferencing is running ahead of
POLYCOM, Inc. 2
video conferencing – with higher desktop/mobile penetration and higher percentage of on-
This trend has huge impact on the choice of communication protocols in visual communication
systems. The trend requires more scalability because desktop video drives up the number of users.
It also requires that new features such as presence and instant messaging are seamlessly
integrated with audio, video and content.
The second major market trend is from overlay video networks to unified collaboration. Video
systems have been deployed as overlay networks (over the organization’s IP network) for years,
and video has been a stand-alone application, separate from the mainstream IT applications.
Video also required separate management tools, directories, and has in general hardly connected
to the rest of the IT infrastructure. With the emergence of the Unified Communications concept,
enterprises, service providers and other organizations started morphing their voice, video, and
data communication systems into one. Figure 3 describes the trend towards Unified
Figure 3: Trend towards Unified Collaboration
This trend creates an interesting technical challenge. Telephony call control servers have started
the migration from proprietary protocols to standard SIP, and there are already a large number of
standards-based implementations, some of them open source. Even the remaining proprietary IP-
PBX systems on the market provide some level of SIP interoperability and allow third-party
equipment to connect to the IP-PBX, or even control it.
Many Presence and Instant Messaging systems support SIP via the SIP for Instant Messaging and
Presence Leveraging Extensions (SIMPLE) protocol. Other implementations are based on the
eXtensible Messaging and Presence Protocol (XMPP).
Enterprise video today is mostly H.323-based, although video endpoints, video soft clients and
even MCU’s support basic SIP connectivity. For example, all Polycom endpoints can run in SIP
mode, while conference servers such as Polycom RMX 2000 and MCG support SIP, H.323,
The technical challenge that UC poses is how to connect all of the elements in Figure 3 into one
system that provides the full range of services to users. Based on the current state of the
networking technology, SIP is the most functional common denominator that could interconnect
the different applications within the organization.
POLYCOM, Inc. 3
In order to compare SIP and H.323, we will need a brief description of the H.323 protocol. H.323
is an umbrella signaling protocol, i.e. it refers to a set of other protocol such as H.225 and H.245
which are known as ‘the H.323 family of protocols’. H.323 was originally defined for multimedia
communications and perfectly fits the video conferencing application because it had from the
very beginning mechanisms for audio and video call setup. It also has the so-called capability
exchange procedure (often referred to as CAPS) that is very important for finding communication
parameters acceptable for both communication sides, as well as a master-slave determination
mechanism that is very useful when MCUs are involved in the communication.
H.323 is optimized for machine communication. It uses ASN.1 notation/encoding, and the H.323
messages are encoded using the Basic Encoding Rules (BER). This means that very few people
can actually read captured H.323 messages.
H.323 Elements and Call Flow
H.323 defines H.323 Terminals which can initiate or receive calls and H.323 Gatekeepers which
register H.323 terminals, provide call admission control, and call routing. Gatekeepers can be
very simple or very complex – depending on how many of the optional functions in H.323 they
implement. H.323 also defines Gateways to other networks, e.g. H.320/ISDN. While gateways
are optional in H.323, they play a central role when migration to H.323 (e.g. from H.320/ISDN to
H.323) or from H.323 (e.g. to SIP) is required. Since the topic of this paper is migration from
H.323 to SIP, we will discuss the H.323-SIP gateway in more detail later in this paper. Figure 4
looks at the interaction of the two critical and mandatory elements in the H.323 network:
Terminals and Gatekeeper.
Figure 4: H.323 Basic Call Flow
H.323 describes the call setup procedure, and refers to the H.225 and H.245 protocols for
signaling message formats and some additional functions. The signaling messages are described
in H.225. The H.225 SETUP message includes information about the source, i.e. who is sending
the message (in Figure 4, this is Terminal A) and about the destination (Terminal B). The
Gatekeeper then uses this information to allocate the destination (Terminal B).
After receiving the SETUP message, Terminal B stores the information about the request (IP
addresses, port numbers, etc.), and sends back the CONNECT message. The most important
information in the CONNECT message is about the setup of an H.245 control channel, which is
POLYCOM, Inc. 4
used for three main functions: capability exchange (CAPS), master-slave determination (MS),
and opening logical channels (OLC), i.e. creating media streams for audio, video and content.
H.245 Terminal Capability Exchange is a procedure for exchanging preferred codecs and settings
between the two H.323 terminals. For example, Terminal A may suggest H.264 or H.263 video
and Siren 22 Stereo or Siren 14 Mono audio, and the Terminal B may respond that it only
supports H.263 and Siren 14. Once both sides agree on common parameters the ‘conversation’
moves to its next phase - H.245 Master Slave Determination - which is useful for avoiding
conflicts during call control operations. H.245 Master Slave Determination is very important
when an H.323 Terminal connects to an MCU (the MCU is the master), and when one MCU
connects to another MCU through a so-called ‘cascading’ – in this case one of the MCUs has to
be the master.
After capabilities have been exchanged and connection master determined, the H.245 Open
Logical Channel Request procedure creates media channels (voice, video, or content/data)
between the communication parties. Note that these channels are always created in pairs, i.e. the
video channel from Terminal A to Terminal B is different and separate from the video channel
from Terminal B to Terminal A. Therefore, communication can be asymmetric: Terminal A can
send high quality video to B, and receive lower quality video from B, and vice versa.
H.245 control channel is also used to transmit the Flow Control command, which is used by the
receiver to set an upper limit for the transmitter bit rate on any logical channel, and the Fast
Update command, which is used by the receiver to request resending video frames that were lost
in the transmission.
Audio streams and video streams are transmitted via the Real Time Protocol (RTP, RFC 3550),
and for each RTP stream there is an associated Real Time Control Protocol (RTCP, also RFC
3550) channel which is used to periodically transmit control packets to participants in a
multimedia session. The primary function of RTCP is to provide feedback on the quality of
service being provided by RTP.
H.323 for Enterprise Video
H.323 has been widely deployed in visual communication equipment. The H.323 Terminal
function is implemented in video endpoints such as Polycom HDX and VSX. The H.323
Gatekeeper function is implemented in products such as Polycom SE 200 and PathNavigator. The
H.323 MCU function is implemented in products such as Polycom RMX 2000 and MGC.
In addition to basic call and DTMF tones, these systems support a range of additional features.
The most important ones are listed in Figure 5.
POLYCOM, Inc. 5
Figure 5: H.323 Enterprise Video
Multipoint conferencing is very natural in H.323 because every call in H.323 (including point-to-
point calls) is defined as a ‘conference’. It is therefore assumed from the start that parties will be
added to the conference.
H.323 has its own set of security mechanisms. Early implementations used DES and 3DES
encryption, while the latest generation of equipment supports the Advanced Encryption Standard
(AES). H.323 also has a mechanism for traversing firewalls and NATs – it is described in
H.460.17, H.460.18, and H.460.19 standards.
Vendors embraced the H.323 protocol and added functions that are quite unique to visual
communications. Examples are Dual Video Streams (based on the H.239 protocol), Video
Channel Control (implemented in the H.245 protocol) and Far End Camera Control (FECC, based
on H.224 and H.281 protocols). We will discuss each of the features later in this paper.
The Session Initiation Protocol (SIP, RFC 3261) was developed by the Internet Engineering Task
Force (IETF), an organization that sets the technical standards for the Internet. In many ways SIP
is similar to H.323 as it also can be used to setup audio and video calls, and it also refers to a long
list of other standards (called ‘Request for Comment’ or RFCs in the IETF lingo) that constitute
‘the SIP family of protocols’. For example, SIP refers to the Session Description Protocol (SDP,
RFC 2327) as format for describing media parameters.
IETF envisioned SIP to be generic protocol that can setup any kind of session, not just audio and
video, i.e. SIP can be used for instant messaging, data transfer, etc. In addition, SIP was designed
to be similar to the Hyper Text Transfer Protocol (HTTP) which is used for web browsing in the
Internet. The idea was that HTTP developers should be able to easily learn the SIP protocol and
develop Voice over IP and Video over IP applications, the same way they develop web
applications. While this did not exactly happen, SIP became easier to read and understand than
H.323, mainly because it uses readable clear-text messages (in comparison, H.323 uses ASN.1
Since IETF develops standards for Internet, it is very concerned about the scalability of
networking protocols. Therefore, SIP was designed to be lightweight and scale well. While wave
of extensions, mainly for VoIP applications, increased the complexity of the protocol, the core
SIP specification (RFC 3261) and a few closely related specs - such as SDP (RFC 2327) and RTP
(RFC 3550) - are sufficient for a functional SIP implementation.
POLYCOM, Inc. 6
SIP Elements and Call Flow
The equivalent of H.323 Terminal in SIP is the SIP User Agent (UA). The name ‘user agent’
leans towards mobile communication and user mobility, i.e. the ability of the user to log on at a
communication device which then becomes the user’s agent. Different from H.323, SIP splits the
server functions (concentrated in the H.323 Gatekeeper) into several entities: SIP Redirect Server,
SIP Proxy Server, and SIP Registrar. This is also in line with the Internet philosophy that the
server that registers and authenticates you (the Registrar) does not need be the server that gets
your requests (the Proxy) and does not need be the server that knows the current location of the
destination (the Redirect Server). Figure 6 shows the basic SIP message exchange necessary to
setup an audio/video call.
Figure 6: SIP Basic Call Flow
The UA’s learn the SIP servers’ addresses (Domain Name like www.sipregistrar1.com or IP
address like 192.168.1.2) by configuration/provisioning or dynamically, i.e., by sending a DNS
SRV request asking the Internet ‘What SIP servers are there?’ and receiving a list of servers.
Subsequently, UA’s register with their home Registrars (registration procedure not shown here),
and get authenticated, i.e., the Registrar queries a user data base to verify user name, user
password, and an additional authentication parameters called ‘SIP Realm’.
While H.323 uses E.164 phone numbers (e.g. +14085551212) or aliases to identify the
destination, SIP uses Unified Resource Identifier (URI) in the format user@<domain name>. In
our example, UA A is in the domain home.com and wants to reach ‘userB’ which is currently in a
different domain visited.com. UA A starts the session (call) by sending an INVITE message (the
equivalent of a H.323 SETUP message) for userB@home.com to the local Redirect Server asking
for the current location of ‘userB’. The Redirect Server responds with error code 302 (SIP error
codes are similar and often equivalent to the HTTP error codes) which means that the user has
moved temporarily. The response includes the new domain of the user: visited.com.
UA A then sends a new INVITE to the local Proxy Server (for simplicity Proxy and Registrar are
residing in the same server in Figure 6), and the Proxy server routes the INVITE through the
network to the destination. A handshake procedure including the SIP messages 200OK and ACK
makes sure both communicating partners and the proxy server know that the session is
Similar to H.323, the signaling procedure ends with the setup of media streams, e.g. for audio and
video. As in H.323, audio streams and video streams are transmitted via the Real Time Protocol
(RTP, RFC 3550), and for each RTP stream there is an associated Real Time Control Protocol
POLYCOM, Inc. 7
(RTCP, also RFC 3550) channel. The importance of the RTP use in both H.323 and SIP will be
highlighted later in the discussion around SIP-H.323 gateways.
SIP for Enterprise Video
As mentioned above, the H.323 community invested much effort adding new functionality to
H.323 for the purposes of visual communication. SIP on the other hand was embraced by the
Voice over IP community and extended in many ways to support voice communications - both
replicate some traditional telephony functions and create new ones. For example, IETF created a
set of security mechanism (STUN, TURN, and ICE) that allow RTP streams to traverse firewalls
and Network Address Translation (NAT) boxes – very common elements in IP networks, and a
huge problem for both Voice over IP and Video over IP. As Figure 7 below shows, SIP is
available today in visual communication equipment (endpoints, MCUs) but the list of features
available in SIP – from visual communications perspective – is still shorter than in H.323.
Figure 7: SIP Enterprise Video
The major difference between SIP and H.323 is in the area of security and Firewall/NAT
traversal. While H.323 systems deploy AES for media encoding, i.e. all RTP packets carrying
audio and video are encrypted by the sender using AES, SIP refers to Secure Real Time Protocol
(SRTP, RFC 3711) for encrypting media. While signaling messages in H.323 are transmitted
unencrypted, SIP – maybe because it is a clear text protocol that can be read easily – enforces the
use of Transport Layer Security (TLS, RFC 4346) to encrypt SIP signaling messages.
The other major delta – also related to security - is in the area of Firewall and NAT traversal.
H.323 relies on H.460.17, H.460.18, and H.460.19 standards for Firewall and NAT traversal.
IETF originally developed STUN (Simple Traversal of UDP through NATs), then added the
TURN (Traversal Using Relay NAT) mechanism to increase the firewall traversal success rate,
and finally created the ICE (Interactive Connectivity Establishment ) specification that combines
STUN and TURN functions into one. Firewall traversal has long been considered the forte of
IETF and the hope is that through the newly developed traversal mechanisms, SIP-based
communication will be able to flow across enterprise (including healthcare, government, and
education) and service provider networks.
What is SIP Used for Today?
Although video network elements today support SIP, they are rarely deployed in a complete SIP
video solution. The reason is that SIP still cannot match the H.323 functionality and an all-H.323
solution can provide great interoperability and more functionality than an all-SIP solution.
POLYCOM, Inc. 8
SIP gained ground from proprietary protocols from Avaya, Nortel, Siemens, etc. – mostly
because it allows better interoperability across vendors, i.e. the ability to mix and match
components. But in the H.323 video communications market, interoperability is great, and H.323
interoperability events (bakeoffs, cookouts, for some reason culinary terminology was widely
adopted) are as efficient as SIP interoperability events such as SIPit.
SIP for Integration with IM and Presence
SIP is however irreplaceable in integrations with IM/Presence systems such as IBM Same Time
and Microsoft LCS and OCS. The idea is that since SIP is used for exchanging Presence
information and for setting up IM sessions (based on the SIMPLE specifications) it makes sense
to integrate video system via SIP. The reality is however that SIMPLE is not the leading approach
to Presence and IM. Microsoft added proprietary extensions to SIP for MS Office Communicator
and LCS/OCS. Even within IETF, the competing XMPP protocol is gaining momentum, and
seems to have eclipsed SIMPLE for Internet applications. Nevertheless, SIP is today the only
common denominator that allows integration of video into IM and Presence systems. Figure 8 is
an example of such integration.
Figure 8: Integration with IM/Presence
In the diagram, two IM/Presence clients communicate with an IM/Presence server which is
connected through a gateway function - translation software that runs on a standard server. The
SIP protocol is used for the communication among video components: video soft clients
(associated with the IM/Presence clients), video endpoints (as the room system displayed in
Figure 8) and conferencing servers (MCUs). A SIP Registrar/Proxy (marked ‘SIP Server’ here)
handles registration, call setup, and call tear-down.
A video client can be connected to another video client or to a video endpoint such as a room
system. All video clients and endpoints can be part of a multipoint call through the conferencing
server. Note that once video soft clients and video endpoints connect in a multi-party conference
call, additional participants from H.323, H.320 (ISDN), and PSTN (voice only) can also join the
SIP for Integration with IP-PBXs
Early versions of IP-PBXs supported basic H.323 and allowed registering H.323 clients.
However, as SIP became more important to IP-PBX interoperability, IP-PBXs started supporting
SIP registrations, SIP trunking, etc. H.323 support was dropped or was not updated to the latest
H.323 versions. Since most IP-PBXs in the market support SIP (and do not support H.323), SIP is
POLYCOM, Inc. 9
irreplaceable in integrations with systems such as Avaya Call Manager, Nortel MCS 5100, and
Cisco Call Manager. Note that since most IP-PBXs are based on proprietary architectures, the SIP
interfaces provide only limited functions, i.e. registration, basic call, and DTMF. Hold is usually
also supported because Hold is a part of the base SIP standard (RFC 3261). With the development
of a new generation of IP communication systems based on SIP soft switches (such as Nortel
MCS 5100), the SIP functionality became richer and included features such as Transfer, Forward,
and Conference. Video endpoints can now support such functions, and mirror the functionality of
desktop phones. These features mainly apply to personal video users and are less attractive to
users of group conferencing systems.
If the IP-PBX does not support SIP, integration is still possible through a CTI server with SIP
plug-ins. While one can argue that using SIP or H.323 for such integrations is equally efficient,
almost all integrations are done via SIP since it is not probable that H.323 will be supported
natively in IP-PBXs. There is hope that over time the proprietary solutions will migrate to SIP. So
the protocol selection is often based on which protocol looks more future proof. Figure 9 shows
an example of an integration of video equipment with a SIP-based communication system.
Figure 9: Integration with SIP Communication Server
The SIP Communication Server in Figure 9 acts as SIP Proxy and Registrar for all user agents:
SIP soft clients, SIP phones, video endpoints in SIP mode (HDX 4000 and 9000 in Figure 9), and
the conferencing server that supports multiple protocols simultaneously.
Similar to the integration with IM/Presence systems, the conferencing server (RMX 2000 in this
example) allows H.323, H.320/ISDN, and PSTN (voice-only) participants to join a multiparty
conference. Further benefits of using the conferencing server in such configurations are discussed
in the SIP-H.323 gateway section below.
SIP for Integration with IMS
Integration of video systems (endpoints, application servers, conferencing servers/MCUs) with IP
Multimedia Subsystem (IMS) networks is also based on SIP. IMS uses SIP for communication
among network elements but has defined extensions (most visibly in the form of Privacy P-
headers), so that seamless integration with IMS networks requires a bit more than plain SIP. More
information about Polycom’s involvement in IMS is in the white paper ‘Polycom and IMS’
POLYCOM, Inc. 10
Implementing Visual Communications Features in SIP
In this section, we will look at the implementation approaches for three major video features –
Dual Stream, FECC, and Video Channel Control – in SIP. As discussed in the H.323 section of
this paper, the H.323 community developed these mechanisms, which became very popular
among video users. A migration from H.323 to SIP therefore requires replication of the
functionality in the new environment.
Dual Video Stream
Dual Video Streams allows a ‘presentation’ (sometimes also called ‘content’) audio-video stream
to be created in parallel to the primary ‘live’ audio-video stream. This second stream is used to
share any type of content: slides, spreadsheets, X-rays, video clips, etc. Polycom’s pre-standard
version of this technology is called People+Content. H.239 is heavily based on intellectual
property from Polycom People+Content and became the ITU-T standard that allows
interoperability between different vendors. Figure 10 summarizes the Dual Video Streams
Figure 10: Dual Video Streams
While the function works well on single-monitor systems, it is especially powerful in multi-
screen setups (video endpoints can support up to 4 monitors). In the example in Figure 10, a
Polycom HDX 4000 personal video system is on a live call with a Polycom HDX 9000 Executive
Collection with two flat screen monitors. The live stream is shown on the right monitor.
The user of the HDX 4000 uses a laptop directly connected to HDX 4000 or running Polycom
content sharing software to activate content sharing to the HDX 9000 Executive Collection. A
‘presentation’ stream is created in parallel to the ‘live’ stream, and the content is displayed on the
left screen of the receiver system.
The benefit of this functionality is that users can share not just slides or spreadsheet but also
moving images: Flash video, movie clips, commercials, etc. The ‘presentation’ channel has
flexible resolution, frame rates, and bit rates. For dynamic images, it can support full High-
Definition video at 30 frames per second, and for static content, such as slides, can work for
example at 3 frames per second, and save bandwidth in the IP network. Another major benefit of
using a video channel for content sharing is that the media is encrypted (by AES in H.323 and by
SRTP in SIP). In addition, once the firewall and NAT traversal works for the ‘live’ stream, it
works for the ‘presentation’ channel as well and there is no need for separate traversal solution.
POLYCOM, Inc. 11
The first issue with supporting Dual Video Streams in SIP is describing the content/presentation
stream. As discussed above, the Session Description Protocol (SDP, RFC 2327) is used to
describe media stream parameters. SIP endpoints and conferencing servers have to support RFC
4574 that defines the ‘label’ attribute in the SDP and the RFC 4796 that defines the ‘content’
attribute. Now that we can describe the content stream, we have to be able to associate the content
stream with a live stream – this can be done by supporting RFC 3388 ‘Grouping of Media Lines
in the Session Description Protocol’.
The remaining issue is how to identify who is sending the content and who is receiving it. This is
usually done by tokens (the party that has the token, can send content), and token management
protocols make sure that there is only one token in the session, and that anyone can request and
receive the token. RFC 4582 ‘Binary Flow Control Protocol (BFCP)’ defines token management
mechanism, and can be used for Dual Video Stream implementation in SIP. And since everything
has to be described in SDP, we also need a way to describe the BFCP streams in SDP. This can
be done by supporting RFC 4583 ‘SDP Format for Binary Floor Control Protocol Streams’.
Since it takes 5 specifications (RFCs) to implement the equivalent of H.239 functionality in SIP,
Polycom created a specification that describes how to glue these RFCs together. This
specification is now Internet Draft ‘Role Management and Multiple Stream Functionality in SIP’
Far End Camera Control
FECC is a popular feature in the visual communications – if H.323 Terminals A and B are on a
call, the feature allows Terminal A to control the camera of Terminal B: zoom, pan (move the
camera left and right), and tilt (move the camera up and down). The assumption is that Terminal
B has a PTZ (Pan, Tilt, and Zoom) camera, and has the FECC feature enabled. Figure 11 explains
Figure 11: Far End Camera Control (FECC)
In group conferencing setting, the key FECC benefit is that users can adjust the image that they
get from the remote site, focus on a particular person or a group of people, and then move to
another part of the room. In personal video setting, the feature can be used to adjust the camera if
the remote party is sitting too close or too far from the camera.
In H.323, FECC is implemented via two ITU standards: H.281 defines the binary data that is
transmitted between Terminal A and B to control the camera while H.224 defines the format of
the frames that carry the binary data.
POLYCOM, Inc. 12
In SIP, RFC 4573 ‘MIME Type Registration for RTP Payload Format for H.224’ (authored by
Polycom) registers the H.224 media type, and defines the syntax and the semantics of the Session
Description Protocol (SDP) parameters needed to support far-end camera control protocol using
H.224 in SIP. In effect, RFC 4573 creates a tunnel through the SIP based network, and allows
video endpoints to exchange H.224/H.281 information exactly as they do in H.323-based
Video Channel Control
Video channel control is embedded in H.245 and was discussed in detail earlier in this paper. The
protocol allows sending messages such as ‘Flow Control’ from the receiver of live and
presentation streams back to the sender of these streams, and telling the sender to modify the bit
rate, usually to reduce the bit rate when the receiver detects high packet loss. By sending ‘Fast
Update’ message the receiver asks the sender to resend a full or intra video frame(s), usually
when a video frame is lost in transmission. Figure 12 provides graphical description of the
Figure 12: Video Channel Control
There is still no standard solution for replicating the video channel control functionality in SIP.
Polycom uses the SIP INFO message because it allows easy mapping of the H.245 messages into
SIP. This approach has been embraced by other vendors in the market. However, IETF is in favor
of an RTCP-based mechanism, and there is a work on the so-called Audio Video Profile
Feedback - extension to RTCP that will allow for video channel control functionality.
This approach has substantial impact on the SIP-H.323 gateway function. While H.245-INFO
interworking is simple to implement and only touches the H.323-SIP signaling, RTCP is always
associated with RTP and using RTCP for video channel control means touching the media
stream. We will discuss that in more detail in the SIP-H.323 gateway section that follows.
Although we expect SIP deployments to grow rapidly in the future, the installed base of H.323
endpoints and infrastructure is here to stay in the healthcare, government, education, and general
enterprise markets. Interworking between the two protocols becomes an important issue. In
general, there are two ways to bridge the SIP and H.323 networks: through a signaling gateway
and through a conferencing server/MCU. Figure 13 provides a visual representation of the
interworking concept and lists the functions that have to be considered in the SIP-H.323
POLYCOM, Inc. 13
Figure 13: SIP - H.323 Interworking
SIP and H.323 are different protocols with different message formats but they both can be used in
similar ways. Comparing the call flows in Figure 4 and Figure 6 shows a lot of similarities in the
call setup process. Similarities exist in the call tear down process (not covered in this paper) and
in the mechanisms to spontaneously exchange information during the call. A signaling gateway is
a piece of software that takes incoming SIP messages, extracts the communication parameters,
creates H.323 messages and sends them to the H.323 network. It also takes the incoming H.323
messages, extracts the communication parameters, creates corresponding SIP messages, and
sends them to the SIP network. The gateway therefore looks like a SIP user agent to the SIP
network and like H.323 terminal to the H.323 network.
Luckily, both SIP and H.323 rely on the same protocols (RTP and RTCP) for transmitting media
streams. The signaling gateway can then focus on mediating between the H.323 and SIP signaling
but does not need touch the media. This is very important as media processing is very resource-
intensive. While signaling messages generate traffic in the magnitude of few kilobits per second,
video media streams can be in the megabits per second (HD 720p video starts at 1.2Mbps).
The base RFC relevant to SIP-H.323 signaling interworking is RFC 4123 ‘SIP - H.323
Interworking Requirements’. Since a lot of the audio and video codecs used in visual
communication are ITU-T standards, it was necessary to define RTP payload formats for each of
them: G.722.1, G.722.1 Annex C, H.261 Video, H.263 Video, and H.264 Video.
There are however several issues with the signaling gateway approach. First, media security gets
broken because H.323-based video networks use the AES encryption while SIP refers to SRTP
for encryption. These two standards are completely different – the encryption algorithms and the
key exchange procedures are incompatible. The consequence is that deploying a signaling
gateway would result in failure of the media encryption, i.e. the audio and video streams will be
As we mentioned in the video channel control section, another issue is the IETF-backed approach
that requires the use of RTCP which is associated with RTP media. This concept goes against the
concept of signaling-only gateway because H.245 messages must somehow be mapped into
RTCP messages. There are currently no implementations where RTCP is independent from an
RTP media stream, so media has to traverse the gateway, in order to follow the IETF approach.
The third issue is that signaling gateways only address the SIP-H.323 interworking; ISDN and
PSTN have different media (e.g. B channels in ISDN), and ISDN/PSTN users cannot use this
gateway to connect to the SIP network.
POLYCOM, Inc. 14
Due to these limitation, using the conferencing server as a gateway has been seriously considered
as an alternative concept for H.323-SIP interworking. Conferencing servers can originate and
terminate H.323 and SIP calls, and have sufficient processing power to handle the media. They
already support AES, and can easily add support of SRTP encryption. Mechanisms for video
channel control that use RTCP can be accommodated as well since RTP and RTCP streams go
through the conferencing server. The main disadvantage of this approach is that it creates a
bottleneck – even point to point calls between SIP and H.323 domains have to go through the
conferencing server – and the associated high cost of additional conferencing server ports to
support SIP-H.323 interworking.
The Future of Visual Communications
In the long run, visual communications will migrate from H.323 to SIP, and will seamlessly
integrate with other communications network components: IP-PBXs, IM/Presence servers, etc.
The legacy H.323 equipment will continue to connect to the SIP network through gateways and
conferencing servers. Figure 14 displays the configuration of the future network.
Figure 14: Future Visual Communications
The migration to SIP will allow not only better interoperability with other communication
systems but also increased scalability, better traversal of firewall and NATs, and better security.
With regards to scalability, servers handling tens of thousands of users and providing voice,
video, IM, presence, and directory services are feasible. Through federation, these servers can
support large networks of personal video systems, group conferencing systems, immersive
telepresence systems, soft clients, and mobile clients.
Firewalls and NATs have always been barriers to IP communication but current video solutions
are intranet-based and predominately used for internal company communication where firewalls
are less of a problem. Future networks will connect companies with their suppliers, customers,
and partners, all of which are separated by multiple firewalls. SIP in combination with ICE will
provide an efficient way for connecting people across networks, and making visual
communication ubiquitous, similar to voice communication today.
With the ubiquity of SIP visual communications, security becomes of utmost importance. Once
SRTP is universally adopted and deployed for media security and TLS is supported across
vendors for signaling security, visual communications will become fully protected.
POLYCOM, Inc. 15
Visual communication is expanding beyond enterprise conference rooms to the user’s desktop.
The trend towards Unified Communications requires integrating video with variety of SIP-based
systems in enterprises, hospitals, universities, and government organizations.
SIP is a new protocol that can meet the requirements for scalable distributed visual
communications. SIP has already been deployed for visual communication in certain scenarios.
Once the missing functionality is added to SIP, it will become a solid foundation for visual
communication solution. Transition from H.323 to SIP will be gradual, and interoperability with
the installed H.323 base throughout the process is a key requirement and main technical
Polycom is uniquely positioned to leverage its broad product portfolio, market leadership and
extensive partner network to lead customers through the migration process from H.323 to SIP,
and deliver on the VC2 promise: transform traditional video conferencing into tomorrow’s visual
POLYCOM, Inc. 16