Migrating Visual Communications from H.323 to SIP

White Paper
by Stefan Karapetkov

April 2, 2008

Figure 1: Enterprise Video Communications Today

While dedicated personal video systems integrate monitor, camera, microph...
video conferencing – with higher desktop/mobile penetration and higher percentage of on-
demand conferences.

This trend h...
H.323 Basics
In order to compare SIP and H.323, we will need a brief description of the H.323 protocol. H.323
is an umbrel...
used for three main functions: capability exchange (CAPS), master-slave determination (MS),
and opening logical channels (...
Figure 5: H.323 Enterprise Video

Multipoint conferencing is very natural in H.323 because every call in H.323 (including ...
SIP Elements and Call Flow

The equivalent of H.323 Terminal in SIP is the SIP User Agent (UA). The name ‘user agent’
(RTCP, also RFC 3550) channel. The importance of the RTP use in both H.323 and SIP will be
highlighted later in the discus...
SIP gained ground from proprietary protocols from Avaya, Nortel, Siemens, etc. – mostly
because it allows better interoper...
irreplaceable in integrations with systems such as Avaya Call Manager, Nortel MCS 5100, and
Cisco Call Manager. Note that ...
Implementing Visual Communications Features in SIP
In this section, we will look at the implementation approaches for thre...
The first issue with supporting Dual Video Streams in SIP is describing the content/presentation
stream. As discussed abov...
In SIP, RFC 4573 ‘MIME Type Registration for RTP Payload Format for H.224’ (authored by
Polycom) registers the H.224 media...
Figure 13: SIP - H.323 Interworking

SIP and H.323 are different protocols with different message formats but they both ca...
Due to these limitation, using the conferencing server as a gateway has been seriously considered
as an alternative concep...
Visual communication is expanding beyond enterprise conference rooms to the user’s desktop.
The trend towards U...
Upcoming SlideShare
Loading in …5

Migrating Visual Communications from H.323 to SIP


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Migrating Visual Communications from H.323 to SIP

  1. 1. Migrating Visual Communications from H.323 to SIP White Paper by Stefan Karapetkov April 2, 2008 Introduction The H.323 protocol was developed by the International Telecommunication Union (ITU) - an international standardization body based in Geneva, Switzerland - with video conferencing in mind, and most traditional video conferencing systems are based on H.323. However, the convergence of the voice, video, and data into what is often referred to as Unified Communications (UC) has a dramatic impact on how people use video, and presents a new set of requirements to solutions for the emerging visual communications market. In order to meet these new requirements, Polycom is working on a seamless migration from H.323 to the Session Initiation Protocol (SIP). This process will take long time, and H.323 and SIP will coexist in customer networks for years to come. The content of this paper is based on Polycom’s presentation ‘A New Paradigm for SIP-based Video Communications’ at the International SIP Conference in Paris (January 29 – February 1, 2008). The paper provides an overview of H.323 and SIP, and compares the two protocols. The paper also makes references to specific technologies that Polycom is deploying to guarantee smooth migration of the installed customer base from H.323 to SIP. Visual Communications Market Polycom envisions a dramatically different marketplace for video in the years ahead. Social, economic, and technological trends are aligning to create a unique opportunity for new and innovative forms of visual communication. This combination of factors will bring video into the mainstream and make visual communication essential in both our personal and professional lives. Polycom calls this transformation VC2. Visual communications today include applications such as telepresence which provides an immersive experience for users, group video conferencing which is now available with High- Definition audio and video and provides a new level of user experience, as well as personal video which brings visual communication to the individual user’s desktop or project space. Figure 1 is an overview of these applications. POLYCOM, Inc. 1
  2. 2. Figure 1: Enterprise Video Communications Today While dedicated personal video systems integrate monitor, camera, microphones, speakers, and codec into one and are optimized for video communication, soft clients rely on the PC video and voice processing capabilities. Today, visual communications solutions are widely deployed in education, medical, and government organizations. Deployments in general enterprises were recently revitalized as a result of travel restrictions and green policies. Market Trends Two major market trends are driving the visual communications market. The first trend is the shift from reserved to on-demand conferencing. Both audio and video conferencing started as scheduled events with reserved resources, e.g. ports on the Multipoint Conferencing Unit (MCU) and bandwidth, e.g. B channels in the ISDN network. Audio conferencing made the transition to reservation-less, operator-less systems and is now 96% on-demand. Figure 2 summarizes the trend to on-demand conferencing. Figure 2: Trend from Reserved to On-demand Conferencing Video has stayed scheduled for longer, and even today, about 80% of video conferencing is scheduled. However, there is a clear trend to on-demand video, and strong indicators that future conferencing will be even richer and more flexible - with presence integration and increased number of choices how to access the services, e.g. from desktop computers and mobile phones. Looking at this trend on a higher level, reserved operator-attended services are becoming presence-enabled customer-initiated services. Note that audio conferencing is running ahead of POLYCOM, Inc. 2
  3. 3. video conferencing – with higher desktop/mobile penetration and higher percentage of on- demand conferences. This trend has huge impact on the choice of communication protocols in visual communication systems. The trend requires more scalability because desktop video drives up the number of users. It also requires that new features such as presence and instant messaging are seamlessly integrated with audio, video and content. The second major market trend is from overlay video networks to unified collaboration. Video systems have been deployed as overlay networks (over the organization’s IP network) for years, and video has been a stand-alone application, separate from the mainstream IT applications. Video also required separate management tools, directories, and has in general hardly connected to the rest of the IT infrastructure. With the emergence of the Unified Communications concept, enterprises, service providers and other organizations started morphing their voice, video, and data communication systems into one. Figure 3 describes the trend towards Unified Communications. Figure 3: Trend towards Unified Collaboration This trend creates an interesting technical challenge. Telephony call control servers have started the migration from proprietary protocols to standard SIP, and there are already a large number of standards-based implementations, some of them open source. Even the remaining proprietary IP- PBX systems on the market provide some level of SIP interoperability and allow third-party equipment to connect to the IP-PBX, or even control it. Many Presence and Instant Messaging systems support SIP via the SIP for Instant Messaging and Presence Leveraging Extensions (SIMPLE) protocol. Other implementations are based on the eXtensible Messaging and Presence Protocol (XMPP). Enterprise video today is mostly H.323-based, although video endpoints, video soft clients and even MCU’s support basic SIP connectivity. For example, all Polycom endpoints can run in SIP mode, while conference servers such as Polycom RMX 2000 and MCG support SIP, H.323, H.320, etc. The technical challenge that UC poses is how to connect all of the elements in Figure 3 into one system that provides the full range of services to users. Based on the current state of the networking technology, SIP is the most functional common denominator that could interconnect the different applications within the organization. POLYCOM, Inc. 3
  4. 4. H.323 Basics In order to compare SIP and H.323, we will need a brief description of the H.323 protocol. H.323 is an umbrella signaling protocol, i.e. it refers to a set of other protocol such as H.225 and H.245 which are known as ‘the H.323 family of protocols’. H.323 was originally defined for multimedia communications and perfectly fits the video conferencing application because it had from the very beginning mechanisms for audio and video call setup. It also has the so-called capability exchange procedure (often referred to as CAPS) that is very important for finding communication parameters acceptable for both communication sides, as well as a master-slave determination mechanism that is very useful when MCUs are involved in the communication. H.323 is optimized for machine communication. It uses ASN.1 notation/encoding, and the H.323 messages are encoded using the Basic Encoding Rules (BER). This means that very few people can actually read captured H.323 messages. H.323 Elements and Call Flow H.323 defines H.323 Terminals which can initiate or receive calls and H.323 Gatekeepers which register H.323 terminals, provide call admission control, and call routing. Gatekeepers can be very simple or very complex – depending on how many of the optional functions in H.323 they implement. H.323 also defines Gateways to other networks, e.g. H.320/ISDN. While gateways are optional in H.323, they play a central role when migration to H.323 (e.g. from H.320/ISDN to H.323) or from H.323 (e.g. to SIP) is required. Since the topic of this paper is migration from H.323 to SIP, we will discuss the H.323-SIP gateway in more detail later in this paper. Figure 4 looks at the interaction of the two critical and mandatory elements in the H.323 network: Terminals and Gatekeeper. Figure 4: H.323 Basic Call Flow H.323 describes the call setup procedure, and refers to the H.225 and H.245 protocols for signaling message formats and some additional functions. The signaling messages are described in H.225. The H.225 SETUP message includes information about the source, i.e. who is sending the message (in Figure 4, this is Terminal A) and about the destination (Terminal B). The Gatekeeper then uses this information to allocate the destination (Terminal B). After receiving the SETUP message, Terminal B stores the information about the request (IP addresses, port numbers, etc.), and sends back the CONNECT message. The most important information in the CONNECT message is about the setup of an H.245 control channel, which is POLYCOM, Inc. 4
  5. 5. used for three main functions: capability exchange (CAPS), master-slave determination (MS), and opening logical channels (OLC), i.e. creating media streams for audio, video and content. H.245 Terminal Capability Exchange is a procedure for exchanging preferred codecs and settings between the two H.323 terminals. For example, Terminal A may suggest H.264 or H.263 video and Siren 22 Stereo or Siren 14 Mono audio, and the Terminal B may respond that it only supports H.263 and Siren 14. Once both sides agree on common parameters the ‘conversation’ moves to its next phase - H.245 Master Slave Determination - which is useful for avoiding conflicts during call control operations. H.245 Master Slave Determination is very important when an H.323 Terminal connects to an MCU (the MCU is the master), and when one MCU connects to another MCU through a so-called ‘cascading’ – in this case one of the MCUs has to be the master. After capabilities have been exchanged and connection master determined, the H.245 Open Logical Channel Request procedure creates media channels (voice, video, or content/data) between the communication parties. Note that these channels are always created in pairs, i.e. the video channel from Terminal A to Terminal B is different and separate from the video channel from Terminal B to Terminal A. Therefore, communication can be asymmetric: Terminal A can send high quality video to B, and receive lower quality video from B, and vice versa. H.245 control channel is also used to transmit the Flow Control command, which is used by the receiver to set an upper limit for the transmitter bit rate on any logical channel, and the Fast Update command, which is used by the receiver to request resending video frames that were lost in the transmission. Audio streams and video streams are transmitted via the Real Time Protocol (RTP, RFC 3550), and for each RTP stream there is an associated Real Time Control Protocol (RTCP, also RFC 3550) channel which is used to periodically transmit control packets to participants in a multimedia session. The primary function of RTCP is to provide feedback on the quality of service being provided by RTP. H.323 for Enterprise Video H.323 has been widely deployed in visual communication equipment. The H.323 Terminal function is implemented in video endpoints such as Polycom HDX and VSX. The H.323 Gatekeeper function is implemented in products such as Polycom SE 200 and PathNavigator. The H.323 MCU function is implemented in products such as Polycom RMX 2000 and MGC. In addition to basic call and DTMF tones, these systems support a range of additional features. The most important ones are listed in Figure 5. POLYCOM, Inc. 5
  6. 6. Figure 5: H.323 Enterprise Video Multipoint conferencing is very natural in H.323 because every call in H.323 (including point-to- point calls) is defined as a ‘conference’. It is therefore assumed from the start that parties will be added to the conference. H.323 has its own set of security mechanisms. Early implementations used DES and 3DES encryption, while the latest generation of equipment supports the Advanced Encryption Standard (AES). H.323 also has a mechanism for traversing firewalls and NATs – it is described in H.460.17, H.460.18, and H.460.19 standards. Vendors embraced the H.323 protocol and added functions that are quite unique to visual communications. Examples are Dual Video Streams (based on the H.239 protocol), Video Channel Control (implemented in the H.245 protocol) and Far End Camera Control (FECC, based on H.224 and H.281 protocols). We will discuss each of the features later in this paper. SIP Basics The Session Initiation Protocol (SIP, RFC 3261) was developed by the Internet Engineering Task Force (IETF), an organization that sets the technical standards for the Internet. In many ways SIP is similar to H.323 as it also can be used to setup audio and video calls, and it also refers to a long list of other standards (called ‘Request for Comment’ or RFCs in the IETF lingo) that constitute ‘the SIP family of protocols’. For example, SIP refers to the Session Description Protocol (SDP, RFC 2327) as format for describing media parameters. IETF envisioned SIP to be generic protocol that can setup any kind of session, not just audio and video, i.e. SIP can be used for instant messaging, data transfer, etc. In addition, SIP was designed to be similar to the Hyper Text Transfer Protocol (HTTP) which is used for web browsing in the Internet. The idea was that HTTP developers should be able to easily learn the SIP protocol and develop Voice over IP and Video over IP applications, the same way they develop web applications. While this did not exactly happen, SIP became easier to read and understand than H.323, mainly because it uses readable clear-text messages (in comparison, H.323 uses ASN.1 and BER). Since IETF develops standards for Internet, it is very concerned about the scalability of networking protocols. Therefore, SIP was designed to be lightweight and scale well. While wave of extensions, mainly for VoIP applications, increased the complexity of the protocol, the core SIP specification (RFC 3261) and a few closely related specs - such as SDP (RFC 2327) and RTP (RFC 3550) - are sufficient for a functional SIP implementation. POLYCOM, Inc. 6
  7. 7. SIP Elements and Call Flow The equivalent of H.323 Terminal in SIP is the SIP User Agent (UA). The name ‘user agent’ leans towards mobile communication and user mobility, i.e. the ability of the user to log on at a communication device which then becomes the user’s agent. Different from H.323, SIP splits the server functions (concentrated in the H.323 Gatekeeper) into several entities: SIP Redirect Server, SIP Proxy Server, and SIP Registrar. This is also in line with the Internet philosophy that the server that registers and authenticates you (the Registrar) does not need be the server that gets your requests (the Proxy) and does not need be the server that knows the current location of the destination (the Redirect Server). Figure 6 shows the basic SIP message exchange necessary to setup an audio/video call. Figure 6: SIP Basic Call Flow The UA’s learn the SIP servers’ addresses (Domain Name like www.sipregistrar1.com or IP address like by configuration/provisioning or dynamically, i.e., by sending a DNS SRV request asking the Internet ‘What SIP servers are there?’ and receiving a list of servers. Subsequently, UA’s register with their home Registrars (registration procedure not shown here), and get authenticated, i.e., the Registrar queries a user data base to verify user name, user password, and an additional authentication parameters called ‘SIP Realm’. While H.323 uses E.164 phone numbers (e.g. +14085551212) or aliases to identify the destination, SIP uses Unified Resource Identifier (URI) in the format user@<domain name>. In our example, UA A is in the domain home.com and wants to reach ‘userB’ which is currently in a different domain visited.com. UA A starts the session (call) by sending an INVITE message (the equivalent of a H.323 SETUP message) for userB@home.com to the local Redirect Server asking for the current location of ‘userB’. The Redirect Server responds with error code 302 (SIP error codes are similar and often equivalent to the HTTP error codes) which means that the user has moved temporarily. The response includes the new domain of the user: visited.com. UA A then sends a new INVITE to the local Proxy Server (for simplicity Proxy and Registrar are residing in the same server in Figure 6), and the Proxy server routes the INVITE through the network to the destination. A handshake procedure including the SIP messages 200OK and ACK makes sure both communicating partners and the proxy server know that the session is successfully setup. Similar to H.323, the signaling procedure ends with the setup of media streams, e.g. for audio and video. As in H.323, audio streams and video streams are transmitted via the Real Time Protocol (RTP, RFC 3550), and for each RTP stream there is an associated Real Time Control Protocol POLYCOM, Inc. 7
  8. 8. (RTCP, also RFC 3550) channel. The importance of the RTP use in both H.323 and SIP will be highlighted later in the discussion around SIP-H.323 gateways. SIP for Enterprise Video As mentioned above, the H.323 community invested much effort adding new functionality to H.323 for the purposes of visual communication. SIP on the other hand was embraced by the Voice over IP community and extended in many ways to support voice communications - both replicate some traditional telephony functions and create new ones. For example, IETF created a set of security mechanism (STUN, TURN, and ICE) that allow RTP streams to traverse firewalls and Network Address Translation (NAT) boxes – very common elements in IP networks, and a huge problem for both Voice over IP and Video over IP. As Figure 7 below shows, SIP is available today in visual communication equipment (endpoints, MCUs) but the list of features available in SIP – from visual communications perspective – is still shorter than in H.323. Figure 7: SIP Enterprise Video The major difference between SIP and H.323 is in the area of security and Firewall/NAT traversal. While H.323 systems deploy AES for media encoding, i.e. all RTP packets carrying audio and video are encrypted by the sender using AES, SIP refers to Secure Real Time Protocol (SRTP, RFC 3711) for encrypting media. While signaling messages in H.323 are transmitted unencrypted, SIP – maybe because it is a clear text protocol that can be read easily – enforces the use of Transport Layer Security (TLS, RFC 4346) to encrypt SIP signaling messages. The other major delta – also related to security - is in the area of Firewall and NAT traversal. H.323 relies on H.460.17, H.460.18, and H.460.19 standards for Firewall and NAT traversal. IETF originally developed STUN (Simple Traversal of UDP through NATs), then added the TURN (Traversal Using Relay NAT) mechanism to increase the firewall traversal success rate, and finally created the ICE (Interactive Connectivity Establishment ) specification that combines STUN and TURN functions into one. Firewall traversal has long been considered the forte of IETF and the hope is that through the newly developed traversal mechanisms, SIP-based communication will be able to flow across enterprise (including healthcare, government, and education) and service provider networks. What is SIP Used for Today? Although video network elements today support SIP, they are rarely deployed in a complete SIP video solution. The reason is that SIP still cannot match the H.323 functionality and an all-H.323 solution can provide great interoperability and more functionality than an all-SIP solution. POLYCOM, Inc. 8
  9. 9. SIP gained ground from proprietary protocols from Avaya, Nortel, Siemens, etc. – mostly because it allows better interoperability across vendors, i.e. the ability to mix and match components. But in the H.323 video communications market, interoperability is great, and H.323 interoperability events (bakeoffs, cookouts, for some reason culinary terminology was widely adopted) are as efficient as SIP interoperability events such as SIPit. SIP for Integration with IM and Presence SIP is however irreplaceable in integrations with IM/Presence systems such as IBM Same Time and Microsoft LCS and OCS. The idea is that since SIP is used for exchanging Presence information and for setting up IM sessions (based on the SIMPLE specifications) it makes sense to integrate video system via SIP. The reality is however that SIMPLE is not the leading approach to Presence and IM. Microsoft added proprietary extensions to SIP for MS Office Communicator and LCS/OCS. Even within IETF, the competing XMPP protocol is gaining momentum, and seems to have eclipsed SIMPLE for Internet applications. Nevertheless, SIP is today the only common denominator that allows integration of video into IM and Presence systems. Figure 8 is an example of such integration. Figure 8: Integration with IM/Presence In the diagram, two IM/Presence clients communicate with an IM/Presence server which is connected through a gateway function - translation software that runs on a standard server. The SIP protocol is used for the communication among video components: video soft clients (associated with the IM/Presence clients), video endpoints (as the room system displayed in Figure 8) and conferencing servers (MCUs). A SIP Registrar/Proxy (marked ‘SIP Server’ here) handles registration, call setup, and call tear-down. A video client can be connected to another video client or to a video endpoint such as a room system. All video clients and endpoints can be part of a multipoint call through the conferencing server. Note that once video soft clients and video endpoints connect in a multi-party conference call, additional participants from H.323, H.320 (ISDN), and PSTN (voice only) can also join the conference. SIP for Integration with IP-PBXs Early versions of IP-PBXs supported basic H.323 and allowed registering H.323 clients. However, as SIP became more important to IP-PBX interoperability, IP-PBXs started supporting SIP registrations, SIP trunking, etc. H.323 support was dropped or was not updated to the latest H.323 versions. Since most IP-PBXs in the market support SIP (and do not support H.323), SIP is POLYCOM, Inc. 9
  10. 10. irreplaceable in integrations with systems such as Avaya Call Manager, Nortel MCS 5100, and Cisco Call Manager. Note that since most IP-PBXs are based on proprietary architectures, the SIP interfaces provide only limited functions, i.e. registration, basic call, and DTMF. Hold is usually also supported because Hold is a part of the base SIP standard (RFC 3261). With the development of a new generation of IP communication systems based on SIP soft switches (such as Nortel MCS 5100), the SIP functionality became richer and included features such as Transfer, Forward, and Conference. Video endpoints can now support such functions, and mirror the functionality of desktop phones. These features mainly apply to personal video users and are less attractive to users of group conferencing systems. If the IP-PBX does not support SIP, integration is still possible through a CTI server with SIP plug-ins. While one can argue that using SIP or H.323 for such integrations is equally efficient, almost all integrations are done via SIP since it is not probable that H.323 will be supported natively in IP-PBXs. There is hope that over time the proprietary solutions will migrate to SIP. So the protocol selection is often based on which protocol looks more future proof. Figure 9 shows an example of an integration of video equipment with a SIP-based communication system. Figure 9: Integration with SIP Communication Server The SIP Communication Server in Figure 9 acts as SIP Proxy and Registrar for all user agents: SIP soft clients, SIP phones, video endpoints in SIP mode (HDX 4000 and 9000 in Figure 9), and the conferencing server that supports multiple protocols simultaneously. Similar to the integration with IM/Presence systems, the conferencing server (RMX 2000 in this example) allows H.323, H.320/ISDN, and PSTN (voice-only) participants to join a multiparty conference. Further benefits of using the conferencing server in such configurations are discussed in the SIP-H.323 gateway section below. SIP for Integration with IMS Integration of video systems (endpoints, application servers, conferencing servers/MCUs) with IP Multimedia Subsystem (IMS) networks is also based on SIP. IMS uses SIP for communication among network elements but has defined extensions (most visibly in the form of Privacy P- headers), so that seamless integration with IMS networks requires a bit more than plain SIP. More information about Polycom’s involvement in IMS is in the white paper ‘Polycom and IMS’ http://www.polycom.com/common/documents/whitepapers/polycom_ims_1.pdf. POLYCOM, Inc. 10
  11. 11. Implementing Visual Communications Features in SIP In this section, we will look at the implementation approaches for three major video features – Dual Stream, FECC, and Video Channel Control – in SIP. As discussed in the H.323 section of this paper, the H.323 community developed these mechanisms, which became very popular among video users. A migration from H.323 to SIP therefore requires replication of the functionality in the new environment. Dual Video Stream Dual Video Streams allows a ‘presentation’ (sometimes also called ‘content’) audio-video stream to be created in parallel to the primary ‘live’ audio-video stream. This second stream is used to share any type of content: slides, spreadsheets, X-rays, video clips, etc. Polycom’s pre-standard version of this technology is called People+Content. H.239 is heavily based on intellectual property from Polycom People+Content and became the ITU-T standard that allows interoperability between different vendors. Figure 10 summarizes the Dual Video Streams concept. Figure 10: Dual Video Streams While the function works well on single-monitor systems, it is especially powerful in multi- screen setups (video endpoints can support up to 4 monitors). In the example in Figure 10, a Polycom HDX 4000 personal video system is on a live call with a Polycom HDX 9000 Executive Collection with two flat screen monitors. The live stream is shown on the right monitor. The user of the HDX 4000 uses a laptop directly connected to HDX 4000 or running Polycom content sharing software to activate content sharing to the HDX 9000 Executive Collection. A ‘presentation’ stream is created in parallel to the ‘live’ stream, and the content is displayed on the left screen of the receiver system. The benefit of this functionality is that users can share not just slides or spreadsheet but also moving images: Flash video, movie clips, commercials, etc. The ‘presentation’ channel has flexible resolution, frame rates, and bit rates. For dynamic images, it can support full High- Definition video at 30 frames per second, and for static content, such as slides, can work for example at 3 frames per second, and save bandwidth in the IP network. Another major benefit of using a video channel for content sharing is that the media is encrypted (by AES in H.323 and by SRTP in SIP). In addition, once the firewall and NAT traversal works for the ‘live’ stream, it works for the ‘presentation’ channel as well and there is no need for separate traversal solution. POLYCOM, Inc. 11
  12. 12. The first issue with supporting Dual Video Streams in SIP is describing the content/presentation stream. As discussed above, the Session Description Protocol (SDP, RFC 2327) is used to describe media stream parameters. SIP endpoints and conferencing servers have to support RFC 4574 that defines the ‘label’ attribute in the SDP and the RFC 4796 that defines the ‘content’ attribute. Now that we can describe the content stream, we have to be able to associate the content stream with a live stream – this can be done by supporting RFC 3388 ‘Grouping of Media Lines in the Session Description Protocol’. The remaining issue is how to identify who is sending the content and who is receiving it. This is usually done by tokens (the party that has the token, can send content), and token management protocols make sure that there is only one token in the session, and that anyone can request and receive the token. RFC 4582 ‘Binary Flow Control Protocol (BFCP)’ defines token management mechanism, and can be used for Dual Video Stream implementation in SIP. And since everything has to be described in SDP, we also need a way to describe the BFCP streams in SDP. This can be done by supporting RFC 4583 ‘SDP Format for Binary Floor Control Protocol Streams’. Since it takes 5 specifications (RFCs) to implement the equivalent of H.239 functionality in SIP, Polycom created a specification that describes how to glue these RFCs together. This specification is now Internet Draft ‘Role Management and Multiple Stream Functionality in SIP’ (draft-even-xcon-pnc). Far End Camera Control FECC is a popular feature in the visual communications – if H.323 Terminals A and B are on a call, the feature allows Terminal A to control the camera of Terminal B: zoom, pan (move the camera left and right), and tilt (move the camera up and down). The assumption is that Terminal B has a PTZ (Pan, Tilt, and Zoom) camera, and has the FECC feature enabled. Figure 11 explains the concept. Figure 11: Far End Camera Control (FECC) In group conferencing setting, the key FECC benefit is that users can adjust the image that they get from the remote site, focus on a particular person or a group of people, and then move to another part of the room. In personal video setting, the feature can be used to adjust the camera if the remote party is sitting too close or too far from the camera. In H.323, FECC is implemented via two ITU standards: H.281 defines the binary data that is transmitted between Terminal A and B to control the camera while H.224 defines the format of the frames that carry the binary data. POLYCOM, Inc. 12
  13. 13. In SIP, RFC 4573 ‘MIME Type Registration for RTP Payload Format for H.224’ (authored by Polycom) registers the H.224 media type, and defines the syntax and the semantics of the Session Description Protocol (SDP) parameters needed to support far-end camera control protocol using H.224 in SIP. In effect, RFC 4573 creates a tunnel through the SIP based network, and allows video endpoints to exchange H.224/H.281 information exactly as they do in H.323-based networks. Video Channel Control Video channel control is embedded in H.245 and was discussed in detail earlier in this paper. The protocol allows sending messages such as ‘Flow Control’ from the receiver of live and presentation streams back to the sender of these streams, and telling the sender to modify the bit rate, usually to reduce the bit rate when the receiver detects high packet loss. By sending ‘Fast Update’ message the receiver asks the sender to resend a full or intra video frame(s), usually when a video frame is lost in transmission. Figure 12 provides graphical description of the functionality. Figure 12: Video Channel Control There is still no standard solution for replicating the video channel control functionality in SIP. Polycom uses the SIP INFO message because it allows easy mapping of the H.245 messages into SIP. This approach has been embraced by other vendors in the market. However, IETF is in favor of an RTCP-based mechanism, and there is a work on the so-called Audio Video Profile Feedback - extension to RTCP that will allow for video channel control functionality. This approach has substantial impact on the SIP-H.323 gateway function. While H.245-INFO interworking is simple to implement and only touches the H.323-SIP signaling, RTCP is always associated with RTP and using RTCP for video channel control means touching the media stream. We will discuss that in more detail in the SIP-H.323 gateway section that follows. SIP-H.323 Interworking Although we expect SIP deployments to grow rapidly in the future, the installed base of H.323 endpoints and infrastructure is here to stay in the healthcare, government, education, and general enterprise markets. Interworking between the two protocols becomes an important issue. In general, there are two ways to bridge the SIP and H.323 networks: through a signaling gateway and through a conferencing server/MCU. Figure 13 provides a visual representation of the interworking concept and lists the functions that have to be considered in the SIP-H.323 interworking scenario. POLYCOM, Inc. 13
  14. 14. Figure 13: SIP - H.323 Interworking SIP and H.323 are different protocols with different message formats but they both can be used in similar ways. Comparing the call flows in Figure 4 and Figure 6 shows a lot of similarities in the call setup process. Similarities exist in the call tear down process (not covered in this paper) and in the mechanisms to spontaneously exchange information during the call. A signaling gateway is a piece of software that takes incoming SIP messages, extracts the communication parameters, creates H.323 messages and sends them to the H.323 network. It also takes the incoming H.323 messages, extracts the communication parameters, creates corresponding SIP messages, and sends them to the SIP network. The gateway therefore looks like a SIP user agent to the SIP network and like H.323 terminal to the H.323 network. Luckily, both SIP and H.323 rely on the same protocols (RTP and RTCP) for transmitting media streams. The signaling gateway can then focus on mediating between the H.323 and SIP signaling but does not need touch the media. This is very important as media processing is very resource- intensive. While signaling messages generate traffic in the magnitude of few kilobits per second, video media streams can be in the megabits per second (HD 720p video starts at 1.2Mbps). The base RFC relevant to SIP-H.323 signaling interworking is RFC 4123 ‘SIP - H.323 Interworking Requirements’. Since a lot of the audio and video codecs used in visual communication are ITU-T standards, it was necessary to define RTP payload formats for each of them: G.722.1, G.722.1 Annex C, H.261 Video, H.263 Video, and H.264 Video. There are however several issues with the signaling gateway approach. First, media security gets broken because H.323-based video networks use the AES encryption while SIP refers to SRTP for encryption. These two standards are completely different – the encryption algorithms and the key exchange procedures are incompatible. The consequence is that deploying a signaling gateway would result in failure of the media encryption, i.e. the audio and video streams will be transmitted unencrypted. As we mentioned in the video channel control section, another issue is the IETF-backed approach that requires the use of RTCP which is associated with RTP media. This concept goes against the concept of signaling-only gateway because H.245 messages must somehow be mapped into RTCP messages. There are currently no implementations where RTCP is independent from an RTP media stream, so media has to traverse the gateway, in order to follow the IETF approach. The third issue is that signaling gateways only address the SIP-H.323 interworking; ISDN and PSTN have different media (e.g. B channels in ISDN), and ISDN/PSTN users cannot use this gateway to connect to the SIP network. POLYCOM, Inc. 14
  15. 15. Due to these limitation, using the conferencing server as a gateway has been seriously considered as an alternative concept for H.323-SIP interworking. Conferencing servers can originate and terminate H.323 and SIP calls, and have sufficient processing power to handle the media. They already support AES, and can easily add support of SRTP encryption. Mechanisms for video channel control that use RTCP can be accommodated as well since RTP and RTCP streams go through the conferencing server. The main disadvantage of this approach is that it creates a bottleneck – even point to point calls between SIP and H.323 domains have to go through the conferencing server – and the associated high cost of additional conferencing server ports to support SIP-H.323 interworking. The Future of Visual Communications In the long run, visual communications will migrate from H.323 to SIP, and will seamlessly integrate with other communications network components: IP-PBXs, IM/Presence servers, etc. The legacy H.323 equipment will continue to connect to the SIP network through gateways and conferencing servers. Figure 14 displays the configuration of the future network. Figure 14: Future Visual Communications The migration to SIP will allow not only better interoperability with other communication systems but also increased scalability, better traversal of firewall and NATs, and better security. With regards to scalability, servers handling tens of thousands of users and providing voice, video, IM, presence, and directory services are feasible. Through federation, these servers can support large networks of personal video systems, group conferencing systems, immersive telepresence systems, soft clients, and mobile clients. Firewalls and NATs have always been barriers to IP communication but current video solutions are intranet-based and predominately used for internal company communication where firewalls are less of a problem. Future networks will connect companies with their suppliers, customers, and partners, all of which are separated by multiple firewalls. SIP in combination with ICE will provide an efficient way for connecting people across networks, and making visual communication ubiquitous, similar to voice communication today. With the ubiquity of SIP visual communications, security becomes of utmost importance. Once SRTP is universally adopted and deployed for media security and TLS is supported across vendors for signaling security, visual communications will become fully protected. POLYCOM, Inc. 15
  16. 16. Conclusion Visual communication is expanding beyond enterprise conference rooms to the user’s desktop. The trend towards Unified Communications requires integrating video with variety of SIP-based systems in enterprises, hospitals, universities, and government organizations. SIP is a new protocol that can meet the requirements for scalable distributed visual communications. SIP has already been deployed for visual communication in certain scenarios. Once the missing functionality is added to SIP, it will become a solid foundation for visual communication solution. Transition from H.323 to SIP will be gradual, and interoperability with the installed H.323 base throughout the process is a key requirement and main technical challenge. Polycom is uniquely positioned to leverage its broad product portfolio, market leadership and extensive partner network to lead customers through the migration process from H.323 to SIP, and deliver on the VC2 promise: transform traditional video conferencing into tomorrow’s visual communications. POLYCOM, Inc. 16