You Don't Know Jack About VoIP
ACM Queue vol. 2, no. 6 - September 2004
by Phil Sherburne and Cary Fitzgerald, Cisco
The Communications they are a-changin'.
Telecommunications worldwide has experienced a significant revolution over recent
years. The long-held promise of network convergence is occurring at an increasing pace.
This convergence of data, voice, and video using IP-based networks is delivering
advanced services at lower cost across the spectrum, including residential users, business
customers of varying sizes, and service providers.
One of the key technologies driving this convergence is VoIP (voice over IP), which has
evolved from what many viewed as experimental to a fundamental technology on which
businesses from small to Fortune 500 are running their enterprises. VoIP has moved to a
level of reliability and capability such that mainstream users are adopting it at a rapidly
increasing pace. For this to happen, a number of technical innovations were required to
solve issues such as quality of service and reliability.
This article explores key principles and technology innovations underlying VoIP, and
describes the implications of these innovations for software developers.
FROM ANALOG TO VoIP
Telecommunications technology is entering its third wave with VoIP. It began with analog
signals carried by the first telephones and evolved into digital networks decades later.
Now, with the increasing sophistication of the Internet, VoIP is coming into its own.
From the invention of the telephone in 1876 to today’s modern communications
infrastructure, voice has been carried by analog wave signals. Human speech is an analog
wave signal. In the initial telephone networks, speech was converted to electrical wave
forms (microphone) and converted back to speech at the other end of the conversation
(speaker), traveling the distance between the phones as this analog wave form.
While an obvious leap forward over previous methods of communication, this early
technology had severe limitations that included introduction of “noise” in the signal. This
noise increases with distance traveled. Although various methods of reducing noise were
developed over the years, it remained a noticeable problem (remember the amount of
static on long distance calls?). Another significant problem was one of economics. As the
demand for communications increased dramatically post–World War II, the need to
increase the carrying capacity of a pair of copper wires was significant. This led to the
development of digital transmission capabilities in the long distance network.
The early 1950s saw the introduction of technology that converted speech into digital
signals. Specifically, the invention and deployment of T1 lines allowed for transmission of
voice at 1.544 megabits per second (Mbps). (This is referred to as E1 in Europe and other
places outside the United States, at a rate of 2.044 Mbps). Among other benefits, T1 lines
addressed the two primary problems with analog voice transmission—noise and
economics. Because the digital signals contained either 0s or 1s, the digital “repeaters”
that were used to regenerate the signals over distances could also re-form the signals in
a near-perfect rendition of the original. Thus, the impact of distance on the quality of the
speech was virtually eliminated. The issue of economics was alleviated since a T1 line
carries twenty-four 64-Kbps channels (32 for E1 lines). The mechanism to place multiple
calls on the T1 line is known as TDM (time division multiplexing).
T1 technology (as well as higher digital transmission rates, including fiber optics) has
been deployed extensively over the past 50-plus years. With the exception of the access
portion of the network (i.e., lines from homes to the phone company), virtually all voice
is carried over digital lines worldwide. Again, a great step forward in communications but
with its own set of limitations. Critical among these is the nature of TDM connections,
known as circuit-switched connections. Fundamentally, this means that a call from one
end of the circuit-switched connection to the other always follows the same path through
the network and consumes the same amount of bandwidth, whether there is useful data
to be transmitted or not. For example, during a silent pause on a phone call, 64 Kbps of
data are still being transmitted in each direction. From an economic viewpoint, this is
Since the 1970s there has been an increasing use of packet networks for transmitting
data. Today the most obvious use of this technology is the Internet. The nature of packet
networks in general and IP (Internet protocol) in particular is that the data to be
transmitted is split into small packets that include small amounts of address information
added to each packet. These packets are sent out over the network—quite possibly taking
different paths through the network, unlike a legacy TDM connection where data is simply
data, and routing is established at call setup time. The packets are then reassembled at
the destination node.
Packet-switched networks have significant advantages over circuit-switched networks.
Among these is the ability of packets to take different routes through the network. In the
case of network failures (transmission lines being cut, etc.), this allows the data still to
reach the destination. In addition, the only bandwidth used is that required for useful
data (other than a small amount of control information such as address bits).
People recognized that if a means could be found to use packet technology for the
transmission of voice, then the limitations of TDM networks could be overcome. Voice
packets could take different routes through the network, and only necessary bandwidth
would be used rather than always transmitting even in the face of silence. Even more
significantly, both data and voice could be carried on a common, packet-based network.
This would simplify management by reducing the number of networks to manage, and
lowering network facility and hardware costs.
By the early 1990s certain fundamental technologies were developed that allowed for
initial efforts in VoIP. The rest of this article outlines those technologies and their
implications for software developers.
Any discussion of VoIP must begin with a discussion of both bearer and signaling
components. Bearer refers to the actual voice being sent over the network. Signaling
refers to the information necessary for successful setup and teardown of the call. This
includes the dialed digits, off-hook and on-hook information, originating number, etc. The
separation of signaling from bearer information began in the circuit-switched digital
networks—for example, ISDN. The concepts behind this were leveraged for VoIP.
One difference between data and voice transmission is the sensitivity to delay associated
with transmission across the network. Data is far less sensitive to delay than voice is.
Anyone who has experienced an international call over satellite will recognize this
sensitivity. This is partially solved by the use of RTP (realtime protocol) for the
transmission of voice.
RTP is the standard protocol designed for realtime sensitive data transmission. Because
of the realtime nature of voice, all VoIP traffic is carried as RTP packets. RTP “rides” on
top of the standard UDP (user datagram protocol) and provides information to the
endpoints not available in UDP. Specifically, RTP provides packet sequence information so
endpoints can determine arrival order and time-stamping to allow endpoints to help
manage “jitter” (discussed later in this article).
Voice coding standards. A number of different voice-encoding algorithms—codecs—are
used in VoIP networks. These are standardized as a set of G-series recommendations by
the ITU (International Telecommunication Union). Common ones are G.711, which
encodes at 64 Kbps, and G.729, which encodes at 8 Kbps. Each of the codecs has
different attributes, including compression level, quality, etc.
Considerations for different bearer traffic. Although the discussion so far has
focused on voice, in reality, other types of information are transmitted over traditional
voice networks. For VoIP to be practical and gain common usage, these types of traffic
must also be handled effectively:
• DTMF (dual-tone multi-frequency). This refers to the tones generated by a common
touch-tone phone. These are used for not only initiating a phone call but also
communicating during a phone call—such as for voice-mail and IVR (interactive voice
response) systems. When used for making a phone call, DTMF is part of the signaling
information and not transmitted as part of the bearer information. When used mid-call,
however, it is transmitted as part of the bearer data.
• Fax. The use of fax machines—although less common today than before the common
use of e-mail—remains a critical form of data communication (such as in the legal
profession). For broad market acceptance, VoIP networks and equipment must be able to
handle traditional fax machines, given the large number deployed worldwide. The issue in
handling fax on VoIP networks is that fax transmissions are much more sensitive to
packet loss than voice is. Different methods (Fax Passthru and T.38 Fax Relay) have been
developed to ensure successful fax transmission over VoIP.
An architectural model has evolved within the VoIP industry (see figure 1). Just as with
any reference model, specific products or protocols do not necessarily strictly adhere to
the model, but it has proven to be a useful framework for characterizing components and
their roles. The model is a PSTN (public switched telephone network) gateway, with a set
of interfaces looking into telephone networks and a set of interfaces looking into VoIP
networks, but it equally applies to IP phones and other VoIP endpoints.
The heart of the system is the MGC (media gateway controller). An MGC is an
“intelligent” endpoint; it interacts with its peers to establish, modify, and destroy
connections with its peers within a network. The manipulation of these connections
results in various end-user services: call establishment, features such as transfer and
park and hold, and call forwarding. The MGC is the component that supervises calls and
services from end to end. Often it is implemented as a highly reliable system component,
so call-related information must be mirrored across a complex of MGCs.
The MG (media gateway) is responsible for the media interfaces to the PSTN and to the
IP network. Typically, an MG is implemented with a complex of DSPs (digital signal
processors) to lower system costs, but general-purpose processors are also sometimes
used, depending on the application.
An MG is a simple endpoint. It does only what it is told to do. It does not understand
signaling to either the PSTN or the IP network. It does not understand services or even
calls. It creates, modifies, and destroys connections as instructed by an MGC. These
connections can be between the PSTN and the IP network, between PSTN ports, or even
between IP-based endpoints. Since an MG does not understand the end-to-end nature of
a call, it needs to concern itself only with the connections it is holding up, so the system
reliability requirements for an MG can be somewhat relaxed.
MGCs and MGs interact with each other over a control plane, which can be a proprietary
interface such as an internal API; standardized protocols have also been developed. Both
the ITU and the IETF (Internet Engineering Task Force) saw the need for this protocol
and cooperated to produce the MEGACO (H.248/Media Gateway Controller) protocol, a
first-of-its-kind cooperation between the two standards bodies. It is published in the IETF
as RFC3525 and in the ITU as H.248. H.248/MEGACO is in an early adoption phase of
deployment. Eventually, this is expected to replace an earlier effort to standardize the
MGC/MG control plane that has been known as MGCP (media gateway control protocol).
MGCP (IETF RFC3435) has been deployed in a number of networks and has been adopted
in the ITU for application in VoIP over cable.
MGCs interact with their peers using an intelligent signaling protocol. Two intelligent
protocols have emerged—from the ITU, H.323, and from the IETF, SIP (session initiation
protocol). They share many concepts; both suppose that the endpoints are intelligent—
but they also differ in significant ways.
H.323 is derived from the PSTN protocols used to access PSTN services—Q.931. VoIP
connections in H.323 follow the ISDN model: the same message sequences are used to
establish and tear down calls. H.323 has been extended to support a number of services;
again, these follow an established model from established TDM network architectures. For
example, a number of services are described in the H.450 series; these are modeled on
the corresponding services from Q.SIG.
SIP is a methods-based protocol, whose roots are in HTTP. In general, services are not
explicitly exposed in the protocol; rather, the designer can use a set of well-defined
methods to implement services. So, for example, SIP does not have a transfer primitive
per se, but executing a set of SIP transactions will result in the user experiencing a
transfer. A significant amount of work is going on in the standards communities with
respect to SIP, as well as a significant increase in market adoption of SIP-based
equipment. SIP-based equipment is clearly expected to achieve a significant share of
installed VoIP equipment over the next few years.
A number of exciting new services and concepts are coming out of the VoIP community.
We highlight just a couple as follows: the impact of IM (instant messaging) and presence
on converged communications; and ENUM, a mechanism for telephone number resolution
in VoIP networks.
Instant messaging and presence. A significant number of features in the telephone
network are devoted to the concept of increasing the probability that a call will be
completed to the right recipient at a time that is acceptable to both the caller and the
called party. IM and presence have recently emerged as important business and personal
communications tools. Combining IM and presence with VoIP yields some valuable new
features. Presence information can be used to determine whether offering a new call to a
party is likely to be successful—there is no point in placing the call until the called party is
available and willing to take the call. Instant messages can be used as part of the alerting
process, which allows both called and calling parties to provide more information to each
other about the nature of their communications. The two systems—VoIP and IM/presence
—working in concert are more valuable than either one alone. VoIP deployments for
these applications are in their very early stages.
ENUM. The best-understood and most widely deployed name resolution system today is
the DNS (domain name system). In the DNS, names are written from right to left, with
the most general part of the address on the right, and more specific names written to the
left (e.g., www.ietf.org). In the PSTN, telephone numbers are written from left to right,
with the most general part of the number written on the left and the more specific toward
the right (e.g., 1.212.543.6789). ENUM calls for telephone numbers to be written DNS-
style, rooted at the domain e164.arpa. So, 1.212.543.6789 becomes
184.108.40.206.220.127.116.11.1.2.1.e164.arpa. Interestingly, each digit is treated as a subdomain. This
allows ENUM to ignore the nuances of country codes, city codes, etc. that vary broadly
worldwide. When this address is queried, the DNS can return a specific IP address
corresponding to the telephone number, or it can return a rule for rewriting the original
number into some other form. For example, rules can be returned to rewrite
1.212.543.6789 as sip:firstname.lastname@example.org, sip:email@example.com.
ENUM offers the possibility to reuse the worldwide DNS for VoIP. ENUM is a standard set
by the IETF as RFC3761.
Managing VoIP Quality of Service
Voice quality. The fundamental concern for VoIP QoS (quality of service) is voice quality.
Unfortunately, objective measurements for this have been elusive. That said, the major
factors that affect voice quality are delay, packet loss, and treatment at the endpoints.
Voice codecs are unevenly tolerant of packet loss, but loss above 2 to 5 percent will have
a perceptible effect on quality. Loss is rarely random and is often associated with high
jitter (simply defined as the variation in packet arrival times at the destination).
When one-way delay through a voice network exceeds about 150 milliseconds, natural
conversational communication is strained, so most network deployments attempt to keep
the delay well below that threshold. There are a number of components to delay: codecs
have an intrinsic delay; it takes time to prepare and route a packet to the IP interface on
the phone or gateway; various access networks have intrinsic delays; and transit
networks contribute both in terms of routing delay and propagation delay.
Further, packets are generated at regular intervals, but because of the vagaries of
routing across the IP network, they are delivered to the endpoint with a certain amount
of jitter. Endpoints have built into the software a “jitter buffer” where packets are
buffered and then played out at a constant rate. This, of course, works fine unless the
amount of jitter exceeds what can be absorbed by the jitter buffer. Software-based
mechanisms exist in the endpoints to automatically adjust buffer sizes, etc. as jitter
increases or decreases. Jitter can be a major component of the delay budget.
QoS tools. The basic idea for controlling QoS revolves around two aspects: the first is
ensuring that the network has enough capacity (bandwidth) to allow for high-quality
calls; the second is establishing priorities such that the more realtime-sensitive packets
are given higher priority for transit through the network.
To ensure enough capacity, there are mechanisms such as RSVP (reservation protocol,
RFC 2750). This allows bandwidth to be reserved through the network. Using RSVP, the
endpoints or MGCs signal through the network, reserving capacity. This is done in
advance of a call being set up.
For priority management, the mechanism is different packet queueing methods within the
MGs and routers. A variety of algorithms are available, with the best choice depending on
the customer network and traffic types. Associated with this is the concept of TOS (type
of service) bits. Within each packet there are three bits at the IP level that indicate up to
eight levels of precedence. These are used to ensure that higher-priority packets make it
through the network first.
System and software designers for VoIP equipment and networks face myriad challenges.
Common concerns are QoS; security; manageability and operations, reliability,
redundancy, and sparing; scalability; deployability, installability, and upgradability;
serviceability, capacity management, fault detection, diagnostics, reparability, and
metrics; testability and regressions; internationalization; performance and graceful
degradation under fault conditions and load; extensibility; interoperability with both IP
and legacy systems; modularity; manufacturability and costs; open systems and
standards compliance; ease of use for end users; consistent/normalized database use;
billing and audits; and feature interactions.
Expectations for the reliability of VoIP are as high as those for the traditional voice
networks. Although there are different measures of reliability (such as the oft-misused “5
9s”), for our discussion the assumption is that the VoIP system must work all the time (7
x 24 x 365). While there are occasional “maintenance windows,” the expectations are the
system is always operational (think about a 911 call center in a major metropolitan area).
At a high level this means the software designer must “design for failure”—that is, the
designer must consider potential failures in three domains:
1. In the network, which might be caused by external events such as power failures or
2. In the hardware, including processor, memory failures, etc.
3. In the software, which may be the result of bugs, corrupted data, etc.
Thus, the software designer must include capabilities such as:
• No upgrade downtime. This typically implies some form of duplicated active/standby
systems with data synchronization between the active/standby systems.
• Software audits. There should be separate software components that audit the
primary system software. This includes validating the internal data structures for
accuracy as well as consistency among data structures. Corrective action may include
automatic correction of the invalid data.
• Process monitoring. This means having system monitors that ensure the primary
system software is operating correctly. This includes techniques such as watchdog timers
—that is, having the primary software send a message on a regular basis to the
monitoring software indicating proper functioning. Corrective action by the system
monitor may range from process restart to system failure over to a standby system.
• Automatic failover. As a response to certain types of failures—including a full system
failure—the system automatically fails to a standby system.
• Geographic redundancy. This is the ability to have the active and standby systems
separated by hundreds of miles.
The need to manage their networks is critical for all customers, whether large or small. In
many cases the cost of operating and managing voice systems—whether traditional TDM
or VoIP—far outweighs the cost of the equipment. Therefore, the need for effective tools
allowing for cost-effective management is important for successful deployments.
Manageability, as used here, covers many different areas, including accurate and flexible
billing systems, error reporting and resolution, call tracing, adds/moves/changes, etc.
Although VoIP does not create new concerns, manageability takes on additional roles.
Consider the need for call tracing, which typically arises when an end user complains
about a dropped call, noisy lines, etc. A system administrator will then typically look at
the call traces—the route the call has taken through the network—to identify the source
of the trouble. As noted earlier in the context of a traditional TDM circuit-switched
network, when a call is set up, the voice takes the same path through the network for the
duration of the call. This makes tracing calls through the network reasonably
straightforward by collating call detail records, etc.
In a VoIP network, the packets containing the voice may take very different routes
through the network, which makes the issue of call tracing and diagnosing of intermittent
problems much more challenging. This requires not only good instrumentation on the
MGCs, MGs, and routers in the network, but also very sophisticated management tools
that provide the correlation and reporting of the information.
THE IMPACT OF VoIP
In recent years, we’ve seen increasing adoption of VoIP networks for customers of
varying sizes on a global basis. The cost advantage resulting from convergence and the
value of new applications offered by this convergence are the primary drivers of this
adoption. With this comes the need for increasingly sophisticated systems and
management tools to allow for the extensive adoption and deployment of VoIP.
VoIP’s increasing adoption will have a significant impact on our communications and the
products that provide those communications. Therefore, software developers across the
industry will increasingly need to be aware of and understand the challenges that come
with this latest change in the communications infrastructure.
PHIL SHERBURNE is senior director of the Voice Technology Group, Cisco Systems. He runs the Call Control
Division, and his team is responsible for the development and deployment of Call Control technologies including
the Cisco Call Manager, BTS 10200, PGW 2200, and the SIP Proxy Server products. Previously, he was general
manager for the Packet Telephony Call Control Business Unit responsible for the Softswitch products from Cisco.
During his career with Cisco, he has been involved with a number of VoIP products and offerings.
Prior to joining Cisco in 2000, Sherburne spent more than 20 years at AT&T and Lucent Technologies Bell
Laboratories, where he was involved in development of both PBX and messaging products. He has a B.Sc. in
computer science from the University of Oregon and an M.Sc. in computer science from Ohio State University.
CARY FITZGERALD is senior director of the Voice Technology Group at Cisco Systems. He joined Cisco in 1996
and formed the team that built the first commercial VoIP gateway. He is a key contributor setting Cisco’s VoIP
architectural directions. Prior to joining Cisco, FitzGerald was a distinguished member of technical staff at AT&T
Bell Laboratories, where he led architecture and design teams for voice-response and voice-mail systems. He has
a B.S. in computer science from Purdue University.
Not Your Father's PBX?
ACM Queue vol. 2, no. 6 - September 2004
by James E. Coffman, Avaya
Integrating VoIP into the enterprise could mean the end of telecom
Perhaps no piece of office equipment is more taken for granted than the common
business telephone. The technology behind this basic communication device, however, is
in the midst of a major transformation. Businesses are now converging their voice and
data networks in order to simplify their network operations and take advantage of the
new functional benefits and capabilities that a converged network delivers—from greater
productivity and cost savings to enhanced mobility.
Convergence involves much more than simply sending voice packets over an IP (Internet
protocol) network. It involves a significant new architecture that introduces advanced IP
applications into the framework of an enterprise, with ramifications for communications
that will play out over the years to come.
The discussion that follows describes what’s behind some major changes in
communication systems design. New systems are evolving to become much more
distributed, open, and made up of common, off-the-shelf components. (For an overview
on VoIP, read Phil Sherburne and Cary FitzGerald’s “You Don’t Know Jack About VoIP” on
page 30 of this issue).
WHAT IS A PBX?
Most of us are familiar with a PBX (private branch exchange) only as users who pick up
the phone at the office to call someone inside or outside our business. In fact, although
the most basic function of a PBX is to provide communications (usually voice) to the
employees of a business, there is more to it than that. Sitting behind the familiar
telephone is a sophisticated system of components that provides the functions necessary
These functions can be roughly divided into six major groups:
1. Feature operations. These consist of both those functions available to the phone user
(placing a call, hold, conference, transfer, etc.) and those functions utilized by the
operator of the system to control its use and how it is organized—phone number
2. Endpoints. These allow users to access the functions of the system. The most common
endpoints are telephones, but also included are fax machines, modems, PDAs, telephony
applications running on laptop computers, and more. Most of the “non-voice” endpoints
send signals to the PBX just as if they were simple analog phones. The media streams
they send, however, are not voice-like and often require special handling by the system.
For example, modems send special tones that tell the system not to perform echo
suppression on the information they transmit. PBXs often have special features to
suppress “call waiting” tones, which might be useful to human users but would disrupt
3. Gateway interfaces. Gateway interfaces allow users inside the business to talk to the
outside world. This requires conversion of both the call signaling (how calls are set up)
and the voice stream. If someone in a business wants to call an outside phone, the PBX
must signal to the outside system—typically the PSTN (public switched telephone
network)—and must convert the voice stream inside the business to a voice stream
expected by the PSTN.
4. Switching. Making a call requires that a path be established between the calling and
called endpoints. This process is “switching,” and it can be accomplished in a variety of
ways. It requires a network to tie the components (endpoints and gateways) together.
5. Media processing. Media processing functions combine and transform the voice
streams in a call—to provide conferencing, music on hold, announcements, etc. Media
processing is also needed to make sure the connection between two phones results in a
path where both users can hear each other.
6. Application interfaces. The PBX also provides a voice network used to deliver voice
services beyond simple endpoint-to-endpoint calling, including voice mail, interactive
voice response (IVR), and other applications.
Additional Attributes. Several other important attributes of PBXs affect the way they are
• Reliability. Most businesses expect their communication system to be available
essentially all of the time. As a result, redundancy of components is built into all large
• Scalability. Most businesses expect to grow over time and don’t want to switch
communication systems as they do so. Supporting more users by “adding on” is an
aspect of most systems.
• Cost effectiveness at various sizes. To be effective in the marketplace, communication
systems must be cost-effective for businesses of all sizes.
• Interoperability. It is important for communications systems to work with systems and
devices made by other manufacturers.
Traditional Private Branch Exchanges
YOUR FATHER’S PBX: THE TRADITIONAL ARCHITECTURE
How do you design a system that can provide the capabilities and has the reliability and
scalability attributes described in the previous section?
The technology currently available obviously impacts system design. For example, in the
early 1980s when PBX systems were developed, computing was an expensive resource.
Microprocessors were available but were limited in function and also expensive. Data
networking was relatively unknown and often based on circuit-switched models such as
Most PBXs developed at this time have a common architecture:
1. A control processor to run the software that operates system features. This processor
is typically built to support the reliability and scalability expected of PBXs.
2. The communication software that runs on the control processor. This application drives
all of the system components and determines the functions that it provides.
3. Endpoints used to access the features and functions of the system. There are two
kinds of endpoints: Digital phones provide convenient access to calling functions through
buttons that are used to tell the system what the user wants (hold, transfer call, etc.)
and through a small display used to show who is calling. These endpoints are usually
proprietary to a single manufacturer. All traditional PBX systems also support analog
phones that provide basic calling functions. Both digital and analog endpoints connect to
interface cards in modules.
4. Modules. Sometimes called shelves, these house the interface cards that provide
endpoint or gateway interfaces. An individual interface is generally called a “port.” Ports
for digital phones usually provide enough power for the phone to operate even if the
power to the office fails, assuming that the module itself has backup power. Interfaces to
the PSTN are provided by a variety of interface cards. These interfaces convert from
signaling and voice formats expected by the PSTN to those used internally to the PBX.
Modules also provide a certain amount of switching among the interface cards held
within. Media processing for conferencing, music sources (for music on hold), and
announcements is either built as a card that fits into a module or is built right into the
5. Inter-module switching. This allows the interconnection of ports in different modules.
Traditional PBX systems accomplish this via circuit switching. In circuit switching a
dedicated path is set up between the two ports for the duration of the call. Calls to
phones not within the business are switched to an interface in a module that enables
connection to the PSTN. Often inter-module switching does not have enough capacity to
connect all possible calls simultaneously, and its success depends on the fact that not
everyone is on the phone at the same time. When call volume exceeds capability, calls
The components of the system must be networked together for two purposes. First, a
voice network is needed to create a voice path between devices. The voice network is
created from switching elements within and between the modules. This is usually done
via layers of TDM (time division multiplexing) switching, which is a technology that
transmits multiple signals simultaneously over a single transmission path.
Second, a control network is required so the components can communicate with each
other to implement system operations. For example, when a user pushes a button on a
digital phone, a message indicating the operation requested is sent from the phone to the
module it connects to and then to the communication application software. The control
network is implemented in a variety of ways, often by stealing some of the TDM timeslots
in each module and dedicating them to the transmission of control messages.
The data representation of voice is a 64-kilobits-per-second (Kbps, or 8K eight-bit
samples per second) isochronous stream. The voice sample is generally encoded in one of
two formats: Mu-law (used mainly in North America and Japan) and A-law (used almost
everywhere else). This format matches the one used in digital interfaces to the PSTN.
Using the same voice representation in a PBX as in digital interfaces to the PSTN reduces
the work needed for the PBX to interoperate with the PSTN.
A module, along with a control processor (often housed in a special slot in a module) and
a few interface cards, can provide service to a small number of users.
Systems grow by adding modules and interconnecting them with inter-module switching.
The capacity of the modules and of the inter-module switching determines how small or
large a system can be economically designed. This is typically based on the amount of
switching capacity in these components (see figure 1).
Technology and Architecture
PBX systems have been present in businesses since the early 1980s with little change
and predate data networking and PC technologies. After 15 years of relative stability,
however, virtually all PBX vendors are now introducing radical changes to their
What technologies are enabling this change? Perhaps the most important is the
development of packet switching into an IP-based network with the bandwidth, speed
(low delay), and reliability to support voice communications. The development of this
technology and its use in data networking have both enabled the change and provided a
driver for it. Since most business data networks span the breadth of their organization, it
became possible and advantageous to offer voice communication throughout an
enterprise while using only one network.
Another essential enabler was the creation of inexpensive DSP (digital signal processor)
technology. For voice streams to ride an IP network, they must be packetized and
perhaps compressed. These operations require digital signal processing. Without the
availability of inexpensive DSP technology, IP phones would have been too expensive
compared with their traditional counterparts.
Another technology lowering the cost of IP devices was the creation of inexpensive
network interface chip sets. Fast and inexpensive microprocessors also allowed more
intelligence to be distributed to phone modules and interface boards and enabled the use
of IP to distribute these functions.
THE NEW ARCHITECTURE
These technology changes have led to a re-design of PBX components. They are evolving
to distribute components farther apart, to incorporate more off-the-shelf components,
and to use an IP network to transmit both control information and voice. The common
term for a PBX with this new architecture is IP-PBX.
Control Processor and Communication Software
The control processor often is an off-the-shelf server that runs communication application
software on a standard operating system (Microsoft, Unix, or Linux). The benefits of
moving to commercially available hardware and software are substantial, allowing
vendors to lower their development costs.
Digital endpoints become IP phones and connect to the IP network rather than to
dedicated interfaces in a module. The phones use the IP network to communicate both
control and voice streams.
IP phones put some of the same strain on the data infrastructure as does a PC. Each
requires an IP address and generally needs DHCP (dynamic host configuration protocol)
service to acquire that address. IP phones often use an FTP server to get a new version
of their firmware, and—for security or voice-quality reasons—they may be put into a
special VLAN (virtual LAN), etc.
Phones require an IP address so that they can be identified by the communication
application. Using standard IP network mechanisms, the phone acquires an IP address
and the address of the application server, and “registers” itself. Registration allows the
communication application to establish a correspondence between a phone number and
the IP address used by the phone. Thus, users still dial familiar phone numbers, and the
communication application uses the IP address of the phone to communicate with it.
IP phones use the IP network to carry voice streams directly to each other. Unlike the
traditional architecture, when two IP devices are talking directly together, they do not
use communication system resources to create the voice path.
Interfaces are still needed to provide access to the PSTN and to analog endpoints such as
analog phones and fax machines. These “interfaces” are housed in gateways operating in
much the same way as do modules in the traditional architecture, but they convert the
signaling and voice streams to IP.
Some manufacturers also provide gateway interface cards allowing customers to continue
using existing digital phones, protecting their investment in their existing infrastructure
and reducing the cost of migrating to the new system.
Inter-module switching is done over the IP network where bandwidth can be limited in
certain circumstances (across the wide area, for example). Without provisions to limit the
number of calls across a limited resource, the voice quality will degrade. This is
analogous to “blocking” in circuit-switched networks, but all calls degrade in quality
instead of some being blocked. Some systems enforce “call admission control” so that
only a limited number of calls are allowed across the limited links, allowing those calls
that do get through to maintain optimum voice quality.
Gateways may also provide the media processing found in traditional modules. Logically,
however, this is a separate function, and specialized media processors may be used for
Collectively the endpoints, modules, media processors, control processor, and
communication software use the IP network to provide the same realtime voice
communication functions as provided by a traditional PBX. The new architecture is a
client-server approach: the clients are gateways and endpoints, and the server provides
the communication application that operates the features. This approach is similar to the
way e-mail or Web services are implemented, with a central server providing service to a
set of client PCs.
It is important that the IP-PBX be “well-behaved” from a network administration point of
view, with common tools and protocols for operation and management.
The other voice applications found in a traditional voice network—voice-mail and IVR—are
also migrating to the IP network used for communications (see figure 2).
CHALLENGES TO THE NEW ARCHITECTURE
As customers migrate to this new converged network architecture, they generally expect
to keep all the positive functions and attributes of their traditional PBX while gaining new
advantages. The IP-PBX has some challenges in this regard.
Traditional PBXs are highly reliable systems (many manufacturers claim 99.999 percent
reliability—about five minutes of outage per year). The traditional PBX architecture
achieves this with highly reliable components and with redundancy built in by the
A major question is how to provide reliability for the control processor of an IP network.
It is a key component because if it fails, all the users of the system (who may number
into the tens of thousands) will be without service. One approach to this problem is to
rely on the fact that gateways and phones can register to multiple servers. If the server
to which they are registered fails, they can register to a backup. Depending on how this
is done and the intelligence of the gateway and the phone, the re-registration may or
may not affect the active connection (voice path between the devices).
If the main processor and the backup share call control information, then after re-
registration the callers can continue their conversation and conference additional parties,
etc. This call continuity can be particularly critical in contact center operations where it is
important not to disconnect callers “in queue” who are waiting to be answered.
Keep in mind that IP networks can be designed to be highly reliable—with multiple paths
from device to device. In many real-world environments, however, there is a single IP
link between the control processor and a gateway. If this link fails, then the area served
by the gateway will no longer have service. This problem can be addressed by building
intelligence into the gateway or into a separate processor so that control functions
continue in the event that a primary link is lost.
Voice is more demanding than traditional data communications (such as e-mail, Web
pages, etc.) because of its realtime nature. To ensure voice quality, the following
attributes of the IP network must be managed for all possible voice paths:
• Bandwidth. For the expected number of simultaneous IP voice calls
• Round-trip delay. The time it takes a packet to go from one IP device to another and
• Jitter. Variability in delay
• Packet loss. The number of packets lost (usually expressed as a percent)
Techniques for establishing an IP network suitable for voice must be addressed before
the new architecture can be adopted. This usually requires incorporating various quality-
of-service capabilities into the IP network, as well as additional bandwidth.
The traditional PBX architecture implements echo suppression mechanisms that assume a
circuit-switched network. Within an IP network, the delay increases, requiring changes in
echo-suppression capabilities. These considerations affect IP endpoints and gateways.
Traditional PBXs carry much more than voice over their “voice” networks. For example,
modem traffic, fax, and multiple 64-Kbps channels for video are all found in a large
enterprise. The equipment using these streams may not work well if the streams are
transformed into packets and back into a continuous stream. The delay and possible
packet loss introduced by the data network make it impossible for endpoints to maintain
the synchronization they expect from a circuit-switched network. These limitations are
being addressed as vendors create encoding and error-collecting techniques suitable for
IP Telephone Operations
Traditional PBX interface cards provide power to analog and digital phones. This job now
goes to the IP network. There are several ways to deliver power to the endpoints, but the
most convenient is to have the data switch in the closet provide “inline” power over the
IP network. Standards for doing so have recently been ratified so that data-switching
equipment from one vendor can power phones from another vendor. If communication is
to be preserved through a power outage, the data switches need to be on uninterruptible
One advantage of IP telephones is they can be easily moved from one office to another.
One difficulty with moving phones is that 911 services require information on the location
of the phone placing an emergency call. Again, standards are emerging that allow the IP-
PBX placing the emergency call to query the data network for the identity of the data port
to which the phone is connected. That port can be associated with a physical location for
As PBX components “disaggregate” and become attached to an IP network, they also
become potential targets for intrusion, denial of service, and other hacking threats. These
voice communication system components must be hardened against attacks, like other
parts of the network infrastructure. Some vendors offer encryption of voice packets to
prevent eavesdropping via tools commonly found on the Internet.
WHY: BUSINESS DRIVERS
Moving to IP telephony over a converged network offers several important advantages
over the traditional PBX approach, leading vendors to insist that IP telephony is the
future and that virtually all PBX systems sold in coming years will use this new
Using the IP network to link IP-PBX components together gives an enterprise substantial
flexibility in how a system can be configured. Remote locations can be incorporated into a
single enterprise-wide communication system. Remote workers can have the same
communications capabilities as those working in a headquarters facility. This can improve
the communication capabilities within an enterprise, while lowering the total cost of
system implementation and operation.
Software packages such as databases, SNMP (simple network management protocol)
development environments, and Web servers are available on standard platforms. Thus,
the communication system vendor can more easily integrate these components with the
telephony application in an IP environment. This allows operators of the IP-PBX to use
familiar tools (Web browsers, SNMP management interfaces, etc.) to operate the system,
resulting in lower administrative costs.
Open servers and IP network bandwidth also enable organizations to scale their
communication systems to larger sizes. This can increase efficiency, particularly in
contact-center implementations. Businesses can link employees in distributed locations to
deliver “follow-the-sun” customer service and to take advantage of lower labor costs
available in some parts of the world.
It is also possible to use the resiliency of the data network to increase the availability of
voice communications. Businesses are moving their critical data and voice communication
components to hardened, geographically dispersed “bunkers.” This makes the business
more resilient in the face of fires, floods, and other disasters. The architecture of the IP-
PBX lends itself naturally to this structure as the control processor can be located at a
distance from the endpoints and modules.
Perhaps more important than these network-based advantages is that the communication
system is an application on the data network just like the other applications used in
business. Thus, this application can be integrated with other business services such as
directories, e-mail services, etc. Many systems allow dialing from corporate directories or
personal information managers and integration of voice and e-mail.
Now that the market has begun to evolve toward a new PBX architecture, what changes
can we expect to see?
First, the difficulties and limits of the new architecture will be overcome. Enterprises
expect the new systems to be as reliable and accomplish all the functions of traditional
PBX systems. Thus, modem and other nonvoice TDM traffic will move over the IP network
as it has moved over traditional voice networks. Standards are being defined and the
increasing capacity of DSPs will bring this about in a cost-effective way.
The expectations of reliability for IP-PBXs will drive developments in the reliability and
availability of the new architecture. Since an essential component of the new architecture
is the IP network, improved diagnostic and network analysis tools will enable the quick
diagnosis and repair of network problems impairing voice communications. Since security
breaches will be able to disable both voice and data applications, techniques to protect
critical business networks from denial-of-service and other attacks will be deployed. IP
networks will become more resilient for all applications, not just communications.
Communication systems will take advantage of the new IP-based architecture by scaling
larger and reaching farther. Even large enterprises will likely be able to implement a
single communication system that ties all their employees together around the world.
Rich collaboration and video communication applications will merge with voice
applications—becoming as easy to use and ubiquitous as traditional voice
communications. Voice quality will no longer be tied to traditional network bandwidths;
video room systems will provide stereo sound so listeners can locate talkers by position,
improving audibility and “liveness.”
Audio capabilities will merge into PCs and into other mobile devices. No longer will mobile
workers have to carry a “tool belt” of different communication devices.
We can expect such new capabilities to continue to drive the evolution from traditional
PBX solutions to new, full-featured IP PBX models that will change the way businesses
communicate—delivering greater productivity, cost savings, and mobility.
Definitions and Acronyms
Isochronous. A communication link characterized by both ends using a common clock
source to send a constant bit stream.
IVR (interactive voice response). An automated system that can understand human
speech and provided prerecorded information to the caller.
PSTN (public switched telephone network). The worldwide telephone network. The
standards for PSTN interfaces are specified by the CCITT (now ITU, International
TDM (time division multiplexing). A multiplexing technique by which a communication
medium is divided into discrete time slots. Each time slot can be used as a
communication channel between two devices. If multiple devices attach to a TDM
medium, then the medium can be used as a switch.
LOVE IT, HATE IT? LET US KNOW
firstname.lastname@example.org or www.acmqueue.com/forums
JAMES E. COFFMAN has been with Avaya Labs for more than 20 years, working in a variety of areas in
telecommunications such as multimedia communications, VoIP systems development, VoIP standards, Web access
to call centers, CTI (computer telephony integration) systems development, and operating systems. He currently
is a director in the vertical markets development group responsible for vertical market technology. Previously, he
directed MultiVantage (now called Avaya Communication Manager) technical architecture and planning at Avaya
Labs. Before that, Coffman was responsible for planning and architecture in bringing IP telephony to the Definity
communication platform. Coffman has a B.A. in mathematics from Reed College and an M.S.E.E. and Ph.D. in
computer science from the University of Pennsylvania. He holds several patents in the telecommunications area.
VoIP: What is it good for?
ACM Queue vol. 2, no. 6 - September 2004
by SUDHIR R. AHUJA AND J. ROBERT ENSOR, BELL LABS/LUCENT TECHNOLOGIES
If you think VoIP is just an IP version of telecom-as-usual, think again. A
host of applications are changing the phone call as we know it.
VoIP (voice over IP) technology is a rapidly expanding field. More and more VoIP
components are being developed, while existing VoIP technology is being deployed at a
rapid—and still increasing—pace. This growth is fueled by two goals: decreasing costs
and increasing revenues.
Network and service providers see VoIP technology as a means of reducing their cost of
offering existing voice-based services and new multimedia services. Service providers
also view VoIP infrastructure as an economical base on which to build new revenue-
generating services. As deployment of VoIP technology becomes widespread and part of
a shared competitive landscape, this second goal will become more important, with
service providers working to increase their market bases.
Most current and envisioned VoIP services are so-called converged services, integrating
features and functions from multiple existing services. Often, features from conventional
voice-based telephony services are combined with those found in data network services.
For example, click-to-dial services allow users to control telephone calls from Web
browsers running on their personal computers. Converged services may also provide
users with new media integration. For example, multimedia conference services allow
users to interact with each other through calls in which they exchange both audio and
video information (i.e., new versions of videophones).
The growing opportunities for converged telephony-Web services are motivating
convergence of telephony and data networks. VoIP services are also driving another
network convergence: integration of wireless and wireline networks. More general
network convergence seems likely. Because IP networks can be relatively inexpensive,
network providers are encouraged to build common IP core networks surrounded by
various access networks. These access networks (wireless, wireline, cable, etc.) can
share the IP core resources, and thus reduce the costs of providing common services to
customers with different access devices.
Many engaging VoIP services are already available, and service providers are planning
even more exciting services. Continued deployment of IP networks and IP endpoint
devices will enable further development of new services. Also, as the processing capacity
of IP endpoints increases—allowing them to deal directly with network access controls,
multiple data formats, and transformations—more innovative and convenient services will
become possible. This article introduces some noteworthy services that are being
deployed today and highlights a few of the interesting future services.
CREATING NEW SERVICES
Conventional telephony services—those available to customers through the public
switched telephone network (PSTN)—are built upon a highly structured technology base.
This base was created and optimized to support voice calls using analog telephones. The
base provides application developers with integrated signaling/media transport (in-band
signaling) and a limited set of signal handlers and media processors, which are isolated
from other networks through their switched circuit connections. Since telephones support
very limited signaling mechanisms, invocation and control of PSTN services have been
awkward. Some services are invoked by dialing special phone numbers such as 800 or
900 numbers. PSTN services are often invoked and controlled through in-band signaling,
which is typically activated through touch tones (DTMF, dual-tone multi-frequency) or
voice (IVR, interactive voice response).
Fundamental control and media handling needed by PSTN service providers must be
performed by special network elements (signaling control points, service nodes, etc.).
Figure 1 illustrates key components of a call-center service. In this figure, the 800 server
is a service node; it is an application server that communicates with the class 5 and 4
switches via SS7 (Signal System 7) signaling protocols. This server deals only with
control messages and not with voice itself. It helps establish the final route for the voice
call based on the features it has implemented. For example, it can determine whether a
call is routed to a company’s call center or to one of its retail outlets.
Service providers may require control and media processing not supported by network
elements. This additional processing must be handled at call endpoints. The flow of
information into and from an endpoint is through the voice channel itself, and therefore
specialized controls must be built on audio controls (e.g., conversations with human
operators, DTMF, or IVR). In figure 1, these endpoint application servers are represented
by the call-center IVR server, which terminates voice connections and communicates via
in-band signaling using DTMF or voice recognition.
VoIP technology provides richer, more flexible foundations for building communication
services. IP networks support independent connections for signaling and media traffic.
This decoupling of signal and bearer traffic eliminates interference between the
information flows; in-band signaling is not required. Thus, communication with
application servers is simplified.
In addition, IP network topology allows any node to act as a server. Therefore, multiple
application servers and user endpoints—located in one or several service provider
domains—can communicate via IP to participate in service support.
Finally, IP transport is provided by various underlying networks, and different network
technologies can support different sets of services. For example, DSL and cable networks
provide broadband IP connections that support realtime voice, data, and video services.
Hence, these network providers can offer “triple-play” services to their customers.
Figure 2 shows how to implement a call-center service using VoIP technology. In the
figure, user endpoints are telephones (not IP-based devices) attached to wireline or
wireless access networks. An IP backbone interfaces to these specific access networks
through border elements (e.g., media gateways). These gateways terminate voice calls
for the users; they handle all TDM (time division multiplexing) voice traffic to and from
users. The gateways recognize DTMF signals from the users and convert them to SIP
(session initiation protocol) messages for the IP-based application servers. In addition,
they convert between the users’ TDM voice payload and RTP (realtime transport protocol)
media packets, which are used by the media processors. Several IP-based application
servers work in concert, coordinating their activities through SIP signaling to provide the
call-center service. The softswitch contains a SIP proxy to support this SIP coordination,
and it contains media control functions to support coordination of media processing. The
application servers may be geographically distributed and separated from endpoints and
switches. For example, Web sites can use stored voice or music files to provide
announcements. They can act as music-on-hold servers; a single announcement server is
VoIP technology provides a foundation for creating many new converged services through
different combinations of components. For example, IVR and Web components can
combine—using SIP as a common signaling protocol—to create call-center services with
access from Web browsers or IP phones, as well as voice-only telephones. Similarly, IVR
servers and SMS (short message service) can combine to create call-center services that
include SMS messages. Users will be able to access these call-center services via any of
their access mechanisms or even simultaneously use multiple access technologies to
provide better service. Alternatively, SMS systems can combine with Web-based
information servers to create MMS (multimedia message services) in which messages
may contain Web-based information and be retrieved by Web browsers.
NEW CONTROLS AND COORDINATION
Converged services can employ features from one set of services to control aspects of
other sets of services. For example, click-to-dial services combine Web-based user
interfaces with telephony servers to create Web-controllable phones. These services allow
users to select (highlight) phone numbers embedded in Web pages, indicating that these
numbers should be called. Such services are built by combining the PSTN, IP networks,
and IP-based servers.
Figure 3 shows how a typical click-to-dial service works. When customers use their Web
browsers to click on a telephone number within a Web page, their computer sends a
message over a packet network to an IP-based click-to-dial server. This server, in turn,
uses its connections to the PSTN to make telephone calls to the customer and to the
number that customer is dialing. These calls are then bridged into a single call by a PSTN
This example illustrates an important characteristic of VoIP services: they can be made
as collections of multiple servers. These servers typically base their coordination on SIP
signaling. SIP, however, provides a means only to locate and synchronize the initial
interaction among the appropriate servers. Once the servers have rendezvoused through
SIP, they must then exchange application-specific signaling through appropriate
specialized protocols. In this example, the click-to-dial client and the Web server must
exchange agreed-upon protocols (typically including HTTP) so that Web pages can be
transferred to the user. Also, the click-to-dial client and the click-to-dial server must
exchange an agreed-upon protocol to request and control the required telephony
Service coordination and composition become important issues in the development and
execution of VoIP services, as multiple application servers are often involved. The
industry must develop techniques to coordinate distinct service elements within sessions.
One fundamental problem is that service behavior is difficult to describe both formally
and conveniently, which makes service coordination labor-intensive. A related problem is
that the multiple servers used to create a service might not be in the same network.
Therefore, one service provider might not be willing to publish details of its server for
another provider. Another difficulty is that services can interfere with each another. For
example, if a conference participant temporarily leaves, generating music on hold, this
behavior can interfere with or even block continuation of the conference by the remaining
Integration and Sessions
NEW MEDIA INTEGRATION
Many VoIP services are based on integration of multiple media. One such service is
multimedia conferencing, which can be implemented by taking advantage of both SIP
signaling and IP transport. SIP messages are available for server registration and
rendezvous, as well as the controls that are needed to set up, conduct, and end sessions.
Additional IP control messages are used to send media-specific commands. For example,
service customers can use these commands to select video feeds, change codecs, change
multicast groups, etc. IP transport is used to move the data representing the various
media to and from servers and among users.
Figure 4 illustrates a conferencing service. Similar in overall structure to the IVR service
depicted in figure 3, this system is based on a different set of servers: a multimedia
conference server, an audio bridge, video server, and data-sharing server. The
conferencing server coordinates the activities of the data-specific servers, which
manipulate different sets of packet data corresponding to appropriate media. For
example, the audio bridge receives encoded voice from all participants and distributes
combined voice data back to the participants. As the figure illustrates, uniformity of
endpoint devices is not required—each customer can participant in a conference through
a different type of endpoint—e.g., cellphone, analog phone, or laptop. The media
transmitted to/from each participant depends upon the capabilities of the participant’s
QoS (quality of service) is an important issue for IP-based multimedia services. Many
current IP services have been deployed without QoS guarantees from underlying network
providers. These services are successful because transport quality is sufficient to meet
customer demands. Providers of these services, however, do not have assurances that
their services can grow to meet the needs of larger customer bases while also meeting
time constraints for the services. For example, IP-based voice and video services are
being deployed in enterprises without explicit QoS support. Since the enterprise LANs
used for transport have enough bandwidth to allow over-provisioning for realtime voice,
and video, these services are successful. Timely transport of time-sensitive data,
however, to support realtime multimedia conversations across worldwide networks, is
harder to ensure.
We must solve these problems by using adequate transport performance and servers
within the signaling and media transport paths that can react to messages within realtime
constraints. These servers must process both signaling and bearer traffic within time
bounds to meet processing needs associated with transcoding, composition, distribution,
etc. Currently, servers capable of this processing are economical only for certain
NEW USES OF SESSIONS
SIP sessions can be long-lived, and persistent sessions provide the foundation for some
interesting new VoIP services. One example is an enhanced chat-room service, called
Telechat, illustrated in figure 5.
In this application users can interact through voice, video, and data during multimedia
conferences. They can also exchange private and public (broadcast) messages. Users can
create and access stored data in a shared repository. The data can be imported from
other applications, generated during chat sessions, and accessed during or outside of
multiparty conferences. Service sessions are not restricted to calls, so they can be long-
lived, extending over multiple calls or over other, shorter sessions. These longer sessions
can form the basis for persistent state and data storage.
Persistent sessions support long-term interactions—and can serve as the rendezvous
point for multiple calls. In addition, a persistent session can provide storage for data used
in these calls. Hence, a persistent session can act as a direct representation for a long-
term group effort. Enhanced chat-room services can be built upon persistent sessions,
which can maintain a room state that is stable over the span of several chat sessions.
This persistent state creates a context or surrounding environment for a series of chat
Persistent sessions create new challenges for system designers. Developers must decide
where to maintain session state, which can be distributed among network servers and
endpoints or restricted to subsets of these elements. Designers must also decide where
to store the data associated with the sessions. In Telechat, for example, session state is
stored on multiple servers. In a related issue, service providers must decide who owns
what data. Billing for the resources needed to store persistent state is also a source of
several design decisions. For example, service providers must specify whether a person
who joins a long-term session pays for the session or pays for the connection/interaction
with the session.
ONLY THE BEGINNING
VoIP is a disruptive technology that is causing significant change in the way voice
communication services are delivered. It is providing future roadmaps for telecom
networks. This is only the beginning of a more significant move to convergence. As the
world moves to a common IP-based data network as backbone, VoIP is only one of the
realtime services offered on such networks, along with many data services. The same
network will also support video services from videoconferencing to entertainment video.
More important, these services allow convergence at the control and user levels. A user
can initiate a call or TV program from the Web and then send a video from a camera
phone to the user’s home Web site. Common Web-based services can be used for
provisioning the user’s personal choices. Clearly, this is only the beginning of exciting
services offered by full multimedia on IP.
An important architectural change is that all application servers will move out of specific
networks and become more access-independent. Networks will become multiservice
platforms. To do this effectively, networks have to provide flexible QoS mechanisms and
the ability to create virtual networks to match the services being deployed. This is where
many of VoIP challenges remain to be solved. Specifically, we still need ways to specify
network requirements of a particular application (e.g., multiparty audio-conferencing)
and we need to be able to map that to the multiservice network. Finally, we need to be
able to provision such services and monitor their execution to guarantee delivery.
Last, but not least, is the challenge of integrating the ever-smarter endpoint and
endpoint-based applications with the network-centric view presented earlier. Besides new
service interaction issues, this raises many new concerns about ownership of the user’s
data, authentication, billing for services, and responsibility for security.
VoIP is here and already leading the way not just to cheaper voice calls but also to a host
of new applications. We need to focus on the challenges to enable a host of new
What Is SIP?
SIP (session initiation protocol) is a text-based protocol for initiating communication
sessions between users. These sessions may include calls with conventional telephones,
voice, video, and data calls, multimedia conferencing, streaming media services, games,
etc. SIP is defined by a collection of Requests for Comment managed by the Internet
Engineering Task Force (IETF).
SIP messages are exchanged among two or more peers (IP nodes) for rendezvous and
synchronization, thus supporting initiation of interactive communication sessions.
Once communicating parties have started their session through SIP messages, they are
able to conduct the session through session-specific message exchange. These parties
may also use SIP for additional session events, such as adding and dropping session
members, changing media, and ending sessions.
SIP is fundamentally a protocol for communication among peers. SIP sessions are
conducted by two or more communicating parties. These parties may be network
endpoints—IP nodes associated with end-user devices—as well as network servers. If one
SIP node knows the address of another node, the first may invite the second to join a SIP
session. Thus, SIP sessions do not require support from network servers, but network
intermediates typically help endpoints find one another. Users register their network
addresses with SIP registrars. Users usually send session invitations to one another
through SIP proxies, which use registration information to locate invitees.
SIP sessions provide an extensible framework for a wide variety of interactions. They do
not define—hence, do not constrain—specialized service behavior. Thus, they form the
basis for many different communication services. SIP sessions support services typically
accessed through packet data networks (e.g., streaming video-on-demand service). They
also support conventional telephony services (e.g., conference voice calls).
Because SIP is a framework in which both telephony and nontelephony services have
been developed, SIP has encouraged convergence of services. In particular, SIP is
encouraging convergence of telephony and Web-based services. These converged
services include Web phones, Web-based management of telephony services, and
interactive games in which players can talk with one another in conference calls.
Additional information is available from the SIP working group of the IETF at
SUDHIR AHUJA is vice president of the Converged Networks and Services Research Laboratory at Bell
Labs/Lucent Technologies, where he is leading research in converged networks, services, speech recognition, text-
to-speech coding techniques, video-based communication, and novel multimedia applications. He designed and
developed the first large-scale multiprocessor at Bell Labs and championed the first Internet-based video
conferencing system. His current interests are in the field of communication applications over the Internet.
Ahuja obtained his M.S. and Ph.D. degrees in electrical engineering from Rice University. His undergraduate
education was at the Indian Institute of Technology, Bombay, where he received the President’s Gold Medal for
outstanding academic performance. He is a Fellow of Bell Labs and has served as chairman for the Multimedia
Services and Terminals Committee of the IEEE Society, area editor for the IEEE Communications Committee, and
editor for Transactions on Networking, a joint publication of IEEE and ACM.
BOB ENSOR is a technical manager in the Services Infrastructure Research Department at Bell Labs/Lucent
Technologies. He leads research and development efforts in next-generation network architectures and
components. Earlier, he served as principal researcher in several projects at Bell Labs, including broadband
service data centers, multimedia messaging systems, shared virtual worlds for the Internet, and multimedia
conferencing systems. Ensor holds several patents and has published numerous papers. He received his Ph.D. in
computer science from SUNY at Stony Brook