Microsoft PowerPoint - ccnc10_voip

CCNC 2010 Tutorial: Towards Glitch Free VoIP and Video Conferencing 1/12/2010

TOWARDS GLITCH-FREE
VOIP AND VIDEO CONFERENCING
JIN LI
MICROSOFT RESEARCH

Outline
2

Introduction
Anatomy of VoIP and Video Conferencing Systems
Audio/Video Components
Network Components
Summary

Jin Li, Microsoft Research 1


3 Introduction

Booming of IP Based Communication
4

Advanced voice over IP (VoIP)
Web-, audio-, video-conferencing
Tele-presence
Instant messaging
Calendar and other PIM functions
Email, fax and voice mail



Worldwide VoIP subscribers
5

• Worldwide VoIP service revenue was $24.1B in 2007, up 52% over 2006.
• It is expected that worldwide VoIP service to more than double over the next 4 years, to
$61.3B in 2011, with an annual growth rate of 26%.

Source: 2008 Infonetics Research Inc,

US Broadband Telephony Forecast,
6
2007-2013

VoIP subscriber base are predicted to double from 2007 to 2013.
Source: Jupiter Research, US Broadband Telephony Forecast, 2008 to 2013



VoIP Trend
7

IP networks are the next gen networks for all forms of
communication.
Broadband penetration is a key driver of VoIP expansion
Worldwide DSL subscriptions were at 205.9M at the end of
2007, up 23% from 2011. It is predicted to increase to 363.6M
in 2011.
Cable subscriptions were up 15% annually to 68M at the end of
2007, climbing to 97.3M in 2011.
Passive Optical Network (PON) subscribers were at 10.9M in
2007
Ethernet FTTH subscribers were at 1.7M in 2007
2004/2005 are breakthrough years for VoIP adoption

High End Systems – Tele-Presence
8

Cisco Telepresence $299K Tandberg Experia $225K

HP Halo $425K + $18K/mo Polycom RPX210M $269K + $18.5K/mo



Worldwide Tele-presence Forecast
9
(2006-2012)

# of end points

Revenue forecast

Source: 2008 IDC Research

Desktop Video Conferencing
10

Multiple solutions, often acted as add on to VoIP

Benefit
See faces of people you may not have met before
See facial expressions & gestures
Easier to follow a conversation
More interactive than phone
Get the general mood of ambience
See and show documents/objects
Drawback
Difficult to setup and planning
Network reliability
Without(or poor) video, people talk; without(or poor) audio, people walk.
Interpersonal factors



11
Anatomy of VoIP and Video
Conferencing Systems

Infrastructure vs. P2P
12

Infrastructure based P2P based
Microsoft Unified Skype
Communication

Cisco

Gtalk



13
Infrastructure Based VoIP:
Microsoft Unified Communication

Unified Communication: Architecture
14



Unified Communication: P2P Call
15

Key Steps
16

Alice calls Bob

Find Bob’s registered SIP endpoints



Unified Communication: To VoiceMail
17

Key Steps
18

Alice calls Bob

Find Bob’s registered SIP endpoints

Bob doesn’t answer after a certain period, call re-routes

Voicemail system plays a greeting, records Alice’s msg, send the msg
to Bob’s email, and use speech server to transcribe the msg



Unified Communication: PSTN UC
19

Key Steps
20

PSTN user Alice calls Bob

IP-PSTN gateway terminates the call

MS/Gateway routes call to mediation server, which
performs transcoding & ICE, etc..
Through director, the proper UC client is found



21 P2P VoIP: Skype

P2P VoIP: Skype
22

Information
Debut: 08/2003, by N. Zennstrom and J. Friis, who
founded KaZaA
A P2P overlay network for VoIP and other app
Free intra-net VoIP and fee-based
SkypeOut/SkypeIn



Skype Usage (Apr. 2008)
23

11 million concurrent Skype users on line in peak time
(180,000+ simultaneous calls)
309 million registered users worldwide, the largest
registered user base within eBay portfolio (33 million
added users for Q1FY08)
$126M revenue in Q1FY08 (61% YOY growth, 5.6
billion SkypeOut minutes in FY2007)
100 billion cumulative Skype-to-Skype minutes

Skype Share of International VoIP
24
Traffic



Skype Gadget
25

IPDRUM mobile Skype
Cable

Motorola CN620 IPEVO Free-1
WiFi Cellphone USB Skype Phone

Netgear Skype
Wi-Fi Phone

USB Mouse with Phone
50 hardware partners, 150+ Skype certificated device.

Skype vs. VoIP
26

Public VoIP standard
H.323, SIP
Skype is a proprietary VoIP solution
Rely on P2P network for user directory
Scalable without costly infrastructure
Route calls through supernodes in Skype
Universal firewall/NAT traversal
Encrypted traffic (but you have to trust eBay/Skype)



Skype Ingredient (1)
27

User retrieves ID from
a skype server

Skype Network
28

Skype
Server
authentication

Supernode Overlay:

any computer w/ sufficient CPU, memory
& network bw & not behind firewall
For distributed directory service
Relay traffic for computer behind
NAT/firewall



NAT Traversal (Skype)
29

NAT/Firewall detection
Try UDP connection
Try TCP connection (arb port, 80 (http), 443(https) )
Traversal
Direct connection if a) both clients have no NAT, b) one
client has no NAT, and one behind cone-NAT
Relay by supernode otherwise
Since Skype doesn’t need to pay for relay cost
High bitrate wideband voice codec (>24kbps)

Skype : Call Routing Through Supernode
30

Skype
Server
authentication

Supernode Overlay:

Route call through
supernodes
High bitrate wideband voice
codec (>24kbps)



Skype Encryption
31

Peer 1
Peer 2

256-bit AES over 128 bit data block
1536/2048 RSA for key negotiation (2048/2048
for paid service)

Skype: Complete Black box
(Security by Obfuscation )
32

Almost everything is obfuscated
Many protections, anti-debugging tricks, ciphered code
Avoid static disassembly: xor binary with a hard-coded key,
erasure beginning of the code, own packer
Code integrity check: use checksum to avoid breakpoint
Anti-debugging technique: anti softice, integrity check
Code obfuscation
Network obfuscation



33 Audio/Video Component

Audio/Video Component
34

Audio Codec
Video Codec
Acoustic Echo Cancellation



35 Audio Codec

G.711 (PCM)
Still widely used today: PSTN interface
If uniform quantization
12 bits * 8 k/sec = 96 kbps
Non-uniform quantization
65 kbps DS0 rate
North America: µ-law

Other countries: A-law

MOS of about 4.3
µ = 255 , A = 87.6



G.722.1: Siren

Audio bandwidth: 14 kHz
Sample rate: 32 kHz
Bit rate: 24, 32, and 48 kbit/s
Algorithm: Transform coding (Siren14TM)
Frame size: 20 ms
Algorithmic delay: 40 ms
Complexity: <11 WMOPS (enc/dec)
Available on royalty-free licensing terms (from Polycom)

Siren Encoder



Siren Decoder
39

Siren Codec

Audio sampled at 32kHz
Operates on frames of 20 ms corresponding to 640
samples
Based on transform coding, using a Modulated
Lapped Transform (MLT)
A Look-ahead of 20 ms due to 50% overlap between
frames
Total algorithmic delay of 40 ms



MLT - Modulated Lapped Transforms
41/75

Spatial Response Frequency Domain

Categorization & SQVH
42

Quantization Used by SQVH
Expected # of Bits For Each Category

Vector Property Used in SQVH



AMR-WB Basics
“Wideband coding of speech at around 16kbit/s using
adaptive multi-rate wideband (AMR-WB)”
Adopted as ITU-T G722.2, and also as 3GPP spec TS
26.190.
“Foreseen applications are: VoIP and internet
applications, Mobile Com., PSTN app, ISNDN wideband
telephony, ISDN videophone and videoconf.”
Sampling rate 16KHz;
Bitrate: 6.60, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85,
23.05, and 23.85 kbit/s.
20 ms frame.
ACELP (algebraic code excited LPC).

Pre-processing

Sampling rate conversion: 16 to 12.8KHz; (now a
20ms frame has 256 samples…)
HP filter (cut off @ 50Hz)
Pre-emphasis filter ( 1 -.68 z-1 )



LP analysis and Quant.
One 30 ms asymmetric window
5 ms look-ahead
Obtain LPC Coef.:
Compute correlation;
Multiply by window (add 60HZ BW expansion);
R(0) = 1.0001R(0) ( adds 40dB noise floor);
levinson-durbin to compute LP coefficients.
LP to ISP
Quantize in ISP q-domain.

LP analysis and Quant. (2)
Quantization bottom line:
46 bits/frame on most modes;
36 bits/frame on 6.60 Kbps mode;
M.A. prediction with 1/3 gain;
Quantizer: S-MSVQ (split multistage VQ)
Both quantized and unquantized coefs will be used in
algorithm.



subframes
Each 20ms (256 samples) frame is divided in 4 sub-
frames (64 samples each).
Interpolated LPC coefficients obtained for each sub-
frame
Interpolation done in ISP q-domain

Perceptual weighting
Weighting filter is:
W(z) = A(z/γ1).Hde-emph(z)

This helps solving the tilt problem, which is worse in
WB speech.



Excitation
Searched for each 5ms sub-frame.
Two components:
Adaptive codebook (past excitation)
Algebraic codebook
“target” signal obtained by filtering the LPC residual
(for the sub-frame) through the synthesis LPC filter
and weighting filter.

Adaptive codebook
Start with “open loop” pitch estimation
based on cross correlation;
Low-value bias;
‘last value’ value bias (actually 5-frame median), if voiced.
Re-compute with “closed loop”, around initial value ±7, and up to ¼ sample precision.
“Analysis by synthesis” based;
Restrict to values allowed by encoding step.
Start with “open loop” pitch estimation
based on cross correlation;
Low-value bias;
‘last value’ value bias (actually 5-frame median), if voiced.
Re-compute with “closed loop”, around initial value ±7, and up to ¼ sample precision.
“Analysis by synthesis” based;
Restrict to values allowed by encoding step.



Algebraic codebook

Remove contribution of (unquantized) prediction
from adaptive codebook from the “target signal”
to obtain new target.
Divide sub-frame into 4 alternating tracks.

Algebraic codebook (2)
Select best pulses, for a total of 24 (6),
18(5-4), 16 (4), 12(3), 10(3-2), 8(2), 4(1), 2(.5),
depending on bitrate.
Pulses + Two filters:
Periodicity enhancement: 1/(1-.85z-T);
Tilt: 1/(1- β1 z -1)
Tricks to save bits in encoding pulse position;
Tricks to save computation on pulse search.



Wrap up
High pass, de-emphasis;
Upsample back to 16KHz;
Add high frequency components.

High Freq. Components
Random noise used as excitation
LP filter is extended to 8KHz.
Energy of excitation based on energy of base-band
residual, and voicing info, except in highest bitrate
mode.
Extension of LPC filter is equivalent to mapping 5.1 to
5.6Khz to 6.4 to 7.0KHz;
Band-pass filtered to 6-7KHz, and added to output
signal.



55 Video Codec

H.264/AVC Encoder
56



H.264/AVC Decoder
57

Reference Picture Management
58

Reference pictures are stored in decoded picture buffer (DPB)
Short/long term reference picture, a decoded frame may be
marked as
unused for reference
short term picture
long term picture
Sliding Window” memory management
Keep #(long_term_pic+ short_term_pic)
Remove short term picture if lack of space
Adaptive memory control
issued by encoder
change the type of the ref frame
IDR (Instantaneous Decoder Refresh)
clear ref buffer
I frame



Slice Group
59

Former called “FMO” (Flexible Macroblock
Ordering)
A subset of the macroblocks and may contain one or
more slices
Error resilience

Inter Prediction
60

Variable block size
¼ pixel motion compensation
Interpolation



Motion Vector (MV) Prediction
61

Efficiently encode correlated MV
Other than 16×8 and 8×16, MVp=(MVA+MVB+MVC) /3
16×8, MVp of the upper =MVB ;MVp of the lower =MVA
8×16, MVp of the left =MVA ;MVp of the right =MVC
For skipped macroblocks, do as 16 × 16 Inter mode

Intra Prediction
62

For Luma samples
4*4 block: 9 prediction modes
16*16 block: 4 modes
I_PCM: transmit the encoded samples w/o pred. &
trans



Prediction Modes
63

4x4 Luma

Intra 16x16
8x8 Chroma is similar to 16x16 luma intra

Signaling of Intra Prediction Modes
64

Mode choices need to be signaled to the decoder, but compactly
The prediction mode for luma coded in Intra-16 16 mode or
×
chroma coded in Intra mode is signaled in the macroblock header
Intra modes for neighboring 4 4 blocks are often correlated
×
B
A C
If A and B are available, C = min (A,B)
else if (neither A nor B are available) C = 2 (DC)
else C = available (A,B)
Use prev_intra4x4_pred_mode flag & rem_intra4x4_pred_mode
flag to indicate mode selected.



Deblocking filter
65

Filter 4 vertical/horizontal boundaries of luma
Filter 2 vertical/horizontal boundaries of chroma
Affect up to 3 samples on the either side.
The filter is stronger at places where there is likely to be
significant blocking distortion
e.g.: such as the boundary of an intra coded macroblock or a boundary
between blocks that contain coded coefficients.

Transform and Quantisation
66

3 transforms
DCT-base transform for all 4*4 residual block

a=1/2, b = (2/5)1/2, d = 1/2
Hadamard transform for 4*4 luma DC coefficient (in
16*16 intra)
Hadamard transform for 2*2 chroma DC coefficient



Combine Quantization into Scaling
67
of Transform

4x4 DC Intra Luma

|ZD(i, j)| = (|YD(i, j)| MF(0,0) + 2f ) >> (qbits +1)
sign (ZD(i, j)) = sign (YD(i, j))

|ZD(i, j)| = (|YD(i, j)| MF(0,0) + 2f ) >> (qbits +1)
sign (ZD(i, j)) = sign (YD(i, j))

CAVLC: Context-Based Adaptive
68
Variable Length Coding
Characteristics:
Run-level coding to compact zero string
Trailing ones (+1, -1 after 0)
Number of nonzero coefficient in neighboring blocks is
correlated
Choice VLC lookup table for level parameter for level
magnitude



CAVLC Encoding
69

1. Encode the number of coefficients and trailing ones (coeff token)
TotalCoeffs : 0 ~ 16
TrailingOnes : 0 ~ 3
if more than 3 TrailingOnes, only last three are treated as ‘special cases’
Four look up table
Three variable-length, one fixed-length
Choice depend on neighboring blocks
2. Encode the sign of each TrailingOne: In reverse order
3. Encode the levels of the remaining nonzero coefficients
level_prefix, level_suffix
4.Encode the total number of zeros before the last coefficient
Zero-runs at start of the array need not to be encoded
5. Encode each run of zeros
If less then 3 TrailingOnes, the first nonzero coefficient is adjusted

70 Acoustic Echo Cancellation



71

From Audio
Decoder

To Audio
Encoder


Acoustic Echo Cancellation Module
72



Adaptive Traversal Filter
73

FIR filter – inherently stable
Length of the filter affects other performance, convergence,
goodness, and complexity.
Filter introduces errors since it is trying to model IIR response.
Short Filters
128 – 256 coefficients (taps)
Faster convergence, but final solution has more residual error
Less complex O(N).
Long Filters
512-1024
Slower convergence, but final solution has less error.
More complex, as algorithm can be O(N2)

Challenges
74

Dynamic range of the human ear = 120dB.
Even quiet echoes can be heard.
Longer delays from satellite (300-500ms), VoIP
Ear is more sensitive to longer delays.
More difficult to find the beginning of the echo.
Long filters (~1000 taps) are needed (complexity &
convergence)
Near-end noise: corrupt the echo, decreasing the
cancellers ability to converge.
Acoustic echo paths can change rapidly
More difficult for the AEC to remain converged.
Nonlinear echo components
Speakers driven beyond linear region.



75 Network Component

IP-based VoIP / Video Conference
76



77 Internet Primer

Internet : Grand View
78



Impact on ISPs
79

Economics of ISP relationships
transit peering entity
sibling relationship
boundary several ISPs belong to same org
peering

peering relationship
mutual beneficial free
agreement (to certain extent)

sibling sibling entity transit relationship
boundary
one ISP pays another

Inside ISP
80



ISP POP (Point of Presence)
81

Home Networking
82



83 Network Characteristics

Under-provisioned Links
84

Branch Branch



Growth Trends
85

Packet Loss vs. Jitter (vs. Delay?)
86



The Usual Suspects
87

Packet Bursts
88



What kind of Enterprise User?
89

How QoS can help
90



QoS helps inside and between
91
branches!

Observation
92

IP-based communication in the enterprise is growing
Empirical results show poor calls for Wireless and
VPN users
QoS (DiffServ) is both used and useful!



93 Available Bandwidth Estimation

What is Available Bandwidth (ABW)?
94

ABW is the left-over capacity along an Internet
path



Why Is It Useful?
Maximizing QoE (Quality of Experience) in A/V
conferencing
Audio prefers minimum delay (high priority)
Video prefers maximum rate (low priority)

One Way Delay (OWD) = propagation delay (constant) + queuing delay (variable)

One solution: measure ABW, encode and send
video at the ABW rate

Typical Targeting Scenario

First hop is the bottleneck
Cable modem, DSL, high-speed link…
Timescale for the ABW estimation: 2-4 seconds



Why Is Measuring ABW Hard?
Available bandwidth changes over time
ABW measurements must be quick

Audio packets (along the same path) should
experience minimum delay
Measurement must be non-intrusive

Two Models
Probe Rate Model (PRM) based solutions
Pathload, TOPP, Pathchirp, Bfind, PTR …
Probe Gap Model (PGM) based solutions
Spruce, Delphi, IGI, Moseab …



Pathload (PRM) [Jain & Dovrolis]
Send probe trains at various rates
ABW is the probe rate at transition, where OWD is
increasing (queuing delay is observed)

Spruce (PGM) [Jacob et. al.]
Send probe pairs/train at Ri (Ri > A), measure
sending gaps and receiving gaps
Compute A directly



Advantage/Disadvantages of The
Approaches

Advantages Disadvantages
PGM based Fast estimation: Assumptions are not easy
approaches to verify in practice
Estimation can be done in
single probe.
PRM based No assumption Slow estimation:
approaches
iterative probes

102 Forward Error Correction



Block Based Erasure Resilient Coding
103

Original data: 1 2 3 k k messages

ERC: 1 2 3 k k+1 n

At a certain
instance X X X X
X X

Some of the blocks may be lost in delivery. However, as long as there
are at least k blocks delivered, the original data can be reconstructed.

ERC in VoIP and Video Conferencing
104

VoIP
Mainly packet replication, due to small VoIP packet size
& low delay requirement
Video Conferencing
Packet loss protection (for I frame or P frame in HD)
Each frame is separate into k msg, and protect by n-k
msg. As long as there are less than n-k loss, the
transmission succeeds



ERC Terms
105

Number of Original Block: k
Number of Coded Block: n
Rate of ERC: k/n
MDS: Maximum Distance Separable
Any k of n coded block may recover the original
The theoretical optimal performance

Erasure Encoding: Mathematics
Original data: x1 x2 xk

Coded data: y1 y2 yn

: Vectors on Galois Field.

106



Example: ERC of 10MB
Original data x1 x2 xk k=10, GF(28), each vector is 1MB.
(10MB):
(n=30)

30

10 1M 1M

107

Erasure Decoding: Mathmatics
108


Available

Code select



Erasure Decoding: Mathmatics
109



Original data can be recovered if the sub-generator matrix
has a full rank k.

Systematic vs Non-Systematic ERC
110

Original data: 1 2 3 k k messages

Non systematic 1 2 3 k k+1 n
ERC:

Systematic 1 2 3 k k+1 n
ERC:

Systematic ERC
Slightly low encoding & decoding complexity
Even can’t recover, we can still use some original msg



Reed-Solomon
111

Has been around for decades
Has systematic form
Cauchy Reed-Solomon Code

Tutorial, Jin Li

Reed-Solomon Decoding

Inverse

Receive

112



113 Dejitter Buffer

Variable Delay & Dejitter Buffer
Queuing Queuing Queuing
Delay Delay Delay

Dejitter
Buffer

Queuing delay
Dejitter buffers
Variable packet sizes



Fixed Dejitter Buffer – Budget For Worst Case

Coder Queuing
Delay Delay Dejitter Buffer
40 ms 4-50 ms 50 ms
Site A Site B
Propagation
Delay—8 ms
(128kbps Bandwidth

Total End-to-End Delay
Codec delay: 40ms
Propagation delay: 8ms
Dejitter buffer: 50ms
To accommodate queuing delay: 0-50 ms
Total delay: 98ms

Dejitter Buffer Size & Late Loss

late loss

buffering delay

Fixed playout deadline and jitter
Playout Jitter absorption:
The playout rate is constant
The tradeoff is between Dejitter
buffer size and late loss
Delay Packet Loss



Adaptive Playout and Dejitter Buffer Adaptation

buffering delay

Adaptive playout and jitter adaptation
Playout Jitter Scaling of voice/video packets in highly dynamic
way
Playout schedule set according to past delays
recorded
Usually dejitter buffer size expand quickly to late
packet arrival, and shrink slowly when jitter reduces
Delay Packet Loss
Improved tradeoff between buffering delay and
late loss
Playout rate is not constant

Adaptive Play Out
118

Audio Adaptive
Playout

Packets push into Adaptive Playout module
Render requests new waveform seg for playout
Playout module passes packet to audio decoder



119 Packet Loss Concealment

Audio Packet Loss Concealment
L ∆L

i-2 i-1 i lost i+1 i+2

alignment found by correlation time

i-2 i-1 i+1 i+2
time
2L
1.3 L

Depend on voiced & unvoiced segment



Voiced segments

Unvoiced segments



Concealment as (bi-directional)
stretching

Video Packet Loss Concealment
124

Spatial Concealment
Use spatial correlation
E.g., bilinear interpolation
Projection onto convex sets
Temporal Concealment
Use correlation exists between consecutive frames
Temporal replacement
Boundary matching



Spatial-Temporal Concealment
125

126 Summary



Summary
127

VoIP/Video Conference Systems
Infrastructure based
P2P based
Audio/Video Components
Audio codec
Video codec
Acoustic echo cancellation
Network components
Primer of the Internet
Network characteristics
Available bandwidth estimation
Forward error correction (FEC)
Dejitter buffer
Packet loss concealment


Microsoft PowerPoint - ccnc10_voip

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Microsoft PowerPoint - ccnc10_voip

Similar to Microsoft PowerPoint - ccnc10_voip (20)

More from Videoguy

More from Videoguy (20)

Microsoft PowerPoint - ccnc10_voip