WebRTC infrastructures in the large (with experiences on real cloud deployments)

WebRTC infrastructures in the large
(with experiences from real deployments)
Luis Lopez
lulop@kurento.org
IIT RTC Conference
& Expo
October 2015

Speaker
• Coordinator of Kurento.org
– FOSS project
– WebRTC Media Server
– WebRTC Media APIs
– WebRTC Cloud Infrastructure
• Software developer
• Software trainer
• Software learner
• FOSS enthusiast
http://www.kurento.org
2
http://twitter/@kurentoms
https://www.youtube.com/channel/UCFtGhWYqahVlzMgGNtEmKug

WebRTC infrastructures
3
Peer-to-Peer WebRTC Application (without media infrastructure)
WebRTC video stream
WebRTC Application based on media infrastructure
media infrastructure

Function of WebRTC infrastructures
4
Processing
VP8 H.264
Group Communications
Archiving

WebRTC infrastructures in the large
5
From the hundreds to the millions: the scalability problem
WebRTC Cloud

WebRTC cloud models
High flexibility
Complex
development
Lowhourly
costs
Low flexibility
Simple
development
Highhourly
costs
IaaS
PaaS
APIaaS
SaaS
No WebRTC-specific
players here
Computing
Resources

WebRTC cloud architectures
8
Virtual infrastructure
WebRTC Platform
WebRTC API
WebRTC Application
IaaS
PaaS
APIaaS
SaaS
No new
science
here
The science for
the scalability
problem is here

WebRTC Vs traditional WWW
Platforms: the three tiers
9
Application Server Container
Service Layer
Application 1 Application N…
WebRTC
Media Server
DD.BB.
Server
Signaling

Vertical scalability on monolithic
WebRTC platforms
10
Application Server Instance
Media Server Instance
Application 1 Application N…
Qualityofservice
Number of WebRTC legs
Typical scalability curve
for SFU media servers
~500 to 1000 in
commodity hardware
The bottleneck is here

Horizontal scalability of WebRTC
Media Servers
11
Application
Server
Application
Server
Application
Server
Media
Server
Media
Server
Media
Server
Media
Server
Media Resource Broker
…
…
RFC6917
Load Balancer

• Functions
– MS registration
• MS instances register on the MRB
– MS brokering
• Query model
– AS instances query the MRB for locating a MS instance
– MRB is explicit for the AS
• In-line model
– MRB routes signaling (control requests)
– MRB is transparent for the AS
• MRB does not hold state about MS instances
– MS instances are independent
– MS instances are equivalent
– We say it’s stateless
12

Stateless MRB use cases
• Independent MS
– B2B calls
– WebRTC GW
– Room servers
– Media recording
– Etc.
13
Stateless - MRB
Application
Server
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Call Call

• Amazon Web Services EC2
– Most popular public cloud
• OpenStack
– Popular public clouds (e.g. RackSpace)
– Popular for private clouds
• Deployment
– Cloud deployment templates
• CloudFormation (Amazon)
• Heat (OpenStack)
Deploying in public and private clouds
14

Templates
– Declarative language for
• Declaration of resources
and relationships
– Images, Computing Nodes,
Networks, Volumes, Load
Balancers, Autoscaling
groups, etc.
• Deployment
– Instantiation of resources
• Runtime
– Provisioning
– Autoscaling
15

Deploying in public clouds
16
AWS AMI / OpenStack Glance
Media Server
Image
Application
Server Image
Broker
Image
Stack definition template
AWS EC2 / OpenStack Nova
CloudFormation / HeatChef + Packer
Autoscaling
Rules
Launch
configurations
Autoscaling
Group
Autoscaling
Group
Elastic Load
Balancer
Application
Server
Instance
Application
Server
Instance
Broker
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Source code

Experiences deploying large WebRTC
infrastructures in public clouds
• Lessons learnt: fault-resilience is hard
– AS & MRB layers
• Are stateless => use distributed cache systems
– MS layer
• Is stateful => lots of problems
18
Application
Server
Application
Server
Media
Server
Media
Server
Media
Server
Media
Server
…
…

Computing Node
Lessons learnt: avoid single points of
failure
19
MS
MRB
Computing Node
MS
Computing Node
… MS
Elastic Load Balancer
Computing Node
MS
Computing Node
…
MRB MRB
distributed cache
The wrong way
(single point of failure)
The right way
(fault-tolerant MRB)

Lessons learnt: fault-recovery at the MS
layer
• Fault-tolerance on the MS layer
– Stateful problem
• MS instances hold specific
resources that cannot be
“serialized” to a distributed
cache:
– Specific Sockets
• Machine failure => session failure
– Our proposed solution
• Re construct the session
– Detect failure
– Notify failure
– Reconnect
20
MRB
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Call Call
Application
Server
Instance
Failure
detection
Failure
notification
Session
reconnection

Autoscaling
21

Lessons learnt: lack of optimal scale-out
events and metrics
• Lessons learnt: firing scale-out events
– which metric?
– Bottleneck depends on applications: network, CPU, memory, etc.
– our recommendation: define a synthetic metric (i.e. scaling points)
and be conservative
22
Qualityofservice
Number of WebRTC legs
Typical scalability curve
for SFU media servers
50%
40%

Lessons learnt: scaling-in is harder
than scaling-out
• The options (none-good)
– Expose # sessions as a metric
• Depends on cloud capabilities
• AS needs to be made cloud
aware
– Session migration
• AS needs to be made cloud
aware
• Renegotiations
– Retain period
• Sub-optimal utilization
• The simplest
23
MRB
Application
Server
Instance
MS1 MS2 MS3 MS4
Which one would
you remove?

Limits of the (stateless) MRB
24
Media stream
OnetoMANY

Stateful MRB
25
Stateful MRB
Application
Server
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance

Why?
26

Stateful because …
• MRB
– Must be aware of media topology
• Stateful information about MS relationships
– Request routing depends on topology
• Where to place a new viewer?
– Request routing depends on internal state
• CPU load
• QoS
• Memory
• Etc.
27

Experiences with stateful MRB in AWS
EC2 & OpenStack
• Lessons learned: beware of WebRTC internals
– Differentiated quality
• SVC is the solution
– but its not ready
• Plain SFU forwarding models are not an option.
– RTCP feedback of viewers with bad connectivity destroy QoE
• Simulcast may be an option
– Suppress feedback of viewers with really bad connectivity
• Layered transcoding works nicely
– But its expensive
– Churn and the generation of key-frames
• Periodic key-frame generation is an option
– In VP8 expect significant increase in BW consumption
• Layered transcoding works nicely
– But its again expensive
28

Experiences with stateful MRB in AWS
EC2 & OpenStack
• Lessons learned: the cloud is evil
– Placement of incoming WebRTC legs
• New science required here
– Ideas?
• Our solutions
– Count number of WebRTC legs (points mechanisms9
– Ad-hoc, hard and error prone
– Fault-resilience
• New science required here
– Ideas?
• Our solution
– Re-construct internal parts of the tree, but never leaves.
– Requires client renegotiation
– Ad-hoc, hard and error prone
29

Thanks
31
Luis Lopez
lulop@kurento.org

WebRTC infrastructures in the large (with experiences on real cloud deployments)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to WebRTC infrastructures in the large (with experiences on real cloud deployments)

Similar to WebRTC infrastructures in the large (with experiences on real cloud deployments) (20)

Recently uploaded

Recently uploaded (20)

WebRTC infrastructures in the large (with experiences on real cloud deployments)