distributed high available
stasis app
Jöran Vinzens
@vinzens81
linkedin.com/in/jvinzens/
● motivation
● idea
● implementation
● performance
● open source
schedule
.
current setup
current setup
https://www.youtube.com/watch?v=7im9_mUY768&t
current setup
https://www.youtube.com/watch?v=7im9_mUY768&t
● two different "roles" for Asterisk servers
○ billing, announcement
○ pbx functionality
● pbx part in cluster with sharding of customers by ID
● up to 4 Asterisk in one call flow
● many perl AGI
● old, fragile FastAGI monolith
● patched Asterisk 11 :-/
current setup
.
demands
● high availability
● performance
● scalability
● continuous deployment
● no untested code
● low barrier to entry
● resilience
demands
.
decision
complete re-build
Asterisk 15/16 with ARI
decision
.
change
today
● AGI / AMI
● evolved monolith
● complex call routing
tomorrow
● ARI
● modular system
● one Asterisk
change
.
idea
kamailio
dispatcher module
- dispatch calls
- observe Asterisk
- re-routing on failure
Asterisk
Stasis app / media
- determine Stasis app
- handle all SIP/media
- carrier handling
callcontroller
call logic
- PBX function
- routing logic
- back-end requests
idea
.
how does it work?
SIP REST ARI over Kafka
route{
$var(a) = 4;
if(ds_select("1", "$var(a)")) {
t_on_failure("handleFailedCalls");
ds_next_domain();
t_relay();
exit;
}
}
failure_route[handleFailedCalls] {
ds_mark_dst("ip");
if(ds_select("1", "$var(a)")) {
ds_next_domain();
t_relay();
exit;
}
}
https://www.kamailio.org/docs/modules/5.1.x/modules/dispatcher.html
kamailio
[default]
exten = _X.,1,Stasis(callcontroller)
same = n,Hangup(20)
extensions.conf
asterisk
asterisk
ARI
asterisk
ARI
{"type":"StasisStart
","timestamp":"2018-
09-17T14:34:12.576+0
200","args":[],"chan
nel":{"id":"15371876
52.36","name":"PJSIP
/proxy-00000012","..
...
ARI
envelope
Asterisk ID
routing key
asterisk
ARI over kafka
ARI events -> kafka event envelope
ARI commands <- kafka command envelope
asterisk
ARI over kafka
● one ARI-Proxy for each stasis app/asterisk
asterisk
ARI over kafka
● multiple ARI Proxy on different stasis apps for each Asterisk
asterisk
ARI over kafka
● one combined Kafka topic for each type of stasis app on all Asterisks (Events)
● one separate Kafka topic for each stasis app/Asterisk (Commands)
asterisk
ARI over kafka
asterisk
ARI over kafka
asterisk
kafka routing
● all events of one call to the same call-controller
● reply to the sender
asterisk
kafka routing
StasisStart
- generate routing key (UUID)
- save channel ID -> routing key
asterisk
kafka routing
ChannelStateChange
get routing key with channel ID
asterisk
kafka routing
- get routing key on channel ID
- save routing key -> playback ID
- get routing key on playback ID
asterisk
kafka routing
playbackStarted
command: channel/playback
● one routing key for an entire call "callcontext"
● all commands generating new resources will update the key map
● only StasisStart event will generate key if absent
still todo:
● generate channel out of nothing
not finished
asterisk
kafka routing
goal
● transparent Asterisk server farm
● use ARI documentation for development
● restart-safe
● easy to scale
not finished
callcontroller
architecture
● akka
● actor system
● akka finite state machine (FSM) for "call application"
not finished
callcontroller
callcontroller
callcontroller
routing to correct app
- different apps doing different call types
- each call generates new instance
callcontroller
commands topic is taken from ARI envelope
.
load capacity & timing
● all measurements taken without load
● time from event "Stasis Start" / "Invite" to command "answer" / "200 OK"
● median value taken
● basic call to an announcement
timing
9 ms
ARI over kafkaSIP REST
11 ms
+2 ms
ARI over kafkaSIP REST
33 ms
+22ms
ARI over kafkaSIP REST
54 ms
+21 ms
ARI over kafkaSIP REST
67 ms
+13 ms
ARI over kafkaSIP REST
202 ms
+135 ms
ARI over kafkaSIP REST
2 ms
SIP Invite -> ARI StasisStart
ARI over kafkaSIP REST
133 ms
SIP 200 OK <- rest Command answer
ARI over kafkaSIP REST
load capacity
● produce load with sipp locally on the Asterisk
● increase load until the system breaks
● basic call to announcement
● 13 events for each call
● 7 commands / reply for each call
● 30 sec call duration with RTP
each call -> answer, playback, recording, playback, playback, hangup, delete file
load capacity
● 2 hardware machines running Asterisk / ARI-Proxy
● 3 VM running kafka
● 2 VM running callcontroller
● 1 VM running redis
load capacity
● test run 1
○ high load over long time to find memory leaks
● test run 2
○ increased load until PDD increased/something crash
load capacity
test run 1
each Asterisk
- 4 cps
- ~15h run
- 213k calls
- 0.0141% failure
load capacity
test run 1
load capacity
test run 2
load capacity
test run 2
load capacity
test run 2
- limit at 4.8 cps
- limit due to single-threaded java
load capacity
.
fitting the demands
● high availability
● performance
● scalability
● continuous deployment
● no untested code
● low barrier to entry
● resilience
fitting the demands
.
opensource
https://retel.io
reasons
● clean code without "sipgate" stuff
● giving back to the community
opensource
.
alternatives
connectors
- go ARI Proxy -> github.com/CyCoreSystems/ari-proxy -> NATS
- go ARI Proxy -> github.com/nvisibleinc/go-ari-proxy -> RabbitMQ, NATS
- AsterNET -> github.com/skrusty/AsterNET-ARI-Proxy -> RabbitMQ
alternatives
library
- go -> github.com/CyCoreSystems/ari
- phpari -> github.com/greenfieldtech-nirs/phpari
- python -> github.com/asterisk/ari-py
- node -> github.com/asterisk/node-ari-client
- ruby -> github.com/svoboda-jan/asterisk-ari (> 2y no commit)
...any many more...
alternatives
.
questions
astricon2018

astricon2018