Improving GStreamer performance on large pipelines: from profiling to optimization

Improving GStreamer
performance on large pipelines:
from profiling to optimization
8-9 October 2015
Dublin, Ireland
Conference 2015
Miguel París
mparisdiaz@gmail.com

2
Who I am
Miguel París
●
Software Engineer
●
Telematic Systems Master's
●
Researcher at Universidad Rey
Juan Carlos (Madrid, Spain)
●
Kurento real-time manager
●
●
Twitter: @mparisdiaz

Overview
3

GStreamer is quite good to develop multimedia apps, tools, etc.
in an easy way.
 It could be more efficient
 The first step: measuring / profiling
●
Main principle: “you cannot improve what you cannot measure”
 Detecting bottlenecks
 Measuring the gain of the possible solutions
 Comparing different solutions
●
In large pipelines a “small” performance improvement could
make a “big” difference
 The same for a lot of pipelines in the same machine

Profiling levels
4
●
Different detailed levels: the most detailed, the most overhead (typically)
●
High level
 Threads num: ps -o nlwp <pid>
 CPU: top, perf stat -p <pid>
●
Medium level
 time-profiling: how much time is spent per each GstElement (using GstTracer)
●
Easy way to determine which elements are the bottlenecks.
●
do_push_buffer_(pre|post), do_push_buffer_list_(pre|post)
●
Reducing the overhead as much as possible
– Avoid memory alloc/free: it stores all timestamps in a static memory
previously allocated
– Avoid logs: logging all entries at the end of the execution
– Post-processing: log in CSV format that can be processed by a R script.
 latency-profiling: latency added per each Kurento Element (using GstMeta)
●
Low level: which functions spend more CPU (using callgrind)

Applying solutions
5
●
Top-down function. Repeat this process:
1)Remove unnecessary code
2)Reduce calls
a) Is it needed more than once?
b) Reuse results (CPU vs memory)
3)Go into more low-level functions
●
GstElements
1)Remove unnecessary elements
2)Reduce/Reuse elements

Study case I
6
●
The one2many case
●
What do we want to improve?
 Increase the number of senders in a machine.
 Reduce the consumed resources using a fix number of viewers

7
<GstPipeline>
pipeline0
[>]
KmsWebrtcEndpoint
kmswebrtcendpoint1
[>]
GstRTPRtxQueue
rtprtxqueue1
[>]
GstRtpVP8Pay
rtpvp8pay1
[>]
GstRtpOPUSPay
rtpopuspay1
[>]
KmsWebrtcSession
kmswebrtcsession1
[>]
KmsRtcpDemux
kmsrtcpdemux1
[>] GstRtpSsrcDemux
rtpssrcdemux5
[>]
KmsWebrtcTransportSinkNice
kmswebrtctransportsinknice1
[>]
GstNiceSink
nicesink1
[>]
GstDtlsSrtpEnc
dtlssrtpenc1
[>]
GstFunnel
funnel
[>]
GstSrtpEnc
srtp-encoder
[>]
GstDtlsEnc
dtls-encoder
[>]
KmsWebrtcTransportSrcNice
kmswebrtctransportsrcnice1
[>]
GstDtlsSrtpDec
dtlssrtpdec1
[>]
GstSrtpDec
srtp-decoder
[>]
GstDtlsDec
dtls-decoder
[>]
GstDtlsSrtpDemux
dtls-srtp-demux
[>]
GstNiceSrc
nicesrc1
[>]
GstRtpBin
rtpbin1
[>]
GstRtpSsrcDemux
rtpssrcdemux4
[>]
GstRtpSession
rtpsession3
[>]
GstRtpSsrcDemux
rtpssrcdemux3
[>]
GstRtpSession
rtpsession2
[>]
KmsWebrtcEndpoint
kmswebrtcendpoint0
[>]
GstRtpVP8Depay
rtpvp8depay0
[>]
KmsAgnosticBin2
kmsagnosticbin2-1
[>]
GstQueue
queue3
[>]
KmsParseTreeBin
kmsparsetreebin1
[>]
KmsVp8Parse
kmsvp8parse0
[>]
GstFakeSink
fakesink3
[>]
GstTee
tee3
[>]
GstFakeSink
fakesink2
[>]
GstTee
tee2
[>]
GstRTPOpusDepay
rtpopusdepay0
[>]
KmsAgnosticBin2
kmsagnosticbin2-0
[>]
GstQueue
queue1
[>]
KmsParseTreeBin
kmsparsetreebin0
[>]
GstOpusParse
opusparse0
[>]
GstFakeSink
fakesink1
[>]
GstTee
tee1
[>]
GstFakeSink
fakesink0
[>]
GstTee
tee0
[>]
GstRTPRtxQueue
rtprtxqueue0
[>]
GstRtpVP8Pay
rtpvp8pay0
[>]
GstRtpOPUSPay
rtpopuspay0
[>]
KmsWebrtcSession
kmswebrtcsession0
[>]
KmsRtcpDemux
kmsrtcpdemux0
[>]
GstRtpSsrcDemux
rtpssrcdemux2
[>]
KmsWebrtcTransportSinkNice
kmswebrtctransportsinknice0
[>]
GstNiceSink
nicesink0
[>]
GstDtlsSrtpEnc
dtlssrtpenc0
[>]
GstFunnel
funnel
[>]
GstSrtpEnc
srtp-encoder
[>]
GstDtlsEnc
dtls-encoder
[>]
KmsWebrtcTransportSrcNice
kmswebrtctransportsrcnice0
[>]
GstDtlsSrtpDec
dtlssrtpdec0
[>]
GstSrtpDec
srtp-decoder
[>]
GstDtlsDec
dtls-decoder
[>]
GstDtlsSrtpDemux
dtls-srtp-demux
[>]
GstNiceSrc
nicesrc0
[>]
GstRtpBin
rtpbin0
[>]
GstRtpJitterBuffer
rtpjitterbuffer1
[>] GstRtpPtDemux
rtpptdemux1
[>]
GstRtpJitterBuffer
rtpjitterbuffer0
[>] GstRtpPtDemux
rtpptdemux0
[>]
GstRtpSsrcDemux
rtpssrcdemux1
[>]
GstRtpSession
rtpsession1
[>]
GstRtpSsrcDemux
rtpssrcdemux0
[>]
GstRtpSession
rtpsession0
[>]
Legend
Element-States: [~] void-pending, [0] null, [-] ready, [=] paused, [>] playing
Pad-Activation: [-] none, [>] push, [<] pull
Pad-Flags: [b]locked, [f]lushing, [b]locking; upper-case is set
Pad-Task: [T] has started task, [t] has paused task
proxypad40
[>][bfb]
sink
[>][bfb]
sink_audio
[>][bfb]
proxypad42
[>][bfb]
sink
[>][bfb]
sink_video
[>][bfb]
sink
[>][bfb]
src
[>][bfb]
send_rtp_sink_1
[>][bfb]
proxypad33
[>][bfb]
src
[>][bfb]
src
[>][bfb]
send_rtp_sink_0
[>][bfb]
proxypad31
[>][bfb]
send_rtp_src_0
[>][bfb]
sink
[>][bfb]
rtp_src
[>][bfb]
rtcp_src
[>][bfb] rtcp_sink
[>][bfb]
sink
[>][bfb]
src_1
[>][bfb]
recv_rtp_sink_1
[>][bfb]
rtcp_src_1
[>][bfb] recv_rtcp_sink_1
[>][bfb]
src_421259003
[>][bfb] recv_rtp_sink_0
[>][bfb]
rtcp_src_421259003
[>][bfb] recv_rtcp_sink_0
[>][bfb]
proxypad44
[>][bfb]
proxypad45
[>][bfb]
proxypad46
[>][bfb]
proxypad47
[>][bfb]
sink
[>][bfb]
proxypad34
[>][bfb]
rtp_sink_0
[>][bfb]
rtp_sink_0
[>][bfb]
src
[>][bfb]
proxypad36
[>][bfb]
rtcp_sink_0
[>][bfb]
rtcp_sink_0
[>][bfb]
proxypad37
[>][bfb]
rtp_sink_1
[>][bfb]
rtp_sink_1
[>][bfb]
proxypad39
[>][bfb]
rtcp_sink_1
[>][bfb]
rtcp_sink_1
[>][bfb]
proxypad29
[>][bfb]
funnelpad5
[>][bfb]
src
[>][bfb]
funnelpad6
[>][bfb]
funnelpad7
[>][bfb]
funnelpad8
[>][bfb]
funnelpad9
[>][bfb]
rtp_src_0
[>][bfb]
rtcp_src_0
[>][bfb]
rtp_src_1
[>][bfb]
rtcp_src_1
[>][bfb]
src
[>][bfb][T]
proxypad28
[>][bfb]
sink
[>][bfb]
sink
[>][bfb]
rtp_src
[>][bfb]
proxypad26
[>][bfb]
proxypad27
[>][bfb]
rtcp_src
[>][bfb]
rtp_sink
[>][bfb]
rtp_src
[>][bfb]
rtcp_sink
[>][bfb]
rtcp_src
[>][bfb]
sink
[>][bfb]
rtp_src
[>][bfb]
dtls_src
[>][bfb]
src
[>][bfb][T]
send_rtp_sink
[>][bfb]
send_rtp_sink
[>][bfb]
recv_rtp_sink
[>][bfb]
recv_rtcp_sink
[>][bfb]
recv_rtp_sink
[>][bfb]
recv_rtcp_sink
[>][bfb]
proxypad30
[>][bfb]
proxypad32
[>][bfb]
send_rtp_src_1
[>][bfb]
proxypad35
[>][bfb]
send_rtcp_src_0
[>][bfb]
proxypad38
[>][bfb]
send_rtcp_src_1
[>][bfb]
sink
[>][bfb]
rtcp_sink
[>][bfb]
send_rtp_src
[>][bfb]
send_rtcp_src
[>][bfb]
recv_rtp_src
[>][bfb]
sync_src
[>][bfb]
sink
[>][bfb]
rtcp_sink
[>][bfb]
send_rtp_src
[>][bfb]
send_rtcp_src
[>][bfb]
recv_rtp_src
[>][bfb]
sync_src
[>][bfb]
proxypad14
[>][bfb]
sink
[>][bfb]
sink_audio
[>][bfb]
audio_src_0
[>][bfb]
proxypad15
[>][bfb]
sink
[>][bfb]
sink_video
[>][bfb]
proxypad24
[>][bfb]
proxypad25
[>][bfb]
video_src_0
[>][bfb]
sink
[>][bfb]
src
[>][bfb]
sink
[>][bfb]
proxypad23
[>][bfb]
src_0
[>][bfb]
sink
[>][bfb]
proxypad43
[>][bfb]sink
[>][bfb]
src
[>][bfb][T]
sink
[>][bfb]
src
[>][bfb]
sink
[>][bfb]
src_0
[>][bfb]
sink
[>][bfb]
src_2
[>][bfb]
sink
[>][bfb]
src_0
[>][bfb]
src_1
[>][bfb]
sink
[>][bfb]
src
[>][bfb]
sink
[>][bfb]
proxypad19
[>][bfb]
src_0
[>][bfb]
sink
[>][bfb]
proxypad41
[>][bfb]sink
[>][bfb]
src
[>][bfb][T]
sink
[>][bfb]
src
[>][bfb]
sink
[>][bfb]
src_0
[>][bfb]
sink
[>][bfb]
src_2
[>][bfb]
sink
[>][bfb]
src_0
[>][bfb]
src_1
[>][bfb]
sink
[>][bfb]
src
[>][bfb]
send_rtp_sink_1
[>][bfb]
proxypad7
[>][bfb]
src
[>][bfb]
src
[>][bfb]
send_rtp_sink_0
[>][bfb]
proxypad5
[>][bfb]
send_rtp_src_0
[>][bfb]
sink
[>][bfb]
rtp_src
[>][bfb]
rtcp_src
[>][bfb]
rtcp_sink
[>][bfb]
sink
[>][bfb]
src_1442068093
[>][bfb]
recv_rtp_sink_0
[>][bfb]
rtcp_src_1442068093
[>][bfb]
recv_rtcp_sink_0
[>][bfb]
src_836061664
[>][bfb]
recv_rtp_sink_1
[>][bfb]
rtcp_src_836061664
[>][bfb]
recv_rtcp_sink_1
[>][bfb]
proxypad16
[>][bfb]
proxypad17
[>][bfb]
proxypad20
[>][bfb]
proxypad21
[>][bfb]
sink
[>][bfb]
proxypad8
[>][bfb]
rtp_sink_0
[>][bfb]
rtp_sink_0
[>][bfb]
src
[>][bfb]
proxypad10
[>][bfb]
rtcp_sink_0
[>][bfb]
rtcp_sink_0
[>][bfb]
proxypad11
[>][bfb]
rtp_sink_1
[>][bfb]
rtp_sink_1
[>][bfb]
proxypad13
[>][bfb]
rtcp_sink_1
[>][bfb]
rtcp_sink_1
[>][bfb]
proxypad3
[>][bfb]
funnelpad0
[>][bfb]
src
[>][bfb]
funnelpad1
[>][bfb]
funnelpad2
[>][bfb]
funnelpad3
[>][bfb]
funnelpad4
[>][bfb]
rtp_src_0
[>][bfb]
rtcp_src_0
[>][bfb]
rtp_src_1
[>][bfb]
rtcp_src_1
[>][bfb]
src
[>][bfb][T]
proxypad2
[>][bfb]
sink
[>][bfb]
sink
[>][bfb] rtp_src
[>][bfb]
proxypad0
[>][bfb]
proxypad1
[>][bfb]
rtcp_src
[>][bfb]
rtp_sink
[>][bfb]
rtp_src
[>][bfb]
rtcp_sink
[>][bfb]
rtcp_src
[>][bfb]
sink
[>][bfb]
rtp_src
[>][bfb]
dtls_src
[>][bfb]
src
[>][bfb][T]
send_rtp_sink
[>][bfb]
send_rtp_sink
[>][bfb]
recv_rtp_sink
[>][bfb]
recv_rtcp_sink
[>][bfb]
recv_rtp_sink
[>][bfb]
recv_rtcp_sink
[>][bfb]
proxypad4
[>][bfb]
proxypad6
[>][bfb]
send_rtp_src_1
[>][bfb]
proxypad9
[>][bfb]
send_rtcp_src_0
[>][bfb]
proxypad12
[>][bfb]
send_rtcp_src_1
[>][bfb]
proxypad18
[>][bfb]
recv_rtp_src_0_1442068093_111
[>][bfb]
proxypad22
[>][bfb]
recv_rtp_src_1_836061664_100
[>][bfb]
sink
[>][bfb] src
[>][bfb][T]
sink_rtcp
[>][bfb]
sink
[>][bfb]
src_100
[>][bfb]
sink
[>][bfb] src
[>][bfb][T]
sink_rtcp
[>][bfb]
sink
[>][bfb]
src_111
[>][bfb]
sink
[>][bfb]
src_836061664
[>][bfb]
rtcp_sink
[>][bfb]
rtcp_src_836061664
[>][bfb]
send_rtp_src
[>][bfb]
send_rtcp_src
[>][bfb]
recv_rtp_src
[>][bfb]
sync_src
[>][bfb]
sink
[>][bfb]
src_1442068093
[>][bfb]
rtcp_sink
[>][bfb]
rtcp_src_1442068093
[>][bfb]
send_rtp_src
[>][bfb]
send_rtcp_src
[>][bfb]
recv_rtp_src
[>][bfb]
sync_src
[>][bfb]
Study case II
The pipeline

Study case III
8
●
Analyzing the sender part of the pipeline
●
We detected that:
 funnel is quite inefficient

https://bugzilla.gnome.org/show_bug.cgi?id=749315
 srtpenc does unnecesary work
●

funnel: time-profiling
(nanoseconds)
9
pad mean e_mean e_min
(accumulative)
1 dtlssrtpenc1:src 163034.5 163034.478 49478
2 funnel:src 170207.5 7173 2029
3 srtp-encoder:rtp_src_1 317373.9 147166.435 57318
4 :proxypad40 716469.7 399095.739 105379
5 rtpbin1:send_rtp_src_1 781019 64371.783 1832
6 rtpsession3:send_rtp_src 784436 3417 859
7 :proxypad35 802532 18096 5632
8 rtprtxqueue3:src 806016.1 3484.174 1245
9 rtpvp8pay1:src 834627.3 28611.217 8957
10 :proxypad46 905171.5 69938.136 21206
11 kmswebrtcep0:video_src_1 912607 7435.455 2126
12 kmsagnosticbin2-1:src_0 918833.2 6226.227 2283
13 queue3:src 925268.2 6434.955 2486

funnel: callgrind profiling
10
● IDEA: look for chain functions to see accumulative CPU usage of the
downstream flow.
● CPU percentages (Downstream and ordered by Incl. in kcachegrind)
100 - gst_rtp_base_payload_chain
93.99 - gst_rtp_rtx_queue_chain + gst_rtp_rtx_queue_chain_list
90.90 - gst_rtp_session_chain_send_rtp_common
80.13 - gst_srtp_enc_chain + gst_srtp_enc_chain_list
53.35 - srtp_protect
19.51 - gst_funnel_sink_chain_object
9.82 - gst_pad_sticky_events_foreach
8.79 - gst_base_sink_chain_main

funnel: solution
12
CPU impr.: ~100%
Time before: 147166 ns
Time after: 5829 ns
● Applying solution type 2.a): send sticky events only once
● Add a property to funnel element (“forward-sticky-events”)
 If set to FALSE, do not forward sticky events on sink pad changes.
 Results

srtpenc
13
●
Applying solution type 1)
●
srtpenc: remove unnecessary rtp/rtcp checks
 https://bugzilla.gnome.org/show_bug.cgi?id=752774
 CPU improvement: 2.89 / (100 – 58.90) = 7%

Other examples
14
●
g_socket_receive_message: the most CPU usage of is
wasted in the error management
 https://bugzilla.gnome.org/show_bug.cgi?id=752769

latency-profiling
15
●
Mark Buffers with timestamp using GstMeta
●
Add a considerable overhead
 Sampling (do not profile every buffer)
 GstMeta pool?
●
DEMO (WebRtcEp + FaceOverlay)
 Real time profiling
 WebRTC, decoding, video processing, encoding...

General remarks (BufferLists)
16
Use BufferLists always you can
 Pushing buffers through pads is not free

Really important in large pipelines
 Pushing BufLists through pads spend the same CPU than
pushing only one buffer
 Pushing BufLists through some elements spend the same
CPU than pushing only one buffer. Eg: tee, queue
 Kurento has funded and participated in the BufList support
of a lot of elements
 Open discussion: queue: Add property to allow pushing all
queued buffers together
●

General remarks (BufferPool)
17
Extending the usage of BufferPool
 Significant CPU % is spent allocating / freeing buffers
 Nowadays, memory is much cheaper than CPU

Let's take advantage of this
 Example

Buffers of different size, but always < than
1500Bytes are allocated

Configure a BufferPool to generate Buffers of
1500Bytes and reuse them in a BaseSrc, Queue,
RtpPayloader, etc.

General remarks (Threading)
18
GStreamer could be improved a lot in threading aspects
 Each GstTask has its own thread

It is idle the most time
 A lot of threads → Too many context-switches → wasting CPU
 Kurento team proposes using thread pools and avoid blocking
threads
 Kurento has funded the development of the first
implementation of TaskPool ( thanks Sebastian ;) )
●
http://cgit.freedesktop.org/~slomo/gstreamer/log/?h=task-pool
●
It is not finished, let's try to push it forward
 Ambitious architecture change
●
Sync vs Async
●
Move to a reactive architecture

Conclusion/Future work
19
●
Take into account performance
●
Performance could be as important as a feature works
properly
●
Time processing restriction
●
Embedded devices

Automatic profiling

Reduce manual work

Continuous integration: pass criteria to accept a commit

Warnings

Thank you
Miguel París
http://www.kurento.org
http://www.github.com/kurento
info@kurento.org
Twitter: @kurentoms
http://www.nubomedia.eu
http://www.fi-ware.org
http://ec.europa.eu

Improving GStreamer performance on large pipelines: from profiling to optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Improving GStreamer performance on large pipelines: from profiling to optimization

Similar to Improving GStreamer performance on large pipelines: from profiling to optimization (20)

Recently uploaded

Recently uploaded (20)

Improving GStreamer performance on large pipelines: from profiling to optimization