When using GStreamer for creating media middleware and media infrastructures performance becomes critical for achieving the appropriate scalability without degrading end-user QoE. However, GStreamer does not provide off-the-shelf tools for that objective.
In this talk, we present efforts carried out for improving the performance of the Kurento Media Server during the last year. We present our main principle: “you cannot improve what you cannot measure”. Developing on it, we introduce different techniques for benchmarking large GStreamer pipelines including callgrind, time profiling, gst-meta profiling, chain-profiling, etc. We present results for different pipeline configurations and topologies. After that, we introduce some evolutions for GStreamer which could be helpful for optimizing performance such as the pervasive use of buffer-lists, the introduction of thread-pools or the appropriate management of queues.
To conclude, we present some preliminary work carried out in the GStreamer community for implementing such optimization and we discuss their advantages and drawbacks.
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
Improving GStreamer performance on large pipelines: from profiling to optimization
1. Improving GStreamer
performance on large pipelines:
from profiling to optimization
8-9 October 2015
Dublin, Ireland
Conference 2015
Miguel París
mparisdiaz@gmail.com
2. 2
Who I am
Miguel París
●
Software Engineer
●
Telematic Systems Master's
●
Researcher at Universidad Rey
Juan Carlos (Madrid, Spain)
●
Kurento real-time manager
●
mparisdiaz@gmail.com
●
Twitter: @mparisdiaz
3. Overview
3
GStreamer is quite good to develop multimedia apps, tools, etc.
in an easy way.
It could be more efficient
The first step: measuring / profiling
●
Main principle: “you cannot improve what you cannot measure”
Detecting bottlenecks
Measuring the gain of the possible solutions
Comparing different solutions
●
In large pipelines a “small” performance improvement could
make a “big” difference
The same for a lot of pipelines in the same machine
4. Profiling levels
4
●
Different detailed levels: the most detailed, the most overhead (typically)
●
High level
Threads num: ps -o nlwp <pid>
CPU: top, perf stat -p <pid>
●
Medium level
time-profiling: how much time is spent per each GstElement (using GstTracer)
●
Easy way to determine which elements are the bottlenecks.
●
do_push_buffer_(pre|post), do_push_buffer_list_(pre|post)
●
Reducing the overhead as much as possible
– Avoid memory alloc/free: it stores all timestamps in a static memory
previously allocated
– Avoid logs: logging all entries at the end of the execution
– Post-processing: log in CSV format that can be processed by a R script.
latency-profiling: latency added per each Kurento Element (using GstMeta)
●
Low level: which functions spend more CPU (using callgrind)
5. Applying solutions
5
●
Top-down function. Repeat this process:
1)Remove unnecessary code
2)Reduce calls
a) Is it needed more than once?
b) Reuse results (CPU vs memory)
3)Go into more low-level functions
●
GstElements
1)Remove unnecessary elements
2)Reduce/Reuse elements
6. Study case I
6
●
The one2many case
●
What do we want to improve?
Increase the number of senders in a machine.
Reduce the consumed resources using a fix number of viewers
8. Study case III
8
●
Analyzing the sender part of the pipeline
●
We detected that:
funnel is quite inefficient
https://bugzilla.gnome.org/show_bug.cgi?id=749315
srtpenc does unnecesary work
●
https://bugzilla.gnome.org/show_bug.cgi?id=752774
12. funnel: solution
12
CPU impr.: ~100%
Time before: 147166 ns
Time after: 5829 ns
● Applying solution type 2.a): send sticky events only once
● Add a property to funnel element (“forward-sticky-events”)
If set to FALSE, do not forward sticky events on sink pad changes.
Results
15. latency-profiling
15
●
Mark Buffers with timestamp using GstMeta
●
Add a considerable overhead
Sampling (do not profile every buffer)
GstMeta pool?
●
DEMO (WebRtcEp + FaceOverlay)
Real time profiling
WebRTC, decoding, video processing, encoding...
16. General remarks (BufferLists)
16
Use BufferLists always you can
Pushing buffers through pads is not free
Really important in large pipelines
Pushing BufLists through pads spend the same CPU than
pushing only one buffer
Pushing BufLists through some elements spend the same
CPU than pushing only one buffer. Eg: tee, queue
Kurento has funded and participated in the BufList support
of a lot of elements
Open discussion: queue: Add property to allow pushing all
queued buffers together
●
https://bugzilla.gnome.org/show_bug.cgi?id=746524
17. General remarks (BufferPool)
17
Extending the usage of BufferPool
Significant CPU % is spent allocating / freeing buffers
Nowadays, memory is much cheaper than CPU
Let's take advantage of this
Example
Buffers of different size, but always < than
1500Bytes are allocated
Configure a BufferPool to generate Buffers of
1500Bytes and reuse them in a BaseSrc, Queue,
RtpPayloader, etc.
18. General remarks (Threading)
18
GStreamer could be improved a lot in threading aspects
Each GstTask has its own thread
It is idle the most time
A lot of threads → Too many context-switches → wasting CPU
Kurento team proposes using thread pools and avoid blocking
threads
Kurento has funded the development of the first
implementation of TaskPool ( thanks Sebastian ;) )
●
http://cgit.freedesktop.org/~slomo/gstreamer/log/?h=task-pool
●
It is not finished, let's try to push it forward
Ambitious architecture change
●
Sync vs Async
●
Move to a reactive architecture
19. Conclusion/Future work
19
●
Take into account performance
●
Performance could be as important as a feature works
properly
●
Time processing restriction
●
Embedded devices
Automatic profiling
Reduce manual work
Continuous integration: pass criteria to accept a commit
Warnings