systemd @ FB – a year later
Davide Cavalca
Production Engineer
• Recap
• Tracking upstream
• Resource management
• Service monitoring
• Case studies
• Advocacy
Agenda
Recap
• 100% of the bare metal feet on CentOS 7!
• Migrated countless services to systemd
• libsystemd integration in our build system
• Containers: see Zeal’s talk later today!
Recap
CentOS 7 migration
Tracking upstream
• systemd 231 232 233 (234 235)→ → → →
• Also tracking util-linux, dbus, etc.
• Published our Rawhide-based backports on:
https://github.com/facebookincubator/rpm-backports
• Binary RPMs based on it on:
https://copr.fedorainfracloud.org/coprs/jsynacek/systemd-
backports-for-centos-7/
Tracking upstream
Staying up to date
• Not specifc to systemd
• Duplicate systemd RPMs: package-cleanup wrapper
• rpmdb corruption: dcrpm
• Mismatch between systemd and systemd-libs
Tracking upstream
RPM issues
if ldd /usr/lib/systemd/systemd | grep ‘systemd.*not found$’
yum reinstall -y $systemd_packages
fi
• Rebuild packaging for the Meson transition
• Backported meson, ninja-build in CentOS
• Standalone systemd-compat-libs
https://github.com/facebookincubator/systemd-compat-libs
Tracking upstream
Meson and compat-libs
Tracking upstream
tty woes with 234
• When rolling 234 we discovered a race in the kernel tty
subsystem (repros all the way back to 4.0)
• Turns out both systemd and Tupperware use the real tty0
• Investigation still in progress, likely a use-after-free bug
• Tupperware should probably just use a pty here
Resource management
• See Chris’s talk tomorrow for all things cgroup2!
• Using systemd to partition services and apply limits
• Lightweight daemon to collect metrics from /sys/fs/cgroup
• Chef API to apply confgurations and manage experiments
Resource management
Rolling out cgroup2
Resource management
Slice hierarchy
/
|
|-system.slice
|
|-workload.slice
| |
| +-critical-wdb.slice
|
+-tbd.slice
Service monitoring
Service monitoring
• systemd exposes lots of useful metrics over dbus
• Unit properties (e.g. *Timestamp*, NRestarts)
• Status events (e.g. unit state changes)
• Options: python-dbus, sd-bus, coreos/go-systemd/dbus
Getting metrics out of systemd
Service monitoring
• Lightweight daemon to feed systemd metrics to various
monitoring systems
• Polling for unit properties, subscriptions for status events
• Initial implementation in golang
systemdmon
Service monitoring
• Thin Cython wrapper on top of sd-bus
• Expose systemd dbus object model
• ipython REPL for prototyping
• Will be opensourced together with systemdmon
pystemd
Case studies
Case studies
dbus reliability
• Issues with dbus-daemon or the system bus afect systemd
• systemctl hanging or failing Chef failing→
• Easy to DoS the bus, especially with user services
• Hard to remediate without a reboot
• Looking forward to dbus-broker!
Case studies
rpm macros for systemd services
• By default RPM macros will restart units on upgrade...
• …which is a problem if you’ve also setup Chef to restart
• Solution: knob in our internal packaging tool to optionally
disable the restart macro
Case studies
Logging
• Journald setup: 10MB in memory logging feeding rsyslog
• journalctl is awesome
• Double writing problem
• No way to set per-unit limits
Case studies
Unit loops
• Easy to create loops with x-systemd-requires in fstab
• systemd will delete a random unit to break loops
• Solution: add _netdev to the fstab entry
• systemd-analyze to help debugging
systemd-tmpfiles-setup.service: Job systemd-tmpfiles-
setup.service/start deleted to break ordering cycle starting
with smc_proxy.service/start
Case studies
Transient unit creep
• systemd-run creates units in /run/systemd/transient
• If the unit fails, it sticks around in ‘failed’ state
• 10k failed units 50% cpu usage for pid 1→
• 30k failed units 100% cpu usage for pid 1→
• Fix: call systemctl reset-failed periodically
Case studies
KillMode=process
• KillMode=process may leave stray processes in the cgroup
• Changes to unit slices don’t apply unless the old slice is
empty
• Fix: move to use KillMode=control-group
Case studies
Unit escaping
• Escape logic relies on shell control characters:
/dev/dm0 dev-dmx2d1.swap→
• Chef fx: https://github.com/chef/chef/pull/6230
• path_to_unit wrapper in fb_systemd
Advocacy
• Announce core packages updates widely
• Tailor documentation to customer usecases
• Encourage people to engage upstream directly
• Tech talks
Advocacy
Questions?
systemd @ Facebook -- a year later

systemd @ Facebook -- a year later

  • 2.
    systemd @ FB– a year later Davide Cavalca Production Engineer
  • 3.
    • Recap • Trackingupstream • Resource management • Service monitoring • Case studies • Advocacy Agenda
  • 4.
  • 6.
    • 100% ofthe bare metal feet on CentOS 7! • Migrated countless services to systemd • libsystemd integration in our build system • Containers: see Zeal’s talk later today! Recap CentOS 7 migration
  • 7.
  • 8.
    • systemd 231232 233 (234 235)→ → → → • Also tracking util-linux, dbus, etc. • Published our Rawhide-based backports on: https://github.com/facebookincubator/rpm-backports • Binary RPMs based on it on: https://copr.fedorainfracloud.org/coprs/jsynacek/systemd- backports-for-centos-7/ Tracking upstream Staying up to date
  • 9.
    • Not specifcto systemd • Duplicate systemd RPMs: package-cleanup wrapper • rpmdb corruption: dcrpm • Mismatch between systemd and systemd-libs Tracking upstream RPM issues if ldd /usr/lib/systemd/systemd | grep ‘systemd.*not found$’ yum reinstall -y $systemd_packages fi
  • 10.
    • Rebuild packagingfor the Meson transition • Backported meson, ninja-build in CentOS • Standalone systemd-compat-libs https://github.com/facebookincubator/systemd-compat-libs Tracking upstream Meson and compat-libs
  • 11.
    Tracking upstream tty woeswith 234 • When rolling 234 we discovered a race in the kernel tty subsystem (repros all the way back to 4.0) • Turns out both systemd and Tupperware use the real tty0 • Investigation still in progress, likely a use-after-free bug • Tupperware should probably just use a pty here
  • 12.
  • 13.
    • See Chris’stalk tomorrow for all things cgroup2! • Using systemd to partition services and apply limits • Lightweight daemon to collect metrics from /sys/fs/cgroup • Chef API to apply confgurations and manage experiments Resource management Rolling out cgroup2
  • 14.
  • 15.
  • 16.
    Service monitoring • systemdexposes lots of useful metrics over dbus • Unit properties (e.g. *Timestamp*, NRestarts) • Status events (e.g. unit state changes) • Options: python-dbus, sd-bus, coreos/go-systemd/dbus Getting metrics out of systemd
  • 17.
    Service monitoring • Lightweightdaemon to feed systemd metrics to various monitoring systems • Polling for unit properties, subscriptions for status events • Initial implementation in golang systemdmon
  • 18.
    Service monitoring • ThinCython wrapper on top of sd-bus • Expose systemd dbus object model • ipython REPL for prototyping • Will be opensourced together with systemdmon pystemd
  • 19.
  • 20.
    Case studies dbus reliability •Issues with dbus-daemon or the system bus afect systemd • systemctl hanging or failing Chef failing→ • Easy to DoS the bus, especially with user services • Hard to remediate without a reboot • Looking forward to dbus-broker!
  • 21.
    Case studies rpm macrosfor systemd services • By default RPM macros will restart units on upgrade... • …which is a problem if you’ve also setup Chef to restart • Solution: knob in our internal packaging tool to optionally disable the restart macro
  • 22.
    Case studies Logging • Journaldsetup: 10MB in memory logging feeding rsyslog • journalctl is awesome • Double writing problem • No way to set per-unit limits
  • 23.
    Case studies Unit loops •Easy to create loops with x-systemd-requires in fstab • systemd will delete a random unit to break loops • Solution: add _netdev to the fstab entry • systemd-analyze to help debugging systemd-tmpfiles-setup.service: Job systemd-tmpfiles- setup.service/start deleted to break ordering cycle starting with smc_proxy.service/start
  • 24.
    Case studies Transient unitcreep • systemd-run creates units in /run/systemd/transient • If the unit fails, it sticks around in ‘failed’ state • 10k failed units 50% cpu usage for pid 1→ • 30k failed units 100% cpu usage for pid 1→ • Fix: call systemctl reset-failed periodically
  • 25.
    Case studies KillMode=process • KillMode=processmay leave stray processes in the cgroup • Changes to unit slices don’t apply unless the old slice is empty • Fix: move to use KillMode=control-group
  • 26.
    Case studies Unit escaping •Escape logic relies on shell control characters: /dev/dm0 dev-dmx2d1.swap→ • Chef fx: https://github.com/chef/chef/pull/6230 • path_to_unit wrapper in fb_systemd
  • 27.
  • 28.
    • Announce corepackages updates widely • Tailor documentation to customer usecases • Encourage people to engage upstream directly • Tech talks Advocacy
  • 29.