- PID 1 and splitting
- Rexec (or resource exhaustion)
- systemd and APIs
This post is a reply to the latest article on the EWONTFIX blog: Broken by design: systemd.
Update (2014-04-26): I rewrote part of this article to include more arguments, links and spelling fixes.
First, I’m not a systemd developer, thus everything here is based on my understanding of the code I read. I think some of the points raised in the linked blog post are valid and should be discussed, and the systemd-devel mailing list would be a good place to do so. If the original author does not want to get involved, I’ll probably bring some of his remarks to the mailing list.
PID 1 and splitting
The main issue here is that as systemd is PID 1, if it crashes, the system goes down. Before discussing how systemd could work as a non PID 1 process, let’s review its duties:
Among the reasons systemd wants/needs to run as PID 1 is getting parenthood of badly-behaved daemons that orphan themselves, preventing their immediate parent from knowing their PID to signal or wait on them.
Another reason is to be able to get the return status of all those processes.
systemd deals with all sorts of inputs, including device insertion and removal, changes to mount points and watched points in the filesystem, and even a public DBus-based API.
Device insertion removal is mostly managed by udev. All systemd does is associate a unit to each device found by udev. When a device requires further actions to actually work (filesystem checks for example), systemd forks a separate process to do the work.
Watching changes to mount points, files/directories, sockets, timers require an initial setup the first time. Then, when anything happens, most of the time systemd will fork to handle the change in a separate process.
The infamous DBus-based API mostly triggers forks to handle requests like most events managed by systemd.
These in turn entail resource allocation, file parsing, message parsing, string handling, and so on.
File parsing is very limited in scope as all exotic files are parsed outside of systemd by “generators” which creates units in ini style format which are then parsed by systemd. This is a very limited piece of code.
There is only one way to put the “process return code” feature out of PID 1: have init (kept as PID 1) tell systemd about which processes that died and how. This would imply communication from init to another process. This is doable but would increase complexity. Moreover, if the non PID 1 systemd process dies, we need a way to tell PID 1 to report to the newly spawned systemd process. So the more you split into processes, the more new edge cases you need to handle in PID 1 and the harder it is to show a consistent view of the system at once.
We can not use the
PR_SET_CHILD_SUBREAPER process flag on a non PID 1 systemd process as if it were to crash, resetting this flag on the new systemd process would not have already running processes report to the new systemd upon death. They would report to the non systemd PID 1 process instead.
All of the other features could be split in a non PID 1 process, leaving a very small PID 1 process that does almost nothing.
But this raises a new issue: how should systemd’s death be handled and who would responsible for restarting it?
There could be another process whose sole purpose is to monitor systemd status and respawn it if necessary. However, for this to work properly, systemd would have to permanently maintain a full copy of its state and all services states in a temporary storage available for the new systemd process after a crash. This data could be in an inconsistent state, and anything may have happened on the system during the short time systemd was down. The new systemd instance would therefore have to try to figure out the system state from the data available to it, with no guarantee that it is complete nor correct.
Having the system crash when systemd dies is actually a simple solution here.
On a hardened system without systemd, you have at most one root-privileged process with any exposed surface: sshd. Using systemd then more than doubles the attack surface.
I do not understand how systemd doubles the attack surface as it does not read untrusted input from the network and does not perform any computation on it.
The new attack vectors introduced by systemd are:
- sockets listened to in place of daemons (as xinetd does): this a rather limited attack surface as systemd here mostly accept the connection and then forks to let the actual daemon handle it. No untrusted input is read. I’m not even sure this qualifies as an attack vector;
- DBus API: the first thing done here is validating whether the calling process as the right to perform an action. This is also a rather limited piece of code (a group check and some SELinux checks if enabled).
Most Linux systems have more than one process running as root with an exposed attack surface. Discretionary Access Control (DAC) has been proved insufficient. If you care about security at this point you should use Mandatory Access Control (MAC) and systemd has integrated support for most of those (SELinux, SMACK…). Have a look at grsecurity too.
Everything else is either running as unprivileged users or does not have any channel for providing it input except local input from root.
systemd actually helps here as it makes things much easier for administrators to confine processes in different users or restrict accesses to resources. Have a look at the PrivateTmp, PrivateDevices and PrivateNetwork features.
Rexec (or resource exhaustion)
However, failure of execve is not entirely atomic: The kernel may fail setting up the VM for the new process image after the original VM has already been destroyed; the main situation under which this would happen is resource exhaustion.
In addition, systemd might fail to restore its serialized state due to resource allocation failures (…)
This is a valid point, but even splitting systemd into several processes won’t resolve this issue. Resource usage must be restricted one way or an other for any system to function properly. Interestingly, systemd does a lot in that area as it splits resource access priority between all services evenly and makes it easily to allow more for a particular service. You could even enforce a high resource priority for systemd to make sure other process would be killed in case of resource exhaustion. This is something that should be discussed upstream to make sure the failure case remains “impossible”.
As a reminder, have a look at the old init manpage and current Upstart man page which detail the
telinit u command which will ask init to re-exec itself. I can’t find any complains about this behavior anywhere.
Even after the kernel successfully sets up the new VM and transfers execution to the new process image, it’s possible to have failures prior to the transfer of control to the actual application program. This could happen in the dynamic linker (resource exhaustion or other transient failures mapping required libraries or loading configuration files) or libc startup code. Using musl libc with static linking or even dynamic linking with no additional libraries eliminates these failure cases, but systemd is intended to be used with glibc.
I don’t think systemd is fixed on glibc at all. Again, discussion should take place upstream. This answer perfectly summarize the issue and I think their stance is completely understandable. GNU C extensions and glibc extensions are extremely useful. systemd developers can not be blamed for relying on them.
(…) systemd might fail to restore its serialized state (…) if the old and new versions have diverged sufficiently that the old state is not usable by the new version.
This is unlikely, and this is the distribution job to make sure the upgrade path does not fail.
However for PID 1, if re-execing itself fails, the whole system goes down (kernel panic).
Unfortunately, by moving large amounts of functionality that’s likely to need to be upgraded into PID 1, systemd makes it impossible to upgrade without rebooting.
Fundamentally, upgrading should never require rebooting unless the component being upgraded is the kernel. Even then, for security updates, it’s ideal to have a “hot-patch” that can be applied as a loadable kernel module to mitigate the security issue until rebooting with the new kernel is appropriate.
Current implementations of live patching the kernel are ksplice or the recently introduced kGraft. According to me, rexec in PID 1 is safer than live kernel patching. Really, recommending that people live patch their kernel instead of rebooting is weird, especially when you can avoid long boot process issues (often related to BIOS firmwares) by using kexec.
If you can not reboot a service or system because it is a critical component in your infrastructure, you’ve done something wrong there (i.e. you’ve got a single point of failure). The “Mighty Quest for Epic Uptime” is futile.
systemd and APIs
The intended audience for that sort of thing is clearly servers.
The desktop is quickly becoming irrelevant. The future platform is going to be mobile and is going to be dealing with the reality of running untrusted applications.
Both GNOME and KDE want to rely on user process management using systemd. Some mobile phones already use systemd (Jolla). Security features integrated in systemd would be a perfect fit for untrusted mobile application sandboxing for example.
Engulfing other “essential” system components like udev and making them difficult or impossible to use without systemd (but see eudev).
eudev developers have already proved several times that they do not understand the reasons behind the changes that were made in udev. This does not make the udev “inclusion” in systemd valid. As far as I know it is still possible to use udev without systemd. For logind, it’s hard to blame the systemd developers here as ConsoleKit development had stopped and no one was doing the work.
Setting up for API lock-in (having the DBus interfaces provided by systemd become a necessary API that user-level programs depend on).
The elephant in the room here is the GNOME dependency on systemd. What as been told many times is that those interfaces are not systemd bound (not even Linux bound) and may be reimplemented by anyone. Sure this takes time, but maintaining this code in each application also takes time. This is a similar case to dependency on libraries, this is a compromise.
By providing public APIs intended to be used by other applications, systemd has set itself up to be difficult not to use once it achieves a certain adoption threshold.
So POSIX is bad because it’s an API “available” on most *NIXes and on which a lot of applications depends? As far as I know, no API provided by systemd is fundamentally systemd or even Linux bound.
DJB’s daemontools, runit, and Supervisor, among others, have solved the “legacy init is broken” problem over and over again (though each with some of their own flaws).
One of the interesting bits in systemd is the cgroup based process management. This has a lot of advantages over all the alternatives you mentioned as all of them simply fail to follow properly a process spawning children detaching from the main process.
Their failure to displace legacy sysvinit in major distributions had nothing to do with whether they solved the problem, and everything to do with marketing.
None of the alternatives properly handle the whole life cycle of a system. It’s illusory to do process management without doing device management. Those two features could be split in two different process, but what you will gain in duty separation would be lost in communication, synchronization and complexity.
Dictating policy rather than being scoped such that the user, administrator, or systems integrator (distribution) has to provide glue.
I find the lack of glue a definite advantage.
This eliminates bikesheds and thereby fast-tracks adoption at the expense of flexibility and diversity.
I can not see what’s not flexible in systemd. Please give us some examples. The boot process is flexible, the process management is flexible, the shutdown process is flexible.
If none of them are ready for prime time, then the folks eager to replace legacy init in their favorite distributions need to step up and either polish one of the existing solutions or write a better implementation based on the same principles. Either of these options would be a lot less work than fixing what’s wrong with systemd.
This post is here to convince you this is the other way. The base design of systemd is good, let’s fix the hard cases one by one.
For 30+ years, the choice of init system used has been completely irrelevant to everybody but system integrators and administrators.
And it will still be. Daemons do not really need special changes to work with systemd. Any integration is bonus. The part in GNOME and KDE that (will) rely on systemd implemented APIs are not related to user applications, but system management (power management, session management) thus user applications will never have to care about that.
Administrators had to put up with all the little differences between each init systems for those 30+ years. I find it refreshing we’re almost done with this fragmentation.
Ironically, this sort of modularity and interchangibility is what made systemd possible; if we were starting from the kind of monolithic, API-lock-in-oriented product systemd aims to be, swapping out the init system for something new and innovative would not even be an option.
We came from a status where every single distribution had a different init system to the current situation with a single one fitting almost everyone needs. This is the best example of systemd flexibility.
GNU/Linux/Free/Open Source software is not always about choice when you’re not the one doing the work.
Choice has a cost and Debian is paying it right now.
I’m glad we’re done with this one.