The way capabilities work in Linux is documented in man 7 capabilities.
Processes' capabilities in the effective set are against which permission checks are done. File capabilities are used during an execv call (which happens when you want to run another program1) to calculate the new capability sets for the process.
Files have two sets for capabilities, permitted and inheritable and effective bit.
Processes have three capability sets: effective, permitted and inheritable. There is also a bounding set, which limits which capabilities may be added later to a process' inherited set and affects how capabilities are calculated during a call to execv. Capabilities can only be dropped from the bounding set, not added.
Permissions checks for a process are checked against the process' effective set. A process can raise its capabilities from the permitted to the effective set (using capget and capset syscalls, the recommended APIs are respectively cap_get_proc and cap_set_proc).
Inheritable and bounding sets and file capabilities come into play during an execv syscall. During execv, new effective and permitted sets are calculated and the inherited and bounding sets stay unchanged. The algorithm is described in the capabilities man page:
P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & cap_bset) P'(effective) = F(effective) ? P'(permitted) : 0 P'(inheritable) = P(inheritable) [i.e., unchanged]Where P is the old capability set, P' is the capability set after execv and F is the file capability set.
If a capability is in both processes' inheritable set and the file's inheritable set (intersection/logical AND), it is added to the permitted set. The file's permitted set is added (union/logical OR) to it (if it is within the bounding set).
If the effective bit in file capabilities is set, all permitted capabilities are raised to effective after execv.
Capabilities in kernel are actually set for threads, but regarding file capabilities this distinction is usually relevant only if the process alters its own capabilities.
In your example capabilities cap_net_raw , cap_net_admin and cap_dac_override are added to the inherited and permitted sets and the effective bit is set. When your binary is executed, the process will have those capabilities in the effective and permitted sets if they are not limited by a bounding set.
[1] For the fork syscall, all the capabilities and the bounding set are copied from parent process. Changes in uid also have their own semantics for how capabilities are set in the effective and permitted sets.
Answer from sebasth on Stack ExchangeThe way capabilities work in Linux is documented in man 7 capabilities.
Processes' capabilities in the effective set are against which permission checks are done. File capabilities are used during an execv call (which happens when you want to run another program1) to calculate the new capability sets for the process.
Files have two sets for capabilities, permitted and inheritable and effective bit.
Processes have three capability sets: effective, permitted and inheritable. There is also a bounding set, which limits which capabilities may be added later to a process' inherited set and affects how capabilities are calculated during a call to execv. Capabilities can only be dropped from the bounding set, not added.
Permissions checks for a process are checked against the process' effective set. A process can raise its capabilities from the permitted to the effective set (using capget and capset syscalls, the recommended APIs are respectively cap_get_proc and cap_set_proc).
Inheritable and bounding sets and file capabilities come into play during an execv syscall. During execv, new effective and permitted sets are calculated and the inherited and bounding sets stay unchanged. The algorithm is described in the capabilities man page:
P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & cap_bset) P'(effective) = F(effective) ? P'(permitted) : 0 P'(inheritable) = P(inheritable) [i.e., unchanged]Where P is the old capability set, P' is the capability set after execv and F is the file capability set.
If a capability is in both processes' inheritable set and the file's inheritable set (intersection/logical AND), it is added to the permitted set. The file's permitted set is added (union/logical OR) to it (if it is within the bounding set).
If the effective bit in file capabilities is set, all permitted capabilities are raised to effective after execv.
Capabilities in kernel are actually set for threads, but regarding file capabilities this distinction is usually relevant only if the process alters its own capabilities.
In your example capabilities cap_net_raw , cap_net_admin and cap_dac_override are added to the inherited and permitted sets and the effective bit is set. When your binary is executed, the process will have those capabilities in the effective and permitted sets if they are not limited by a bounding set.
[1] For the fork syscall, all the capabilities and the bounding set are copied from parent process. Changes in uid also have their own semantics for how capabilities are set in the effective and permitted sets.
Setting a capability on a file
sudo setcap 'cap_net_bind_service=ep' file_name
Setting multiple capabilities on a file
sudo setcap 'cap_net_bind_service=ep cap_sys_admin=ep' file_name
Removing all capabilities from a file
sudo setcap -r file_name
Checking capabilities for a file
getcap file_name
List of possible capabilities (some are really interesting)
https://linux.die.net/man/7/capabilities
Pitfall: setting capabilities does not really work for scripts. If you want your Python script to work, you need to set the capabilities on the Python executable itself. It's not ideal.
Note: setcap always overwrites the entire capability set when you run it. Most of the time, you see examples using setcap with + or - syntax, which I believe is a confusing piece of junk and does NOT work as you would expect from other tools like chmod. You can't use setcap multiple times to add different capabilities, it needs to be done in a single command.
Which package provides the setcap command?
What is the setcap command and its purpose?
The most common method to do provide extra capabilities to a process is to assign filesystem capabilities to its binary.
For example, if you want the processes executing /sbin/yourprog to have the CAP_CHOWN capability, add that capability to the permitted and effective sets of that file: sudo setcap cap_chown=ep /sbin/yourprog.
The setcap utility is provided by the libcap2-bin package, and is installed by default on most Linux distributions.
It is also possible to provide the capabilities to the original process, and have that process manipulate its effective capability set as needed. For example, Wireshark's dumpcap is typically installed with CAP_NET_ADMIN and CAP_NET_RAW filesystem capabilities in the effective, permitted, and inheritable sets.
I dislike the idea of adding any filesystem capabilities to the inheritable set. When the capabilities are not in the inheritable set, executing another binary causes the kernel to drop those capabilities (assuming KEEPCAPS is zero; see prctl(PR_SET_KEEPCAPS) and man 7 capabilities for details).
As an example, if you granted /sbin/yourprog only the CAP_CHOWN capability and only in the permitted set (sudo setcap cap_chown=p /sbin/yourprog), then the CAP_CHOWN capability will not be automatically effective, and it will be dropped if the process executes some other binary. To use the CAP_CHOWN capability, a thread can add the capability to its effective set for the duration of the operations needed, then remove it from the effective set (but keep it in the permitted set), via prctl() calls. Note that the libcap cap_get_proc()/cap_set_proc() interface applies the changes to all threads in the process, which may not be what you want.
For temporarily granting a capability, a worker sub-process can be used. This makes sense for a complex process, as it allows delegating/separating the privileged operations to a separate binary. A child process is forked, connected to the parent via an Unix domain stream or datagram socket created via socketpair(), and executes the helper binary that grants it the necessary capabilities. It then uses the Unix domain stream socket to verify the identity (process ID, user ID, group ID, and via the process ID, the executable the other end of the socket is executing). The reason a pipe is not used, is that an Unix domain stream socket or datagram socketpair socket is needed to use the SO_PEERCRED socket option to query the kernel the identity of the other end of the socket.
There are known attack patterns that need to be anticipated and thwarted. The most common attack pattern is causing the parent process to immediately execute a compromised binary after forking and executing the privileged child process, timed just right so the capabled child process trusts the other end is its proper parent executing the proper binary, but in fact control has been transferred to a completely different, compromised or untrustworthy binary.
The details on exactly how to do this securely are a software engineering question much more than a programming question, but using socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fdpair) and verifying the socket peer is the parent process still executing the expected binary more than just once at the beginning, are the key steps needed.
The simplest example I can think of is using prctl() and CAP_NET_BIND_SERVICE filesystem capability only in the permitted set, so that an otherwise unprivileged process can use a privileged port (1-1024, preferably a system-wide subset defined/listed in a root or admin-owned configuration file somewhere under /etc) to provide a network service. If the service will close and reopen its listening socket when told to do so (perhaps via SIGUSR1 signal), the listening socket cannot simply be created once at the beginning then dropped. It is a pretty good match for the "keep in permitted set, but only add to effective set of the thread that actually needs it, then drop it immediately afterwards" pattern.
For CAP_CHOWN, an example program might acquire it into its effective and permitted sets via the filesystem capability, but use a trusted configuration file (root/admin modifiable only) to list the ownership changes it is allowed to do based on the real user and group identity running the process. Consider a dedicated "sudo"-style "chown" utility, intended for say organizations to allow team leads to shift file ownership between their team members, but one that does not use sudo.)
It is not realistically possible to gain capabilities during runtime. The capabilities need to be already set before your software is started.
Some API functions like capset and cap_set_proc exist, but don't expect magic because the situation in which you could gain more capabilities will be both rare and a security oversight.
There are a few general ways of giving your software the required capabilities.
- Set a specific capability on your binary with the
setcaptool. - Use
sudoto call your program. You already mentioned this yourself. - Set the
setuidbit on your binary and set ownership toroot. In this particular case that will be largely equivalent to calling your program withsudo. - Create a utility program that you apply one of the other methods on. Typically you would find such utility in a place like
/usr/libexec. You then call the utility as a subprocess. I would consider this unnecessarily complex for simple situations. However, depending on the situations, this may be preferred over having a potential security risk of your software constantly running with too many privileges.
The first method should be considered the desired way. Your software should drop the capability as soon as it no longer requires it.
The CAP_CHOWN could be used for example to change ownership of /etc/shadow. The new owner could then change password for other users such as root, so effectively it could be equivalent to granting all capabilities. Hence, this capability is -like many others- potentially dangerous.
And one last desperate syntax guess pays off:
# setcap cap_net_bind_service,cap_sys_boot=+ep /usr/bin/nodejs
# getcap /usr/bin/nodejs
/usr/bin/nodejs = cap_net_bind_service,cap_sys_boot+ep
For anyone who arrived at this question: if you want to specify different actions for different capabilities, this is the syntax you need:
# this is for illustrative purposes only
# obviously it's not how you would formally define the syntax
setcap "<CAP_0,CAP_1,...><ACTIONS_A> <CAP_2,...><ACTION_B> ..."
So using OP's example, let's assume that you want to set =+eip for cap_net_bind_service instead of =+ep, this is what you would do:
setcap "cap_net_bind_service=+eip cap_sys_boot=+ep" /usr/bin/nodejs
More formally, each "capability description" as accepted by setcap is comprised of multiple space-separated "clauses", where each clause is in the form of <CAP_NAME_LIST><OPS><FLAGS> (an <OPS><FLAGS> pair is also referred to as an "action").
So if we take the aforementioned example:
cap_net_bind_service=+eip cap_sys_boot=+ep
Here cap_net_bind_service=+eip is the first clause, where cap_net_bind_service is the comma-separated list of capability names (only one item in this list), =+ are the operators (reset then raise), and eip are the flags (effective, inheritable, and permitted). Similarly cap_sys_boot=+ep is the second clause etc.
If you want a detailed description of the full syntax, you should refer to the manual page of cap_from_text. Here's a link if you prefer reading from die.net.