|
| 1 | +--- |
| 2 | +content_type: reference |
| 3 | +title: Seccomp and Kubernetes |
| 4 | +weight: 80 |
| 5 | +--- |
| 6 | + |
| 7 | +<!-- overview --> |
| 8 | + |
| 9 | +Seccomp stands for secure computing mode and has been a feature of the Linux |
| 10 | +kernel since version 2.6.12. It can be used to sandbox the privileges of a |
| 11 | +process, restricting the calls it is able to make from userspace into the |
| 12 | +kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a |
| 13 | +{{< glossary_tooltip text="node" term_id="node" >}} to your Pods and containers. |
| 14 | + |
| 15 | +## Seccomp fields |
| 16 | + |
| 17 | +{{< feature-state for_k8s_version="v1.19" state="stable" >}} |
| 18 | + |
| 19 | +There are four ways to specify a seccomp profile for a |
| 20 | +{{< glossary_tooltip text="pod" term_id="pod" >}}: |
| 21 | + |
| 22 | +- for the whole Pod using [`spec.securityContext.seccompProfile`](/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context) |
| 23 | +- for a single container using [`spec.containers[*].securityContext.seccompProfile`](/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1) |
| 24 | +- for an (restartable / sidecar) init container using [`spec.initContainers[*].securityContext.seccompProfile`](/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1) |
| 25 | +- for an [ephermal container](/docs/concepts/workloads/pods/ephemeral-containers) using [`spec.ephemeralContainers[*].securityContext.seccompProfile`](/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-2) |
| 26 | + |
| 27 | +{{% code_sample file="pods/security/seccomp/fields.yaml" %}} |
| 28 | + |
| 29 | +The Pod in the example above runs as `Unconfined`, while the |
| 30 | +`ephemeral-container` and `init-container` specifically defines |
| 31 | +`RuntimeDefault`. If the ephemeral or init container would not have set the |
| 32 | +`securityContext.seccompProfile` field explicitly, then the value would be |
| 33 | +inherited from the Pod. The same applies to the container, which runs a |
| 34 | +`Localhost` profile `my-profile.json`. |
| 35 | + |
| 36 | +Generally speaking, fields from (ephemeral) containers have a higher priority |
| 37 | +than the Pod level value, while containers which do not set the seccomp field |
| 38 | +inherit the profile from the Pod. |
| 39 | + |
| 40 | +{{< note >}} |
| 41 | +It is not possible to apply a seccomp profile to a Pod or container running with |
| 42 | +`privileged: true` set in the container's `securityContext`. Privileged |
| 43 | +containers always run as `Unconfined`. |
| 44 | +{{< /note >}} |
| 45 | + |
| 46 | +The following values are possible for the `seccompProfile.type`: |
| 47 | + |
| 48 | +`Unconfined` |
| 49 | +: The workload runs without any seccomp restrictions. |
| 50 | + |
| 51 | +`RuntimeDefault` |
| 52 | +: A default seccomp profile defined by the |
| 53 | +{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}} |
| 54 | +is applied. The default profiles aim to provide a strong set of security |
| 55 | +defaults while preserving the functionality of the workload. It is possible that |
| 56 | +the default profiles differ between container runtimes and their release |
| 57 | +versions, for example when comparing those from |
| 58 | +{{< glossary_tooltip text="CRI-O" term_id="cri-o" >}} and |
| 59 | +{{< glossary_tooltip text="containerd" term_id="containerd" >}}. |
| 60 | + |
| 61 | +`Localhost` |
| 62 | +: The `localhostProfile` will be applied, which has to be available on the node |
| 63 | +disk (on Linux it's `/var/lib/kubelet/seccomp`). The availability of the seccomp |
| 64 | +profile is verified by the |
| 65 | +{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}} |
| 66 | +on container creation. If the profile does not exist, then the container |
| 67 | +creation will fail with a `CreateContainerError`. |
| 68 | + |
| 69 | +### `Localhost` profiles |
| 70 | + |
| 71 | +Seccomp profiles are JSON files following the scheme defined by the |
| 72 | +[OCI runtime specification](https://github.com/opencontainers/runtime-spec/blob/f329913/config-linux.md#seccomp). |
| 73 | +A profile basically defines actions based on matched syscalls, but also allows |
| 74 | +to pass specific values as arguments to syscalls. For example: |
| 75 | + |
| 76 | +```json |
| 77 | +{ |
| 78 | + "defaultAction": "SCMP_ACT_ERRNO", |
| 79 | + "defaultErrnoRet": 38, |
| 80 | + "syscalls": [ |
| 81 | + { |
| 82 | + "names": [ |
| 83 | + "adjtimex", |
| 84 | + "alarm", |
| 85 | + "bind", |
| 86 | + "waitid", |
| 87 | + "waitpid", |
| 88 | + "write", |
| 89 | + "writev" |
| 90 | + ], |
| 91 | + "action": "SCMP_ACT_ALLOW" |
| 92 | + } |
| 93 | + ] |
| 94 | +} |
| 95 | +``` |
| 96 | + |
| 97 | +The `defaultAction` in the profile above is defined as `SCMP_ACT_ERRNO` and |
| 98 | +will return as fallback to the actions defined in `syscalls`. The error is |
| 99 | +defined as code `38` via the `defaultErrnoRet` field. |
| 100 | + |
| 101 | +The following actions are generally possible: |
| 102 | + |
| 103 | +`SCMP_ACT_ERRNO` |
| 104 | +: Return the specified error code. |
| 105 | + |
| 106 | +`SCMP_ACT_ALLOW` |
| 107 | +: Allow the syscall to be executed. |
| 108 | + |
| 109 | +`SCMP_ACT_KILL_PROCESS` |
| 110 | +: Kill the process. |
| 111 | + |
| 112 | +`SCMP_ACT_KILL_THREAD` and `SCMP_ACT_KILL` |
| 113 | +: Kill only the thread. |
| 114 | + |
| 115 | +`SCMP_ACT_TRAP` |
| 116 | +: Throw a `SIGSYS` signal. |
| 117 | + |
| 118 | +`SCMP_ACT_NOTIFY` and `SECCOMP_RET_USER_NOTIF`. |
| 119 | +: Notify the user space. |
| 120 | + |
| 121 | +`SCMP_ACT_TRACE` |
| 122 | +: Notify a tracing process with the specified value. |
| 123 | + |
| 124 | +`SCMP_ACT_LOG` |
| 125 | +: Allow the syscall to be executed after the action has been logged to syslog or |
| 126 | +auditd. |
| 127 | + |
| 128 | +Some actions like `SCMP_ACT_NOTIFY` or `SECCOMP_RET_USER_NOTIF` may be not |
| 129 | +supported depending on the container runtime, OCI runtime or Linux kernel |
| 130 | +version being used. There may be also further limitations, for example that |
| 131 | +`SCMP_ACT_NOTIFY` cannot be used as `defaultAction` or for certain syscalls like |
| 132 | +`write`. All those limitations are defined by either the OCI runtime |
| 133 | +([runc](https://github.com/opencontainers/runc), |
| 134 | +[crun](https://github.com/containers/crun)) or |
| 135 | +[libseccomp](https://github.com/seccomp/libseccomp). |
| 136 | + |
| 137 | +The `syscalls` JSON array contains a list of objects referencing syscalls by |
| 138 | +their respective `names`. For example, the action `SCMP_ACT_ALLOW` can be used |
| 139 | +to create a whitelist of allowed syscalls as outlined in the example above. It |
| 140 | +would also be possible to define another list using the action `SCMP_ACT_ERRNO` |
| 141 | +but a different return (`errnoRet`) value. |
| 142 | + |
| 143 | +It is also possible to specify the arguments (`args`) passed to certain |
| 144 | +syscalls. More information about those advanced use cases can be found in the |
| 145 | +[OCI runtime spec](https://github.com/opencontainers/runtime-spec/blob/f329913/config-linux.md#seccomp) |
| 146 | +and the [Seccomp Linux kernel documentation](https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt). |
| 147 | + |
| 148 | +## Further reading |
| 149 | + |
| 150 | +- [Restrict a Container's Syscalls with seccomp](/docs/tutorials/security/seccomp/) |
| 151 | +- [Pod Security Standards](/docs/concepts/security/pod-security-standards/) |
0 commit comments