26, K8s system enhanced 2-seccomp and sysdig

1, Experimental environment


The underlying system is Ubuntu 18 04, then install k8s on each node and build a cluster. The IP address of the Master node is 192.168 26.71/24, the IP address of two worker nodes is 192.168 26.72/24,192.168. 26.73/24.

2, Seccomp

1.Seccomp concept
seccomp (full name: secure computing mode) is a security mechanism supported by linux kernel since version 2.6.23. In Linux system, a large number of system calls are directly exposed to user programs.

For example, if we enter any command on a linux device, a large number of syscall s will be called behind it. As shown below, you can use the strace -fqc command to view it:

[root@localhost ~]# strace -fqc cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.26.72 www.ck8s.top www


% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 36.58    0.000698          22        31        13 openat
 21.54    0.000411          21        19           fstat
 18.03    0.000344          17        20           close
 17.92    0.000342          18        19           mmap
  2.10    0.000040          40         1           write
  1.78    0.000034           5         6           read
  1.52    0.000029          29         1           fadvise64
  0.52    0.000010           5         2           munmap
  0.00    0.000000           0         1           lseek
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         4           brk
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2         1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.001908                   112        15 total

You can see that there are 112 system calls behind a simple command. Many other system calls are necessary, such as read. However, many times, not all system calls are needed, and unsafe code abuse of system calls will pose a security threat to the system. Through seccomp, we restrict the program to use some system calls, which can reduce the exposure of the system and make the program enter a "safe" state.

The following actions are included in seccomp:

  1. SCMP_ACT_KILL: when a process makes a corresponding system call, the kernel sends a SIGSYS signal to terminate the process, and the process will not receive this signal
  2. SCMP_ACT_TRAP: when a process makes a corresponding system call, the process will receive the SIGSYS signal and change its behavior
  3. SCMP_ACT_ERRNO: when the process makes the corresponding system call, the system call fails, and the process will receive the return value of errno
  4. SCMP_ACT_TRACE: when a process makes a corresponding system call, the process will be tracked
  5. SCMP_ACT_ALLOW: allows the process to perform the corresponding system call behavior
  6. SCMP_ACT_LOG: record all information

2. seccomp in docker
By default, our host does not limit which system calls. The container shares the kernel with the host, so if there is no restriction on the container, the container can also call the system calls on all the host. But in fact, the Docker container has a default seccomp rule that limits some system calls.

The default seccomp configuration file provides a reasonable default for running containers using seccomp and disables about 44 of the more than 300 system calls. It has moderate protection and provides wide application compatibility. Can be in https://github.com/moby/moby/blob/master/profiles/seccomp/default.json Find the default Docker configuration file.

In fact, the configuration file is an allowlist, which denies access to system calls by default, and then allows specific system calls to be listed. This configuration file is defined by SCMP_ACT_ERRNO's defaultAction and override the action only for specific system calls to work. SCMP_ ACT_ Role of errno when the process makes the corresponding system call, the system call fails, and the process will receive the return value of errno. Next, the configuration file defines a specific list of fully allowed system calls because their operations are overridden as SCMP_ACT_ALLOW. Finally, some specific rules are used for individual system calls, such as personality, and other rules to allow variants of these system calls to have specific parameters.

Seccomp is a tool to run the Docker container. Modifying the default seccomp configuration file is not recommended. When the container runs, it uses the default configuration file unless it is overwritten with the -- Security opt option. For example, the following explicitly specifies a policy:

 docker run --rm -it --security-opt seccomp=/path/to/seccomp/profile.json  hello-world

Allow all seccomps and deny all seccomps as follows:

cat aa1.json
{
"defaultAction": "SCMP_ACT_ALLOW"
}
cat bb.json
{
"defaultAction": "SCMP_ACT_ERRNO"
}

Note that these two seccomp rules should not be used, allowing too many security vulnerabilities in all rules; Once all seccomp rules are run, the container cannot be created normally because the container daemon cannot run (no system calls are allowed).

In addition, if seccomp=Unconfined, it means that seccomp is disabled, which has the same effect as allowing all seccomp rules.

3. seccomp in k8s cluster
When creating a workload in the K8s cluster, you can load the set seccomp rule file to control the system call of the container.

If we want to enable RuntimeDefault as the default seccomp configuration file for all workloads in the K8s cluster (in this environment, that is, the default seccomp of docker), we can add the following contents under the spec of the yaml file of the workload:

spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

Of course, we can also write the corresponding seccomp Json file ourselves. Take the example given on the official website:

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "accept4",
                "epoll_wait",
                "pselect6",
                "futex",
                "madvise",
                "epoll_ctl",
                "getsockname",
                "setsockopt",
                "vfork",
                "mmap",
                "read",
                "write",
                "close",
                "arch_prctl",
                "sched_getaffinity",
                "munmap",
                "brk",
                "rt_sigaction",
                "rt_sigprocmask",
                "sigaltstack",
                "gettid",
                "clone",
                "bind",
                "socket",
                "openat",
                "readlinkat",
                "exit_group",
                "epoll_create1",
                "listen",
                "rt_sigreturn",
                "sched_yield",
                "clock_gettime",
                "connect",
                "dup2",
                "epoll_pwait",
                "execve",
                "exit",
                "fcntl",
                "getpid",
                "getuid",
                "ioctl",
                "mprotect",
                "nanosleep",
                "open",
                "poll",
                "recvfrom",
                "sendto",
                "set_tid_address",
                "setitimer",
                "writev"
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

The rule file disables all system calls by default, but then allows system calls in []. In addition, it should be noted that we need to put the seccomp rules written by ourselves in the / var/lib/kubelet/seccomp / (to be created by ourselves) directory or its subdirectory of all worker nodes in the cluster. Here, we keep them in the fine-grained.json file in this directory. Then use pod to test. The yaml file is as follows:

apiVersion: v1
kind: Pod
metadata:
  name: fine-pod
  labels:
    app: fine-pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: fine-grained.json
  containers:
  - name: test-container
    image: hashicorp/http-echo:0.2.3
    imagePullPolicy: IfNotPresent
    args:
    - "-text=just made some syscalls!"
    securityContext:
      allowPrivilegeEscalation: false
apiVersion: v1

The following section sets the seccomp rules to be called:

  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: fine-grained.json

The Localhost keyword indicates the node device on which the pod is running, and the localhostProfile indicates fine grained in the / var/lib/kubelet/seccomp directory on the node on which the invoked pod is running JSON rule file. If fine grained If the JSON file is in the aa directory, you need to use aa / fine grained JSON represents the subdirectory where the rule file is located.

After using the yaml file to create a pod, you can do corresponding tests to check whether its seccomp file is effective. In our use of fine grained JSON file, you can notice that in the first example, the configuration file is set to "defaultAction": "SCMP_ACT_LOG". Now, the configuration file is set to "defaultAction": "SCMP_ACT_ERRNO", but a set of system calls is explicitly allowed in the "action": "SCMP_ACT_ALLOW" block. Ideally, the container will run successfully and we will not see any messages sent to syslog.

To start the test, we first need to open a new terminal window and use the tail command to view the output of the call from HTTP echo:

tail -f /var/log/syslog | grep 'http-echo'

Then use the NodePort service to open a port for the Pod:

kubectl expose pod/fine-pod --type NodePort --port 5678

Check what port is assigned to the service at the node, 32028.

root@vms71:/var/lib/kubelet/seccomp# kubectl get svc/fine-pod
NAME       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
fine-pod   NodePort   10.105.125.123   <none>        5678:32028/TCP   16m

We can use the browser to access the container just deployed. We can see that it can be accessed normally (192.168.26.72 worker1 address running for this container).

Finally, you will see that there is no output in syslog, because this configuration file allows all required system calls and specifies that an error will occur if there are system calls outside the list. From a security point of view, this is ideal, but more effort is needed to analyze the program.

3, Sysdig

If we want to set the seccomp file ourselves in the K8s environment, it is undoubtedly troublesome, because we need to clarify which system calls the whole container needs to allow or prohibit. Sysdig is a ready-made tool that can help us monitor which system calls are used in the container.

First, install sysdig as a container on the Node of K8s cluster. The command is as follows:

docker run -i -dt --name sysdig --restart=always --privileged -v /var/run/docker.sock:/host/var/run/docker.sock -v /dev:/host/dev -v /proc:/host/proc:ro -v /boot:/host/boot:ro -v /lib/modules:/host/lib/modules:ro -v /usr:/host/usr:ro sysdig/sysdig

This command can help us create a sysdig container. Once we use sysdig related commands in this container, such as docker exec -it sysdig sysdig, we actually get the information of the host machine, because the volume has been mounted when creating the container.

Next, we can use the following commands to set and execute the sysdig command on the Node and map it to the docker exec -it sysdig sysdig command, which is convenient for us to call sysdig directly outside the sysdig container.

alias sysdig='docker exec -it sysdig sysdig '

Then, we can use sysdig to view the system calls performed by deploying fine pod in the previous chapter. First, find the containerid value of this pod on the worker1 node running this Pod:

root@vms72:/var/lib/kubelet/seccomp# docker ps | grep k8s_test-container
f15f04664e8a   a6838e9a6ff6                                        "/http-echo '-text=j..."   28 minutes ago   Up 28 minutes             k8s_test-container_fine-pod_default_3df3bcde-3141-47d6-b8f9-cd0aba50df2e_0

As you can see, the ID value is f15f04664e8. Then continue to format and view the system calls of this container on worker1 with the following command:

sysdig -p "*%evt.time,%proc.name,%evt.type" container.id=f15f04664e8a

Common output values are as follows:

  1. evt.num: incremental event number;
  2. evt.time: the time when the event occurred;
  3. evt.cpu: the CPU where the event is captured, that is, the CPU on which the system call is executed. Compared with the above example, the value 0 represents the first CPU of the machine;
  4. proc.name: the name of the process that generated the event, that is, which process is running;
  5. thread.tid: id of the thread. If it is a single threaded program, this is also the pid of the process;
  6. evt.dir: the direction of the event, > represents the entry event, < represents the exit event;
  7. evt.type: the name of the event, such as open, stat, etc., which is generally a system call;
  8. evt.args: parameter of the event. If it is a system call, these correspond to the parameters of the system call.

Real time monitoring is as follows:

When we revisit the container's home page, we can see more system calls. Then we write seccomp rules according to other designed system calls.

Sorting data source:
Old section CKS course
docker seccomp: https://docs.docker.com/engine/security/seccomp/
K8s seccomp: https://kubernetes.io/zh/docs/tutorials/clusters/seccomp/

Keywords: Linux Operation & Maintenance Kubernetes server cloud computing

Added by viraj on Tue, 14 Dec 2021 03:53:54 +0200