-
Notifications
You must be signed in to change notification settings - Fork 652
Description
HI, I found a corner case that could make unix domain socket restore fail.
Here is a rough description of the reproduction steps:
- process has a dead process group leader.
This will make CRIU create a helper process to restore the pgid. The helper process will share fs info with clone flag CLONE_FS.
int init_pstree_helper(struct pstree_item *ret)
{
BUG_ON(!ret->parent);
ret->pid->state = TASK_HELPER;
rsti(ret)->clone_flags = CLONE_FILES | CLONE_FS;
if (shared_fdt_prepare(ret) < 0)
return -1;
task_entries->nr_helpers++;
return 0;
}
- process uses an unconnected unix socket.
This will make CRIU restore call prep_unix_sk_cwd(), and final call switch_ns_by_fd() to switch mount namespace.
static int prep_unix_sk_cwd(struct unix_sk_info *ui, int *prev_cwd_fd, int *prev_root_fd, int *prev_mntns_fd)
{
...
if (prev_mntns_fd && ui->name[0] && ui->ue->mnt_id >= 0) {
struct ns_id *mntns = lookup_nsid_by_mnt_id(ui->ue->mnt_id);
...
if (switch_ns_by_fd(ns_fd, &mnt_ns_desc, prev_mntns_fd)) {
close(ns_fd);
goto err;
}
...
}
...
}
- switch mount namespace with setns will return error.
switch_ns_by_fd() will call setns() to switch mount ns, and linux kernel returns EINVAL.
When switching mount ns, linux requires that fs_struct->user should be 1. But step 1 created a helper process, which increases the user count to 2, so linux reject to switch mount ns.
The corresponding code is here.
With AI's help, I have constructed a program that can trigger this bug.
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <sys/wait.h>
#define SOCKET_PATH "/dev/shm/uds_socket_example"
void create_and_connect(int listen_fd) {
struct sockaddr_un addr;
int client_fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (client_fd == -1) {
perror("socket");
exit(EXIT_FAILURE);
}
memset(&addr, 0, sizeof(addr));
addr.sun_family = AF_UNIX;
strncpy(addr.sun_path, SOCKET_PATH, sizeof(addr.sun_path) - 1);
if (connect(client_fd, (struct sockaddr*)&addr, sizeof(addr)) == -1) {
perror("connect");
close(client_fd);
exit(EXIT_FAILURE);
}
printf("Connected to the server socket. client_fd: %d\n", client_fd);
pause();
}
int main() {
int status;
pid_t pid = fork();
if (pid == -1) {
perror("fork");
exit(EXIT_FAILURE);
}
if (pid == 0) {
int listen_fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (listen_fd == -1) {
perror("socket");
exit(EXIT_FAILURE);
}
struct sockaddr_un addr;
memset(&addr, 0, sizeof(addr));
addr.sun_family = AF_UNIX;
strncpy(addr.sun_path, SOCKET_PATH, sizeof(addr.sun_path) - 1);
unlink(SOCKET_PATH);
if (bind(listen_fd, (struct sockaddr*)&addr, sizeof(addr)) == -1) {
perror("bind");
close(listen_fd);
exit(EXIT_FAILURE);
}
if (listen(listen_fd, 5) == -1) {
perror("listen");
close(listen_fd);
exit(EXIT_FAILURE);
}
printf("Child process: pid=%d, pgid=%d\n", getpid(), getpgid(0));
create_and_connect(listen_fd);
pause();
close(listen_fd);
unlink(SOCKET_PATH);
exit(EXIT_SUCCESS);
} else {
exit(0);
}
return 0;
}
Here are the specific reproduction steps.
The environment for execution: Linux v6.16-rc5, runc v1.1.12
- Prepare a rootfs.
For example, usepodman export
to export a rootfs from centos container image. - Generate a default runc spec.
runc spec
- Compile the program and copy it into the rootfs
gcc -o test test.c
cp test rootfs/root/
- Clone the runc repository and compile the recvtty tool.
- Run a detached container.
./recvtty ./console.sock
runc run --console-socket console.sock -d test
- Run test program in container
runc exec -t test bash
/root/test
- Checkpoint and restore
runc checkpoint test
runc restore -b . test
Checkpoint success, but restore failed.
Here is the restore log:
restore.log
BTW, I found that a recently merged commit will causes criu restore fail earlier.
So, to reproduce the bug, the commit need to be reverted first.
Finally, my question is:
How should we solve this problem elegantly?
First, the logic of the Linux kernel appears to be correct, we should not modify it here.
Then, are there other ways for CRIU to restore the pgid?
Or maybe we need to avoid use CLONE_FS flag fro CRIU's helper process? But would doing so have any other side effects?