How does joining a thread actually work?

As part of my research I’ve been playing around with something similar to dynamic linking. In particular, part of my work has led me to having programs that may need multiple copies of libc in the same address space (yes, it’s not safe, I know). I noticed that this would sometimes cause multithreaded programs to hang while calling pthread_join. Join (pun intended) me to understand a bit about how threads work and what was happening under the hood.


Seeing a hang on pthread_join, my first instinct was that the child thread is not exiting. I was able to confirm that it truly was exiting by running the program with gdb and setting a break point on the child thread start function. I was able to then observe the thread successfully exit while the parent thread remained stuck at pthread_join. This meant we have two possibilities - either the child is not properly notifying the parent of the exit, or the parent isn’t correctly listening for the child’s exit. In my case, all of my libc reinitialization shenanigans were happening in the child so I was more suspicious of the child exit path than the parent wait path. So how does a thread signal it’s parent of completion?

In linux, threads and processes have a very distinct way of signaling completion. A process will emit a SIGCHILD signal, but a thread does not. Instead threads use the futex mechanism. A futex is similar to a condition variable in that it is a synchronization mechanism that typically tracks state changes. A waiter registers itself by issuing a FUTEX_WAIT operation on a particular address, while some notifier can wake up the waiter by performing a FUTEX_WAKE on the same address. Usually, before the WAKE is emitted, the memory at that address might be written to. Unlike a condition variable, there is no additional state stored at the address - it’s managed by the kernel instead. So when a thread runs pthread_join, what it really does is call FUTEX_WAIT on some address and it expects the child thread to call FUTEX_WAKE before it exits. Or at least, that’s what ChatGPT told me. As a good scientist I had to confirm this by writing a test program and running it with strace to check which system calls were being emitted. Surprisingly I saw the FUTEX_WAIT call, but no FUTEX_WAKE! My first instinct was that futexes might have special behavior that’s managed by the kernel - similar to file descriptors which are closed on exit. A bit of googling revealed that this was the case for a futex that represents the TID.

But what is the TID and how does the kernel know that a particular futex represents one? The TID, or thread-id is how a thread distinguishes itself from another thread in the same process. While each thread has a distinct process context in the kernel, the process id is shared in userspace, so the thread id is used to distinguish threads. Typically this thread id is stored in thread local memory (on x86 this is accomplished via the fs register). A thread can tell the kernel where it’s TID is by calling the set_tid_address syscall. The final piece of the puzzle is that when threads are created, they use the clone syscall with the CLONE_CLEAR_CHILDTID flag set to tell the kernel to treat the tid address as a futex. When the thread exits, the kernel will write a 0 to the tid address and wake up the futex. When the parent wakes up, it can use the TID like a condition variable and check if it’s 0 before continuing.

The Fix

In my case, I was breaking this process because reinitializing libc was causing the fs register to change, and a new address was registered as the TID address. While I was tracking the fs register change and restoring it to the original value when appropriate, I wasn’t calling set_tid_address to fix the TID address tracked by the kernel, and therefore wasn’t waking up the parent thread on exit!

How to debug these kinds of issues?

I learned a couple new debugging tricks to figure this out! The first was debugging threaded programs with gdb. There’s a neat trick of running set scheduler-locking on which pauses all threads except for the current thread which can be switched with thread <number>. The other trick was to use catch syscall in gdb which effectively sets a breakpoint on every syscall instruction and provides details about the syscall executed - like an interactive strace. Beyond that, there was a fair amount of staring at assembly in gdb, rubber ducking with other lab members, and questioning my life choices.

Hopefully this post will help some other lost night watchers. ChatGPT was actually pretty helpful in explaining thread internals to me, but without testing things out myself and debugging things by hand, there’s a lot of nuances I would have missed. While I’m still on the fence about the role of AI/LLMs in society and software development, it seems like it’s here to stay. Maybe posts like this will help provide more training data on more niche topics like operating system internals.

Written on January 16, 2026