How does joining a thread actually work?
As part of my research I’ve been playing around with something similar to
dynamic linking. In particular, part of my work has led me to having programs
that may need multiple copies of libc in the same address space (yes, it’s not
safe, I know). I noticed that this would sometimes cause multithreaded programs
to hang while calling pthread_join. Join (pun intended) me to understand a bit
about how threads work and what was happening under the hood.
Seeing a hang on pthread_join, my first instinct was that the child thread is
not exiting. I was able to confirm that it truly was exiting by running the
program with gdb and setting a break point on the child thread start function.
I was able to then observe the thread successfully exit while the parent thread
remained stuck at pthread_join. This meant we have two possibilities - either
the child is not properly notifying the parent of the exit, or the parent isn’t
correctly listening for the child’s exit. In my case, all of my libc
reinitialization shenanigans were happening in the child so I was more
suspicious of the child exit path than the parent wait path. So how does a
thread signal it’s parent of completion?
In linux, threads and processes have a very distinct way of signaling
completion. A process will emit a SIGCHILD signal, but a thread does not.
Instead threads use the
futex
mechanism. A futex is similar to a condition variable in that it is a
synchronization mechanism that typically tracks state changes. A waiter
registers itself by issuing a FUTEX_WAIT operation on a particular address,
while some notifier can wake up the waiter by performing a FUTEX_WAKE on the
same address. Usually, before the WAKE is emitted, the memory at that address
might be written to. Unlike a condition variable, there is no additional state
stored at the address - it’s managed by the kernel instead. So when a thread
runs pthread_join, what it really does is call FUTEX_WAIT on some address
and it expects the child thread to call FUTEX_WAKE before it exits. Or at
least, that’s what ChatGPT told me. As a good scientist I had to confirm this
by writing a test program and running it with strace to check which system
calls were being emitted. Surprisingly I saw the FUTEX_WAIT call, but no
FUTEX_WAKE! My first instinct was that futexes might have special behavior
that’s managed by the kernel - similar to file descriptors which are closed on
exit. A bit of googling revealed that this was the case for a futex that
represents the TID.
But what is the TID and how does the kernel know that a particular futex
represents one? The TID, or thread-id is how a thread distinguishes itself from
another thread in the same process. While each thread has a distinct process
context in the kernel, the process id is shared in userspace, so the thread id
is used to distinguish threads. Typically this thread id is stored in thread
local memory (on x86 this is accomplished via the fs register). A thread can
tell the kernel where it’s TID is by calling the set_tid_address syscall. The
final piece of the puzzle is that when threads are created, they use the clone
syscall with the CLONE_CLEAR_CHILDTID flag set to tell the kernel to treat the
tid address as a futex. When the thread exits, the kernel will write a 0 to the
tid address and wake up the futex. When the parent wakes up, it can use the TID
like a condition variable and check if it’s 0 before continuing.
The Fix
In my case, I was breaking this process because reinitializing libc was causing
the fs register to change, and a new address was registered as the TID
address. While I was tracking the fs register change and restoring it to the
original value when appropriate, I wasn’t calling set_tid_address to fix the
TID address tracked by the kernel, and therefore wasn’t waking up the parent
thread on exit!
How to debug these kinds of issues?
I learned a couple new debugging tricks to figure this out! The first was
debugging threaded programs with gdb. There’s a neat trick of running set
scheduler-locking on which pauses all threads except for the current thread
which can be switched with thread <number>. The other trick was to use catch
syscall in gdb which effectively sets a breakpoint on every syscall
instruction and provides details about the syscall executed - like an
interactive strace. Beyond that, there was a fair amount of staring at
assembly in gdb, rubber ducking with other lab members, and questioning my
life choices.
Hopefully this post will help some other lost night
watchers. ChatGPT
was actually pretty helpful in explaining thread internals to me, but without
testing things out myself and debugging things by hand, there’s a lot of nuances
I would have missed. While I’m still on the fence about the role of AI/LLMs in
society and software development, it seems like it’s here to stay. Maybe posts
like this will help provide more training data on more niche topics like
operating system internals.
