There's a CoW holding up my HugePages

Hugepages can lead to performance issues when mapped in Copy-on-Write (CoW) mode. In this post, I’ll describe the problem with some benchmarks.

CoW is a great technique that helps concurrent programs share data transparently - which is especially important to UNIX programs that rely on fork. fork works by duplicating a process and all of it’s memory mappings to create a new process with the only difference being that the return value of fork is different (0 in the child, pid of the child in the parent). when a process forks, all of it’s writeable pages are marked both shared and read-only, and a flag is set to note that it should be considered for CoW. When a process writes to a shared page, it triggers a write-fault, which causes the OS to copy the contents of the page to a new location and update the mapping for the writing process, while the original page is marked writeable and exclusive. This allows the write to only be visible to the process that wrote, while also allowing the other process to retain it’s view of the page. For shared pages that never get written (or shared pages that are only written to by the parent after the child exits), CoW allows the OS to save on both memory usage and time, since unnecessary copies are avoided.

While fork isn’t the best abstraction for all workloads (see this), it’s particularly well-suited for checkpoint/restore workloads.

Checkpointing

In many programs there’s a need for a mechanism that can save the current state of the program in some way that can be resumed or examined later. For example, video games (where there is an explicit notion of “saving”), machine learning (where intermediate model weights are checkpointed for either resuming training - perhaps when compute is sufficiently inexpensive, or for debugging), or even entire operating systems (see CRIU or VM migration). Regular readers of my blog (who probably don’t really exist tbh) probably know that this is an area that I’ve been interested in recently.

For a concrete example, let’s take a look at redis (an in-memory key-value store). redis has a checkpointing command called BGSAVE which works by forking the process, then more or less dumping memory to disk.

HugePages

fork-based checkpointing tends to perform poorly in the presence of HugePages. For a quick recap, in linux, every process has a unique virtual address space that is mapped to physical memory with the help of a hardware TLB (translation lookaside buffer). By default every mapping is for a 4KB “page”. However, this has some drawbacks. For programs that will map large sections of memory, 4KB mappings will lead to many Page Table Entries which can make fault handling slower, and worse, for large objects, there’s no guarantees that contiguous regions of virtual address space are actually contiguous physically beyond 4KB chunks.

HugePages are a relatively new addition to the linux-kernel that allows for having memory mappings that address a region of physical memory that are larger than 4KB. By default a hugepage is 2MB, but it can be configured to be 1GB. Linux has a mode that enables hugepages “transparently”, but for a plethora of reasons, this is disabled by default and allocating hugepages needs to be explicitly requested by the process today.

As far as other popular OSes go, Windows also has a similar feature called large pages. MacOS has a similar feature it calls superpages, though MacOS also uses 16KB pages by default instead of 4KB.

Why are HugePages problematic for checkpointing?

HugePages come with some huge tradeoffs. In terms of checkpointing, the child will usually never write to any pages from the parent. However, the parent may continue writing. For a 4KB page, this would trigger a 4KB copy. However, for a HugePage of 2MB, this will trigger a 2MB copy - even if the write only changed a single byte! This can introduce huge latency spikes during the copies, and also results in increased memory pressure. For this reason redis recommends disabling transparent huge pages.

How can we verify this for ourselves?

I wanted to see if I could observe the latency spikes reported by redis for myself, so I built a simple test program:

// This benchmark can be configured to either use HugePages or use regular
// (small) pages with `-DUSE_SMALLPAGES` at compile time.
...

// This can be modified during compilation to deteremine how different numbers
// of allocated hugepages affects latency
#ifndef NPAGES
#define NPAGES 1
#endif

constexpr size_t HUGEPAGESIZE = 2048 * 1024;
constexpr size_t PAGESIZE = 4 * 1024;
constexpr size_t TESTSIZE = NPAGES * HUGEPAGESIZE;

int main() {
#ifndef USE_SMALLPAGES
  // Not strictly necessary, but I was using memfd_create here so that I could
  // more easily see the huge pages in the memory mapping
  int hpfd = memfd_create("hugememfd", MFD_HUGETLB);
  ftruncate(hpfd, TESTSIZE);
  auto *hp_ptr = mmap(NULL, TESTSIZE, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_HUGETLB, hpfd, 0);
#else
  auto *hp_ptr = mmap(NULL, TESTSIZE, PROT_READ | PROT_WRITE,
                      MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
#endif

  // Setting up some pipes so I can control where I take measurements
  int fds0[2];
  int fds1[2];
  if (pipe(fds0) != 0) {
    std::cerr << "Error creating pipe\n";
    exit(1);
  }
  if (pipe(fds1) != 0) {
    std::cerr << "Error creating pipe\n";
    exit(1);
  }

  // I don't actually have to write to every byte here, but I want to ensure
  // that in the regular pages setting that every page is actually mapped into
  // memory.
  for (size_t i = 0; i < TESTSIZE; i++) {
    ((char *)hp_ptr)[i] = i;
  }

  // This shows stats on how many HugePages are allocated
  // system("cat /proc/meminfo | grep ^Huge");
  pid_t p = fork();
  if (p != 0) {
    close(fds0[1]);
    close(fds1[0]);

    char b = '\0';
    read(fds0[0], &b, 1);

    std::cout << "got byte " << b << std::endl;
    // trigger a huge page write - note that this *should* be a lot cheaper in
    // the regular pages setting because we should only copy NPAGES*4KB instead
    // of NPAGES*2MB
    for (size_t i = 0; i < TESTSIZE; i += HUGEPAGESIZE) {
      ((char *)hp_ptr)[i] = b;
    }
    std::cout << "in mem: " << (int)*(char *)hp_ptr << std::endl;

    // This is useful when not benchmarking to confirm that the HugePage really
    // did double.
    // system("cat /proc/meminfo | grep ^Huge");

    // let the child exit
    write(fds1[1], &b, 1);
    std::cout << "parent done\n";
  } else {
    close(fds0[0]);
    close(fds1[1]);

    std::cout << "child in mem: " << (int)*(char *)hp_ptr << std::endl;
    char b = 'a';
    // Inform the parent that the child has spawned
    write(fds0[1], &b, 1);
    // Wait until the parent signals to exit (ensures that hp_ptr's page is
    // mapped shared until we're ready)
    read(fds1[0], &b, 1);
    std::cout << "child done\n";
  }
}

I setup the following makefile to build all my test configurations:

all: test_small test

test: test.cpp
	clang++ test.cpp -o test
	clang++ test.cpp -D NPAGES=128 -o test_128
	clang++ test.cpp -D NPAGES=256 -o test_256
	clang++ test.cpp -D NPAGES=512 -o test_512

test_small: test.cpp
	clang++ test.cpp -D USE_SMALLPAGES -o test_small
	clang++ test.cpp -D USE_SMALLPAGES -D NPAGES=128 -o test_small_128
	clang++ test.cpp -D USE_SMALLPAGES -D NPAGES=256 -o test_small_256
	clang++ test.cpp -D USE_SMALLPAGES -D NPAGES=512 -o test_small_512

enable_hugepages:
    # By default there are no huge pages available for allocation on my system
    # This can be persisted by setting it in a config file, but I don't want to.
	sudo sysctl -w vm.nr_hugepages=1024

I ran all the benchmarks with hyperfine and got the following measurements:

Benchmark 1: ./test
  Time (mean ± σ):      10.3 ms ±   3.1 ms    [User: 7.1 ms, System: 2.2 ms]
  Range (min … max):     2.3 ms …  17.0 ms    317 runs
 
Benchmark 1: ./test_small
  Time (mean ± σ):      10.3 ms ±   3.1 ms    [User: 6.2 ms, System: 3.7 ms]
  Range (min … max):     2.2 ms …  15.5 ms    298 runs
 
Benchmark 1: ./test_128
  Time (mean ± σ):     276.7 ms ±  21.7 ms    [User: 234.0 ms, System: 18.0 ms]
  Range (min … max):   245.0 ms … 316.6 ms    12 runs
 
Benchmark 1: ./test_small_128
  Time (mean ± σ):     200.0 ms ±   4.2 ms    [User: 141.8 ms, System: 57.2 ms]
  Range (min … max):   194.2 ms … 206.5 ms    14 runs
 
Benchmark 1: ./test_256
  Time (mean ± σ):     563.2 ms ±  33.6 ms    [User: 481.4 ms, System: 35.3 ms]
  Range (min … max):   513.4 ms … 614.3 ms    10 runs
 
Benchmark 1: ./test_small_256
  Time (mean ± σ):     383.6 ms ±   8.4 ms    [User: 277.8 ms, System: 104.6 ms]
  Range (min … max):   361.4 ms … 390.0 ms    10 runs
 
Benchmark 1: ./test_512
  Time (mean ± σ):      1.161 s ±  0.091 s    [User: 1.002 s, System: 0.068 s]
  Range (min … max):    1.023 s …  1.329 s    10 runs
 
Benchmark 1: ./test_small_512
  Time (mean ± σ):     773.3 ms ±   9.2 ms    [User: 560.5 ms, System: 210.4 ms]
  Range (min … max):   761.3 ms … 788.0 ms    10 runs

This confirms that CoW for hugepages is a significant source of latency!

However, I’d like to got a step further to confirm that my benchmark doesn’t have any other sources of latency. To check, I modified the test program in the small pages setting to trigger a write to every mapped page, and got the following numbers:

Benchmark 1: ./test
  Time (mean ± σ):      12.2 ms ±   4.7 ms    [User: 9.4 ms, System: 1.9 ms]
  Range (min … max):     2.8 ms …  20.2 ms    298 runs
 
Benchmark 1: ./test_small
  Time (mean ± σ):      15.1 ms ±   4.1 ms    [User: 9.2 ms, System: 3.5 ms]
  Range (min … max):     5.1 ms …  22.4 ms    166 runs
 
Benchmark 1: ./test_128
  Time (mean ± σ):     369.3 ms ±   6.3 ms    [User: 323.7 ms, System: 20.8 ms]
  Range (min … max):   362.5 ms … 382.0 ms    10 runs
 
Benchmark 1: ./test_small_128
  Time (mean ± σ):     392.4 ms ±   6.1 ms    [User: 268.4 ms, System: 56.4 ms]
  Range (min … max):   384.2 ms … 400.8 ms    10 runs
 
Benchmark 1: ./test_256
  Time (mean ± σ):     732.2 ms ±   7.7 ms    [User: 647.6 ms, System: 37.5 ms]
  Range (min … max):   724.2 ms … 747.8 ms    10 runs
 
Benchmark 1: ./test_small_256
  Time (mean ± σ):     784.4 ms ±  10.7 ms    [User: 541.9 ms, System: 107.3 ms]
  Range (min … max):   766.0 ms … 803.4 ms    10 runs
 
Benchmark 1: ./test_512
  Time (mean ± σ):      1.467 s ±  0.010 s    [User: 1.300 s, System: 0.076 s]
  Range (min … max):    1.452 s …  1.487 s    10 runs
 
Benchmark 1: ./test_small_512
  Time (mean ± σ):      1.577 s ±  0.008 s    [User: 1.080 s, System: 0.225 s]
  Range (min … max):    1.567 s …  1.589 s    10 runs

Here we see that the small tests take slightly longer than the HugePage tests which is probably because we’re triggering more faults, though the dominant factor is probably still the time spent copying page contents so the number are roughly similar between the two variants.

Conclusion

If you’re using fork and HugePages, the latency of copying over pages during a CoW fault can be expensive, but as the second benchmark shows above, it’s only significant if your writes are sparse. If you were going to touch every 4KB sub-region of a given HugePage, it’s probably slightly better than using regular sized pages. However, for checkpointing, I think that it’s pretty common for the checkpointing child process to be short-lived and only require a read-only view of the parent memory at a snapshot in time, so the cost imposed by HugePages is probably significant.

Interested in solving this problem?

This is a problem that I am actively working on and hope to address soon! I’ve already got some work in progress towards this, so if you’re also interested in this area, or would like to collaborate, please contact me either below or at aneeshd (at) cs.utexas.edu and I would love to talk!

Written on September 8, 2025