GPU Accelerated PCAP Filtering

This post is a part of a series. Click here for the previous post.

On my quest to build a GPU accelerated PCAP parser/filtering engine, I turn to CUDA, and explore Rust’s FFI capabilities.

In my last post I showed my “completed” GPU CSV parser. One of the biggest issues with my implementation was that rows that spanned multiple buffers weren’t correctly parsed. To solve this I would need to map two buffers per call and consider their logical concatenation by keeping track of the offset of the last line in the first buffer. This is a hack to work around the fact that there are limits on how big each buffer can be. Another issue is that implementing every operation on the GPU is probably not optimal - some operations, like the initial line count is probably more efficient on the CPU than the GPU.

Using CUDA instead of writing a compute shader

My experience so far has felt like I’ve been fighting against the tools and APIs provided. There’s a lot of boilerplate to just get code to run on the GPU, and even then I found myself facing a bit of boilerplate and complexity with using rust-gpu. For this phase of the project, I decided to use CUDA so I could focus on implementing parallel algorithms and improving my understanding of GPU architecture. However, I still wanted to use Rust for most of the non-GPU pieces. Working with Rust has been so much nicer than working with C++, and I’d like to continue honing my familiarity with Rust. I found this repo with an example of writing Rust code that links a CUDA kernel. I had to make a few tweaks to get it running on my system. Namely, I had to add .ccbin(false) to the cc::Build instance (omits the -ccbin flag to nvcc which lets nvcc select the best available compiler), and removed the -gencode flag (tells nvcc to determine the architecture and code automatically - for my purposes I don’t really care too much about compatibility with anything other than the hardware on this machine).

Parsing PCAPs

I found this link that explained the PCAP file format. This was super useful in helping my understand what the challenges of parsing PCAPs on the GPU would be. While running filters or extract data from captured packets might be embarrassingly parallel, actually finding the packets in the file is far from trivial. This is because there’s no global index for the offsets of packets in the file. The only way to determine where a packet starts and stops is by knowing where the previous packets begins, then reading the cap length field to know how many bytes were captured for that packet. This means that the only reliable way to find all offsets of packets is to start from the header and advance through every packet to build an index with the offset of every packet. Maybe it’s possible to do this in parallel with some constraints like maximum packet length, but I just don’t see a better way. It would be ideal for tcpdump itself to emit this index while it’s capturing the packets, but that’s out of scope for now. Building this index is pretty fast - even for very large captures that are several gigabytes in size.

Once I know where the offsets are, each thread can read a specific packet an evaluate the filter. For now, the only supported operation is filtering packets by source IP. On the GPU there was an additional complication that all reads must be aligned to a 32 bit boundary. However, the source IP is not guaranteed to be at an aligned offset in a PCAP. To solve this, I wrote the following function to allow reading a u32 from an arbitrary offset:

__device__ uint32_t access_u32(char *d_pcap, uint64_t offset) {
  // only aligned accesses are allowed, so we need to align offset to a 32b
  // boundry
  auto rem = offset % 4;
  auto start = offset - rem;

  // Read the ip address with the correct endianess
  auto first = bswap(*(uint32_t *)(d_pcap + start));
  if (rem == 0) {
    return first;
  }
  auto last = bswap(*(uint32_t *)(d_pcap + start + 4));

  // get the last `rem` bytes from `first` and the first `4 - rem` bytes from
  // last
  first <<= 8 * rem;
  last >>= 8 * (4 - rem);

  return first | last;
}

Here, bswap is my implementation of __builtin_bswap32, because CUDA doesn’t allow functions that aren’t marked as __device__ to be used from a device function. This is a lot more restrictive/annoying than rust-gpu where any function could be called from the kernel and it would simply be compiled for the GPU target.

The full code can be found here.

Comparison with a multi-threaded implementation

To establish a baseline, I also implemented a threaded version on the CPU. To benchmark, I generated a relatively small PCAP that’s mostly just a trace of me watching a youtube video (this one to be precise). Comparing the performance was…disappointing. Currently the GPU version takes 1s while the multi-threaded CPU version takes 10ms. I haven’t looked into why the GPU version is so much slower yet, but I also want to try larger PCAP sizes to see if I’m just not hitting the right scale for the GPU to shine. However, I’m not confident that my code as-is would work with very large files - I think I still need some kind of chunking to avoid attempting to allocate buffers that are larger than the largest availible contiguous region on device memory.

Thoughts/Conclusions

CUDA isn’t as painful to use as I’d expected
Linking C/C++/CUDA and Rust is relatively straight forward
rust-gpu’s biggest strength IMO was that it was easy to share code between the CPU/GPU, but also to easily run the GPU code on the CPU.
Learning about shaders before trying CUDA was probably the best way to learn the core concepts

Using CUDA and linking to Rust has been a better experience than using wgpu and rust-gpu so far. I still believe in rust-gpu’s vision, but it’s definitely not ready for production use. Hopefully this phase of the project will help me become more familiar with CUDA, and make better use of the GPU.

Written on January 6, 2025