GPU Accelerated PCAP Filtering

This post is a part of a series. Click here for the previous post.

On the final installment of my quest to build a GPU accelerated PCAP parser/filtering engine, I turn to CUDA, and explore Rust’s FFI capabilities.

In my last post I showed my “completed” GPU CSV parser. One of the biggest issues with my implementation was that rows that spanned multiple buffers weren’t correctly parsed. To solve this I would need to map two buffers per call and consider their logical concatenation by keeping track of the offset of the last line in the first buffer. This is a hack to work around the fact that there are limits on how big each buffer can be. Another issue is that implementing every operation on the GPU is probably not optimal - some operations, like the initial line count is probably more efficient on the CPU than the GPU.

Using CUDA instead of writing a compute shader

My experience so far has felt like I’ve been fighting against the tools and APIs provided. There’s a lot of boilerplate to just get code to run on the GPU, and even then I found myself facing a bit of boilerplate and complexity with using rust-gpu. For this phase of the project, I decided to use CUDA so I could focus on implementing parallel algorithms and improving my understanding of GPU architecture. However, I still wanted to use Rust for most of the non-GPU pieces. Working with Rust has been so much nicer than working with C++, and I’d like to continue honing my familiarity with Rust. I found this repo with an example of writing Rust code that links a CUDA kernel. I had to make a few tweaks to get it running on my system. Namely, I had to add .ccbin(false) to the cc::Build instance (omits the -ccbin flag to nvcc which lets nvcc select the best available compiler), and removed the -gencode flag (tells nvcc to determine the architecture and code automatically - for my purposes I don’t really care too much about compatibility with anything other than the hardware on this machine).

Parsing PCAPs

I found this link that explained the PCAP file format. This was super useful in helping my understand what the challenges of parsing PCAPs on the GPU would be. While running filters or extract data from captured packets might be embarrassingly parallel, actually finding the packets in the file is far from trivial. This is because there’s no global index for the offsets of packets in the file. The only way to determine where a packet starts and stops is by knowing where the previous packets begins, then reading the cap length field to know how many bytes were captured for that packet. This means that the only reliable way to find all offsets of packets is to start from the header and advance through every packet to build an index with the offset of every packet. Maybe it’s possible to do this in parallel with some constraints like maximum packet length, but I just don’t see a better way. It would be ideal for tcpdump itself to emit this index while it’s capturing the packets, but that’s out of scope for now. Building this index is pretty fast - even for very large captures that are several gigabytes in size.

Once I know where the offsets are, each thread can read a specific packet an evaluate the filter. For now, the only supported operation is filtering packets by source IP. On the GPU there was an additional complication that all reads must be aligned to a 32 bit boundary. However, the source IP is not guaranteed to be at an aligned offset in a PCAP. To solve this, I wrote the following function to allow reading a u32 from an arbitrary offset:

__device__ uint32_t access_u32(char *d_pcap, uint64_t offset) {
  // only aligned accesses are allowed, so we need to align offset to a 32b
  // boundry
  auto rem = offset % 4;
  auto start = offset - rem;

  // Read the ip address with the correct endianess
  auto first = bswap(*(uint32_t *)(d_pcap + start));
  if (rem == 0) {
    return first;
  }
  auto last = bswap(*(uint32_t *)(d_pcap + start + 4));

  // get the last `rem` bytes from `first` and the first `4 - rem` bytes from
  // last
  first <<= 8 * rem;
  last >>= 8 * (4 - rem);

  return first | last;
}

Here, bswap is my implementation of __builtin_bswap32, because CUDA doesn’t allow functions that aren’t marked as __device__ to be used from a device function. This is a lot more restrictive/annoying than rust-gpu where any function could be called from the kernel and it would simply be compiled for the GPU target.

The full code can be found here.

Comparison with a multi-threaded implementation

To establish a baseline, I also implemented a threaded version on the CPU. To benchmark, I generated a relatively small PCAP that’s mostly just a trace of me watching a youtube video (this one to be precise). Comparing the performance was…disappointing. Currently the GPU version takes 1s while the multi-threaded CPU version takes 10ms.

To see if this was just a matter of IO overheads dwarfing any speed up in compute, I tried again with a larger PCAP, a capture of some traffic over a docker networking running a graph database, which was a bit over 9GB. This didn’t work with my code as-is because the memory allocations on the device failed. I was able to get things to at least run by telling the kernel to only consider the first 20 million packets (one fourth of the total). This ran, and showed the GPU version taking around 600ms while the threaded version took 500ms to process all packets.

At this point, I added a few timers to the kernel and measured various phases of the program to get the following report:

Alloc time: 55ms
Copy time: 414ms
Compute time: 256us
Copy to host/free time: 40ms
Finalization time: 62ms
elapsed = 577.202005ms

The elapsed measurement is done from the driving code in Rust, while the rest of the timers are in my CUDA implementation. There’s a few milliseconds unaccounted for, but some of that is probably due to rounding error since my C++/CUDA timers only report whole milliseconds/microseconds while the rest probably comes from a sum operation on the CPU that takes the results of the filtering and turns it into a single count (this could be done on the GPU), and there’s possibly some overhead to converting buffers to/from Rust slices. What’s most interesting here is the compute time itself - only 256us!!!

This gives us a sense of what the numbers would look like if I implemented some kind of chunking and processed some piece of the PCAP per iteration. Since we ran the above with one fourth of the file, let’s assume that it would take us four iterations. If we just ran the compute portion itself four times, we’d still only be around 1ms, and even if copying the results back to the host took 40ms per iteration, that would be 161ms total, still faster than my threaded implementation. However, the real killer is the cost of copying the PCAP into GPU memory - we’d be able to only the allocation once, but the copying itself would need to be done for each chunk, taking at least 1.6s alone, just to copy data to the GPU.

Thoughts/Conclusions

CUDA isn’t as painful to use as I’d expected
Linking C/C++/CUDA and Rust is relatively straight forward
rust-gpu’s biggest strength IMO was that it was easy to share code between the CPU/GPU, but also to easily run the GPU code on the CPU.
Learning about shaders before trying CUDA was probably the best way to learn the core concepts

Using CUDA and linking to Rust has been a better experience than using wgpu and rust-gpu so far. I still believe in rust-gpu’s vision, but it’s definitely not ready for production use. One really cool library I found was cuco. Reading through some of the implementations of containers present there was pretty fun, and it’s always good to know what tools are already available to use.

At the end of the day, filtering PCAPs just isn’t compute heavy enough to be a good workload for the GPU. Revisiting my original inspiration for this project, Compilation on the GPU?: a feasibility study, it’s worth noting that the authors specifically omit IO time from their measurements as they state that their goals are only to measure the feasibility of parallel algorithms. I wish they would have included a section that showed what the IO speeds they were dealing with were though to help better inform where work needs to be done to take full advantage of these algorithms.

My other key takeaway is that as a file format PCAPs don’t directly allow parallel processing to happen super easily. The index I built to allow parallel processing was pretty simple and only adds a bit of overhead - it seems to me like not generating this index during packet collection is a missing feature from tools like tcpdump. As far as I can tell, newer formats like PCAP-ng don’t provide anything equivalent within the format itself either. Maybe someday I’ll try to patch tcpdump to generate this index file along with the output as a proof of concept.

I think this will be my last post in this series - it was a fun journey, and I learned some new skills! I think the conclusion was pretty predictable, but this was a good motivating function to try out new approaches to GPU programming, learning Rust, and understanding the PCAP file format better.

Written on January 21, 2025