GPU Accelerated PCAP Filtering
On my quest to build a GPU accelerated PCAP parser/filtering engine, I turn to CUDA
, and explore
Rust’s FFI capabilities.
In my last post I showed my “completed” GPU CSV parser. One of the biggest issues with my implementation was that rows that spanned multiple buffers weren’t correctly parsed. To solve this I would need to map two buffers per call and consider their logical concatenation by keeping track of the offset of the last line in the first buffer. This is a hack to work around the fact that there are limits on how big each buffer can be. Another issue is that implementing every operation on the GPU is probably not optimal - some operations, like the initial line count is probably more efficient on the CPU than the GPU.
Using CUDA instead of writing a compute shader
My experience so far has felt like I’ve been fighting against the tools and APIs provided. There’s a
lot of boilerplate to just get code to run on the GPU, and even then I found myself facing a bit of
boilerplate and complexity with using rust-gpu
. For this phase of the project, I decided to use
CUDA
so I could focus on implementing parallel algorithms and improving my understanding of GPU
architecture. However, I still wanted to use Rust
for most of the non-GPU pieces. Working with
Rust has been so much nicer than working with C++
, and I’d like to continue honing my familiarity
with Rust. I found this repo with an example of
writing Rust code that links a CUDA kernel. I had to make a few tweaks to get it running on my
system. Namely, I had to add .ccbin(false)
to the cc::Build
instance (omits the -ccbin
flag to
nvcc
which lets nvcc
select the best available compiler), and removed the -gencode
flag (tells
nvcc
to determine the architecture and code automatically - for my purposes I don’t really care
too much about compatibility with anything other than the hardware on this machine).
Parsing PCAPs
I found this link that explained the PCAP file
format. This was super useful in helping my understand what the challenges of parsing PCAPs on the
GPU would be. While running filters or extract data from captured packets might be embarrassingly
parallel, actually finding the packets in the file is far from trivial. This is because there’s no
global index for the offsets of packets in the file. The only way to determine where a packet starts
and stops is by knowing where the previous packets begins, then reading the cap length
field to
know how many bytes were captured for that packet. This means that the only reliable way to find all
offsets of packets is to start from the header and advance through every packet to build an index
with the offset of every packet. Maybe it’s possible to do this in parallel with some constraints
like maximum packet length, but I just don’t see a better way. It would be ideal for tcpdump
itself to emit this index while it’s capturing the packets, but that’s out of scope for now.
Building this index is pretty fast - even for very large captures that are several gigabytes in
size.
Once I know where the offsets are, each thread can read a specific packet an evaluate the filter.
For now, the only supported operation is filtering packets by source IP. On the GPU there was an
additional complication that all reads must be aligned to a 32 bit boundary. However, the source IP
is not guaranteed to be at an aligned offset in a PCAP. To solve this, I wrote the following
function to allow reading a u32
from an arbitrary offset:
__device__ uint32_t access_u32(char *d_pcap, uint64_t offset) {
// only aligned accesses are allowed, so we need to align offset to a 32b
// boundry
auto rem = offset % 4;
auto start = offset - rem;
// Read the ip address with the correct endianess
auto first = bswap(*(uint32_t *)(d_pcap + start));
if (rem == 0) {
return first;
}
auto last = bswap(*(uint32_t *)(d_pcap + start + 4));
// get the last `rem` bytes from `first` and the first `4 - rem` bytes from
// last
first <<= 8 * rem;
last >>= 8 * (4 - rem);
return first | last;
}
Here, bswap
is my implementation of
__builtin_bswap32,
because CUDA
doesn’t allow functions that aren’t marked as __device__
to be used from a device
function. This is a lot more restrictive/annoying than rust-gpu
where any function could be called
from the kernel and it would simply be compiled for the GPU target.
The full code can be found here.
Comparison with a multi-threaded implementation
To establish a baseline, I also implemented a threaded version on the CPU. To benchmark, I generated
a relatively small PCAP that’s mostly just a trace of me watching a youtube video (this
one to be precise). Comparing the performance
was…disappointing. Currently the GPU version takes 1s
while the multi-threaded CPU version takes
10ms
. I haven’t looked into why the GPU version is so much slower yet, but I also want to try
larger PCAP sizes to see if I’m just not hitting the right scale for the GPU to shine. However, I’m
not confident that my code as-is would work with very large files - I think I still need some kind
of chunking to avoid attempting to allocate buffers that are larger than the largest availible
contiguous region on device memory.
Thoughts/Conclusions
CUDA
isn’t as painful to use as I’d expected- Linking C/C++/CUDA and Rust is relatively straight forward
rust-gpu
’s biggest strength IMO was that it was easy to share code between the CPU/GPU, but also to easily run the GPU code on the CPU.- Learning about shaders before trying CUDA was probably the best way to learn the core concepts
Using CUDA
and linking to Rust has been a better experience than using wgpu
and rust-gpu
so
far. I still believe in rust-gpu
’s vision, but it’s definitely not ready for production
use. Hopefully this phase of the project will help me become more familiar with CUDA, and make
better use of the GPU.