GPU Accelerated PCAP Filtering
On the final installment of my quest to build a GPU accelerated PCAP parser/filtering engine, I turn
to CUDA
, and explore Rust’s FFI capabilities.
In my last post I showed my “completed” GPU CSV parser. One of the biggest issues with my implementation was that rows that spanned multiple buffers weren’t correctly parsed. To solve this I would need to map two buffers per call and consider their logical concatenation by keeping track of the offset of the last line in the first buffer. This is a hack to work around the fact that there are limits on how big each buffer can be. Another issue is that implementing every operation on the GPU is probably not optimal - some operations, like the initial line count is probably more efficient on the CPU than the GPU.
Using CUDA instead of writing a compute shader
My experience so far has felt like I’ve been fighting against the tools and APIs provided. There’s a
lot of boilerplate to just get code to run on the GPU, and even then I found myself facing a bit of
boilerplate and complexity with using rust-gpu
. For this phase of the project, I decided to use
CUDA
so I could focus on implementing parallel algorithms and improving my understanding of GPU
architecture. However, I still wanted to use Rust
for most of the non-GPU pieces. Working with
Rust has been so much nicer than working with C++
, and I’d like to continue honing my familiarity
with Rust. I found this repo with an example of
writing Rust code that links a CUDA kernel. I had to make a few tweaks to get it running on my
system. Namely, I had to add .ccbin(false)
to the cc::Build
instance (omits the -ccbin
flag to
nvcc
which lets nvcc
select the best available compiler), and removed the -gencode
flag (tells
nvcc
to determine the architecture and code automatically - for my purposes I don’t really care
too much about compatibility with anything other than the hardware on this machine).
Parsing PCAPs
I found this link that explained the PCAP file
format. This was super useful in helping my understand what the challenges of parsing PCAPs on the
GPU would be. While running filters or extract data from captured packets might be embarrassingly
parallel, actually finding the packets in the file is far from trivial. This is because there’s no
global index for the offsets of packets in the file. The only way to determine where a packet starts
and stops is by knowing where the previous packets begins, then reading the cap length
field to
know how many bytes were captured for that packet. This means that the only reliable way to find all
offsets of packets is to start from the header and advance through every packet to build an index
with the offset of every packet. Maybe it’s possible to do this in parallel with some constraints
like maximum packet length, but I just don’t see a better way. It would be ideal for tcpdump
itself to emit this index while it’s capturing the packets, but that’s out of scope for now.
Building this index is pretty fast - even for very large captures that are several gigabytes in
size.
Once I know where the offsets are, each thread can read a specific packet an evaluate the filter.
For now, the only supported operation is filtering packets by source IP. On the GPU there was an
additional complication that all reads must be aligned to a 32 bit boundary. However, the source IP
is not guaranteed to be at an aligned offset in a PCAP. To solve this, I wrote the following
function to allow reading a u32
from an arbitrary offset:
__device__ uint32_t access_u32(char *d_pcap, uint64_t offset) {
// only aligned accesses are allowed, so we need to align offset to a 32b
// boundry
auto rem = offset % 4;
auto start = offset - rem;
// Read the ip address with the correct endianess
auto first = bswap(*(uint32_t *)(d_pcap + start));
if (rem == 0) {
return first;
}
auto last = bswap(*(uint32_t *)(d_pcap + start + 4));
// get the last `rem` bytes from `first` and the first `4 - rem` bytes from
// last
first <<= 8 * rem;
last >>= 8 * (4 - rem);
return first | last;
}
Here, bswap
is my implementation of
__builtin_bswap32,
because CUDA
doesn’t allow functions that aren’t marked as __device__
to be used from a device
function. This is a lot more restrictive/annoying than rust-gpu
where any function could be called
from the kernel and it would simply be compiled for the GPU target.
The full code can be found here.
Comparison with a multi-threaded implementation
To establish a baseline, I also implemented a threaded version on the CPU. To benchmark, I generated
a relatively small PCAP that’s mostly just a trace of me watching a youtube video (this
one to be precise). Comparing the performance
was…disappointing. Currently the GPU version takes 1s
while the multi-threaded CPU version takes
10ms
.
To see if this was just a matter of IO overheads dwarfing any speed up in compute, I tried again
with a larger PCAP, a capture of some traffic over a docker networking running a graph database,
which was a bit over 9GB
. This didn’t work with my code as-is because the memory allocations on
the device failed. I was able to get things to at least run by telling the kernel to only consider
the first 20 million packets (one fourth of the total). This ran, and showed the GPU
version
taking around 600ms
while the threaded version took 500ms
to process all packets.
At this point, I added a few timers to the kernel and measured various phases of the program to get the following report:
Alloc time: 55ms
Copy time: 414ms
Compute time: 256us
Copy to host/free time: 40ms
Finalization time: 62ms
elapsed = 577.202005ms
The elapsed
measurement is done from the driving code in Rust
, while the rest of the timers are
in my CUDA
implementation. There’s a few milliseconds unaccounted for, but some of that is
probably due to rounding error since my C++/CUDA
timers only report whole
milliseconds/microseconds while the rest probably comes from a sum operation on the CPU that takes
the results of the filtering and turns it into a single count (this could be done on the GPU), and
there’s possibly some overhead to converting buffers to/from Rust slices. What’s most interesting
here is the compute time itself - only 256us
!!!
This gives us a sense of what the numbers would look like if I implemented some kind of chunking and
processed some piece of the PCAP per iteration. Since we ran the above with one fourth of the file,
let’s assume that it would take us four iterations. If we just ran the compute portion itself four
times, we’d still only be around 1ms
, and even if copying the results back to the host took 40ms
per iteration, that would be 161ms
total, still faster than my threaded implementation. However, the real
killer is the cost of copying the PCAP into GPU memory - we’d be able to only the allocation once,
but the copying itself would need to be done for each chunk, taking at least 1.6s
alone, just to
copy data to the GPU.
Thoughts/Conclusions
CUDA
isn’t as painful to use as I’d expected- Linking C/C++/CUDA and Rust is relatively straight forward
rust-gpu
’s biggest strength IMO was that it was easy to share code between the CPU/GPU, but also to easily run the GPU code on the CPU.- Learning about shaders before trying CUDA was probably the best way to learn the core concepts
Using CUDA
and linking to Rust has been a better experience than using wgpu
and rust-gpu
so
far. I still believe in rust-gpu
’s vision, but it’s definitely not ready for production
use. One really cool library I found was cuco. Reading
through some of the implementations of containers present there was pretty fun, and it’s always good
to know what tools are already available to use.
At the end of the day, filtering PCAPs just isn’t compute heavy enough to be a good workload for the GPU. Revisiting my original inspiration for this project, Compilation on the GPU?: a feasibility study, it’s worth noting that the authors specifically omit IO time from their measurements as they state that their goals are only to measure the feasibility of parallel algorithms. I wish they would have included a section that showed what the IO speeds they were dealing with were though to help better inform where work needs to be done to take full advantage of these algorithms.
My other key takeaway is that as a file format PCAPs don’t directly allow parallel processing to
happen super easily. The index I built to allow parallel processing was pretty simple and only adds
a bit of overhead - it seems to me like not generating this index during packet collection is a
missing feature from tools like tcpdump
. As far as I can tell, newer formats like PCAP-ng
don’t
provide anything equivalent within the format itself either. Maybe someday I’ll try to patch
tcpdump
to generate this index file along with the output as a proof of concept.
I think this will be my last post in this series - it was a fun journey, and I learned some new
skills! I think the conclusion was pretty predictable, but this was a good motivating function to
try out new approaches to GPU programming, learning Rust, and understanding the PCAP
file format
better.