Parsing files on the GPU
GPUs are good at massively parallel operations. Parsing is often parallelizable, can parsers be written on the GPU?
Background/Motivation
Recently I came across this paper: Compilation on the GPU?: a feasibility study, which made me think about the ways GPUs could be used to speed up complex parsers. In particular, some of my work involves processing large packet capture files, which could be a fun format to try parsing on the GPU. I’m not confident that parsing pcaps on the GPU would actually be faster than a multi-threaded CPU implementation, so part of the goals of this project is to determine if that is the case or not. From what I understand, pcaps (not pcap-ng), is a relatively simple format, so while implementing the parser might not be too difficult, it might not be computationally expensive enough to outweigh the costs of data transfer between the GPU and CPU. In this series of posts, I hope to document my journey towards building this parser.
Running code on the GPU in 2024
Till now, most of my experience in programming for the GPU as been via
WebGL.
However, WebGL isn’t the best interface for writing compute shaders, and with
the standardization of
WebGPU, it seems
like the browser is also looking towards other directions for general purpose
GPU (gpgpu) computing. This seems to be largely because OpenGL
development has
halted in favor of Vulkan
/Metal
and other native GPU libraries. On a
personal level, I wanted to try building a project that could run natively
instead of having to use the browser to submit work to the GPU. Additionally,
I’ve been learning Rust, so I wanted to try using Rust as the primary language
in this project.
I did some initial research, which led me to find nvParse, a repo with an implementation of a CSV parser on the GPU (via cuda). I thought a good place to start could be to port this to Rust, to learn what options the Rust ecosystem had for GPU programming. As part of the port, the first step was to get a kernel running that could count the number lines in a file.
In terms of Rust libraries for GPU computing, there’s wgpu for providing the device abstractions, but doesn’t provide any features for actually writing code targeting the GPU. In theory, one could also use CUDA directly, but that seemed less fun.
For writing code targeting the GPU I found two promising options,
gpgpu and
rust-gpu. gpgpu
hides a lot of the
boilerplate encountered when using gpgpu
, and additionally provides a layer to
allow compiling and running wgsl
(more on that below) on the GPU. rust-gpu
is an effort to allow writing shaders in Rust. It provides a toolchain to
compile Rust to
SPIR-V, but the
resulting SPIR-V binary will still need to be launched with wgpu
or through
gpgpu
’s abstractions. Initially, gpgpu
+ wgsl
seemed to be the best choice
to get something easy up and running.
What is WGSL?
WGSL
seems to be the evolution of GLSL
with gpgpu workloads in mind. The
syntax is vaguely rust-esque. While it was easy to get up and running with
wgsl
, there were a few warts. The first is that wgsl
seems to be evolving
rapidly, and it was mildly annoying to determine whether code examples online
could be run via the gpgpu
crate, or if it used unsupported syntax features.
The other annoying bit was that the only supported scalar datatypes are i32
,
u32
, f32
, and bool
. Ultimately, I decided to try using rust-gpu
instead
for u8
support, and because tooling was better (e.g. I can use rust-analyzer
for language server features, the compiler gives better error messages - most of
the time - and it’s easier to hack/extend the compiler - either via custom
preprocessing phases in build.rs
or by modifying the compiler backend itself).
I was able to get some code that did what I wanted, but the lack of 8-bit
integer support meant that I had to read 4 chars at at time (into a u32
), and
use some bitshifting and modulo arithmetic to operate on the bytes I wanted to
check. One thing that was nice about the development process was that changing
the WGSL code didn’t require re-compiling, which kept development fast.
Here’s the code I produced:
struct Vector32 {
data: array<u32>,
};
[[group(0), binding(0)]] var<storage, read> a: Vector32;
[[group(0), binding(1)]] var<storage, read> chunk_size: Vector32;
[[group(0), binding(2)]] var<storage, read> data_len: Vector32;
[[group(0), binding(3)]] var<storage, read> char: Vector32;
[[group(0), binding(4)]] var<storage, read_write> c: Vector32;
[[stage(compute), workgroup_size(1)]]
fn main([[builtin(global_invocation_id)]] global_id: vec3<u32>) {
c.data[global_id.x] = 0u;
var start: u32 = global_id.x * chunk_size.data[0];
var nelems: u32 = min(chunk_size.data[0], data_len.data[0] - start);
var rem: u32 = start % 4u;
if (rem != 0u) {
var src: u32 = a.data[start / 4u];
// unpack a u32 into 4 x u8
var c0: u32 = src & 255u;
src = src / 256u;
var c1: u32 = src & 255u;
src = src / 256u;
var c2: u32 = src & 255u;
src = src / 256u;
var c3: u32 = src & 255u;
if (rem <= 0u && c0 == char.data[0]) {
c.data[global_id.x] = c.data[global_id.x] + 1u;
}
if (rem <= 1u && c1 == char.data[0]) {
c.data[global_id.x] = c.data[global_id.x] + 1u;
}
if (rem <= 2u && c2 == char.data[0]) {
c.data[global_id.x] = c.data[global_id.x] + 1u;
}
if (rem <= 3u && c3 == char.data[0]) {
c.data[global_id.x] = c.data[global_id.x] + 1u;
}
start = start + 4u - rem;
nelems = nelems - (4u - rem);
}
for (var i: u32 = 0u; i < nelems; i = i + 4u) {
var src: u32 = a.data[(start + i) / 4u];
var c0: u32 = src & 255u;
src = src / 256u;
var c1: u32 = src & 255u;
src = src / 256u;
var c2: u32 = src & 255u;
src = src / 256u;
var c3: u32 = src & 255u;
if ((i + 0u) < nelems && c0 == char.data[0]) {
c.data[global_id.x] = c.data[global_id.x] + 1u;
}
if ((i + 1u) < nelems && c1 == char.data[0]) {
c.data[global_id.x] = c.data[global_id.x] + 1u;
}
if ((i + 2u) < nelems && c2 == char.data[0]) {
c.data[global_id.x] = c.data[global_id.x] + 1u;
}
if ((i + 3u) < nelems && c3 == char.data[0]) {
c.data[global_id.x] = c.data[global_id.x] + 1u;
}
}
}
What is rust-gpu and SPIR-V?
SPIR-V
is an intermediate language (IR) for GPU compilation targets, similar to
LLVM
. By having a common IR for languages to target, languages like WGSL
or
GLSL
need not be the only frontends for GPU programming.
rust-gpu
aims to make Rust a first-class language for writing GPU shaders by
providing a way to compile Rust to SPIR-V
. The
project claims that Rust is a good choice for
writing GPU shaders due to it’s ability to benefit from crates with no-std
capability (that is, libraries that don’t use Rust’s standard library - which
can’t be supported in the GPU. I think this is because they require dynamic
memory allocation? This makes me wonder if some other language like zig
which
could be configured to use an arena allocator globally could be a better
choice). The project seems to involve many components including a codegen
backend which emits SPIR-V
. The compiled code is then available via a
compile-time environment variable to be inlined into your final binary
(alternatively, it’s also available as a file in the build directory).
Here’s roughly equivalent code to the wgsl
code above (some notable changes
are that some “storage” buffers have been changed to “uniform”s - this could
have been done in the wgsl
version as well):
#![cfg_attr(target_arch = "spirv", no_std)]
use glam::UVec3;
use spirv_std::{glam, spirv};
fn min(a: u32, b: u32) -> u32 {
if a < b {
a
} else {
b
}
}
fn to_bytes(a: u32) -> [u8; 4] {
let mut v = [0u8; 4];
let mut a = a;
v[0] = (a % 256) as u8;
a /= 256;
v[1] = (a % 256) as u8;
a /= 256;
v[2] = (a % 256) as u8;
a /= 256;
v[3] = (a % 256) as u8;
v
}
// LocalSize/numthreads of (x = 64, y = 1, z = 1)
#[spirv(compute(threads(64)))]
pub fn main_cc(
#[spirv(global_invocation_id)] id: UVec3,
#[spirv(storage_buffer, descriptor_set = 0, binding = 0)] input: &mut [u32],
#[spirv(uniform, descriptor_set = 0, binding = 1)] chunk_size: &u32,
#[spirv(uniform, descriptor_set = 0, binding = 2)] data_len: &u32,
#[spirv(uniform, descriptor_set = 0, binding = 3)] char: &u8,
#[spirv(storage_buffer, descriptor_set = 0, binding = 4)] count: &mut [u32],
) {
let index = id.x as usize;
count[index] = 0;
let mut start: usize = index * (*chunk_size as usize);
let mut nelems: usize = min(*chunk_size, *data_len - start as u32) as usize;
let rem = start % 4;
if rem != 0 {
let src: u32 = input[start / 4];
let c = to_bytes(src);
for i in 0..4 {
if rem <= i {
if c[i] == *char {
count[index] += 1;
}
}
}
start = start + 4 - rem;
nelems = nelems - (4 - rem);
}
let mut i = 0;
while i < nelems {
let src: u32 = input[(start + i) / 4];
let c = to_bytes(src);
for j in 0..4 {
if (i + j) < nelems && c[j] == *char {
count[index] += 1;
}
}
i += 4;
}
}
Interlude: Enabling Int8 Support in rust-gpu
While SPIR-V
does not support 8-bit integers by default (likely due to
platform compatibility concerns?), it has a way to enable 8-bit support via it’s
OpCapability
instruction, which defines the necessary types. I went down a bit
of a rabbit hole trying to get this enabled (this probably could have been
avoided if I had just read the rust-gpu crate docs more closely, but whatever).
My initial attempt was to use inline assembly via the asm!
macro, but that
wouldn’t work, as rust-gpu
still rejected any code that used the u8
type. I
tried using global_asm!
, but found that the SPIR-V codegen backend doesn’t yet
support global inline assembly.
After poking around, I read the SpirvBuilder
source code and realized that the
correct way to do this is to register the capability with the builder, and
include SprivBuilder.capability(Capbility::Int8)
in the builder.
Here’s the code with u8
support enabled:
#![cfg_attr(target_arch = "spirv", no_std)]
// #![deny(warnings)]
use glam::UVec3;
use spirv_std::{glam, spirv};
fn min(a: u32, b: u32) -> u32 {
if a < b {
a
} else {
b
}
}
fn to_bytes(a: u32) -> [u8; 4] {
let mut v = [0u8; 4];
let mut a = a;
v[0] = (a % 256) as u8;
a /= 256;
v[1] = (a % 256) as u8;
a /= 256;
v[2] = (a % 256) as u8;
a /= 256;
v[3] = (a % 256) as u8;
v
}
// LocalSize/numthreads of (x = 64, y = 1, z = 1)
#[spirv(compute(threads(64)))]
pub fn main_cc(
#[spirv(global_invocation_id)] id: UVec3,
#[spirv(storage_buffer, descriptor_set = 0, binding = 0)] input: &mut [u8],
#[spirv(uniform, descriptor_set = 0, binding = 1)] chunk_size: &u32,
#[spirv(uniform, descriptor_set = 0, binding = 2)] data_len: &u32,
#[spirv(uniform, descriptor_set = 0, binding = 3)] char: &u8,
#[spirv(storage_buffer, descriptor_set = 0, binding = 4)] count: &mut [u32],
) {
let index = id.x as usize;
count[index] = 0;
let mut start: usize = index * (*chunk_size as usize);
let mut nelems: usize = min(*chunk_size, *data_len - start as u32) as usize;
let mut i = 0;
for i in start..(start + nelems) {
if input[i] == *char {
count[index] += 1;
}
}
}
Note that I can’t directly cast u32
to [u8;4]
via to_ne_bytes
, or
mem::transmute
because those produce a BitCast
instruction in Rust’s IR
which currently doesn’t have a defined instruction in the SPIR-V
backend. This
could change overtime though.
While rust-gpu
suited my needs better than wgsl
, I also began to hit the
limits of what gpgpu
could offer. While the abstractions were welcome in terms
of buffer management, it didn’t provide a good way to query device limits
directly, and relied on a pretty old version of wgpu
.
Using wgpu directly
I decided that it was time to use wgpu
directly. I started by reading the code
in
rust-gpu-wgpu-compute-minimal.
However, this repo uses pretty old versions of all it’s dependencies. It also
references dependencies by their git repos, but some of those repos have moved
active development elsewhere (e.g. rust-gpu
itself has moved to a repo under
the org rust-gpu
instead of the org of the company/studio that started the
project). I submitted a PR that updates the repo to use the latest version of
it’s dependencies here.
Using wgpu
was a bit more painful than gpgpu
as it required me to better
understand the concept of buffer “mapping”. Essentially, data cannot be directly
copied from the CPU to buffers that are used by the shader. Instead, a buffer on
the GPU needs to be “mapped” - made available to the CPU for writing. Mapped
buffers cannot be referenced by a shader and must be unmapped before compute can
begin. This seems to be because mapped memory is much slower to access on the
GPU.
I was able to produce a working version and code can be found here.
Performance
Is my implementation fast? To benchmark, I took the lineitem_small.tbl
file
included in the nvParse
repo, and created larger versions of the file by
concatenating multiple copies of the file with itself:
aneesh@orion:~/nvparse_rs$ wc -l lineitem_*.tbl
10000000 lineitem_huge.tbl
100000 lineitem_large.tbl
1000 lineitem_small.tbl
10101000 total
Then, I ran time wc -l $f
and ./target/release/nvparse_rs $f -n 64
for each file:
aneesh@orion:~/nvparse_rs$ for f in *.tbl
figlet $f
time wc -l $f
echo "GPU"
./target/release/nvparse_rs $f -n 1024
echo
end
_ _ _ _ _ _ _ _
| (_)_ __ ___(_) |_ ___ _ __ ___ | |__ _ _ __ _ ___ | |_| |__ | |
| | | '_ \ / _ \ | __/ _ \ '_ ` _ \ | '_ \| | | |/ _` |/ _ \| __| '_ \| |
| | | | | | __/ | || __/ | | | | | | | | | |_| | (_| | __/| |_| |_) | |
|_|_|_| |_|\___|_|\__\___|_| |_| |_|___|_| |_|\__,_|\__, |\___(_)__|_.__/|_|
|_____| |___/
10000000 lineitem_huge.tbl
________________________________________________________
Executed in 224.67 millis fish external
usr time 27.91 millis 0.00 micros 27.91 millis
sys time 196.88 millis 534.00 micros 196.34 millis
GPU
Initialization time: 1.236729058s
Compute time: 1.673913908s
10000000
_ _ _ _ _ _ _ _
| (_)_ __ ___(_) |_ ___ _ __ ___ | | __ _ _ __ __ _ ___ | |_| |__ | |
| | | '_ \ / _ \ | __/ _ \ '_ ` _ \ | |/ _` | '__/ _` |/ _ \| __| '_ \| |
| | | | | | __/ | || __/ | | | | | | | (_| | | | (_| | __/| |_| |_) | |
|_|_|_| |_|\___|_|\__\___|_| |_| |_|___|_|\__,_|_| \__, |\___(_)__|_.__/|_|
|_____| |___/
100000 lineitem_large.tbl
________________________________________________________
Executed in 5.94 millis fish external
usr time 1.11 millis 0.00 micros 1.11 millis
sys time 4.91 millis 467.00 micros 4.45 millis
GPU
Initialization time: 148.243567ms
Compute time: 55.777128ms
100000
_ _ _ _ _ _ _ _ _
| (_)_ __ ___(_) |_ ___ _ __ ___ ___ _ __ ___ __ _| | || |_| |__ | |
| | | '_ \ / _ \ | __/ _ \ '_ ` _ \ / __| '_ ` _ \ / _` | | || __| '_ \| |
| | | | | | __/ | || __/ | | | | | \__ \ | | | | | (_| | | || |_| |_) | |
|_|_|_| |_|\___|_|\__\___|_| |_| |_|___|___/_| |_| |_|\__,_|_|_(_)__|_.__/|_|
|_____|
1000 lineitem_small.tbl
________________________________________________________
Executed in 2.62 millis fish external
usr time 0.41 millis 409.00 micros 0.00 millis
sys time 2.33 millis 216.00 micros 2.11 millis
GPU
Initialization time: 170.137512ms
Compute time: 39.899839ms
1000
oof, even on the largest file, my GPU implementation is far slower than wc
! To
better understand this, I ran cargo flamegraph
with the largest file. This is
the output svg
(it should be interactive below!):
Almost \(70\%\) of the time is spent in
wgpu::api::queue::Queue::write_buffer
, which seems to imply that the cost of
transferring data is the biggest bottleneck. To ensure that this hypothesis made
sense, I commented out all write to the GPU buffers, and timed execution.
aneesh@orion:~/nvparse_rs$ cargo r --release -- lineitem_huge.tbl -n 1024
Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs)
Finished `release` profile [optimized + debuginfo] target(s) in 15.79s
Running `target/release/nvparse_rs lineitem_huge.tbl -n 1024`
Initialization time: 1.20581354s
Compute time: 52.181916ms
0
Which confirms that the bottleneck is writing, and confirms that execution on the GPU should be faster than execution on the CPU.
My next steps will be to try to mitigate this by overlapping compute and data transfer, and introducing more compute operations in the GPU to take better advantage of GPU parallelism.
Thoughts/Conclusions
- Compiling to SPIR-V is a better option for compute shaders than GLSL
- WGSL is an interesting developement, but isn’t the most ergonomic option for compute shaders
rust-gpu
is very cool - Rust is a far better option for expressing computation than WGSL, and the ecosystem of libraries certainly helps.- I’m not convinced that Rust is the best language for this task though. I’d like to try other
languages, like
zig
in the future.
- I’m not convinced that Rust is the best language for this task though. I’d like to try other
languages, like
- Data-transfer to the GPU remains a critical bottleneck. While crates like
gpgpu
help avoid boilerplate, I’d like abstractions that also setup efficient input and output pipelines.