Parsing files on the GPU

GPUs are good at massively parallel operations. Parsing is often parallelizable, can parsers be written on the GPU?

Background/Motivation

Recently I came across this paper: Compilation on the GPU?: a feasibility study, which made me think about the ways GPUs could be used to speed up complex parsers. In particular, some of my work involves processing large packet capture files, which could be a fun format to try parsing on the GPU. I’m not confident that parsing pcaps on the GPU would actually be faster than a multi-threaded CPU implementation, so part of the goals of this project is to determine if that is the case or not. From what I understand, pcaps (not pcap-ng), is a relatively simple format, so while implementing the parser might not be too difficult, it might not be computationally expensive enough to outweigh the costs of data transfer between the GPU and CPU. In this series of posts, I hope to document my journey towards building this parser.

Running code on the GPU in 2024

Till now, most of my experience in programming for the GPU as been via WebGL. However, WebGL isn’t the best interface for writing compute shaders, and with the standardization of WebGPU, it seems like the browser is also looking towards other directions for general purpose GPU (gpgpu) computing. This seems to be largely because OpenGL development has halted in favor of Vulkan/Metal and other native GPU libraries. On a personal level, I wanted to try building a project that could run natively instead of having to use the browser to submit work to the GPU. Additionally, I’ve been learning Rust, so I wanted to try using Rust as the primary language in this project.

I did some initial research, which led me to find nvParse, a repo with an implementation of a CSV parser on the GPU (via cuda). I thought a good place to start could be to port this to Rust, to learn what options the Rust ecosystem had for GPU programming. As part of the port, the first step was to get a kernel running that could count the number lines in a file.

In terms of Rust libraries for GPU computing, there’s wgpu for providing the device abstractions, but doesn’t provide any features for actually writing code targeting the GPU. In theory, one could also use CUDA directly, but that seemed less fun.

For writing code targeting the GPU I found two promising options, gpgpu and rust-gpu. gpgpu hides a lot of the boilerplate encountered when using gpgpu, and additionally provides a layer to allow compiling and running wgsl (more on that below) on the GPU. rust-gpu is an effort to allow writing shaders in Rust. It provides a toolchain to compile Rust to SPIR-V, but the resulting SPIR-V binary will still need to be launched with wgpu or through gpgpu’s abstractions. Initially, gpgpu + wgsl seemed to be the best choice to get something easy up and running.

What is WGSL?

WGSL seems to be the evolution of GLSL with gpgpu workloads in mind. The syntax is vaguely rust-esque. While it was easy to get up and running with wgsl, there were a few warts. The first is that wgsl seems to be evolving rapidly, and it was mildly annoying to determine whether code examples online could be run via the gpgpu crate, or if it used unsupported syntax features. The other annoying bit was that the only supported scalar datatypes are i32, u32, f32, and bool. Ultimately, I decided to try using rust-gpu instead for u8 support, and because tooling was better (e.g. I can use rust-analyzer for language server features, the compiler gives better error messages - most of the time - and it’s easier to hack/extend the compiler - either via custom preprocessing phases in build.rs or by modifying the compiler backend itself).

I was able to get some code that did what I wanted, but the lack of 8-bit integer support meant that I had to read 4 chars at at time (into a u32), and use some bitshifting and modulo arithmetic to operate on the bytes I wanted to check. One thing that was nice about the development process was that changing the WGSL code didn’t require re-compiling, which kept development fast.

Here’s the code I produced:

struct Vector32 {
    data: array<u32>,
};

[[group(0), binding(0)]] var<storage, read>  a: Vector32;
[[group(0), binding(1)]] var<storage, read>  chunk_size: Vector32;
[[group(0), binding(2)]] var<storage, read>  data_len: Vector32;
[[group(0), binding(3)]] var<storage, read>  char: Vector32;
[[group(0), binding(4)]] var<storage, read_write> c: Vector32;

[[stage(compute), workgroup_size(1)]]
fn main([[builtin(global_invocation_id)]] global_id: vec3<u32>) {
    c.data[global_id.x] = 0u;

    var start: u32 = global_id.x * chunk_size.data[0];
    var nelems: u32 = min(chunk_size.data[0], data_len.data[0] - start);

    var rem: u32 = start % 4u;
    if (rem != 0u) {
        var src: u32 = a.data[start / 4u];

        // unpack a u32 into 4 x u8
        var c0: u32 = src & 255u;
        src = src / 256u;
        var c1: u32 = src & 255u;
        src = src / 256u;
        var c2: u32 = src & 255u;
        src = src / 256u;
        var c3: u32 = src & 255u;

        if (rem <= 0u && c0 == char.data[0]) {
            c.data[global_id.x] = c.data[global_id.x] + 1u;
        }
        if (rem <= 1u && c1 == char.data[0]) {
            c.data[global_id.x] = c.data[global_id.x] + 1u;
        }
        if (rem <= 2u && c2 == char.data[0]) {
            c.data[global_id.x] = c.data[global_id.x] + 1u;
        }
        if (rem <= 3u && c3 == char.data[0]) {
            c.data[global_id.x] = c.data[global_id.x] + 1u;
        }

        start = start + 4u - rem;
        nelems = nelems - (4u - rem);
    }

    for (var i: u32 = 0u; i < nelems; i = i + 4u) {
        var src: u32 = a.data[(start + i) / 4u];

        var c0: u32 = src & 255u;
        src = src / 256u;
        var c1: u32 = src & 255u;
        src = src / 256u;
        var c2: u32 = src & 255u;
        src = src / 256u;
        var c3: u32 = src & 255u;

        if ((i + 0u) < nelems && c0 == char.data[0]) {
            c.data[global_id.x] = c.data[global_id.x] + 1u;
        }
        if ((i + 1u) < nelems && c1 == char.data[0]) {
            c.data[global_id.x] = c.data[global_id.x] + 1u;
        }
        if ((i + 2u) < nelems && c2 == char.data[0]) {
            c.data[global_id.x] = c.data[global_id.x] + 1u;
        }
        if ((i + 3u) < nelems && c3 == char.data[0]) {
            c.data[global_id.x] = c.data[global_id.x] + 1u;
        }
    }
}

What is rust-gpu and SPIR-V?

SPIR-V is an intermediate language (IR) for GPU compilation targets, similar to LLVM. By having a common IR for languages to target, languages like WGSL or GLSL need not be the only frontends for GPU programming.

rust-gpu aims to make Rust a first-class language for writing GPU shaders by providing a way to compile Rust to SPIR-V. The project claims that Rust is a good choice for writing GPU shaders due to it’s ability to benefit from crates with no-std capability (that is, libraries that don’t use Rust’s standard library - which can’t be supported in the GPU. I think this is because they require dynamic memory allocation? This makes me wonder if some other language like zig which could be configured to use an arena allocator globally could be a better choice). The project seems to involve many components including a codegen backend which emits SPIR-V. The compiled code is then available via a compile-time environment variable to be inlined into your final binary (alternatively, it’s also available as a file in the build directory).

Here’s roughly equivalent code to the wgsl code above (some notable changes are that some “storage” buffers have been changed to “uniform”s - this could have been done in the wgsl version as well):

#![cfg_attr(target_arch = "spirv", no_std)]
use glam::UVec3;
use spirv_std::{glam, spirv};

fn min(a: u32, b: u32) -> u32 {
    if a < b {
        a
    } else {
        b
    }
}

fn to_bytes(a: u32) -> [u8; 4] {
    let mut v = [0u8; 4];
    let mut a = a;
    v[0] = (a % 256) as u8;
    a /= 256;
    v[1] = (a % 256) as u8;
    a /= 256;
    v[2] = (a % 256) as u8;
    a /= 256;
    v[3] = (a % 256) as u8;
    v
}

// LocalSize/numthreads of (x = 64, y = 1, z = 1)
#[spirv(compute(threads(64)))]
pub fn main_cc(
    #[spirv(global_invocation_id)] id: UVec3,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 0)] input: &mut [u32],
    #[spirv(uniform, descriptor_set = 0, binding = 1)] chunk_size: &u32,
    #[spirv(uniform, descriptor_set = 0, binding = 2)] data_len: &u32,
    #[spirv(uniform, descriptor_set = 0, binding = 3)] char: &u8,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 4)] count: &mut [u32],
) {
    let index = id.x as usize;
    count[index] = 0;

    let mut start: usize = index * (*chunk_size as usize);
    let mut nelems: usize = min(*chunk_size, *data_len - start as u32) as usize;

    let rem = start % 4;
    if rem != 0 {
        let src: u32 = input[start / 4];

        let c = to_bytes(src);
        for i in 0..4 {
            if rem <= i {
                if c[i] == *char {
                    count[index] += 1;
                }
            }
        }

        start = start + 4 - rem;
        nelems = nelems - (4 - rem);
    }

    let mut i = 0;
    while i < nelems {
        let src: u32 = input[(start + i) / 4];
        let c = to_bytes(src);
        for j in 0..4 {
            if (i + j) < nelems && c[j] == *char {
                count[index] += 1;
            }
        }

        i += 4;
    }
}

Interlude: Enabling Int8 Support in rust-gpu

While SPIR-V does not support 8-bit integers by default (likely due to platform compatibility concerns?), it has a way to enable 8-bit support via it’s OpCapability instruction, which defines the necessary types. I went down a bit of a rabbit hole trying to get this enabled (this probably could have been avoided if I had just read the rust-gpu crate docs more closely, but whatever). My initial attempt was to use inline assembly via the asm! macro, but that wouldn’t work, as rust-gpu still rejected any code that used the u8 type. I tried using global_asm!, but found that the SPIR-V codegen backend doesn’t yet support global inline assembly.

After poking around, I read the SpirvBuilder source code and realized that the correct way to do this is to register the capability with the builder, and include SprivBuilder.capability(Capbility::Int8) in the builder.

Here’s the code with u8 support enabled:

#![cfg_attr(target_arch = "spirv", no_std)]
// #![deny(warnings)]
use glam::UVec3;
use spirv_std::{glam, spirv};

fn min(a: u32, b: u32) -> u32 {
    if a < b {
        a
    } else {
        b
    }
}

fn to_bytes(a: u32) -> [u8; 4] {
    let mut v = [0u8; 4];
    let mut a = a;
    v[0] = (a % 256) as u8;
    a /= 256;
    v[1] = (a % 256) as u8;
    a /= 256;
    v[2] = (a % 256) as u8;
    a /= 256;
    v[3] = (a % 256) as u8;
    v
}

// LocalSize/numthreads of (x = 64, y = 1, z = 1)
#[spirv(compute(threads(64)))]
pub fn main_cc(
    #[spirv(global_invocation_id)] id: UVec3,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 0)] input: &mut [u8],
    #[spirv(uniform, descriptor_set = 0, binding = 1)] chunk_size: &u32,
    #[spirv(uniform, descriptor_set = 0, binding = 2)] data_len: &u32,
    #[spirv(uniform, descriptor_set = 0, binding = 3)] char: &u8,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 4)] count: &mut [u32],
) {
    let index = id.x as usize;
    count[index] = 0;

    let mut start: usize = index * (*chunk_size as usize);
    let mut nelems: usize = min(*chunk_size, *data_len - start as u32) as usize;

    let mut i = 0;
    for i in start..(start + nelems) {
        if input[i] == *char {
            count[index] += 1;
        }
    }
}

Note that I can’t directly cast u32 to [u8;4] via to_ne_bytes, or mem::transmute because those produce a BitCast instruction in Rust’s IR which currently doesn’t have a defined instruction in the SPIR-V backend. This could change overtime though.

While rust-gpu suited my needs better than wgsl, I also began to hit the limits of what gpgpu could offer. While the abstractions were welcome in terms of buffer management, it didn’t provide a good way to query device limits directly, and relied on a pretty old version of wgpu.

Using wgpu directly

I decided that it was time to use wgpu directly. I started by reading the code in rust-gpu-wgpu-compute-minimal. However, this repo uses pretty old versions of all it’s dependencies. It also references dependencies by their git repos, but some of those repos have moved active development elsewhere (e.g. rust-gpu itself has moved to a repo under the org rust-gpu instead of the org of the company/studio that started the project). I submitted a PR that updates the repo to use the latest version of it’s dependencies here.

Using wgpu was a bit more painful than gpgpu as it required me to better understand the concept of buffer “mapping”. Essentially, data cannot be directly copied from the CPU to buffers that are used by the shader. Instead, a buffer on the GPU needs to be “mapped” - made available to the CPU for writing. Mapped buffers cannot be referenced by a shader and must be unmapped before compute can begin. This seems to be because mapped memory is much slower to access on the GPU.

I was able to produce a working version and code can be found here.

Performance

Is my implementation fast? To benchmark, I took the lineitem_small.tbl file included in the nvParse repo, and created larger versions of the file by concatenating multiple copies of the file with itself:

aneesh@orion:~/nvparse_rs$ wc -l lineitem_*.tbl
  10000000 lineitem_huge.tbl
    100000 lineitem_large.tbl
      1000 lineitem_small.tbl
  10101000 total

Then, I ran time wc -l $f and ./target/release/nvparse_rs $f -n 64 for each file:

aneesh@orion:~/nvparse_rs$ for f in *.tbl
                                          figlet $f
                                          time wc -l $f
                                          echo "GPU"
                                          ./target/release/nvparse_rs $f -n 1024
                                          echo
                                      end
 _ _            _ _                     _                       _   _     _ 
| (_)_ __   ___(_) |_ ___ _ __ ___     | |__  _   _  __ _  ___ | |_| |__ | |
| | | '_ \ / _ \ | __/ _ \ '_ ` _ \    | '_ \| | | |/ _` |/ _ \| __| '_ \| |
| | | | | |  __/ | ||  __/ | | | | |   | | | | |_| | (_| |  __/| |_| |_) | |
|_|_|_| |_|\___|_|\__\___|_| |_| |_|___|_| |_|\__,_|\__, |\___(_)__|_.__/|_|
                                  |_____|           |___/                   
10000000 lineitem_huge.tbl

________________________________________________________
Executed in  224.67 millis    fish           external
   usr time   27.91 millis    0.00 micros   27.91 millis
   sys time  196.88 millis  534.00 micros  196.34 millis

GPU
Initialization time: 1.236729058s
Compute time: 1.673913908s
10000000

 _ _            _ _                     _                       _   _     _ 
| (_)_ __   ___(_) |_ ___ _ __ ___     | | __ _ _ __ __ _  ___ | |_| |__ | |
| | | '_ \ / _ \ | __/ _ \ '_ ` _ \    | |/ _` | '__/ _` |/ _ \| __| '_ \| |
| | | | | |  __/ | ||  __/ | | | | |   | | (_| | | | (_| |  __/| |_| |_) | |
|_|_|_| |_|\___|_|\__\___|_| |_| |_|___|_|\__,_|_|  \__, |\___(_)__|_.__/|_|
                                  |_____|           |___/                   
100000 lineitem_large.tbl

________________________________________________________
Executed in    5.94 millis    fish           external
   usr time    1.11 millis    0.00 micros    1.11 millis
   sys time    4.91 millis  467.00 micros    4.45 millis

GPU
Initialization time: 148.243567ms
Compute time: 55.777128ms
100000

 _ _            _ _                                         _ _  _   _     _ 
| (_)_ __   ___(_) |_ ___ _ __ ___      ___ _ __ ___   __ _| | || |_| |__ | |
| | | '_ \ / _ \ | __/ _ \ '_ ` _ \    / __| '_ ` _ \ / _` | | || __| '_ \| |
| | | | | |  __/ | ||  __/ | | | | |   \__ \ | | | | | (_| | | || |_| |_) | |
|_|_|_| |_|\___|_|\__\___|_| |_| |_|___|___/_| |_| |_|\__,_|_|_(_)__|_.__/|_|
                                  |_____|                                    
1000 lineitem_small.tbl

________________________________________________________
Executed in    2.62 millis    fish           external
   usr time    0.41 millis  409.00 micros    0.00 millis
   sys time    2.33 millis  216.00 micros    2.11 millis

GPU
Initialization time: 170.137512ms
Compute time: 39.899839ms
1000

oof, even on the largest file, my GPU implementation is far slower than wc! To better understand this, I ran cargo flamegraph with the largest file. This is the output svg (it should be interactive below!):

Almost $70\%$ of the time is spent in wgpu::api::queue::Queue::write_buffer, which seems to imply that the cost of transferring data is the biggest bottleneck. To ensure that this hypothesis made sense, I commented out all write to the GPU buffers, and timed execution.

aneesh@orion:~/nvparse_rs$ cargo r --release -- lineitem_huge.tbl -n 1024
   Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs)
    Finished `release` profile [optimized + debuginfo] target(s) in 15.79s
     Running `target/release/nvparse_rs lineitem_huge.tbl -n 1024`
Initialization time: 1.20581354s
Compute time: 52.181916ms
0

Which confirms that the bottleneck is writing, and confirms that execution on the GPU should be faster than execution on the CPU.

My next steps will be to try to mitigate this by overlapping compute and data transfer, and introducing more compute operations in the GPU to take better advantage of GPU parallelism.

Thoughts/Conclusions

Compiling to SPIR-V is a better option for compute shaders than GLSL
WGSL is an interesting developement, but isn’t the most ergonomic option for compute shaders
rust-gpu is very cool - Rust is a far better option for expressing computation than WGSL, and the ecosystem of libraries certainly helps.
- I’m not convinced that Rust is the best language for this task though. I’d like to try other languages, like zig in the future.
Data-transfer to the GPU remains a critical bottleneck. While crates like gpgpu help avoid boilerplate, I’d like abstractions that also setup efficient input and output pipelines.

This post is a part of a series. Click here for the next post.

Written on December 8, 2024