Optimizing IO and Simplifying Compute Shader Development

This post is a part of a series. Click here for the previous post.

In this post, I continue implementing a parser on the GPU and take steps to hide the latency of transferring data to the GPU. Additionally, I explore Rust’s code generation mechanisms to simplify running compute shaders.

Codegen’ing my way out of boilerplate hell

One thing that frustrated me was the amount of boilerplate needed to run a compute kernel, with no real compile-time checks to ensure that your CPU driver code matches the semantics of your compute kernels. I decided that I would take a stab at solving this problem, by leveraging the fact that rust-gpu uses the same language for both the GPU and CPU. My plan was to write a macro that decorates the entry point of the compute kernel, and generates a struct with metadata about the buffers needed to be bound for the kernel, the sizes of the workgroup, etc.

This required learning how Rust’s macros work. Unlike C/C++, Rust macros are given access to a stream of tokens, and not just raw text. This makes them far more powerful than the C-preprocessor. Implementing this macro was rather non-trivial. The code can be found here. This definitely reduced the amount of boilerplate, and made it easier for me to implement other kernels necessary for implementing all of nvParse.

I plan to continue refining my codegen system to work in more generic cases. I think this could be a valuable tool in the rust-gpu ecosystem, and could help simplify GPU development, and hopefully I can publish the crate share this work with the rust-gpu community soon.

Optimizing GPU IO

In my last post I found the main bottleneck to be writing data to the GPU. While there isn’t much we can do to make communication itself faster, we can try to hide the latency by overlapping some of the compute with IO. At a high level, our current pipeline is:

Load and compile the GPU compute shader
Read a chunk (the size of the maximum sized GPU buffer) of data from disk
Copy data to the GPU
Run the compute kernel
Go back to (2) and repeat until all data is consumed

However the maximum buffer size is much less than the total amount of memory available. So what if we allocated a bunch of buffers up front, and then created a stream of buffers that are ready for consumption by the CPU?

To implement this, I create an array of buffers, and spawned a thread which handles the GPU side of the compute. On the main thread, I write data into the buffers. In order to synchronize the producer and consumer usages of buffers, I opted to create two channels:

const N_INPUT_BUFS: usize = 8;
let mut input_bufs = Vec::new();
for i in 0..N_INPUT_BUFS {
    input_bufs.push(device.create_buffer(&wgpu::BufferDescriptor {
        label: Some(&format!("File Input {}", i)),
        size: max_buffer_size as wgpu::BufferAddress,
        usage: wgpu::BufferUsages::STORAGE
            | wgpu::BufferUsages::MAP_WRITE
            | wgpu::BufferUsages::COPY_SRC
            | wgpu::BufferUsages::COPY_DST,
        mapped_at_creation: false,
    }));
}
let input_bufs = std::sync::Arc::new(input_bufs);

// This channel marks input buffs in the vector above as "free" for writing or "allocated" for
// compute. A producer will need to allocate buffers and transfer them to the consumer, which
// will then free the buffer.
let (free_buffer, allocate_buffer) = mpsc::channel();
for i in 0..input_bufs.len() {
    free_buffer
        .send(i)
        .expect("semaphore initialization failed");
}
// This channel is used to send tasks from the producer to the consumer. Each task includes a
// buffer id to identify which buffer should be bound to the GPU's compute pipeline
let (sender, receiver) = mpsc::channel();

On the producer side, populating chunks looks like this:

// Copy chunks into buffers that aren't currently in-use
let mut offset = 0;
while offset < total_len {
    // Get a buffer that is not in use
    let input_buf_id = allocate_buffer.recv().unwrap();
    let end = std::cmp::min(offset + max_buffer_size as usize, total_len);
    let slice = &input[offset..end];

    let input_buf = &input_bufs[input_buf_id];
    // Write the slice to input_buf, mapping and unmapping input_buf as required
    ...

    sender
        .send((offset, end, input_buf_id))
        .expect("send failed");

    offset = end;
}

Correspondingly on the consumer:

loop {
    let (offset, end, input_buf_id) = receiver.recv().unwrap();
    let input_buf = &input_bufs[input_buf_id];
    // Bind the appropriate buffers and invoke the compute shader
    ...

    if end == total_len {
        break;
    }
    // Mark the input buffer as ready for writing again
    free_buffer
        .send(input_buf_id)
        .expect("semaphore add failed");
}

In order to satisfy Rust’s restrictions on sharing data across threads, input_bufs is wrapped in Arc - the Rust equivalent of C++’s std::shared_ptr. However, this doesn’t allow taking mutable references to the data contained in the array. This is okay for us, because we aren’t mutating the array - the buffers themselves remain constant, but the data within is what’s modified.

Before settling on this implementation, I tried to use async tasks instead of threads, but found performance benefits to be minimal. I suspect that was mainly because the consumer task busy waits or otherwise waits in a way that prevents other async tasks from running. Perhaps there are other APIs within wgpu that I’m not aware of that interact with the async ecosystem better, but for my purposes using multiple threads was perfectly acceptable.

Benchmarking

While in the last post I was comparing my GPU implementation to wc. I was curious if wc was more optimized than I expected, and that maybe it wasn’t a fair comparison. To level the playing field a bit, I instead implemented my own line counting function to benchmark against.

fn cpu_count_char(data: &[u8], char: u8) -> u32 {
    let mut acc = 0;
    let mut pbar = tqdm::pbar(Some(data.len()));
    let mut i = 0;
    let mut last_update = 0;
    for c in data {
        acc += if *c == char { 1 } else { 0 };
        i += 1;
        if i % 1024 == 0 {
            let _ = pbar.update(1024);
            last_update = i;
        }
    }
    let _ = pbar.update(data.len() - last_update);
    acc
}

There’s some extra things happening here like a progress bar which adds some overhead, but the GPU version also uses a progress bar so it’s not too unfair of a comparison. This function does run slower than wc, even with compiler optimizations enabled.

aneesh@orion:~/nvparse_rs$ cargo r --release -- lineitem_huge.tbl -n 1024
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/aneesh/nvparse_rs/nvparse_rs/Cargo.toml
workspace: /home/aneesh/nvparse_rs/Cargo.toml
   Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs/nvparse_rs)
    Finished `release` profile [optimized] target(s) in 12.36s
     Running `target/release/nvparse_rs lineitem_huge.tbl -n 1024`
max_buffer_size 134217728
100%|███████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:01<00:00, 1394924706.13it/s]
GPU time: 1.202456962s (res=10000000)
100%|███████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:00<00:00, 1553880686.78it/s]
CPU time: 805.946597ms (res=10000000)
10000000
aneesh@orion:~/nvparse_rs$ cargo r -- lineitem_huge.tbl -n 1024
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/aneesh/nvparse_rs/nvparse_rs/Cargo.toml
workspace: /home/aneesh/nvparse_rs/Cargo.toml
   Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs/nvparse_rs)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 13.81s
     Running `target/debug/nvparse_rs lineitem_huge.tbl -n 1024`
max_buffer_size 134217728
100%|███████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:05<00:00, 1040667173.64it/s]
GPU time: 5.681436518s (res=10000000)
100%|████████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:07<00:00, 161695836.83it/s]
CPU time: 7.695374128s (res=10000000)
10000000

In the above you can see that there’s quite a huge difference between optimized and unoptimized performance, for both the CPU and GPU implementations. I was curious to see what kinds of optimizations are happening here, so I looked the generated assembly (cargo rustc -p nvparse_rs -- --emit asm). As far as I could tell, the biggest differences were the following:

checks for integer addition overflow were removed
functions call overhead from creating iterators over the slices of the buffers were replaced with looping over the bytes directly

I suspect that avoiding explicit construction of the iterators. I didn’t poke at the instructions generated for the GPU driver code, but I imagine that a lot of the overhead in the non-release build comes from heavy use of iterators.

Out of curiosity, I removed the progress bar and found that the optimized version was able to then achieve faster speeds than wc. The generated assembly showed that this was done by leveraging SIMD instructions and inlining cpu_count_char entirely.

I also noticed a huge difference in performance when my laptop is running off of battery instead of being plugged in. This is likely due to slower clock speeds:

aneesh@orion:~/nvparse_rs$ cargo r -- lineitem_huge.tbl -n 1024
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/aneesh/nvparse_rs/nvparse_rs/Cargo.toml
workspace: /home/aneesh/nvparse_rs/Cargo.toml
   Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs/nvparse_rs)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 3.23s
     Running `target/debug/nvparse_rs lineitem_huge.tbl -n 1024`
max_buffer_size 134217728
100%|████████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:08<00:00, 925878113.63it/s]
GPU time: 8.463448308s (res=10000000)
100%|████████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:11<00:00, 102090426.30it/s]
CPU time: 11.999583947s (res=10000000)
10000000
aneesh@orion:~/nvparse_rs$ cargo r --release -- lineitem_huge.tbl -n 1024
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/aneesh/nvparse_rs/nvparse_rs/Cargo.toml
workspace: /home/aneesh/nvparse_rs/Cargo.toml
   Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs/nvparse_rs)
    Finished `release` profile [optimized] target(s) in 31.47s
     Running `target/release/nvparse_rs lineitem_huge.tbl -n 1024`
max_buffer_size 134217728
100%|███████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:01<00:00, 1233205328.14it/s]
GPU time: 1.545687833s (res=10000000)
100%|████████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:02<00:00, 509911651.26it/s]
CPU time: 2.511051421s (res=10000000)
10000000

What’s cool is that in this low power mode, the GPU implementation is significantly faster than the CPU implementation!

When IO latency isn’t actually IO latency

At this point, I wanted to know how my implementation compared to nvParse, so I compiled and ran it on my machine. This showed that nvParse was computing the number of lines in just \(0.5s\) on the GPU! While it seems reasonable to me that using wgpu/rust-gpu might have some overhead in loading code and setting up the necessary pipelines, my implementation being more than 3x slower was odd.

To understand this better, I started added finer grained timers to my code. This showed that a significant portion of time was being spent reading output from the GPU. I realized that instead of reading output for every chunk, I could instead have my kernel increment values in the existing buffer, and only read after all compute is done. Doing this had no effect on overall runtime, and instead my timers began reporting that all of the latency had moved to submitting the compute pipeline to the GPU!

Submitting the compute pipeline shouldn’t do much - it doesn’t compile or allocate, but instead tells the GPU to run modules with buffers that already exist. At this point, I realized that I’d seen similar things when debugging WebGL - where the true time spent on the GPU was only revealed when I added gl.finish calls. Without explicitly waiting for the GPU to finish, timing the first iteration of my rendering pass was very fast, but the second iteration would be much slower, because submitting a task to the GPU would implicitly wait for the first iteration’s compute to completely finish. Since mapping/reading a GPU buffer requires computation to finish, I realized that the slow read times were actually being caused by slow compute times. This made me revisit my write IO as well, and realized that most of my time in the producer thread was spent waiting for the GPU to be ready for writing.

So why was my GPU code so slow? Let’s take a look at the compute shader I initially wrote:

#![cfg_attr(target_arch = "spirv", no_std)]
use glam::UVec3;
use spirv_std::{glam, spirv};

#[spirv(compute(threads(1024)))]
pub fn main_cc(
    #[spirv(global_invocation_id)] id: UVec3,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 0)] input: &mut [u8],
    #[spirv(uniform, descriptor_set = 0, binding = 1)] chunk_size: &u32,
    #[spirv(uniform, descriptor_set = 0, binding = 2)] data_len: &u32,
    #[spirv(uniform, descriptor_set = 0, binding = 3)] char: &u8,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 4)] count: &mut [u32],
) {
    let index = id.x as usize;
    count[index] = 0;

    let mut start: usize = index * (*chunk_size as usize);
    let mut nelems: usize = core::cmp::min(*chunk_size, *data_len - start as u32) as usize;

    let mut i = 0;
    for i in start..(start + nelems) {
        if input[i] == *char {
            count[index] += 1;
        }
    }
}

Here, each thread accesses chunk_size elements, but I call this kernel once per chunk of data. Remember that a single chunk of data is the maximum buffer size, so for sufficiently large inputs the chunk_size is \(\frac{max\_buffer\_size}{n\_threads}\) - this is in the tens of thousands on my machine. GPUs are not designed to allow threads to access large regions of data. Each thread should ideally only access a single index from each input or output buffer, but threads can communicate by using barriers and atomics, with various options for which set of threads should participate in synchronization operations.

With that in mind, I wrote the following:

#![cfg_attr(target_arch = "spirv", no_std)]
use glam::UVec3;
use kernelcodegen::generate_kernel;
use spirv_std::{arch, glam, memory, spirv};

#[generate_kernel()]
#[spirv(compute(threads(256)))]
pub fn main_cc(
    #[spirv(local_invocation_id)] lid: UVec3,
    #[spirv(global_invocation_id)] id: UVec3,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 0)] input: &mut [u8],
    #[spirv(uniform, descriptor_set = 0, binding = 1)] chunk_size: &u32,
    #[spirv(uniform, descriptor_set = 0, binding = 2)] data_len: &u32,
    #[spirv(uniform, descriptor_set = 0, binding = 3)] char: &u8,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 4)] count: &mut [u32],
) {
    let index = id.x as usize;
    let lindex = lid.x as usize;

    let start: usize = index * (*chunk_size as usize);

    let mut acc = 0;
    for i in start..(start + *chunk_size as usize) {
        if i < *data_len as usize && input[i] == *char {
            acc += 1;
        }
    }

    // Each thread per workgroup adds to a unique index in the output - this needs to be
    // synchronized across all workgroup instances though.
    unsafe {
        arch::atomic_i_add::<
            u32,
            { memory::Scope::Device as u32 },
            { memory::Semantics::OUTPUT_MEMORY.bits() },
        >(&mut count[lindex], acc)
    };
}

This might look pretty similar, but the corresponding code on the CPU side was changed as well:

...
let n_dispatches = std::cmp::min(1 + data_len / countchar_gen.workgroup_dim.0, limits.max_compute_workgroups_per_dimension);
let chunk_size: u32 = data_len / (n_dispatches * countchar_gen.workgroup_dim.0) + 1;
...

Here n_dispatches controls how many workgroups we spawn on the GPU. From the GPU code above, we see that each workgroup has 256 threads. So effectively, we have \(n\_threads = 256 * n\_dispatches\) threads. Ideally we would have \(data\_len = 256 * n\_dispatches\), however, there’s still a limit on how many workgroups can be spawned, so we still need a chunk_size in the cases where we’re already spawning the maximum number of threads. However, the maximum value for chunk_size for the device limits of my hardware seems to be 17. This is small enough to where there is no degradation in performance compared to accessing a single element. In the GPU code lid (the local_invocation_id) will provide an index in \([0, 256)\), while id (the global_invocation_id) provides an index in \([0, data\_len)\). We use Device scoped atomics to synchronize across workgroups and write to a small buffer. I experimented with Device vs Workgroup scoped atomics (with one unique output location per workgroup), but didn’t see a significant difference in performance - this makes sense because the maximum number of conflicting accesses in the Device scope is still less than 500, while Workgroup will have 256 conflicting accesses. There’s likely not enough difference between these numbers to impact performance (especially since it’s also not guaranteed that we’ll spawn all workgroups at once - the contention in the Device case could even be better than Workgroup on some hardware with this setup).

This brought the total compute time down to around 500ms! Without optimizations this beats the CPU, though with SIMD enabled, the CPU is still around 200ms faster, not to mention there’s still time spent doing IO and initialization, so overall the CPU is around 400-500ms faster with SIMD. This brings my implementation to be close enough to the nvParse numbers for me to call this portion of the project done. I think I could still tweak things a bit to match or beat the nvParse performance, but I’d like to move on and implement the rest of functionality - like the actual CSV parser itself.

Conclusion

Just for fun, I wanted to see how many lines of code it took to get to this point. The current version of code can be found here:

aneesh@orion:~/nvparse_rs$ cloc main --git
      23 text files.
      19 unique files.                              
       4 files ignored.

github.com/AlDanial/cloc v 1.98  T=0.04 s (487.5 files/s, 27430.7 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Rust                             9            113            112            821
TOML                             9             18              0             89
Markdown                         1              2              0              3
-------------------------------------------------------------------------------
SUM:                            19            133            112            913
-------------------------------------------------------------------------------

This is quite a lot of code, for something that ultimately doesn’t do that much! About half of the lines come from the codegen system, but the commits introducing it add roughly as many lines as they removed. This suggests to me that writing compute shaders in Rust has a fair bit of overhead compared to writing CUDA - but keep in mind that unlike CUDA, this implementation is portable.

rust-gpu really shines when you are actually using your shader crates as regular rust crates. This opens up possibilities for sharing struct definitions, utility functions, and even testing your shaders. However, from what I’ve read online there can be some gotchas (e.g. usize isn’t guaranteed to be the same size across the CPU target and SPIRV target - it seems like usize is an alias for u64 on most CPU targets, and u32 on SPIRV). I’ve also been reading up on SYCL, which seems to be a similar idea, but seems like more portable/better CUDA, and still requires using C++. Understanding more about the GPU compilation landscape has also given me a better appreciation for the problems MLIR claims to solve and is something I want to read more about.

This post is a part of a series. Click here for the next post.

Written on December 14, 2024