Optimizing IO and Simplifying Compute Shader Development
In this post, I continue implementing a parser on the GPU and take steps to hide the latency of transferring data to the GPU. Additionally, I explore Rust’s code generation mechanisms to simplify running compute shaders.
Codegen’ing my way out of boilerplate hell
One thing that frustrated me was the amount of boilerplate needed to run a compute kernel, with no
real compile-time checks to ensure that your CPU driver code matches the semantics of your compute
kernels. I decided that I would take a stab at solving this problem, by leveraging the fact that
rust-gpu
uses the same language for both the GPU and CPU. My plan was to write a macro that
decorates the entry point of the compute kernel, and generates a struct with metadata about the
buffers needed to be bound for the kernel, the sizes of the workgroup, etc.
This required learning how Rust’s macros work. Unlike C/C++, Rust macros are given access to a
stream of tokens, and not just raw text. This makes them far more powerful than the C-preprocessor.
Implementing this macro was rather non-trivial. The code can be found
here.
This definitely reduced the amount of boilerplate, and made it easier for me to implement other
kernels necessary for implementing all of nvParse
.
I plan to continue refining my codegen system to work in more generic cases. I think this could be a
valuable tool in the rust-gpu
ecosystem, and could help simplify GPU development, and hopefully I
can publish the crate share this work with the rust-gpu
community soon.
Optimizing GPU IO
In my last post I found the main bottleneck to be writing data to the GPU. While there isn’t much we can do to make communication itself faster, we can try to hide the latency by overlapping some of the compute with IO. At a high level, our current pipeline is:
- Load and compile the GPU compute shader
- Read a chunk (the size of the maximum sized GPU buffer) of data from disk
- Copy data to the GPU
- Run the compute kernel
- Go back to (2) and repeat until all data is consumed
However the maximum buffer size is much less than the total amount of memory available. So what if we allocated a bunch of buffers up front, and then created a stream of buffers that are ready for consumption by the CPU?
To implement this, I create an array of buffers, and spawned a thread which handles the GPU side of the compute. On the main thread, I write data into the buffers. In order to synchronize the producer and consumer usages of buffers, I opted to create two channels:
const N_INPUT_BUFS: usize = 8;
let mut input_bufs = Vec::new();
for i in 0..N_INPUT_BUFS {
input_bufs.push(device.create_buffer(&wgpu::BufferDescriptor {
label: Some(&format!("File Input {}", i)),
size: max_buffer_size as wgpu::BufferAddress,
usage: wgpu::BufferUsages::STORAGE
| wgpu::BufferUsages::MAP_WRITE
| wgpu::BufferUsages::COPY_SRC
| wgpu::BufferUsages::COPY_DST,
mapped_at_creation: false,
}));
}
let input_bufs = std::sync::Arc::new(input_bufs);
// This channel marks input buffs in the vector above as "free" for writing or "allocated" for
// compute. A producer will need to allocate buffers and transfer them to the consumer, which
// will then free the buffer.
let (free_buffer, allocate_buffer) = mpsc::channel();
for i in 0..input_bufs.len() {
free_buffer
.send(i)
.expect("semaphore initialization failed");
}
// This channel is used to send tasks from the producer to the consumer. Each task includes a
// buffer id to identify which buffer should be bound to the GPU's compute pipeline
let (sender, receiver) = mpsc::channel();
On the producer side, populating chunks looks like this:
// Copy chunks into buffers that aren't currently in-use
let mut offset = 0;
while offset < total_len {
// Get a buffer that is not in use
let input_buf_id = allocate_buffer.recv().unwrap();
let end = std::cmp::min(offset + max_buffer_size as usize, total_len);
let slice = &input[offset..end];
let input_buf = &input_bufs[input_buf_id];
// Write the slice to input_buf, mapping and unmapping input_buf as required
...
sender
.send((offset, end, input_buf_id))
.expect("send failed");
offset = end;
}
Correspondingly on the consumer:
loop {
let (offset, end, input_buf_id) = receiver.recv().unwrap();
let input_buf = &input_bufs[input_buf_id];
// Bind the appropriate buffers and invoke the compute shader
...
if end == total_len {
break;
}
// Mark the input buffer as ready for writing again
free_buffer
.send(input_buf_id)
.expect("semaphore add failed");
}
In order to satisfy Rust’s restrictions on sharing data across threads, input_bufs
is wrapped in
Arc
- the Rust equivalent of C++’s std::shared_ptr
. However, this doesn’t allow taking mutable
references to the data contained in the array. This is okay for us, because we aren’t mutating the
array - the buffers themselves remain constant, but the data within is what’s modified.
Before settling on this implementation, I tried to use async tasks instead of threads, but found
performance benefits to be minimal. I suspect that was mainly because the consumer task busy waits
or otherwise waits in a way that prevents other async tasks from running. Perhaps there are other
APIs within wgpu
that I’m not aware of that interact with the async ecosystem better, but for my
purposes using multiple threads was perfectly acceptable.
Benchmarking
While in the last post I was comparing my GPU implementation to wc
. I was curious if wc
was more
optimized than I expected, and that maybe it wasn’t a fair comparison. To level the playing field a
bit, I instead implemented my own line counting function to benchmark against.
fn cpu_count_char(data: &[u8], char: u8) -> u32 {
let mut acc = 0;
let mut pbar = tqdm::pbar(Some(data.len()));
let mut i = 0;
let mut last_update = 0;
for c in data {
acc += if *c == char { 1 } else { 0 };
i += 1;
if i % 1024 == 0 {
let _ = pbar.update(1024);
last_update = i;
}
}
let _ = pbar.update(data.len() - last_update);
acc
}
There’s some extra things happening here like a progress bar which adds some overhead, but the GPU
version also uses a progress bar so it’s not too unfair of a comparison. This function does run
slower than wc
, even with compiler optimizations enabled.
aneesh@orion:~/nvparse_rs$ cargo r --release -- lineitem_huge.tbl -n 1024
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package: /home/aneesh/nvparse_rs/nvparse_rs/Cargo.toml
workspace: /home/aneesh/nvparse_rs/Cargo.toml
Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs/nvparse_rs)
Finished `release` profile [optimized] target(s) in 12.36s
Running `target/release/nvparse_rs lineitem_huge.tbl -n 1024`
max_buffer_size 134217728
100%|███████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:01<00:00, 1394924706.13it/s]
GPU time: 1.202456962s (res=10000000)
100%|███████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:00<00:00, 1553880686.78it/s]
CPU time: 805.946597ms (res=10000000)
10000000
aneesh@orion:~/nvparse_rs$ cargo r -- lineitem_huge.tbl -n 1024
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package: /home/aneesh/nvparse_rs/nvparse_rs/Cargo.toml
workspace: /home/aneesh/nvparse_rs/Cargo.toml
Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs/nvparse_rs)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 13.81s
Running `target/debug/nvparse_rs lineitem_huge.tbl -n 1024`
max_buffer_size 134217728
100%|███████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:05<00:00, 1040667173.64it/s]
GPU time: 5.681436518s (res=10000000)
100%|████████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:07<00:00, 161695836.83it/s]
CPU time: 7.695374128s (res=10000000)
10000000
In the above you can see that there’s quite a huge difference between optimized and unoptimized
performance, for both the CPU and GPU implementations. I was curious to see what kinds of
optimizations are happening here, so I looked the generated assembly (cargo rustc -p nvparse_rs --
--emit asm
). As far as I could tell, the biggest differences were the following:
- checks for integer addition overflow were removed
- functions call overhead from creating iterators over the slices of the buffers were replaced with looping over the bytes directly
I suspect that avoiding explicit construction of the iterators. I didn’t poke at the instructions generated for the GPU driver code, but I imagine that a lot of the overhead in the non-release build comes from heavy use of iterators.
Out of curiosity, I removed the progress bar and found that the optimized version was able to then
achieve faster speeds than wc
. The generated assembly showed that this was done by leveraging
SIMD
instructions and inlining cpu_count_char
entirely.
I also noticed a huge difference in performance when my laptop is running off of battery instead of being plugged in. This is likely due to slower clock speeds:
aneesh@orion:~/nvparse_rs$ cargo r -- lineitem_huge.tbl -n 1024
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package: /home/aneesh/nvparse_rs/nvparse_rs/Cargo.toml
workspace: /home/aneesh/nvparse_rs/Cargo.toml
Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs/nvparse_rs)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 3.23s
Running `target/debug/nvparse_rs lineitem_huge.tbl -n 1024`
max_buffer_size 134217728
100%|████████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:08<00:00, 925878113.63it/s]
GPU time: 8.463448308s (res=10000000)
100%|████████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:11<00:00, 102090426.30it/s]
CPU time: 11.999583947s (res=10000000)
10000000
aneesh@orion:~/nvparse_rs$ cargo r --release -- lineitem_huge.tbl -n 1024
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package: /home/aneesh/nvparse_rs/nvparse_rs/Cargo.toml
workspace: /home/aneesh/nvparse_rs/Cargo.toml
Compiling nvparse_rs v0.1.0 (/home/aneesh/nvparse_rs/nvparse_rs)
Finished `release` profile [optimized] target(s) in 31.47s
Running `target/release/nvparse_rs lineitem_huge.tbl -n 1024`
max_buffer_size 134217728
100%|███████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:01<00:00, 1233205328.14it/s]
GPU time: 1.545687833s (res=10000000)
100%|████████████████████████████████████████████████████████████████████████████| 1238480000/1238480000 [00:02<00:00, 509911651.26it/s]
CPU time: 2.511051421s (res=10000000)
10000000
What’s cool is that in this low power mode, the GPU implementation is significantly faster than the CPU implementation!
When IO latency isn’t actually IO latency
At this point, I wanted to know how my implementation compared to
nvParse, so I compiled and ran it on my machine. This showed
that nvParse
was computing the number of lines in just \(0.5s\) on the GPU!
While it seems reasonable to me that using wgpu/rust-gpu
might have some overhead in loading code
and setting up the necessary pipelines, my implementation being more than 3x slower was odd.
To understand this better, I started added finer grained timers to my code. This showed that a significant portion of time was being spent reading output from the GPU. I realized that instead of reading output for every chunk, I could instead have my kernel increment values in the existing buffer, and only read after all compute is done. Doing this had no effect on overall runtime, and instead my timers began reporting that all of the latency had moved to submitting the compute pipeline to the GPU!
Submitting the compute pipeline shouldn’t do much - it doesn’t compile or allocate, but instead
tells the GPU to run modules with buffers that already exist. At this point, I realized that I’d
seen similar things when debugging WebGL - where the true time spent on the GPU was only revealed when I added gl.finish
calls. Without
explicitly waiting for the GPU to finish, timing the first iteration of my rendering pass was very fast,
but the second iteration would be much slower, because submitting a task to the GPU would implicitly
wait for the first iteration’s compute to completely finish. Since mapping/reading a GPU buffer
requires computation to finish, I realized that the slow read times were actually being caused by
slow compute times. This made me revisit my write IO as well, and realized that most of my time in
the producer thread was spent waiting for the GPU to be ready for writing.
So why was my GPU code so slow? Let’s take a look at the compute shader I initially wrote:
#![cfg_attr(target_arch = "spirv", no_std)]
use glam::UVec3;
use spirv_std::{glam, spirv};
#[spirv(compute(threads(1024)))]
pub fn main_cc(
#[spirv(global_invocation_id)] id: UVec3,
#[spirv(storage_buffer, descriptor_set = 0, binding = 0)] input: &mut [u8],
#[spirv(uniform, descriptor_set = 0, binding = 1)] chunk_size: &u32,
#[spirv(uniform, descriptor_set = 0, binding = 2)] data_len: &u32,
#[spirv(uniform, descriptor_set = 0, binding = 3)] char: &u8,
#[spirv(storage_buffer, descriptor_set = 0, binding = 4)] count: &mut [u32],
) {
let index = id.x as usize;
count[index] = 0;
let mut start: usize = index * (*chunk_size as usize);
let mut nelems: usize = core::cmp::min(*chunk_size, *data_len - start as u32) as usize;
let mut i = 0;
for i in start..(start + nelems) {
if input[i] == *char {
count[index] += 1;
}
}
}
Here, each thread accesses chunk_size
elements, but I call this kernel once per chunk of data.
Remember that a single chunk
of data is the maximum buffer size, so for sufficiently large inputs
the chunk_size
is \(\frac{max\_buffer\_size}{n\_threads}\) - this is in the tens of thousands on my
machine. GPUs are not designed to allow threads to access large regions of data. Each thread should
ideally only access a single index from each input or output buffer, but threads can communicate by
using barriers and atomics, with various options for which set of threads should participate in
synchronization operations.
With that in mind, I wrote the following:
#![cfg_attr(target_arch = "spirv", no_std)]
use glam::UVec3;
use kernelcodegen::generate_kernel;
use spirv_std::{arch, glam, memory, spirv};
#[generate_kernel()]
#[spirv(compute(threads(256)))]
pub fn main_cc(
#[spirv(local_invocation_id)] lid: UVec3,
#[spirv(global_invocation_id)] id: UVec3,
#[spirv(storage_buffer, descriptor_set = 0, binding = 0)] input: &mut [u8],
#[spirv(uniform, descriptor_set = 0, binding = 1)] chunk_size: &u32,
#[spirv(uniform, descriptor_set = 0, binding = 2)] data_len: &u32,
#[spirv(uniform, descriptor_set = 0, binding = 3)] char: &u8,
#[spirv(storage_buffer, descriptor_set = 0, binding = 4)] count: &mut [u32],
) {
let index = id.x as usize;
let lindex = lid.x as usize;
let start: usize = index * (*chunk_size as usize);
let mut acc = 0;
for i in start..(start + *chunk_size as usize) {
if i < *data_len as usize && input[i] == *char {
acc += 1;
}
}
// Each thread per workgroup adds to a unique index in the output - this needs to be
// synchronized across all workgroup instances though.
unsafe {
arch::atomic_i_add::<
u32,
{ memory::Scope::Device as u32 },
{ memory::Semantics::OUTPUT_MEMORY.bits() },
>(&mut count[lindex], acc)
};
}
This might look pretty similar, but the corresponding code on the CPU side was changed as well:
...
let n_dispatches = std::cmp::min(1 + data_len / countchar_gen.workgroup_dim.0, limits.max_compute_workgroups_per_dimension);
let chunk_size: u32 = data_len / (n_dispatches * countchar_gen.workgroup_dim.0) + 1;
...
Here n_dispatches
controls how many workgroups we spawn on the GPU. From the GPU code above, we
see that each workgroup has 256 threads. So effectively, we have \(n\_threads = 256 * n\_dispatches\) threads.
Ideally we would have \(data\_len = 256 * n\_dispatches\), however, there’s still a limit on how many
workgroups can be spawned, so we still need a chunk_size
in the cases where we’re already spawning
the maximum number of threads. However, the maximum value for chunk_size
for the device limits of
my hardware seems to be 17
. This is small enough to where there is no degradation in performance
compared to accessing a single element. In the GPU code lid
(the local_invocation_id
) will
provide an index in \([0, 256)\), while id
(the global_invocation_id
) provides an index in
\([0, data\_len)\). We use Device
scoped atomics to synchronize across workgroups and write to
a small buffer. I experimented with Device
vs Workgroup
scoped atomics (with one unique output
location per workgroup), but didn’t see a significant difference in performance - this makes sense
because the maximum number of conflicting accesses in the Device
scope is still less than 500,
while Workgroup
will have 256 conflicting accesses. There’s likely not enough difference between
these numbers to impact performance (especially since it’s also not guaranteed that we’ll spawn all
workgroups at once - the contention in the Device
case could even be better than Workgroup
on
some hardware with this setup).
This brought the total compute time down to around 500ms
! Without optimizations this beats the
CPU, though with SIMD enabled, the CPU is still around 200ms faster, not to mention there’s still
time spent doing IO and initialization, so overall the CPU is around 400-500ms faster with SIMD
.
This brings my implementation to be close enough to the nvParse
numbers for me to call this
portion of the project done. I think I could still tweak things a bit to match or beat the nvParse
performance, but I’d like to move on and implement the rest of functionality - like the actual CSV
parser itself.
Conclusion
Just for fun, I wanted to see how many lines of code it took to get to this point. The current version of code can be found here:
aneesh@orion:~/nvparse_rs$ cloc main --git
23 text files.
19 unique files.
4 files ignored.
github.com/AlDanial/cloc v 1.98 T=0.04 s (487.5 files/s, 27430.7 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Rust 9 113 112 821
TOML 9 18 0 89
Markdown 1 2 0 3
-------------------------------------------------------------------------------
SUM: 19 133 112 913
-------------------------------------------------------------------------------
This is quite a lot of code, for something that ultimately doesn’t do that much! About half of the lines come from the codegen system, but the commits introducing it add roughly as many lines as they removed. This suggests to me that writing compute shaders in Rust has a fair bit of overhead compared to writing CUDA - but keep in mind that unlike CUDA, this implementation is portable.
rust-gpu
really shines when you are actually using your shader crates as regular rust crates. This
opens up possibilities for sharing struct definitions, utility functions, and even testing your
shaders. However, from what I’ve read online there can be some gotchas (e.g. usize
isn’t
guaranteed to be the same size across the CPU target and SPIRV target - it seems like usize
is an
alias for u64
on most CPU targets, and u32
on SPIRV). I’ve also been reading up on
SYCL, which seems to be a similar idea, but seems like more
portable/better CUDA, and still requires using C++
. Understanding more about the GPU compilation
landscape has also given me a better appreciation for the problems MLIR
claims to solve and is something I want to read more about.