CSV Parsing on GPUs
In this post, I (mostly) complete my port of nvParse to Rust.
After my last post, implementing the final kernel necessary for parsing most of a CSV file was almost trivial. Here’s the code:
#![cfg_attr(target_arch = "spirv", no_std)]
use glam::UVec3;
use kernelcodegen::generate_kernel;
use spirv_std::{glam, spirv};
fn parse_u32(input: &[u8], start_offset: usize, end_offset: usize) -> u32 {
let mut val: u32 = 0;
for i in start_offset..end_offset {
let b = input[i];
val *= 10;
if b < b'0' || b > b'9' {
return u32::max_value();
}
val += (b - b'0') as u32;
};
val
}
#[generate_kernel()]
#[spirv(compute(threads(256)))]
pub fn main_cc(
#[spirv(global_invocation_id)] id: UVec3,
#[spirv(storage_buffer, descriptor_set = 0, binding = 2)] input: &mut [u8],
#[spirv(uniform, descriptor_set = 0, binding = 3)] input_len: &u32,
#[spirv(uniform, descriptor_set = 0, binding = 4)] delimiter: &u8,
#[spirv(uniform, descriptor_set = 0, binding = 5)] chunk_lines: &u32,
#[spirv(storage_buffer, descriptor_set = 0, binding = 6)] line_start_offsets: &mut [u32],
#[spirv(storage_buffer, descriptor_set = 0, binding = 7)] parsed: &mut [u32],
) {
let index = (id.x * *chunk_lines) as usize;
for i in 0..(*chunk_lines as usize) {
if (index + i) >= (line_start_offsets.len() - 1) {
break;
}
let start_offset = line_start_offsets[index + i] as usize;
let res = if start_offset == 0 {
// TODO access fields from the previous chunk
u32::max_value()
} else {
// start_offset as u32
let mut end_offset = start_offset;
let mut found = true;
while input[end_offset] != *delimiter && input[end_offset] != b'\n' {
end_offset += 1;
if end_offset >= (*input_len as usize) {
found = false;
break;
}
}
if found {
parse_u32(input, start_offset, end_offset)
} else {
// Failed to parse the value - this is probably because start_offset == *input_len
1111000000 + (start_offset as u32)
}
};
parsed[index + i] = res;
}
}
There’s a few pieces missing here. One is that it’s possible that a chunk doesn’t end on a new line exactly. In that case, when processing the previous chunk, we cannot parse the final CSV entry, and when processing the current chunk, we need to access the last row of the previous chunk and concatenate that with the first line. This isn’t too complicated to implement, but it didn’t seem worth the effort to me. This highlights one of the key difficulties in using compute shaders instead of higher level APIs like CUDA. One must manually control how data is moved to the GPU, and buffer size limits can be pretty restrictive. However, the reference implementation for this effort (nvParse) seems to segfault on sufficiently large inputs, so I suppose my version is actually more robust.
Another missing piece is that I’m only extracting a single field from the CSV.
This is also just to make my implementation simpler. nvParse
hardcodes the
schema it’s parsing and extracts all fields - it’s completely feasible for me to
do the same, but I decided to just implement a single field.
A pain point I hit in rust-gpu
here was that parsing string as integers. The
codegen backend doesn’t support taking slices of a buffer, which means that just
calling .parse()
on a string isn’t allowed, and to make matters worse, methods
that I implement must take start/end offsets to avoid creating slices. This
seems like something that can be improved as the project matures though - but
further suggests that rust-gpu
isn’t the best choice for GPGPU
yet.
Comparing with nvParse
For a file with 10000000 lines, nvParse
runs in around 1.6s. My implementation
takes about 2s. I think the most likely source of slower performance in my
implementation is in reading data back to the CPU. nvParse
only reads once,
but my implementation does one read per chunk. This is definitely possible to
avoid by filling up buffers across multiple chunks and then reading only at the
end.
nvParse
claims that it has better performance than a single-core CPU
implementation. However this “implementation” is measuring the time it takes
cut
to extract a field from the CSV and print all rows. This is probably
dominated by IO time more than parsing time. To understand what CPU speeds are
like, I wrote a small program that extracts a field from a CSV, but does not
print it out. On a single thread, it took around 5s, and on 16 threads, it took
700ms. Here’s the code I used:
use std::io::{self, Read, Seek, SeekFrom};
use std::sync::{Arc, Mutex};
use std::thread;
fn main() -> io::Result<()> {
let f = std::fs::File::open("./lineitem.tbl")?;
let filelen = f.metadata()?.len();
let nfields: Arc<Mutex<usize>> = Arc::new(Mutex::new(0));
let nthreads = 1;
let nbytes_per_thread = 1 + filelen / nthreads as u64;
let mut child_handles = Vec::new();
for i in 0..nthreads {
let nfields = nfields.clone();
child_handles.push(thread::spawn(move || {
let mut f = std::fs::File::open("./lineitem.tbl").unwrap();
let mut start = nbytes_per_thread * i as u64;
let mut c = [0u8; 1];
f.seek(SeekFrom::Start(start)).unwrap();
f.read(&mut c).unwrap();
while start != 0 && c[0] != b'\n' {
start -= 1;
f.seek(SeekFrom::Start(start)).unwrap();
f.read(&mut c).unwrap();
}
if start != 0 {
start += 1;
}
let mut end = std::cmp::min(nbytes_per_thread * (i as u64 + 1), filelen);
c[0] = 0;
f.seek(SeekFrom::Start(end)).unwrap();
f.read(&mut c).unwrap();
while end != filelen && c[0] != b'\n' {
end -= 1;
f.seek(SeekFrom::Start(end)).unwrap();
f.read(&mut c).unwrap();
}
f.seek(SeekFrom::Start(start)).unwrap();
let mut bytes = vec![0u8; (end - start) as usize];
f.read_exact(&mut bytes).unwrap();
let data = String::from_utf8(bytes).unwrap();
let mut d: Vec<u32> = Vec::new();
for l in data.split('\n') {
let fields: Vec<&str> = l.split('|').into_iter().collect();
if fields.len() == 17 {
let v = fields[0].parse().unwrap();
d.push(v);
}
}
let mut s = nfields.lock().unwrap();
*s += d.len();
}));
}
for c in child_handles {
c.join().unwrap();
}
println!("parsed {} lines", nfields.lock().unwrap());
Ok(())
}
If I do all the IO upfront, on a single thread, I measured that just reading the file takes 1.6
s.
This seems to suggest that the dominant cost in nvParse
is moving data to the GPU, and highlights
that parsing CSV files isn’t an ideal workload to parallelize using GPUs, but it also seems
reasonable that a more complex format could fare better.
Thoughts/Conclusions
This was a fun exercise to learn more about GPU programming. My next steps will be to continue this work to try to build a GPU accelerated PCAP parser, and hopefully provide more opportunities to take advantage of GPU parallelism by evaluate packet filters on the GPU.
This project has re-inspired the artist in me though, and in my next post I’ll show some fun art I made using compute shaders.
Overall, using rust-gpu
is fun, but not super practical. I was recently
exposed to chapel’s support
for GPU programming, and found their approach to single source GPU programming
to be pretty alluring, though as a more obscure language there’s less support
for it than something more widely adopted like Rust.