CSV Parsing on GPUs

This post is a part of a series. Click here for the previous post.

In this post, I (mostly) complete my port of nvParse to Rust.


After my last post, implementing the final kernel necessary for parsing most of a CSV file was almost trivial. Here’s the code:

#![cfg_attr(target_arch = "spirv", no_std)]
use glam::UVec3;
use kernelcodegen::generate_kernel;
use spirv_std::{glam, spirv};

fn parse_u32(input: &[u8], start_offset: usize, end_offset: usize) -> u32 {
    let mut val: u32 = 0;
    for i in start_offset..end_offset {
        let b = input[i];
        val *= 10;
        if b < b'0' || b > b'9' {
            return u32::max_value();
        }
        val += (b - b'0') as u32;
    };
    val
}

#[generate_kernel()]
#[spirv(compute(threads(256)))]
pub fn main_cc(
    #[spirv(global_invocation_id)] id: UVec3,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 2)] input: &mut [u8],
    #[spirv(uniform, descriptor_set = 0, binding = 3)] input_len: &u32,
    #[spirv(uniform, descriptor_set = 0, binding = 4)] delimiter: &u8,
    #[spirv(uniform, descriptor_set = 0, binding = 5)] chunk_lines: &u32,
    #[spirv(storage_buffer, descriptor_set = 0, binding = 6)] line_start_offsets: &mut [u32],
    #[spirv(storage_buffer, descriptor_set = 0, binding = 7)] parsed: &mut [u32],
) {
    let index = (id.x * *chunk_lines) as usize;
    for i in 0..(*chunk_lines as usize) {
        if (index + i) >= (line_start_offsets.len() - 1) {
            break;
        }

        let start_offset = line_start_offsets[index + i] as usize;
        let res = if start_offset == 0 {
            // TODO access fields from the previous chunk
            u32::max_value()
        } else {
            // start_offset as u32
            let mut end_offset = start_offset;
            let mut found = true;
            while input[end_offset] != *delimiter && input[end_offset] != b'\n' {
                end_offset += 1;
                if end_offset >= (*input_len as usize) {
                    found = false;
                    break;
                }
            }

            if found {
                parse_u32(input, start_offset, end_offset)
            } else {
                // Failed to parse the value - this is probably because start_offset == *input_len
                1111000000 + (start_offset as u32)
            }
        };
        parsed[index + i] = res;
    }
}

There’s a few pieces missing here. One is that it’s possible that a chunk doesn’t end on a new line exactly. In that case, when processing the previous chunk, we cannot parse the final CSV entry, and when processing the current chunk, we need to access the last row of the previous chunk and concatenate that with the first line. This isn’t too complicated to implement, but it didn’t seem worth the effort to me. This highlights one of the key difficulties in using compute shaders instead of higher level APIs like CUDA. One must manually control how data is moved to the GPU, and buffer size limits can be pretty restrictive. However, the reference implementation for this effort (nvParse) seems to segfault on sufficiently large inputs, so I suppose my version is actually more robust.

Another missing piece is that I’m only extracting a single field from the CSV. This is also just to make my implementation simpler. nvParse hardcodes the schema it’s parsing and extracts all fields - it’s completely feasible for me to do the same, but I decided to just implement a single field.

A pain point I hit in rust-gpu here was that parsing string as integers. The codegen backend doesn’t support taking slices of a buffer, which means that just calling .parse() on a string isn’t allowed, and to make matters worse, methods that I implement must take start/end offsets to avoid creating slices. This seems like something that can be improved as the project matures though - but further suggests that rust-gpu isn’t the best choice for GPGPU yet.

Comparing with nvParse

For a file with 10000000 lines, nvParse runs in around 1.6s. My implementation takes about 2s. I think the most likely source of slower performance in my implementation is in reading data back to the CPU. nvParse only reads once, but my implementation does one read per chunk. This is definitely possible to avoid by filling up buffers across multiple chunks and then reading only at the end.

nvParse claims that it has better performance than a single-core CPU implementation. However this “implementation” is measuring the time it takes cut to extract a field from the CSV and print all rows. This is probably dominated by IO time more than parsing time. To understand what CPU speeds are like, I wrote a small program that extracts a field from a CSV, but does not print it out. On a single thread, it took around 5s, and on 16 threads, it took 700ms. Here’s the code I used:

use std::io::{self, Read, Seek, SeekFrom};
use std::sync::{Arc, Mutex};
use std::thread;

fn main() -> io::Result<()> {
    let f = std::fs::File::open("./lineitem.tbl")?;
    let filelen = f.metadata()?.len();
    let nfields: Arc<Mutex<usize>> = Arc::new(Mutex::new(0));
    let nthreads = 1;

    let nbytes_per_thread = 1 + filelen / nthreads as u64;
    let mut child_handles = Vec::new();
    for i in 0..nthreads {
        let nfields = nfields.clone();
        child_handles.push(thread::spawn(move || {
            let mut f = std::fs::File::open("./lineitem.tbl").unwrap();
            let mut start = nbytes_per_thread * i as u64;
            let mut c = [0u8; 1];
            f.seek(SeekFrom::Start(start)).unwrap();
            f.read(&mut c).unwrap();
            while start != 0 && c[0] != b'\n' {
                start -= 1;
                f.seek(SeekFrom::Start(start)).unwrap();
                f.read(&mut c).unwrap();
            }
            if start != 0 {
                start += 1;
            }

            let mut end = std::cmp::min(nbytes_per_thread * (i as u64 + 1), filelen);
            c[0] = 0;
            f.seek(SeekFrom::Start(end)).unwrap();
            f.read(&mut c).unwrap();
            while end != filelen && c[0] != b'\n' {
                end -= 1;
                f.seek(SeekFrom::Start(end)).unwrap();
                f.read(&mut c).unwrap();
            }

            f.seek(SeekFrom::Start(start)).unwrap();
            let mut bytes = vec![0u8; (end - start) as usize];
            f.read_exact(&mut bytes).unwrap();
            let data = String::from_utf8(bytes).unwrap();

            let mut d: Vec<u32> = Vec::new();
            for l in data.split('\n') {
                let fields: Vec<&str> = l.split('|').into_iter().collect();
                if fields.len() == 17 {
                    let v = fields[0].parse().unwrap();
                    d.push(v);
                }
            }

            let mut s = nfields.lock().unwrap();
            *s += d.len();
        }));
    }

    for c in child_handles {
        c.join().unwrap();
    }

    println!("parsed {} lines", nfields.lock().unwrap());
    Ok(())
}

If I do all the IO upfront, on a single thread, I measured that just reading the file takes 1.6s. This seems to suggest that the dominant cost in nvParse is moving data to the GPU, and highlights that parsing CSV files isn’t an ideal workload to parallelize using GPUs, but it also seems reasonable that a more complex format could fare better.

Thoughts/Conclusions

This was a fun exercise to learn more about GPU programming. My next steps will be to continue this work to try to build a GPU accelerated PCAP parser, and hopefully provide more opportunities to take advantage of GPU parallelism by evaluate packet filters on the GPU.

This project has re-inspired the artist in me though, and in my next post I’ll show some fun art I made using compute shaders.

Overall, using rust-gpu is fun, but not super practical. I was recently exposed to chapel’s support for GPU programming, and found their approach to single source GPU programming to be pretty alluring, though as a more obscure language there’s less support for it than something more widely adopted like Rust.

This post is a part of a series. Click here for the next post.
Written on December 30, 2024