Paper Review 01 - Xorbits

My thoughts on Xorbits: Automating Operator Tiling for Distributed Data Science


What is this post?

This year, one of my goals is to rekindle my academic aspirations, and I figured a good way to do that might be to not only read more papers, but also document my thoughts about them (and possibly even identify related problems along the way). For this post’s paper in question (linked above), I just went on arxiv and looked at recent papers in the field of databases. I have mixed feelings about this approach to selecting papers, and for the next few publication I intend to write about, I instead looked at papers accepted to reputable conferences.

What is Xorbits?

Xorbits is a open source framework that aims to enable multicore/distributed processing for existing code targeting dataframe APIs (e.g. pandas). The stated goals seem to prioritize processing of large datasets that might be too large to process on a single machine, and to improve speed. Xorbits is primarily maintained by a startup, either called XProbe Inc or Xorbits Inc. It seems that they’ve renamed themselves at some point, and I’m not sure which name represents the current entity. I’d probably guess the former, since they seem to be branching out with a new product called XInference that seems to be some kind of compatibility layer for LLMs.

The space occupied by Xorbits is not unique, several other companies/OSS projects have also identified the poor scaling of python based dataframe implementations, such as modin, polars, and bodo among others.

For full transparency, I am currently a full-time engineer at bodo - the similarity of my employer’s product to Xorbits is what made me click on this paper, but the thoughts in this post are entirely my own and do not necessarily reflect the views or intentions of my employer.

The fact that this space has so many players is definitely a sign that Xorbits is attempting to solve a very real problem.

How does Xorbits improve the state of dateframe processing in python?

The general workflow of a program accelerated with Xorbits is that API calls are collected into a graph until a method that triggers computation (or an explicit request for computation) is encountered. At that point, the Xorbits runtime optimizes the graph, chunks execution (partitions input data/operators into per-logical-thread operations), and executes. There seemed to be three main contributions the paper focused on.

Dynamic Tiling

Many operators that reshape data do so in a way that makes it impossible to statically determine what the output shape of the data may be. This is problematic for scheduling because it’s hard to determine what the optimal approach for chunking data should be. Some systems wash their hands of the problem and rely on the user to supply a fixed chunk size, which can lead to unexpected performance, or increases the burden on the user to try different chunk sizes to find the optimal one.

Xorbits’ dynamic tiling is implemented by running parts of the input graph up until an operator that reshapes data is encountered. Such operators are assigned some kind of initial chunking strategy, but after a few chunks are collected, the metadata is extracted and the plan is modified to better reflect the real state of the data. This is done by making use of python’s built in generator/coroutine mechanisms, which is a pretty clean way of implementing it.

Operator Fusion

There’s also a stage of optimization for operator fusion where nodes in the compute graph are combined based off of a graph coloring algorithm. The paper claims that using graph coloring to implement this is a novel contribution, and it certainly seems like a really cool way to approach the problem.

Storage Backends

To be honest I only skimmed this section, so my brief summary is that they support multiple storage backends and even have some notion of tiering to use the kind of storage mechanism that best represents how the data is being used. I did learn about the existence of pickle5 though, which was neat.

Side note - a non-technical explanation

While I was initially reading the paper, my partner offhandedly asked me to summarize what I was reading and asked for a non-technical explanation, so here’s my best non-technical explanation of how chunked processing works:

It’s kinda like when you’re eating a bunch of food, and if you eat too much too fast, you choke and throw up. If you take really small bites, you probably won’t throw up (ignoring a full stomach), but you’ll take a lot longer to finish the meal. Computers can only handle computing over some maximum amount of data, and going beyond that leads to OOMs (out-of-memory). The goal is to look at the “digestive system” and determine how fast each part of the system can work, and then determine what’s the biggest bites you can take at any given moment to finish your plate as fast as possible, but also not choke.

My thoughts

When I initially read the paper I missed that this paper relates to a commercial product. I’m not entirely sure what Xorbits’ business model is, but the paper definitely reads a bit like an advertisement. The benchmarks also seem slightly cherry-picked but more on that later. I don’t think that automatically makes this a bad paper, but it does change how I interpret the results. I think some of the contributions in this paper are really interesting. I especially liked the elegance of using python’s coroutines (yield) to implement dynamic tiling. I also found the details on how graph fusion worked for them to be rather interesting. Prior systems I’ve worked on/read about seem to focus more on fusing sequences of operators to replace them with more efficient implementations, but few seemed to focus on reduce inter-operator overheads (which may or may not have been the goal of the implementation here, but it was interesting to think about nonetheless). I’m a little surprised by how large the impact of graph fusion was (I would have expected operator fusion to be more impactful). I’d love to dive into more details about what exactly makes operator fusion fast.

I couldn’t help but compare Xorbits to bodo given my familiarity with the latter. There’s several interesting things to note about the differences in approach to the problem. Xorbits seems to manage most of it’s infrastructure as a service on top of kubernetes, while bodo’s community edition relies on users configuring MPI. Another interesting difference is that Xorbits replaces the pandas/numpy runtimes with it’s own implementation that defers computation, which allows for extremely high compatibility with existing python scripts, while bodo only supports the subset of those scripts that can be compiled by numba. There’s pros and cons to both of these approaches, and choosing one comes down to whether you want to be more explicit in targeting only the subset of workflows where your product can make a meaningful impact, or prioritizing compatibility so that users don’t need to take many actions to use the product.

I felt like the paper was lacking details comparisons to other state of the art systems. I think the paper’s first author would agree with me on that statement, given that they wrote this blog post which compares Xorbits to a few other systems including bodo. However, this comparison only examines differences on relatively small datasets and only using their laptop’s compute resources. But it does show that Xorbits is far from perfect scaling, and it would be interesting to know more about where bottlenecks are encountered.

Overall, cool to learn about more approaches in this space. I’ve already selected the next paper I plan to write about, and hopefully I’ll be able to find time before the end of this month to finish that write-up as well!

Written on January 5, 2024