Automatically Checkpointing Numpy/Pandas objects

Have you ever ran a long-running script, only to decide after it completes that there’s some other operation you wish you’d included, or maybe there’s a bug causing the script to crash, throwing away your intermediate results in the process? Automatic checkpointing could be a viable solution for you!


Often, I find myself having wasted minutes or even hours of work because of a stupid bug, or because I only realized after looking at some data that the metrics I wanted to compute were slightly different. In those cases, I often wish that I could just rewind time a bit and modify the program. What I want is for my script to have some kind of checkpointing feature - every now and then the intermediate state of the program should be saved in some way that can be restored or at least interrogated. However, I don’t want to modify my scripts. I know there is already sufficient functionality within the python standard library to achieve this, but the up front cost of using it, or even the friction of just determining what should be included in the checkpoint is often enough to lead me to ignore it.

What I want is a checkpointing system that accomplishes the following:

  • Be lightweight enough to where I can run any script with it all the time
  • Have some mechanism to limit how big the checkpoints can be
  • Have mechanisms to control the frequency of checkpoints
  • Automatically determine some set of salient objects to include in the checkpoint.

So, I set out to build it, and currently have a prototype. It seems to work in some basic test cases. Getting to this point was a fun learning adventure. My prototype doesn’t allow resuming the program, but it does allow saving salient variables, which is good enough for my usage.

The first challenge was finding some way to trigger code that runs automatically. In the past, I’ve played around a bit with sys.monitoring and I’ve seen some academic papers and blog posts that mentioned how it has relatively low overhead, approaching zero overhead with some more advanced techniques. For my usage, I’m okay with some amount of performance degradation in exchange for safety. The following code gave me a basic example that printed out a message for every line that the interpreter executed:

import atexit
import sys.monitoring

def line_handler(code: CodeType, line_number: int) -> Any:
    print("[line_handler]", code, line_number)

    # We need some logic here to determine what to checkpoint and where to put
    # it

sys.monitoring.use_tool_id(sys.monitoring.OPTIMIZER_ID, "dbg")
sys.monitoring.set_events(sys.monitoring.OPTIMIZER_ID, sys.monitoring.events.LINE)
sys.monitoring.register_callback(
    sys.monitoring.OPTIMIZER_ID, sys.monitoring.events.LINE, line_handler
)
atexit.register(lambda: sys.monitoring.free_tool_id(sys.monitoring.OPTIMIZER_ID))

# Any code here causes `line_handler` to be called!
print("Hello world")

This is great and within the handler, I can every poke around the scope of the line that was executed by doing:

import inspect
...

def line_handler(code: CodeType, line_number: int) -> Any:
    ...
    cf = inspect.currentframe()
    for n, v in cf.f_back.f_globals.items():
        # Save global variables here - the variable name is n, and current value
        # is v
        ...
    for n, v in cf.f_back.f_locals.items():
        # local variables
        ...

Unfortunately, this alone isn’t quite what I want. I don’t want to run my checkpointing logic for every line that the interpreter evaluates, because that causes importing libraries to become incredibly slow. What I really want is to only run this for code that’s in the ‘top-level’ of the script being executed. My workaround was to check the filename in code.co_filename against some file that I want to “trace”. Problem solved, and performance doesn’t seem to impacted for initializing large imports (e.g. numpy and pandas).

But this brings me to my next problem. I don’t want to checkpoint every object. For starters, some datatypes are not trivially serializable. What I’d really like to do is only checkpoint numpy types and pandas types. Furthermore, in many situations, I’d actually like to only checkpoint large arrays/dataframes - it would be cool to filter the objects by size. I’d ideally like to avoid having my checkpointing framework itself depend on numpy or pandas, and I especially wanted to avoid only checking for some set of types because I didn’t feel confident that I’d be able to enumerate all the types I care about. To solve these problems, I came up with the following:

# size in bytes of the smallest object I want to checkpoint
min_capture_size = 512

def should_capture(v: Any) -> bool:
    sz = sys.getsizeof(v)
    if sz < min_capture_size:
        return False
    m = inspect.getmodule(type(v))
    if m is None:
        return False
    module_name = m.__name__

    if module_name == "builtins":
        if type(v) in [str, int, list, dict, set]:
            return True
    return module_name.startswith("pandas") or module_name.startswith("numpy")

This code looks at a value, finds it’s size by using sys.getsizeof, it’s type, and then what module the type was defined in, then checks if that module is likely to be numpy or pandas. This is a bit hacky, and I’m not super confident that sys.getsizeof can be accurate in every case. In particular, for libraries that store data as opaque pointers to internal objects implemented in some other langauge, I don’t think sys.getsizeof can tell me anything other than the size of those pointers. But that’s okay, because I can always just set min_capture_size to 0, and deal with capturing everything. Additionally, in my prototype, I added an option to control the frequency of checkpoints, to further cut down on the overheads.

The final piece that was interesting to me was figuring out how to get my monitoring tool to be installed before any “user” code runs. I figured it must be possible because tools like cProfile can do it. So I read the cProfile source code and modified it a bit to get this (annotations are entirely mine):

# modify sys.argv to be the argv that we want the user code to see
progname = ...
sys.path.insert(0, os.path.dirname(progname))

# Compile the file to python bytecode, informing it to compile for python's exec
with io.open_code(progname) as fp:
    code = compile(fp.read(), progname, "exec")

# Create a spec for the compiled code to see when exec'd - this helps "trick"
# the user code into behaving like it was launched directly
spec = importlib.machinery.ModuleSpec(name="__main__", loader=None, origin=progname)
globs = {
    "__spec__": spec,
    "__file__": spec.origin,
    "__name__": spec.name,
    "__package__": None,
    "__cached__": None,
}

# This is where I install the monitoring tool
install_hooks(progname)

# Execute the user code
exec(code, globs, None)

Conclusion/Future work

Overall the prototype is cool, but there’s a number of things that make it incomplete:

  • For starters, I’d like to do a bit more in-depth performance testing to understand what the overhead of this actually is.
  • I’d like to make it easier to specify the libraries and datatypes that should be captured (e.g. it would be nice to also support polars).
  • Currently I can only really track data that exists as a top-level variable in the user code. It does work for variables that are in functions, but not for variables that are members or otherwise nested in other non-tracked objects. It seems possible to do something even more involved - maybe by integrating with the garbage collection mechanism?
  • It would also be cool to only save data that’s actually changed since the last snapshot. There’s a bunch of different ways to go about this, but the only I’m most interested in would be to use a combination of different monitoring endpoints - I could use the start/enter hooks to statically analyze a code block for variables I intend to save, then I could use instruction level monitoring to determine when tracked objects are modified, and use that information in my per-line execution monitoring callback to determine what to save. A much simpler but more expensive option could be to compute some kind of hash instead, but that basically trades time for space saved.
  • It would be nice for the tool to default to storing checkpoints in some globally shared directory (e.g. ~/.cache or something like that), and to put a limit on the total space that directory can occupy - if it starts hitting the limit, the oldest checkpoints can be deleted, thus making it easy for me to alias python to python -m chkpt and forget about this tool being present until I realize that I need it.

Overall, this was a cool prototype that I was able to bang out in an evening thanks to the super cool sys.monitoring. I definitely learned some cool things about python internals along the way too!

Written on February 21, 2025