Reinitializing PySpark

Sometimes I look at the code I’ve written and I know I’m going to hell.

At $WORK I encountered a weird bug - some tests that relied on pyspark for executing SQL were failing. The failure mode was the following message:

java.lang.IllegalArgumentException: Cannot initialize FileIO implementation org.apache.iceberg.aws.s3.S3FileIO: Cannot find constructor for interface org.apache.iceberg.io.FileIO

This was odd, because the code we used to create the SparkSession instance looked like this:

SparkSession.builder.appName("foo")
.config(
    "spark.jars.packages",
    "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.5.2,"
    "software.amazon.awssdk:bundle:2.19.13,"
    "software.amazon.awssdk:url-connection-client:2.19.13,",
)
...
.getOrCreate()

We’re definitely including the right set of jars to provide S3FileIO so what gives? After poking around I realized that the culprit wasn’t the setup for this particular test, but the test before it, which did:

SparkSession.builder.appName("bar")
.config(
    "spark.jars.packages",
    "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.5.2"
)
...
.getOrCreate()

This lead me to the extremely fun realization that getOrCreate will do exactly what it says - it will always try to get an existing instance, even if the configuration differs. Looking around online, I found other reports matching the same behavior, and a general sense of defeat - having differing configurations in pyspark is simply a no-go. Even if you stop an existing session, the underlying JVM process will live on until the python process ends, and future SparkSession.Builders will not be able to change the set of jars. While for our particular usecase, it was okay to just use the super set of jars everywhere, I couldn’t help but feel like there must be a way to reset the state and create a totally new instance. After a bit of poking, I found one:

from pyspark import SparkContext

if SparkContext._gateway:
    SparkContext._gateway.proc.kill()
    SparkContext._gateway = None
    SparkContext._jvm = None

…so cursed. This will invalidate all existing references to sessions, and will do weird things if you have an sessions that aren’t stopped, so use at your own risk.

Written on June 20, 2024