Loading Data in JanusGraph

Lately, I’ve been playing around with JanusGraph. I’ve been trying to determine the best way to load the LDBC SNB dataset, and it hasn’t been as straightforward as I expected.

JanusGraph is an opensource graph database that allows plugging in various storage backends and is capable for distributed execution. I’ve been trying to benchmark it in order to better understand characteristics of graph based applications.

The LDBC SNB dataset/workloads are an industry standard benchmark that exercises many common operations against graphs and also provides a generator to synthesize graphs that exhibit properties that are similar to many real world graphs. In order to run this workload against JanusGraph, I need to load the dataset into JanusGraph. Unfortunately, the official LDBC data generator only outputs CSVs and parquet files, neither of which has built-in support in JanusGraph. This means that the only option is to either convert the CSVs (or parquet, but I’m working with relatively small graphs, so CSV is good enough for me) into a supported format, or write a script that is capable of importing from CSV. This post aims to summarize my learnings in attempting this.

GraphML

GraphML is a format for representing graphs using XML. JanusGraph is capable of reading and writing GraphML files - here is an example file created by asking JanusGraph to dump a graph constructed in memory:

<?xml version='1.0' encoding='UTF-8'?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.1/graphml.xsd">
  <key id="date" for="node" attr.name="date" attr.type="string"/>
  <key id="labelV" for="node" attr.name="labelV" attr.type="string"/>
  <key id="hello" for="node" attr.name="hello" attr.type="string"/>
  <key id="id" for="node" attr.name="id" attr.type="long"/>
  <key id="labelE" for="edge" attr.name="labelE" attr.type="string"/>
  <key id="count" for="edge" attr.name="count" attr.type="int"/>
  <graph id="G" edgedefault="directed">
    <node id="4128">
      <data key="labelV">PERSON</data>
      <data key="date">Sat Jul 05 00:00:00 UTC 1986</data>
      <data key="hello">world</data>
      <data key="id">0</data>
    </node>
    <node id="4312">
      <data key="labelV">PERSON</data>
      <data key="date">Mon Jan 02 00:00:01 UTC 2023</data>
      <data key="hello">self</data>
      <data key="id">1</data>
    </node>
    <edge id="2e3-3bs-3yt-36o" source="4312" target="4128">
      <data key="labelE">KNOWS</data>
      <data key="count">1</data>
    </edge>
  </graph>
</graphml>

This format is pretty easy to read, and it’s also easy to see how it could be generated by any environment that supports XML (e.g. from a Python script). However, while JanusGraph will create GraphML files that have date types, it doesn’t support reading files with a date type. In theory I could work around this by encoding dates as strings for the initial import and then running an update query that parses the dates, but that feels like more effort than just using a different format.

GraphSON

GraphSON is a format built on top of JSON. However, it is not itself valid JSON, but instead expects every line to a valid JSON blob. It has a way to encode types of values. Here’s what the same graph we saw above looks like in GraphSON:

{"id":{"@type":"g:Int64","@value":4208}, "label":"PERSON", "outE":{"KNOWS":[{"inV":{"@type":"g:Int64","@value":4200}}]}, "properties":{"date":[{"value":{"@type":"g:Date","@value":1672617601234}}],"hello":[{"value": "self"}],"id":[{"value": 1}]}}
{"id":{"@type":"g:Int64","@value":4200}, "label":"PERSON", "inE":{"KNOWS":[{"outV":{"@type":"g:Int64","@value":4208}}]}, "properties":{"date":[{"value":{"@type":"g:Date","@value":520905600000}}],"hello":[{"value":"world"}],"id":[{"value":{"@type":"g:Int32","@value":0}}]}}

While this perfectly handles dates, there’s one huge problem that makes this a non-starter for me - edges have to be represented twice, once as an out-edge and once as an in-edge. All properties must also be present twice to match.

In my opinion this is terrible design. Edges are usually more frequent than vertices in many real world graphs, and this cost of replication can be high. Not to mention, many other formats (like the LDBC CSV output) represent edges and vertices separately, meaning that converting those sources into GraphSON requires a very high-overhead process to collect all edges per source and destination vertex. GraphSON is, in my opinion, not a serious contender, and is only useful as a toy format for creating small graphs. Additionally, JanusGraph outputs many unnecessary fields when using GraphSON, which provide no real value to me.

Kryo?!

I have not explored this option much, but Kryo appears to be the Java equivalent of Python’s pickle, but unlike pickle it is not part of the standard library. While it seems like a useful thing to support, I am not interested in understanding how to convert my CSVs to Kryo, and it seems very Java-centric. It’s also not a format that I’ve ever seen mentioned by other graph databases, so it’s not the most portable.

Writing a bespoke script

One could always write their own script to convert every row in the CSV files to an equivalent gremlin program that creates the desired node/edge. I briefly explored writing a program to generate a gremlin script that would create the graph, but it’s easier and faster to just read the CSV within gremlin itself. While running this script on it’s own is fast, there seems to be no way to invoke this efficiently in the REPL short of writing this functionality in java and importing it since it seems to be impossible to turn of the output echoing in the groovy interpreter packaged with the default janusgraph client docker image (it seems like newer versions of groovy support this feature). Here’s some relevant snippets from the script I ended up writing (~~when I’m done with my current efforts I’ll try to put the full code somewhere~~ full code):

def CSVToMap(filename) {
  file = new File(filename)
  lines = file.readLines()
  // Get the CSV Header
  header = lines[0].tokenize("|")
  // Load the CSV file as a map using the header as keys
  rows = []
  lines[1..-1].each { line ->
      def row = [:]
      def props = line.tokenize("|")
      for (int i = 0; i < props.size(); i++) {
          if (header[i] == "id") {
            row.put(header[i], props[i].toLong())
          } else {
            row.put(header[i], props[i])
          }
      }
      rows << row
  }

  return rows
}

def constructVertices(g, label, allVertices) {
  allVertices.each { vertexProps ->
      v = g.addV(label).next()
      vertexProps.each { key, val -> v.property(key, val) }
  }
}

This could probably be made even simpler using a real CSV library instead of implementing the parsing myself. Admittedly, it’s not too much effort, and I can now serialize the graph into either GraphSON (which would still have a ridiculously high overhead due to duplicated edges), or Kyro to avoid having to reconstruct the graph for every clean run I want to do, but it feels unsatisfying, and it’s unclear if this scales well. Posts from community members on the JanusGraph forum seemed to find that Groovy scripts are too slow for production usage, and projects on github seem to use Java for implementing this instead.

Directly importing into the storage backend

This isn’t an option that I’ve explored very deeply, but in theory, if you understand how JanusGraph models data on it’s backing stores, you could write directly to the backing store and take advantage of data ingest mechanisms present in the backing store (e.g. load the data via SQL queries). While this seems to me like it would be the most performant approach, JanusGraph’s internal data model does not seem to be a part of it’s public API (though poking around, it seems relatively feasible to figure out how to do this). For the scale I’m currently interested in, writing my own importer is probably less effort and easier to do, but in my opinion, this is the only “real” option for serious usage.

Conclusion

JanusGraph is a really cool piece of free software. Being able to choose your own backend and have control over the infrastructure of your graph database solution is a good thing for many usecases. However, the story around importing data is a mess. There’s a number of forum posts where community members seem to have struggled with the same problems, or resorted to writing Java classes that handle importing for them. Compared to my experience with Neo4J, or even my experience implementing some of these things at KatanaGraph - this was far more painful. GraphML and GraphSON both have their selling points, but their flaws make them unusable for many usecases, which is sad because they could both be so powerful if tweaked slightly - GraphML needs better support for custom datatypes, and GraphSON needs a less insane format for encoding edges (which weirdly enough, it seems like GraphSON version 0.2 was better at?). It’s possible that the true panacea was Kyro all along, but the deep connection to Java pushed me away from it - while I could bite the bullet and setup a Java development environment, using scripting languages like groovy, or ideally Python makes for a nicer experience in the context of data exploration. While the script I wrote to import data was not too hard to write, I can definitely see data ingest being a reason that some teams might not proceed with JanusGraph as a production database. Hopefully this is something that will improve with time, but it was definitely an adventure exploring and evaluating the different options that are present today.

Written on April 8, 2024