Loading Data in JanusGraph
Lately, I’ve been playing around with JanusGraph. I’ve been trying to determine the best way to load the LDBC SNB dataset, and it hasn’t been as straightforward as I expected.
JanusGraph
is an opensource graph database that allows plugging in various
storage backends and is capable for distributed execution. I’ve been trying to
benchmark it in order to better understand characteristics of graph based
applications.
The LDBC SNB dataset/workloads are an industry standard benchmark that exercises
many common operations against graphs and also provides a generator to
synthesize graphs that exhibit properties that are similar to many real world
graphs. In order to run this workload against JanusGraph
, I need to load the
dataset into JanusGraph
. Unfortunately, the official LDBC
data generator
only outputs CSVs and parquet files, neither of which has built-in support in
JanusGraph
. This means that the only option is to either convert the CSVs
(or parquet, but I’m working with relatively small graphs, so CSV is good
enough for me) into a supported format, or write a script that is capable of
importing from CSV. This post aims to summarize my learnings in
attempting this.
GraphML
GraphML
is a format for representing graphs using XML. JanusGraph
is capable
of reading and writing GraphML
files - here is an example file created by
asking JanusGraph
to dump a graph constructed in memory:
<?xml version='1.0' encoding='UTF-8'?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.1/graphml.xsd">
<key id="date" for="node" attr.name="date" attr.type="string"/>
<key id="labelV" for="node" attr.name="labelV" attr.type="string"/>
<key id="hello" for="node" attr.name="hello" attr.type="string"/>
<key id="id" for="node" attr.name="id" attr.type="long"/>
<key id="labelE" for="edge" attr.name="labelE" attr.type="string"/>
<key id="count" for="edge" attr.name="count" attr.type="int"/>
<graph id="G" edgedefault="directed">
<node id="4128">
<data key="labelV">PERSON</data>
<data key="date">Sat Jul 05 00:00:00 UTC 1986</data>
<data key="hello">world</data>
<data key="id">0</data>
</node>
<node id="4312">
<data key="labelV">PERSON</data>
<data key="date">Mon Jan 02 00:00:01 UTC 2023</data>
<data key="hello">self</data>
<data key="id">1</data>
</node>
<edge id="2e3-3bs-3yt-36o" source="4312" target="4128">
<data key="labelE">KNOWS</data>
<data key="count">1</data>
</edge>
</graph>
</graphml>
This format is pretty easy to read, and it’s also easy to see how it could be
generated by any environment that supports XML (e.g. from a Python
script).
However, while JanusGraph
will create GraphML
files that have date
types,
it doesn’t support reading files with a date
type. In theory I could work
around this by encoding dates as strings for the initial import and then running
an update query that parses the dates, but that feels like more effort than just
using a different format.
GraphSON
GraphSON
is a format built on top of JSON
. However, it is not itself valid
JSON
, but instead expects every line to a valid JSON
blob. It has a way to
encode types of values. Here’s what the same graph we saw above looks like in
GraphSON
:
{"id":{"@type":"g:Int64","@value":4208}, "label":"PERSON", "outE":{"KNOWS":[{"inV":{"@type":"g:Int64","@value":4200}}]}, "properties":{"date":[{"value":{"@type":"g:Date","@value":1672617601234}}],"hello":[{"value": "self"}],"id":[{"value": 1}]}}
{"id":{"@type":"g:Int64","@value":4200}, "label":"PERSON", "inE":{"KNOWS":[{"outV":{"@type":"g:Int64","@value":4208}}]}, "properties":{"date":[{"value":{"@type":"g:Date","@value":520905600000}}],"hello":[{"value":"world"}],"id":[{"value":{"@type":"g:Int32","@value":0}}]}}
While this perfectly handles date
s, there’s one huge problem that makes this a
non-starter for me - edges have to be represented twice, once as an out-edge and
once as an in-edge. All properties must also be present twice to match.
In my opinion this is terrible design. Edges are usually more frequent than
vertices in many real world graphs, and this cost of replication can be high.
Not to mention, many other formats (like the LDBC CSV output) represent edges
and vertices separately, meaning that converting those sources into GraphSON
requires a very high-overhead process to collect all edges per source and
destination vertex. GraphSON
is, in my opinion, not a serious contender, and
is only useful as a toy format for creating small graphs. Additionally,
JanusGraph
outputs many unnecessary fields when using GraphSON
, which
provide no real value to me.
Kryo?!
I have not explored this option much, but Kryo
appears to be the Java
equivalent of Python
’s pickle
, but unlike pickle
it is not part of the
standard library. While it seems like a useful thing to support, I am not
interested in understanding how to convert my CSVs to Kryo
, and it seems very
Java
-centric. It’s also not a format that I’ve ever seen mentioned by other
graph databases, so it’s not the most portable.
Writing a bespoke script
One could always write their own script to convert every row in the CSV files to
an equivalent gremlin
program that creates the desired node/edge. I briefly
explored writing a program to generate a gremlin
script that would create the
graph, but it’s easier and faster to just read the CSV within gremlin
itself.
While running this script on it’s own is fast, there seems to be no way to
invoke this efficiently in the REPL short of writing this functionality in java
and importing it since it seems to be impossible to turn of the output echoing
in the groovy
interpreter packaged with the default janusgraph
client docker
image (it seems like newer versions of groovy
support this feature). Here’s
some relevant snippets from the script I ended up writing (when I’m done with my
current efforts I’ll try to put the full code somewhere full
code):
def CSVToMap(filename) {
file = new File(filename)
lines = file.readLines()
// Get the CSV Header
header = lines[0].tokenize("|")
// Load the CSV file as a map using the header as keys
rows = []
lines[1..-1].each { line ->
def row = [:]
def props = line.tokenize("|")
for (int i = 0; i < props.size(); i++) {
if (header[i] == "id") {
row.put(header[i], props[i].toLong())
} else {
row.put(header[i], props[i])
}
}
rows << row
}
return rows
}
def constructVertices(g, label, allVertices) {
allVertices.each { vertexProps ->
v = g.addV(label).next()
vertexProps.each { key, val -> v.property(key, val) }
}
}
This could probably be made even simpler using a real CSV library instead of
implementing the parsing myself. Admittedly, it’s not too much effort, and I can
now serialize the graph into either GraphSON
(which would still have a
ridiculously high overhead due to duplicated edges), or Kyro
to avoid having
to reconstruct the graph for every clean run I want to do, but it feels
unsatisfying, and it’s unclear if this scales well. Posts from community members
on the JanusGraph
forum seemed to find that Groovy
scripts are too slow for
production usage, and projects on github
seem to use Java
for implementing
this instead.
Directly importing into the storage backend
This isn’t an option that I’ve explored very deeply, but in theory, if you
understand how JanusGraph
models data on it’s backing stores, you could write
directly to the backing store and take advantage of data ingest mechanisms
present in the backing store (e.g. load the data via SQL
queries). While this
seems to me like it would be the most performant approach, JanusGraph
’s
internal data model does not seem to be a part of it’s public API (though poking
around, it seems relatively feasible to figure out how to do this). For the
scale I’m currently interested in, writing my own importer is probably less
effort and easier to do, but in my opinion, this is the only “real” option for
serious usage.
Conclusion
JanusGraph
is a really cool piece of free software. Being able to choose your
own backend and have control over the infrastructure of your graph database
solution is a good thing for many usecases. However, the story around importing
data is a mess. There’s a number of forum posts where community members seem to
have struggled with the same problems, or resorted to writing Java
classes
that handle importing for them. Compared to my experience with Neo4J
, or even
my experience implementing some of these things at KatanaGraph
- this was far
more painful. GraphML
and GraphSON
both have their selling points, but their
flaws make them unusable for many usecases, which is sad because they could both
be so powerful if tweaked slightly - GraphML
needs better support for custom
datatypes, and GraphSON
needs a less insane format for encoding edges (which
weirdly enough, it seems like GraphSON
version 0.2
was better at?). It’s
possible that the true panacea was Kyro
all along, but the deep connection to
Java
pushed me away from it - while I could bite the bullet and setup a Java
development environment, using scripting languages like groovy
, or ideally
Python
makes for a nicer experience in the context of data exploration. While
the script I wrote to import data was not too hard to write, I can definitely
see data ingest being a reason that some teams might not proceed with
JanusGraph
as a production database. Hopefully this is something that will
improve with time, but it was definitely an adventure exploring and evaluating
the different options that are present today.