Using openCypher as a webscraper
openCypher is a query language for property graph
databases. If you squint slightly, an HTML document describes a property graph.
Can we use openCypher
for webscraping?
I’ve spent the last two years professionally working on building a query engine
that supports openCypher
. Over those two years, I’ve come to really like a few
things about the language. It’s very visual, and that makes it relatively easy
for non-technical folks to get started using it. It’s also pretty ergonomic to
write - especially when you want to query relationships between objects.
Oftentimes, when I’m working on something involving webscraping, what I really
care about is to query against the structure of the webpage. I’m sometimes
looking for a specific descendant within a div
of some class, or maybe I want to
extract all elements of a certain tag and then get their children. These are
tasks that could be expressed easily within openCypher
. We could maybe imagine
something like this:
// Get all links within a `foo` div.
MATCH (parent:div {class: "foo"})-[*]->(link:a) RETURN link.href
// Get children of all `code` elements and concatenate their contents
MATCH (parent:code)-->(child) RETURN sum(child.innerText)
Expressing the same thing in JavaScript or python is also possible, but tedious, and the syntax isn’t as intuitive in my opinion.
Here’s a proof of concept:
(Initialization can take a minute or two)
Before I started thinking about this problem, I was already writing my own open
source implementation of openCypher
in python. The project is called
sPyCy and it supports most of the language. One
of sPyCy
’s key features is that it allows easily swapping out components of
the engine such as the underlying graph representation, the pattern matcher, and
even the expression evaluator. This means that rather than just being a database
engine, it’s more of a library for integrating openCypher
as a frontend to
whatever you want. In the context of webscraping, this means that we could, in
theory, write a high performance, low memory overhead, graph backend that parses
HTML documents, and doesn’t require explicit preprocessing steps/overhead (in
the example we will implement in this blog post, we won’t achieve better
efficiency than preprocessing, but it’s just a proof of concept).
Implementing a webscraper using sPyCy
The entry point to sPyCy
is a class called CypherExecutorBase
. This class is
responsible for interpreting the query and invoking relevant subcomponents.
Here’s a salient snippet from it’s definition:
@dataclass
class CypherExecutorBase(Generic[NodeType, EdgeType]):
graph: Graph[NodeType, EdgeType]
table: pd.DataFrame = field(
default_factory=lambda: CypherExecutorBase._default_table()
)
expr_eval: type[
ExpressionEvaluator[NodeType, EdgeType]
] = ConcreteExpressionEvaluator[NodeType, EdgeType]
matcher: type[Matcher[NodeType, EdgeType]] = DFSMatcher[NodeType, EdgeType]
In this definition, Graph
, ExpressionEvaluator
, and Matcher
are all
abstract base classes. expr_eval
and matcher
are not instances of a class but
rather the class type itself, and are instantiated during runtime. Graph
is
an instance itself. One thing to note is that the whole class is parameterized
on NodeType
and EdgeType
which are the types that the underlying Graph
will use to represent nodes and edges respectively.
Let’s take a look at a Graph
implementation that will enable webscraping. Note
that this isn’t quite what’s running in the demo above, for reasons discussed
later.
from dataclasses import dataclass, field
from typing import Any, Dict, List, Mapping, Tuple
from bs4 import BeautifulSoup
from spycy.graph import Graph
NodeType = int
EdgeType = Tuple[int, int]
@dataclass
class SoupGraph(Graph[NodeType, EdgeType]):
"""A Graph implementation that maps graph access to a HTML DOM"""
soup: BeautifulSoup
node_map: Dict[int, Any] = field(default_factory=dict)
edge_map: Dict[Tuple[int, int], Any] = field(default_factory=dict)
def __post_init__(self):
self.node_map = {
hash(s): {
"labels": [s.name],
"properties": {"attrs": s.attrs, "text": s.text},
"obj": s,
}
for s in self.soup.find_all()
}
for n, node in self.node_map.items():
for c in node["obj"].children:
self.edge_map[(n, hash(c))] = {"type": "child", "properties": {}}
@property
def nodes(self) -> Mapping[NodeType, Any]:
return self.node_map
@property
def edges(self) -> Mapping[EdgeType, Any]:
return self.edge_map
def out_edges(self, node: NodeType) -> List[EdgeType]:
el = self.node_map[node]["obj"]
return [(node, hash(child)) for child in el.children if child.name is not None]
def in_edges(self, node: NodeType) -> List[EdgeType]:
el = self.node_map[node]["obj"]
return [(node, hash(el.parent))]
...
This implementation is using BeautifulSoup4
to do the actual parsing of HTML,
and then creating maps to represent nodes and edges. Strictly speaking, this
isn’t actually necessary. All we really need is a way to provide a unique
identifier for nodes and edges, but this makes the example simpler. This is made
possible because the return types of Graph.nodes()
and Graph.edges()
is set
to be Mapping
instead of Dict
- this allows the returned object to not
actually materialize all nodes or edges, and instead could return an object that
defines __getitem__
to allow only loading nodes or edges that are actually
accessed.
Some other methods are omitted here for brevity, but a link to the full code is available below.
Next, we need to actually create an implementation of CypherExecutorBase
.
from spycy.spycy import CypherExecutorBase
class SoupCypherExecutor(CypherExecutorBase[NodeType, EdgeType]):
"""Enable openCypher queries against a SoupGraph"""
def __init__(self, input_: Path):
"""input_ should be a Path to an HTML document"""
with input_.open() as f:
soup = BeautifulSoup(f.read(), features="html5lib")
soup_graph = SoupGraph(soup)
super().__init__(graph=soup_graph)
And that’s it! Now we can load a file and run queries against it, like so:
exe = SoupCypherExecutor(args.filename)
all_div_children: List[NodeType] = list(exe.exec("match (a:div)-->(b) return a")["a"])
The full code for this example, can be found here.
Making a browser demo
To make this work in the browser we can run sPyCy
via
pyodide. As a result, instead of using
BeautifulSoup
, we can instead use the HTML parser within the browser itself!
The code for doing so can be found
here.
While this demo works, if we replaced the pattern matcher implementation
entirely, we could build a more performant scraper that instead translates every
pattern into querySelector/querySelectorAll
calls.
Future work?
It blows my mind that with sPyCy
this was achievable in less than 100 lines of
python. I think this could be even extended to do so much more. For example, I
can imagine that one could write a custom Graph
implementation that could
support webcrawling as well. There’s still work to be done on sPyCy
itself,
but I think it has potential to be actually useful!