Thoughts on implementation methods for scalable computational biology software. Updates about Genotype Representation Graphs (GRGs).
by Drew DeHaas
One of the challenges with non-linear data representations, like graphs, is getting reasonable performance out of modern computers with multiple levels of caches. Caches access data in chunks called “lines” (or “blocks” - see wikipedia), so you pay a latency penalty on accessing a line of data the first time, but subsequent accesses do not pay this penalty (IF they are performed before eviction). So random access tends to be much more expensive than linear access: e.g., accessing 1000 items in a linked list (which were allocated non-adjacently in the heap) can be much slower than accessing 1000 items in an array.
Cache efficiency is one of the reasons that GRGL
has both a mutable
(pygrgl.load_mutable_grg)
and immutable (pygrgl.load_immutable_grg)
format. The mutable format is essentially an edge-list format, which allows
appending/delete edges for each node, but at the cost of each edge-list being
somewhat randomly located in the heap. The immutable GRG is in compressed sparse
row (CSR) format (a nice overview is
here),
which has (at least) two advantages: first, it is a much more compact
representation. Each edge-list in the mutable GRG has 24 bytes of overhead from
std::vector’s metadata, but the CSR format has only 4 bytes of overhead for
each “list”. Second, CSR presents all the edge lists in a single array, which
is ideal for cache efficiency, IF we can ensure that most operations will
access the data in the same order that we have stored it in. This is why, on
writing a GRG to disk, we always renumber all the nodes according to a
topological order, and then store the edge lists in the same order (as per
CSR). Almost all useful graph traversals on a GRG can be done in topological
(or reverse topological) order. When all nodes are visited, the order will be
exactly linear in our edge lists. When a subset of nodes are visited, we may
skip large “chunks” over our CSR array, so it is possible that we only access
one node per cache line, but we are still more likely to access multiple nodes
per cache line than if we had randomly allocated our edge lists.
There are many topological orders for most graphs. We choose the topological order that is based on a depth-first search (DFS) from the (arbitrarily ordered) roots of the graph. When people only visit a subset of nodes in a GRG, they are likely interested in sub-trees beneath a set of nodes, and given our DFS-based ordering those nodes are likely to have similar node IDs and thus be nearby in the CSR array.
A more recent improvement to our immutable GRG is the use of variable-length
integer encoding in the CSR edge array. For each edge list (corresponding,
e.g., to the “down edges” for node n) we sort the edges (which are just the
node IDs that each edge connects to). Then we use a variable-length integer
encoding (libvbyte) to store the difference between the ordered edges. So we
have node 1 is connected to [20001, 20009, 30100, 39000] we take the
differences to get the integers [20001, 8, 10091, 8900] which can be stored
in 2, 1, 2, 2 bytes respectively. Again, because our node number is based on
the topological order, most edges are between nodes that are “nearby” in the
node ordering, thus we can get something near 50% storage savings by encoding
our edges this way, even for really large graphs (e.g., with 10 million+ nodes
and a billion edges). By encoding our integers we are somewhat trading off
memory latency (we now need half the memory as before, thus fewer cache misses)
for CPU effort (we now need to decode each integer before using).
libvbyte helps somewhat with this, by
using AVX instructions
to do the decoding when possible. We still end up with a slightly slower graph
traversal by doing this encoding, but the 50% RAM savings seems worth it as we
can essentially analyze biobank-scale datasets on a cheap consumer laptop now,
entirely within RAM.