Drew DeHaas

Thoughts on implementation methods for scalable computational biology software. Updates about Genotype Representation Graphs (GRGs).

View My GitHub Profile

24 March 2025

Compressed CSR representation for efficient graph traversal

by Drew DeHaas

One of the challenges with non-linear data representations, like graphs, is getting reasonable performance out of modern computers with multiple levels of caches. Caches access data in chunks called “lines” (or “blocks” - see wikipedia), so you pay a latency penalty on accessing a line of data the first time, but subsequent accesses do not pay this penalty (IF they are performed before eviction). So random access tends to be much more expensive than linear access: e.g., accessing 1000 items in a linked list (which were allocated non-adjacently in the heap) can be much slower than accessing 1000 items in an array.

Cache efficiency is one of the reasons that GRGL has both a mutable (pygrgl.load_mutable_grg) and immutable (pygrgl.load_immutable_grg) format. The mutable format is essentially an edge-list format, which allows appending/delete edges for each node, but at the cost of each edge-list being somewhat randomly located in the heap. The immutable GRG is in compressed sparse row (CSR) format (a nice overview is here), which has (at least) two advantages: first, it is a much more compact representation. Each edge-list in the mutable GRG has 24 bytes of overhead from std::vector’s metadata, but the CSR format has only 4 bytes of overhead for each “list”. Second, CSR presents all the edge lists in a single array, which is ideal for cache efficiency, IF we can ensure that most operations will access the data in the same order that we have stored it in. This is why, on writing a GRG to disk, we always renumber all the nodes according to a topological order, and then store the edge lists in the same order (as per CSR). Almost all useful graph traversals on a GRG can be done in topological (or reverse topological) order. When all nodes are visited, the order will be exactly linear in our edge lists. When a subset of nodes are visited, we may skip large “chunks” over our CSR array, so it is possible that we only access one node per cache line, but we are still more likely to access multiple nodes per cache line than if we had randomly allocated our edge lists.

example CSR graph

A sample example graph (a) and the corresponding CSR node/edge representation (b)



There are many topological orders for most graphs. We choose the topological order that is based on a depth-first search (DFS) from the (arbitrarily ordered) roots of the graph. When people only visit a subset of nodes in a GRG, they are likely interested in sub-trees beneath a set of nodes, and given our DFS-based ordering those nodes are likely to have similar node IDs and thus be nearby in the CSR array.

A more recent improvement to our immutable GRG is the use of variable-length integer encoding in the CSR edge array. For each edge list (corresponding, e.g., to the “down edges” for node n) we sort the edges (which are just the node IDs that each edge connects to). Then we use a variable-length integer encoding (libvbyte) to store the difference between the ordered edges. So we have node 1 is connected to [20001, 20009, 30100, 39000] we take the differences to get the integers [20001, 8, 10091, 8900] which can be stored in 2, 1, 2, 2 bytes respectively. Again, because our node number is based on the topological order, most edges are between nodes that are “nearby” in the node ordering, thus we can get something near 50% storage savings by encoding our edges this way, even for really large graphs (e.g., with 10 million+ nodes and a billion edges). By encoding our integers we are somewhat trading off memory latency (we now need half the memory as before, thus fewer cache misses) for CPU effort (we now need to decode each integer before using). libvbyte helps somewhat with this, by using AVX instructions to do the decoding when possible. We still end up with a slightly slower graph traversal by doing this encoding, but the 50% RAM savings seems worth it as we can essentially analyze biobank-scale datasets on a cheap consumer laptop now, entirely within RAM.

tags: GRG - performance - caching