Though it's pretty minor, it is a little involved. Might be worth making a coffee.
Here's the old copy:
http://aidanhogan.com/docs/blank_nodes_jws-old.pdf
On the right hand column in page 8, it describes how SPARQL SELECT queries can be used to discover if an RDF-graph is non-lean or not. The idea is to create a SPARQL query representing the graph itself (converting the blank nodes in the graph to distinguished variables and making it into a BGP), and run it against the graph itself. There is an example provided in the paper above.
Funnily enough, the problematic part starts with the word "Trivially", which I'm starting to understand is sort of a danger word in CS papers, even in my own writing. The old text says:
"Trivially, if a query variable is bound to a term in G other than the blank node for which it was originally created by ?, then that term is a witness for the non-leanness of the blank node and thus G is non-lean. Since blank nodes may be arbitrarily relabelled during SPARQL query evaluation, checking that the surrogate variable corresponds with its original blank node may be impossible. However, since a surrogate variable will always bind its original blank node, to check whether or not G is non-lean, one can check the solutions to see if there is any surrogate variable bound to more than one unique term."
There is an example given for which the idea works, but in general, the above method will sometimes label lean graphs as non-lean ...
... graphs like the following ...
_:a p _:b .
_:b p _:a .
We create a SPARQL query from this:
SELECT * WHERE {
?a p ?b .
?b p ?a
}
(And we remember the mapping _:a -> ?a, ?b -> ?b and its inverse.)
We run the SPARQL query against the graph itself:
=============
== ?a | ?b ==
=============
| _:x | _:y |
| _:y | _:x |
=============
Here the SPARQL engine relabelled the results in the answer, as it is entitled to do (however, the blank nodes are scoped to the entire result set, so the _:x in the first solution must correspond with that of the second solution).
Now for each "surrogate" variable (variable representing a blank node) we have multiple terms in the answers. By our original idea, the graph would be considered non-lean. Obviously it's not. The problem is that there are automorphisms in G that cause multiple "orientations" of the graph to be returned, but do not indicate non-leanness.
So okay ... taking a step back ... the solutions to the SPARQL query are a list of the homomorphisms from the graph G to itself. Now we need to find at least one homomorphism (one solution) that indicates that the graph G is non-lean, meaning that the homomorphism "maps" G to a(n isomorphic copy of a) proper subgraph of G. These solutions are ones where either (i) a ground term appears [indicating a blank node can be mapped to a ground term in the graph], and/or (ii) multiple variables are mapped to the same blank node [indicating two or more blank nodes can be "collapsed"]. This is the fix I applied in the new version. Non-trivial automorphisms (as above) should no longer cause a problem.
There was a knock-on effect for Algorithm 5, which needs not only to discover which graphs are non-lean using this method, but also which blank nodes cause the non-leanness. This was a bit trickier. In the case where a surrogate variable is mapped to a ground term, the blank node is trivially non-lean, but in the case where multiple variables are mapped to the same blank node, it's a little trickier and needed some head scratching. Consider a graph:
_:a p _:b .
_:a p _:c .
_:c p _:d .
=====================
= ?a | ?b | ?c | ?d =
=====================
| _w | _x | _y | _z |
| _w | _y | _y | _z |
=====================
(Just dropping ':' from the blank node syntax to keep the formatting correct: w ... z correspond to how the SPARQL engine renames the blank nodes a ... d respectively. The first solution is effectively the identity mapping.)
In the second solution, we see that the graph is non-lean since _:b and _:c map to the same term. But which blank node is causing the non-leanness? Is it _:b or _:c? Well, let M(bn) denote all the terms bound to the blank node bn in all solutions, e.g., M(_:b) = { _:x, _:y } above. For every pair of blank nodes (b1,b2) bound to the same term in a given solution, if |M(b1)| \geq |M(b2)|, then b1 is causing non-leanness. Otherwise, any blank node not found to be non-lean by this method, or by being bound to a ground term, is lean.
Note that in cases such as:
_:a p _:b .
_:a p _:c .
We would count both _:b and _:c as non-lean blank nodes. This is why the condition is \geq and not just >.
This is the fix I applied to Algorithm 5. There's no proof listed, but just to sketch why I think it's correct ... which is maybe not *so* direct, first let's just ignore the trivial part where blank nodes can be mapped to ground terms. Try for a contradiction ... assume that a blank node b1 is non-lean, and that there does *not* exist a pair (b1,b2) such that b1 and b2 map to the same term in a given solution and |M(b1)| \geq |M(b2)|. But (again ignoring the ground case) we can show that b1 must map to at least one other blank node b' in the solutions for b1 to be non-lean. Now by the original assumption, b' must have fewer terms in its solutions than b1. But how can b1 have strictly fewer terms in its solutions than b': a blank node it maps to? A contradiction.
In terms of the experiments themselves, I did not fix the code to rerun the new algorithm. However, the number of blank nodes that are possibly affected are like 189 out of 88.6 *million*, i.e., less than 0.00002% of all blank nodes, or something like that. So I feel it's not worth the hassle, especially given that the sorts of cases that cause problems (a subset of non-trivial automorphisms) are also probably very rare even within that small fraction. The bug has negligible effects on the empirical results presented.