Given a dataset with around 20+ million instances of a class (obscured in this example), loading the list of predicates associated with instances of a specific class:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?predicate
FROM <https://example.org/data/>
WHERE
{
?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://example.org/SomeClass> .
?instance ?predicate ?any.
}
the result is provided (containing 15 values, as expected), but the execution time is quite bad (20 secs). My guess is that the merge join leads to a full table scan of the whole database.
I took a look at the query hints hoping to find any advice on how to modify the query plan, but I don't think that any of the hints provided can help in this case.
Any proposal on how to solve this (other than persisting/caching this list)?
Yes, I acknowledge the problem. It's not really the full scan of the whole database but it's an iteration over all assertions for instances of this type. But the main problem isn't even that but DISTINCT. The join returns all triples for instances of SomeClass sorted by the instance, not by the predicate, so filtering out duplicates in a large unsorted stream is what takes most of the time. You can verify it if you add a LIMIT to your query, like LIMIT 15, the query will probably return quite fast.
Of course this is not the right approach for this query. It's be much faster to iterate over all predicates in the databases (there're probably only a handful of those, hundreds?) and for each check if a triple for such predicate occurs in some triple for some instance of SomeClass. Unfortunately there're not yet hints to force the optimiser to do that. The closest you can get is to write it like this:
Unfortunately the optimiser won't execute it like that, it will rewrite the filter into a join so the performance would be roughly the same. That transformation is known to improve performance in almost all cases, just not this one. We'll add a hint to disable it as a short-term solution (before the next release) before addressing the issue in a more principled way.
I'm sorry that I don't have a good workaround for you at this time, I'll update the ticket if we get a better idea.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?predicate
FROM <https://example.org/data/>
WHERE
{
?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://example.org/SomeClass> .
?instance ?predicate ?any.
} group by ?predicate
Stardog 6.1 released yesterday improves the situation here. This now will evaluate as you'd expect and much faster when the number of distinct predicates is low but the number of instances is high: