I am normally very happy with Stardog performance, however sometimes I try to do simple operations (at least from a SQL perspective) that take a very long time. For example, I am trying to delete some triples. Before, I make sure the select is correct by using
SELECT ?s ?p ?o
WHERE {
?s a ex:Document .
FILTER NOT EXISTS { ?ss ?pp ?s } .
}
So: give me all documents that are not referenced by anything.
This takes forever until "Server Error" is raised (time out of 300000).
The query uses reasoning, since Document has subclasses that have to be obtained
Without the FILTER, the query takes around 250 ms
There are around 200,000 ex:Documents
What I need seems quite simple and straight forward in my small data set. How can I make this query work in a reasonable time?
The problem here is the ?ss ?pp ?s pattern, specifically, the unbound predicate. With reasoning this is extremely expensive (in fact it's not even valid w.r.t. the SPARQL Entailment Regimes spec as it's not a first-order query atom and is known to cause performance problems). What you probably want is to evaluate ?s a ex:Document with reasoning and FILTER NOT EXISTS { ?ss ?pp ?s } without reasoning, which would be efficient. Unfortunately it's currently not possible, but we have an open ticket for such a thing.
I see two possibilities:
i) you turn off reasoning and capture some of it in the query, e.g. ?s a/rdfs:subClassOf* ex:Document, or use a suitable rewriting in terms of subclasses.
ii) you keep reasoning on but rewrite the FILTER body s.t. all predicates are bound and all triple patterns with rdf:type have a bound object. Then it should also work fine.
Would using a service call also possibly work? I can’t recall if reasoning is applied to service calls but if it was could you just reverse it and execute the query without reasoning and turn on reasoning for the service query in the service url.
You’d be ececuting a service query to the same dB which is a little strange and you’d have the usual service query performance caveats but figured it might work.
SELECT ?s ?p ?o
WHERE {
?s a ex:Document .
FILTER NOT EXISTS { ?ss ex:XXX ?s } .
}
This actually finishes much faster, with the problem that I have to specify the different XXX instead of using a single predicate. A solution is to create a parent property YYY that holds all XXX sub-properties. This seems also to work.
I know that in theory the predicate is very expensive. But, in practice, if you know that your dataset has only 20 properties, it seems feasible to work with it. What I mean is that sometimes even the simplest things (in practice) entail extremely complicated theoretical problems, but because we manage a finite, limited, known dataset these theoretical problems can be easily solved. But again I am not an expert.
If the number of properties is small, I'd always go for enumeration by means of either FILTER(?p in (:p1, :p2, ...)) or VALUES ?p {:p1 :p2 ...} such that an index can be used.
And yes, a parent property should also work but then one might also think of ?p a owl:ObjectProperty (resp. rdf:Property) if this information is known