Hi, I am trying to generate some dataset statistics based on class, properties and countries.
Although class / property statistics are relatively easily computed, when I put the country dimension, the query cannot be evaluated and timeouts.
The original query was:
SELECT ?class ?country ?property (count(distinct ?s) AS ?total) WHERE {
?s a ?class .
?s ?p2 ?country .
?country a ex:country .
?s ?p ?o .
BIND(?p AS ?property)
} GROUP BY ?class ?country ?property
this provides the number of properties per class for each country.
Trying to improve this, I split this one into multiple queries by injecting the country in each query
e.g.
SELECT ?class ?country ?property (count(distinct ?s) AS ?total) WHERE {
?s a ?class .
?s ?p2 ex:nl.
ex:nl a ex:country .
BIND(ex:nl AS ?country)
?s ?p ?o .
BIND(?p AS ?property)
#FILTER
} GROUP BY ?class ?country ?property
but the query still timeouts.
before trying to split the query further, is there any other way I can optimize this query?
This is injected by the auto-query splitting code. I am reusing the original query and in order to project the country into the select variables I do this trick.
hmm, wondering if I could remove it completely and re-inject it manually when I read the results
One possibility might be to statically enumerate the predicates which connect classes to countries instead of using a variable (?p2). If the number of such predicates is low and known in advance, you can use a non-recursive property path expression: ?s (:p1 | :p2 | ... | :pn ) ?country.
If this works out, we can think how to avoid the pre-computation of predicates.
@lorenz_b I was using this approach with FILTER in the first place but it didn’t work out and tried to do direct injection in case I can get some extra gain. However, I had to introduce the extra bind statements which I am not sure if they create more delay.
@pavel, looks like hardcoding the property path works much faster, thanks for the tip!
I can live with the hardcoded string for now but would be also good to automate it somehow. If it cannot happen in the same query, I could also generate it from the ontology using the rdfs:range statements
select distinct ?s ?country where {
?s ?p2 ?country .
?country a ex:country
}
to join with the rest of the query. If the number of distinct <class, country> combinations isn’t too big, it might work.
But, in general, queries where the predicate variable isn’t bound to a constant (or a set of constants, as above) are often pretty hard so you might end up using a separate query to get the predicates and use them in the query string for the main query.
It’s a possibiity but not sure it’d be more efficient than the subquery for both ?s and ?country. The engine would have to search for countries twice (first to get the predicates, second to match triples connecting classes to countries via those predicates). But I get surprised by these things all the time, so this can be tried too. Especially if the number of relevant predicates is very small but the number of <class, country> pairs is very large.