Aggregate query optimization

Hi, I am trying to generate some dataset statistics based on class, properties and countries.

Although class / property statistics are relatively easily computed, when I put the country dimension, the query cannot be evaluated and timeouts.

The original query was:

SELECT ?class ?country ?property (count(distinct ?s) AS ?total) WHERE { 
   ?s a ?class .  
    ?s ?p2 ?country . 
 ?country a ex:country . 
 
    ?s ?p ?o .  
BIND(?p AS ?property)
  
} GROUP BY ?class ?country ?property

this provides the number of properties per class for each country.
Trying to improve this, I split this one into multiple queries by injecting the country in each query
e.g.

SELECT ?class ?country ?property (count(distinct ?s) AS ?total) WHERE { 
   ?s a ?class .  
    ?s ?p2 ex:nl. 
 ex:nl a ex:country . 
BIND(ex:nl AS ?country)
 
    ?s ?p ?o .  
BIND(?p AS ?property)
   
#FILTER
} GROUP BY ?class ?country ?property

but the query still timeouts.

before trying to split the query further, is there any other way I can optimize this query?

Thank you in advance, DImitris

Your second query looks odd.
If you have the country, why do you need the part

ex:nl a ex:country . 
BIND(ex:nl AS ?country)

It is already bound by the previous triple pattern

?s ?p2 ex:nl.

This is injected by the auto-query splitting code. I am reusing the original query and in order to project the country into the select variables I do this trick.
hmm, wondering if I could remove it completely and re-inject it manually when I read the results

Can you try this query manually

SELECT  ?class (SAMPLE(?c) AS ?country) (?p AS ?property) (COUNT(DISTINCT ?s) AS ?total)
WHERE
  { ?s  rdf:type  ?class ;
        ?p2       ?c
    FILTER ( ex:nl = ?c )
    ?s  ?p  ?o
  }
GROUP BY ?class ?p

And what is the size of the dataset?
And what is your setup for Stardog?

Lorenz,

Your query is syntactically invalid because the BIND introduces a variable (?country) that’s already used in the scope.

Cheers,
Pavel

Hi Pavel,

Oh yes, you’re right. There was also another mistake with the GROUPING.
Fixed now, thanks.

Regards,
Lorenz

Dimitris,

One possibility might be to statically enumerate the predicates which connect classes to countries instead of using a variable (?p2). If the number of such predicates is low and known in advance, you can use a non-recursive property path expression: ?s (:p1 | :p2 | ... | :pn ) ?country.

If this works out, we can think how to avoid the pre-computation of predicates.

Cheers,
Pavel

@lorenz_b I was using this approach with FILTER in the first place but it didn’t work out and tried to do direct injection in case I can get some extra gain. However, I had to introduce the extra bind statements which I am not sure if they create more delay.

@pavel, looks like hardcoding the property path works much faster, thanks for the tip!
I can live with the hardcoded string for now but would be also good to automate it somehow. If it cannot happen in the same query, I could also generate it from the ontology using the rdfs:range statements

Well, you can use a subquery

select distinct ?s ?country where {
  ?s ?p2 ?country . 
  ?country a ex:country
}

to join with the rest of the query. If the number of distinct <class, country> combinations isn’t too big, it might work.

But, in general, queries where the predicate variable isn’t bound to a constant (or a set of constants, as above) are often pretty hard so you might end up using a separate query to get the predicates and use them in the query string for the main query.

Cheers,
Pavel

@pavel Would it work to get the predicates in a subquery? I mean would it be more efficient and help the query optimizer?

It’s a possibiity but not sure it’d be more efficient than the subquery for both ?s and ?country. The engine would have to search for countries twice (first to get the predicates, second to match triples connecting classes to countries via those predicates). But I get surprised by these things all the time, so this can be tried too. Especially if the number of relevant predicates is very small but the number of <class, country> pairs is very large.

Cheers,
Pavel

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.