Aggregate query optimization

jimkont · July 31, 2017, 8:50am

Hi, I am trying to generate some dataset statistics based on class, properties and countries.

Although class / property statistics are relatively easily computed, when I put the country dimension, the query cannot be evaluated and timeouts.

The original query was:

SELECT ?class ?country ?property (count(distinct ?s) AS ?total) WHERE { 
   ?s a ?class .  
    ?s ?p2 ?country . 
 ?country a ex:country . 
 
    ?s ?p ?o .  
BIND(?p AS ?property)
  
} GROUP BY ?class ?country ?property

this provides the number of properties per class for each country.
Trying to improve this, I split this one into multiple queries by injecting the country in each query
e.g.

SELECT ?class ?country ?property (count(distinct ?s) AS ?total) WHERE { 
   ?s a ?class .  
    ?s ?p2 ex:nl. 
 ex:nl a ex:country . 
BIND(ex:nl AS ?country)
 
    ?s ?p ?o .  
BIND(?p AS ?property)
   
#FILTER
} GROUP BY ?class ?country ?property

but the query still timeouts.

before trying to split the query further, is there any other way I can optimize this query?

Thank you in advance, DImitris

lorenz_b · July 31, 2017, 9:05am

Your second query looks odd.
If you have the country, why do you need the part

ex:nl a ex:country . 
BIND(ex:nl AS ?country)

It is already bound by the previous triple pattern

?s ?p2 ex:nl.

jimkont · July 31, 2017, 9:17am

This is injected by the auto-query splitting code. I am reusing the original query and in order to project the country into the select variables I do this trick.
hmm, wondering if I could remove it completely and re-inject it manually when I read the results

lorenz_b · July 31, 2017, 9:23am

Can you try this query manually

SELECT  ?class (SAMPLE(?c) AS ?country) (?p AS ?property) (COUNT(DISTINCT ?s) AS ?total)
WHERE
  { ?s  rdf:type  ?class ;
        ?p2       ?c
    FILTER ( ex:nl = ?c )
    ?s  ?p  ?o
  }
GROUP BY ?class ?p

And what is the size of the dataset?
And what is your setup for Stardog?

pavel · July 31, 2017, 11:11am

Lorenz,

Your query is syntactically invalid because the BIND introduces a variable (?country) that’s already used in the scope.

Cheers,
Pavel

lorenz_b · July 31, 2017, 11:22am

Hi Pavel,

Oh yes, you’re right. There was also another mistake with the GROUPING.
Fixed now, thanks.

Regards,
Lorenz

pavel · July 31, 2017, 11:25am

Dimitris,

One possibility might be to statically enumerate the predicates which connect classes to countries instead of using a variable (?p2). If the number of such predicates is low and known in advance, you can use a non-recursive property path expression: ?s (:p1 | :p2 | ... | :pn ) ?country.

If this works out, we can think how to avoid the pre-computation of predicates.

Cheers,
Pavel

jimkont · August 1, 2017, 5:40am

@lorenz_b I was using this approach with FILTER in the first place but it didn’t work out and tried to do direct injection in case I can get some extra gain. However, I had to introduce the extra bind statements which I am not sure if they create more delay.

@pavel, looks like hardcoding the property path works much faster, thanks for the tip!
I can live with the hardcoded string for now but would be also good to automate it somehow. If it cannot happen in the same query, I could also generate it from the ontology using the rdfs:range statements

pavel · August 1, 2017, 10:34am

Well, you can use a subquery

select distinct ?s ?country where {
  ?s ?p2 ?country . 
  ?country a ex:country
}

to join with the rest of the query. If the number of distinct <class, country> combinations isn’t too big, it might work.

But, in general, queries where the predicate variable isn’t bound to a constant (or a set of constants, as above) are often pretty hard so you might end up using a separate query to get the predicates and use them in the query string for the main query.

Cheers,
Pavel

lorenz_b · August 1, 2017, 11:15am

@pavel Would it work to get the predicates in a subquery? I mean would it be more efficient and help the query optimizer?

pavel · August 1, 2017, 11:24am

It’s a possibiity but not sure it’d be more efficient than the subquery for both ?s and ?country. The engine would have to search for countries twice (first to get the predicates, second to match triples connecting classes to countries via those predicates). But I get surprised by these things all the time, so this can be tried too. Especially if the number of relevant predicates is very small but the number of <class, country> pairs is very large.

Cheers,
Pavel

system · August 15, 2017, 11:24am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query Optimization Support	3	440	June 9, 2017
Query endpoint optimisation Support	4	475	July 22, 2020
Issues with tutorials in Stardog Studio Stardog Studio	1	376	January 31, 2023
SELECT query using 'GROUP BY' returning unexpected results (application/sparql-results+json) Bug	1	342	February 28, 2022
Query performance when using VALUES goes from 30 ms to > 5 minutes Support	4	505	February 28, 2017

Aggregate query optimization

Related topics