Poor performance on query for predicates

rnavarropiris · December 19, 2018, 3:17pm

Given a dataset with around 20+ million instances of a class (obscured in this example), loading the list of predicates associated with instances of a specific class:

PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?predicate
FROM <https://example.org/data/>
WHERE
{   
    ?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://example.org/SomeClass> .
    ?instance ?predicate ?any.
}

following query plan is provided:

prefix rdfs: rdfs:>

From <https://example.org/data/>
Distinct [#474]
`─ Projection(?predicate) [#21.9M]
   `─ MergeJoin(?instance) [#21.9M]
      +─ Scan[POSC](?instance, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, <https://example.org/SomeClass>) [#2.5M]
      `─ Scan[SPOC](?instance, ?predicate, ?any) [#34.0M]

the result is provided (containing 15 values, as expected), but the execution time is quite bad (20 secs). My guess is that the merge join leads to a full table scan of the whole database.

I took a look at the query hints hoping to find any advice on how to modify the query plan, but I don't think that any of the hints provided can help in this case.

Any proposal on how to solve this (other than persisting/caching this list)?

pavel · December 19, 2018, 8:47pm

Hi Ruben,

Yes, I acknowledge the problem. It's not really the full scan of the whole database but it's an iteration over all assertions for instances of this type. But the main problem isn't even that but DISTINCT. The join returns all triples for instances of SomeClass sorted by the instance, not by the predicate, so filtering out duplicates in a large unsorted stream is what takes most of the time. You can verify it if you add a LIMIT to your query, like LIMIT 15, the query will probably return quite fast.

Of course this is not the right approach for this query. It's be much faster to iterate over all predicates in the databases (there're probably only a handful of those, hundreds?) and for each check if a triple for such predicate occurs in some triple for some instance of SomeClass. Unfortunately there're not yet hints to force the optimiser to do that. The closest you can get is to write it like this:

SELECT ?predicate
FROM <https://example.org/data/>
WHERE
{   
   { select distinct ?predicate { [] ?predicate [] } }
   FILTER EXISTS {
    ?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://example.org/SomeClass> .
    ?instance ?predicate ?any.
  }
}

Unfortunately the optimiser won't execute it like that, it will rewrite the filter into a join so the performance would be roughly the same. That transformation is known to improve performance in almost all cases, just not this one. We'll add a hint to disable it as a short-term solution (before the next release) before addressing the issue in a more principled way.

I'm sorry that I don't have a good workaround for you at this time, I'll update the ticket if we get a better idea.

Best,
Pavel

rnavarropiris · December 21, 2018, 1:54pm

Hi Pavel,

thanks for the response! Looking forward for the mentioned hint. Together with your query approach it should be a satisfying solution.

Let me know if you think of a way of tricking the server into a more effective execution plan (ideally, for Stardog 5.3.6).

Best regards

Ruben

hmottestad · December 22, 2018, 12:46pm

Does group by result in the same plan?

PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT  ?predicate
FROM <https://example.org/data/>
WHERE
{   
    ?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://example.org/SomeClass> .
    ?instance ?predicate ?any.
} group by ?predicate

pavel · December 24, 2018, 1:52pm

Yeah, that'd suffer from the same issue (aggregation over an unordered stream is expensive).

system · January 7, 2019, 1:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

pavel · January 17, 2019, 8:32am

Hi Ruben,

Stardog 6.1 released yesterday improves the situation here. This now will evaluate as you'd expect and much faster when the number of distinct predicates is low but the number of instances is high:

SELECT ?predicate
FROM <https://example.org/data/>
WHERE
{  #pragma optimizer.filters.exists off 
   { select distinct ?predicate { [] ?predicate [] } }
   FILTER EXISTS {
    ?instance <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://example.org/SomeClass> .
    ?instance ?predicate ?any.
  }
}

In the future we'll make sure the hint won't be needed and also that the optimiser could rewrite your original query into this form.

Best,
Pavel

Topic		Replies	Views
Poor performance querying predicates for a class Support	5	494	December 11, 2019
SPARQL Query Optimization Tips and Tricks	5	589	November 2, 2022
Optimize query caching: possible? Support	12	852	May 17, 2018
Problem running (apparently) simple query Support	5	364	November 9, 2018
Remote construct query problem Support	3	622	May 12, 2017

Poor performance on query for predicates

Related topics