Poor performance querying predicates for a class

joergunbehauen · November 27, 2019, 10:44am

Hi all,

with stardog 7 the problem described in

seems to surface again.
On a dataset of ~60 million triples on stardog 6.2.3 and stardog 7.0.2 querying for all distinct predicates varies by querying strategy:
(stardog 7 has less off-heap memory than 6, but according to htop, both of them do not hit their limits)

select distinct ?predicate {
#pragma optimizer.filters.exists off
#the pragma does not seem to influence the query plan/retrieval speed
{ SELECT DISTINCT ?predicate { ?predicate } }
FILTER EXISTS {?instance a :MyClass.}
?instance ?predicate ?any .
}

stardog 6: 18sec
stardog 7: 200sec

select distinct ?predicate
WHERE
{
FILTER EXISTS {?instance a :MyClass.}
?instance ?predicate ?any .
}

stardog 6: 200sec
stardog 7: 220sec

select distinct ?predicate
WHERE
{
?instance a :MyClass.
?instance ?predicate ?any .
}

stardog 6: 150 sec
stardog 7: 220 sec

So my question is: with stardog 6 there was a way of getting those distinct predicates in an acceptable timeframe, with stardog 7 this does not seem to work any more.

Can this be solved by reformulating the query?

pavel · November 27, 2019, 10:52am

Hm, we haven't yet done the work to eliminate the need of the hint but the

select distinct ?predicate {
#pragma optimizer.filters.exists off
{ SELECT DISTINCT ?predicate {[] ?predicate []} }
FILTER EXISTS {?instance a :MyClass.}
?instance ?predicate ?any .
}

should still work. It might be a littke slower than 6 but not >200s. It seems to work for me with 7.0.3 and BSBM 100M data.

Could you show the plan and make sure your client/library preserves the #pragma hint?

Thanks,
Pavel

joergunbehauen · November 27, 2019, 3:50pm

I was using stardog studio to execute the queries directly, from my experience the pragma is preserved here.

The data queried here differs in two ways from BSBM: we have many more (and sparse) properties to (400+) and many graphs (600+) , with only a few containing the data.

While compiling the information about the query plan i found discovered hopefully the root cause, all exeuted on stardog 6:

with no from clause given, the query as the following plan and executed in ~4 sec

From all
Distinct [#1.4K]
`─ Projection(?predicate) [#14.0M]
   `─ BindJoin(?predicate) [#14.0M]
      +─ Distinct [#1.4K]
      │  `─ Projection(?predicate) [#11K]
      │     `─ Scan[PC](_, ?predicate, _) [#11K]
      `─ MergeJoin(?instance) [#16.7M]
         +─ Distinct [#2.3M]
         │  `─ Scan[POSC](?instance, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :myClass) [#2.6M]
         `─ Scan[SPOC](?instance, ?predicate, ?any) [#126.9M]

With a single from clause, the query has the follwing plan and executes in ~10sec

From <mygraph1>
Distinct [#1.4K]
`─ Projection(?predicate) [#12.0M]
   `─ HashJoin(?instance) [#12.0M]
      +─ MergeJoin(?predicate) [#12.0M]
      │  +─ Distinct [#1.4K]
      │  │  `─ Projection(?predicate) [#2.5K]
      │  │     `─ Scan[PC](_, ?predicate, _) [#2.5K]
      │  `─ Scan[PSC](?instance, ?predicate, _) [#21.0M]
      `─ Distinct [#501K]
         `─ Scan[POSC](?instance, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :myClass) [#501K]

with ~200 named graphs the query produces the following query plan and executes in ~30 seconds
with pragma:

 From <mygraph2>
From <mygraph3>
From <mygraph4>
Distinct [#1.4K]
`─ Projection(?predicate) [#12.0M]
   `─ HashJoin(?instance) [#12.0M]
      +─ MergeJoin(?predicate) [#12.0M]
      │  +─ Distinct [#1.4K]
      │  │  `─ Projection(?predicate) [#2.7K]
      │  │     `─ Scan[PC](_, ?predicate, _) [#2.7K]
      │  `─ Scan[PSC](?instance, ?predicate, _) [#22.7M]
      `─ Distinct [#501K]
         `─ Scan[POSC](?instance, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :myClass) [#501K]

For the sake of completeness, here is the query plans w/o the pragma

Distinct [#1.4K]
`─ Projection(?predicate) [#12.0M]
   `─ HashJoin(?instance) [#12.0M]
      +─ MergeJoin(?predicate) [#12.0M]
      │  +─ Distinct [#1.4K]
      │  │  `─ Projection(?predicate) [#2.7K]
      │  │     `─ Scan[PC](_, ?predicate, _) [#2.7K]
      │  `─ Scan[PSC](?instance, ?predicate, _) [#22.7M]
      `─ Distinct [#501K]
         `─ Scan[POSC](?instance, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :myClass) [#501K]

pavel · November 27, 2019, 4:07pm

Thanks, I think I see what's happening. The hint is ignored in all your queries. The problem is the distinct keyword in your query, there's indeed a bug with ignoring hints when top-level distinct or aggregation is used (we should prioritise it since it's an easy fix).

select ?predicate {
#pragma optimizer.filters.exists off
{ SELECT DISTINCT ?predicate {[] ?predicate []} }
FILTER EXISTS {?instance a :MyClass.}
?instance ?predicate ?any .
}

should work and should not need distinct since there's nested distinct on ?predicate and FILTER EXISTS preserves multiplicity.

Let me know if it works. If the hint is not ignored, you should see FILTER EXISTS in the query plan, not a join.

Best,
Pavel

joergunbehauen · November 27, 2019, 4:40pm

Can confirm, omitting the distinct reduces execution time on stardog 6 to 3 seconds and on stardog 7 to 13 seconds, in all cases with the FROM clauses given.

In both cases a nice linear join plan is generated:

Projection(?predicate) [#29.6M]
`─ Filter(EXISTS {  {   ?instance rdf:type :myClass .   ?instance ?predicate ?any .  } }) [#29.6M]
   `─ Distinct [#327]
      `─ Projection(?predicate) [#327]
         `─ Scan[P_NOC](_, ?predicate, _) [#327]

system · December 11, 2019, 4:40pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Poor performance on query for predicates Support	6	680	January 17, 2019
Problem running (apparently) simple query Support	5	364	November 9, 2018
Query performance issues Support	8	1249	February 28, 2017
Query Performance Has Changed Support	5	377	October 21, 2019
DELETE Query Performance Support	4	35	March 7, 2025

Poor performance querying predicates for a class

Related topics