Poor performance querying predicates for a class

Hi all,

with stardog 7 the problem described in

seems to surface again.
On a dataset of ~60 million triples on stardog 6.2.3 and stardog 7.0.2 querying for all distinct predicates varies by querying strategy:
(stardog 7 has less off-heap memory than 6, but according to htop, both of them do not hit their limits)

select distinct ?predicate {
#pragma optimizer.filters.exists off
#the pragma does not seem to influence the query plan/retrieval speed
{ SELECT DISTINCT ?predicate {[] ?predicate []} }
FILTER EXISTS {?instance a :MyClass.}
?instance ?predicate ?any .
}

stardog 6: 18sec
stardog 7: 200sec

select distinct ?predicate
WHERE
{
FILTER EXISTS {?instance a :MyClass.}
?instance ?predicate ?any .
}

stardog 6: 200sec
stardog 7: 220sec

select distinct ?predicate
WHERE
{
?instance a :MyClass.
?instance ?predicate ?any .
}

stardog 6: 150 sec
stardog 7: 220 sec

So my question is: with stardog 6 there was a way of getting those distinct predicates in an acceptable timeframe, with stardog 7 this does not seem to work any more.

Can this be solved by reformulating the query?

Hm, we haven't yet done the work to eliminate the need of the hint but the

select distinct ?predicate {
#pragma optimizer.filters.exists off
{ SELECT DISTINCT ?predicate {[] ?predicate []} }
FILTER EXISTS {?instance a :MyClass.}
?instance ?predicate ?any .
}

should still work. It might be a littke slower than 6 but not >200s. It seems to work for me with 7.0.3 and BSBM 100M data.

Could you show the plan and make sure your client/library preserves the #pragma hint?

Thanks,
Pavel

I was using stardog studio to execute the queries directly, from my experience the pragma is preserved here.

The data queried here differs in two ways from BSBM: we have many more (and sparse) properties to (400+) and many graphs (600+) , with only a few containing the data.

While compiling the information about the query plan i found discovered hopefully the root cause, all exeuted on stardog 6:

with no from clause given, the query as the following plan and executed in ~4 sec

From all
Distinct [#1.4K]
`─ Projection(?predicate) [#14.0M]
   `─ BindJoin(?predicate) [#14.0M]
      +─ Distinct [#1.4K]
      β”‚  `─ Projection(?predicate) [#11K]
      β”‚     `─ Scan[PC](_, ?predicate, _) [#11K]
      `─ MergeJoin(?instance) [#16.7M]
         +─ Distinct [#2.3M]
         β”‚  `─ Scan[POSC](?instance, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :myClass) [#2.6M]
         `─ Scan[SPOC](?instance, ?predicate, ?any) [#126.9M]

With a single from clause, the query has the follwing plan and executes in ~10sec

From <mygraph1>
Distinct [#1.4K]
`─ Projection(?predicate) [#12.0M]
   `─ HashJoin(?instance) [#12.0M]
      +─ MergeJoin(?predicate) [#12.0M]
      β”‚  +─ Distinct [#1.4K]
      β”‚  β”‚  `─ Projection(?predicate) [#2.5K]
      β”‚  β”‚     `─ Scan[PC](_, ?predicate, _) [#2.5K]
      β”‚  `─ Scan[PSC](?instance, ?predicate, _) [#21.0M]
      `─ Distinct [#501K]
         `─ Scan[POSC](?instance, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :myClass) [#501K]

with ~200 named graphs the query produces the following query plan and executes in ~30 seconds
with pragma:

 From <mygraph2>
From <mygraph3>
From <mygraph4>
Distinct [#1.4K]
`─ Projection(?predicate) [#12.0M]
   `─ HashJoin(?instance) [#12.0M]
      +─ MergeJoin(?predicate) [#12.0M]
      β”‚  +─ Distinct [#1.4K]
      β”‚  β”‚  `─ Projection(?predicate) [#2.7K]
      β”‚  β”‚     `─ Scan[PC](_, ?predicate, _) [#2.7K]
      β”‚  `─ Scan[PSC](?instance, ?predicate, _) [#22.7M]
      `─ Distinct [#501K]
         `─ Scan[POSC](?instance, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :myClass) [#501K]

For the sake of completeness, here is the query plans w/o the pragma

Distinct [#1.4K]
`─ Projection(?predicate) [#12.0M]
   `─ HashJoin(?instance) [#12.0M]
      +─ MergeJoin(?predicate) [#12.0M]
      β”‚  +─ Distinct [#1.4K]
      β”‚  β”‚  `─ Projection(?predicate) [#2.7K]
      β”‚  β”‚     `─ Scan[PC](_, ?predicate, _) [#2.7K]
      β”‚  `─ Scan[PSC](?instance, ?predicate, _) [#22.7M]
      `─ Distinct [#501K]
         `─ Scan[POSC](?instance, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, :myClass) [#501K]

Thanks, I think I see what's happening. The hint is ignored in all your queries. The problem is the distinct keyword in your query, there's indeed a bug with ignoring hints when top-level distinct or aggregation is used (we should prioritise it since it's an easy fix).

select ?predicate {
#pragma optimizer.filters.exists off
{ SELECT DISTINCT ?predicate {[] ?predicate []} }
FILTER EXISTS {?instance a :MyClass.}
?instance ?predicate ?any .
}

should work and should not need distinct since there's nested distinct on ?predicate and FILTER EXISTS preserves multiplicity.

Let me know if it works. If the hint is not ignored, you should see FILTER EXISTS in the query plan, not a join.

Best,
Pavel

1 Like

Can confirm, omitting the distinct reduces execution time on stardog 6 to 3 seconds and on stardog 7 to 13 seconds, in all cases with the FROM clauses given.

In both cases a nice linear join plan is generated:

Projection(?predicate) [#29.6M]
`─ Filter(EXISTS {  {   ?instance rdf:type :myClass .   ?instance ?predicate ?any .  } }) [#29.6M]
   `─ Distinct [#327]
      `─ Projection(?predicate) [#327]
         `─ Scan[P_NOC](_, ?predicate, _) [#327]