Visualisation queries OPTIONAL clauses performance

Hi

In Stardog Studio i happened to try pulling up the visualisation for a specific resource, however it eventually timed out after five minutes. The dataset is a large so i was expecting to see some performance issues as i evaluated things.

Having looked at the query generated in studio and played around a bit i've narrowed it down to a specific part which seems to be causing undue overhead for this dataset. However i'm not sure what's the best route going forward in general for the type of query pattern and of course then applying it to visualising items in such a dataset, so interested in some input.

I narrowed things down to this:

SELECT ?subject ?predicate ?predicate_2 ?object ?object_type ?object_label_1 
WHERE {
VALUES (?iri) {
    (<http://example.com/123>)
  }
  { 
    ?iri ?predicate ?object . 
    OPTIONAL { ?object a ?object_type . }
    OPTIONAL { ?object rdfs:label ?object_label_1 . }
  }
}
LIMIT 1000

If you run the query without the two OPTIONAL clauses it runs <400ms and returns 17 rows. Adding in either OPTIONAL will run in about the same time and may add a few rows due to repeated use of the properties. However when both are used together is where things go bad. The query plan for that query is shown below, as you can see the combination of the two clauses with the primary match on the IRI involve the hash join with 3B in the first instance and 450M in the second. This is with Stardog 7.6.4.

From all
Slice(offset=0, limit=1000) [#1.0K]
`─ Projection(?subject, ?predicate, ?predicate_2, ?object, ?object_type, ?object_label_1) [#119.5M]
   `─ HashJoinOuter(?object) [#119.5M]
      +─ DirectHashJoinOuter(?object) [#119.5M]
      │  +─ Bind(<http://example.com/123> AS ?iri) [#17]
      │  │  `─ Scan[SPOC](<http://example.com/123>, ?predicate, ?object) [#17]
      │  `─ Scan[PSOC](?object, rdf:type, ?object_type) [#2956.0M]
      `─ Scan[PSOC](?object, rdfs:label, ?object_label_1) [#494.2M]

So would an appropriate hint in this case help, are there alternative server or db configuration options which might help the query planner?

Thanks

Tony

Thanks, Tony. This is a well spotted issue indeed which we need to look into. I cannot offer any immediate workaround which'd work with the Viz feature in Studio, but I'll appreciate if you try the following hint when you run the query directly:

SELECT ?subject ?predicate ?predicate_2 ?object ?object_type ?object_label_1 
WHERE {
#pragma join.directHash off
VALUES (?iri) {
    (<http://example.com/123>)
  }
  { 
    ?iri ?predicate ?object . 
    OPTIONAL { ?object a ?object_type . }
    OPTIONAL { ?object rdfs:label ?object_label_1 . }
  }
}
LIMIT 1000

Please show the plan for that one too.

Best,
Pavel

Hi Pavel

Gave that one a try and didn't finish in time either. Query plan below. Would any other of the statistics generation effect the query planner decision?

Thanks

Tony

From all
Slice(offset=0, limit=1000) [#1.0K]
`─ Projection(?subject, ?predicate, ?predicate_2, ?object, ?object_type, ?object_label_1) [#119.5M]
   `─ HashJoinOuter(?object) [#119.5M]
      +─ BindJoinOuter(?object) [#119.5M]
      │  +─ Bind(<http://example.com/123> AS ?iri) [#17]
      │  │  `─ Scan[SPOC](<http://example.com/123>, ?predicate, ?object) [#17]
      │  `─ Scan[PSOC](?object, rdf:type, ?object_type) [#2956.0M]
      `─ Scan[PSOC](?object, rdfs:label, ?object_label_1) [#494.2M]

Well, kind of, we don't have accurate cardinality estimations when the variable in the predicate position is not bound (?predicate). But it's mostly about choosing wrong join algorithms here. OK, can you also add #pragma join.bind off to the next line after the previous pragma? I am just trying to verify that it can run faster with proper joins...

Thanks,
Pavel

that one did the trick, plan below.

From all
Slice(offset=0, limit=1000) [#1.0K]
`─ Projection(?subject, ?predicate, ?predicate_2, ?object, ?object_type, ?object_label_1) [#119.5M]
   `─ MergeJoinOuter(?object) [#119.5M]
      +─ MergeJoinOuter(?object) [#119.5M]
      │  +─ Sort(?object) [#17]
      │  │  `─ Bind(<http://example.com/123> AS ?iri) [#17]
      │  │     `─ Scan[SPOC](<http://example.com/123>, ?predicate, ?object) [#17]
      │  `─ Scan[PSOC](?object, rdf:type, ?object_type) [#2956.0M]
      `─ Scan[PSOC](?object, rdfs:label, ?object_label_1) [#494.2M]
1 Like

I went back to the larger query and added in the other two OPTIONAL clauses in the first part of the UNION from the visualisation query and that then fails to return. I can play with those and the other related hints and see if some combination works, i wasnt sure whether i might need to place them at a different line to effect certain levels of the query.

Tony

Can you post the full query and the plan?

Hi Pavel

I went back to the full visualisation query and got it to work, i had to reorder the OPTIONAL clauses. The one with "tag:stardog:studio:label" was causing a different join (HashJoinOuter), but moving it last made it use a merge join throughout.

Tony

# Expand From IRIs
CONSTRUCT {
  ?iri ?predicate ?object .
  ?object a ?object_type .
  ?object <tag:stardog:studio:label> ?object_label_0 .
  ?object rdfs:label ?object_label_1 .
  ?object <http://purl.org/dc/elements/1.1/title> ?object_label_2 .
  ?subject ?predicate_2 ?iri .
  ?subject a ?subject_type .
  ?subject <tag:stardog:studio:label> ?subject_label_0 .
  ?subject rdfs:label ?subject_label_1 .
  ?subject <http://purl.org/dc/elements/1.1/title> ?subject_label_2 .
} WHERE {
#pragma join.directHash off
#pragma join.bind off 
  VALUES (?iri) {
    (<http://example.com/123>)
  }
  { 
    ?iri ?predicate ?object . 
    OPTIONAL { ?object a ?object_type . }
    OPTIONAL { ?object rdfs:label ?object_label_1 . }
    OPTIONAL { ?object <http://purl.org/dc/elements/1.1/title> ?object_label_2 . }
    OPTIONAL { ?object <tag:stardog:studio:label> ?object_label_0 . }
  }
  UNION {
    ?subject ?predicate_2 ?iri .
    OPTIONAL { ?subject a ?subject_type . }
    OPTIONAL { ?subject rdfs:label ?subject_label_1 . }
    OPTIONAL { ?subject <http://purl.org/dc/elements/1.1/title> ?subject_label_2 . }
    OPTIONAL { ?subject <tag:stardog:studio:label> ?subject_label_0 . }
  }
}
LIMIT 1000

Right, this query should run without any hash joins. It's very possible that adding #pragma join.hash off as well would have worked too, and you wouldn't need to reorder optionals (that generally doesn't guarantee any specific plan).

I have made a note about this. Obviously you shouldn't be using these hints for this query, we will look into the issue.

Thanks,
Pavel

Thanks for your help with this one.

Tony

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

I know it's been a long time but I just came across this old ticket and decided to check how the newly released Stardog 8.0 does on it. It works as expected (I used the Yago2s KG):

From local
From named local named
Slice(offset=0, limit=1000) [#21]
`─ Projection(?subject, ?predicate, ?predicate_2, ?object, ?object_type, ?object_label_1) [#21]
   `─ DirectHashJoinOuter(?object) [#21]
      +─ DirectHashJoinOuter(?object) [#21]
      │  +─ Bind(<http://yago-knowledge.org/resource/wordnet_person_100007846> AS ?iri) sortedBy=?iri [#21]
      │  │  `─ Scan[SPOC](<http://yago-knowledge.org/resource/wordnet_person_100007846>, ?predicate, ?object) [#21]
      │  `─ Scan[PSOC](?object, rdf:type, ?object_type) [#9.0M]
      `─ Scan[PSOC](?object, rdfs:label, ?object_label_1) [#7.7M]

This runs in milliseconds.

Best,
Pavel