Poor query plan for reasoning of subPropertyOf

Hi,

I have only 5 Citation resources that are linked to some organizations with a isFrom object property. And I want the names of those 5 organizations but the names are asserted with different data sub-properties and I use the super property instead with inference to get all of them.

That query should be very fast, because only 5 Citations and 5 Organizations are involved, but it is very slow. The query plan shows why. The complete organizations are scanned instead of getting directly to the 5 needed organizations.

It seems that the indexes are not used. I am only starting to use reasoning in StarDog and maybe I don't have all the necessary index, stats or something elase. However, I have the following already covered:
(1) All the needed ontologies are in different named graphs and those are registered in the Reasoning Schema Graphs database parameter.
(2) Since I am using named graphs, the index named graphs parameter is set.
(3) The automatic statistics parameter is set.
(4) Since I am doing simple reasoning for now, I have set the reasoning type to RDFS.

Thanks

Distinct [#10]
`─ Projection(?eventDate, ?fromOrg, ?fromOrgNm, ?author, ?authorNm, ?eventText, ?eventLocfrom) [#10]
   `─ LoopJoinOuter(_) [#10]
      +─ Union [#10]
      │  +─ Union [#5]
      │  │  +─ Union [#2]
      │  │  │  +─ MergeJoin(?fromOrg) [#1]
      │  │  │  │  +─ Sort(?fromOrg) [#1]
      │  │  │  │  │  `─ MergeJoin(?author) [#1]
      │  │  │  │  │     +─ Sort(?author) [#1]
      │  │  │  │  │     │  `─ NaryJoin(?cit) [#1]
      │  │  │  │  │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │  │  │  │  │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │  │  │  │  │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │  │  │  │  │     │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │  │  │  │  │     `─ Scan[PSOC](?author, onto:firstName, ?authorNm) [#2]
      │  │  │  │  `─ Scan[PSOC](?fromOrg, <http://ld.thomsonreuters.com/feed/schema/commonName>, ?fromOrgNm) [#5.0M]
      │  │  │  `─ MergeJoin(?author) [#1]
      │  │  │     +─ Sort(?author) [#1]
      │  │  │     │  `─ MergeJoin(?fromOrg) [#1]
      │  │  │     │     +─ Sort(?fromOrg) [#1]
      │  │  │     │     │  `─ NaryJoin(?cit) [#1]
      │  │  │     │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │  │  │     │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │  │  │     │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │  │  │     │     │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │  │  │     │     `─ Scan[PSOC](?fromOrg, <http://permid.org/ontology/organization/hasAKAName>, ?fromOrgNm) [#1.3M]
      │  │  │     `─ Scan[PSOC](?author, onto:firstName, ?authorNm) [#2]
      │  │  `─ Union [#3]
      │  │     +─ MergeJoin(?author) [#1]
      │  │     │  +─ Sort(?author) [#1]
      │  │     │  │  `─ MergeJoin(?fromOrg) [#1]
      │  │     │  │     +─ Sort(?fromOrg) [#1]
      │  │     │  │     │  `─ NaryJoin(?cit) [#1]
      │  │     │  │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │  │     │  │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │  │     │  │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │  │     │  │     │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │  │     │  │     `─ Scan[PSOC](?fromOrg, <http://permid.org/ontology/organization/hasOfficialName>, ?fromOrgNm) [#5.3M]
      │  │     │  `─ Scan[PSOC](?author, onto:firstName, ?authorNm) [#2]
      │  │     `─ Union [#2]
      │  │        +─ MergeJoin(?author) [#1]
      │  │        │  +─ Sort(?author) [#1]
      │  │        │  │  `─ MergeJoin(?fromOrg) [#1]
      │  │        │  │     +─ Sort(?fromOrg) [#1]
      │  │        │  │     │  `─ NaryJoin(?cit) [#1]
      │  │        │  │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │  │        │  │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │  │        │  │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │  │        │  │     │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │  │        │  │     `─ Scan[PSOC](?fromOrg, <http://permid.org/ontology/organization/hasShortName>, ?fromOrgNm) [#4.9M]
      │  │        │  `─ Scan[PSOC](?author, onto:firstName, ?authorNm) [#2]
      │  │        `─ MergeJoin(?author) [#1]
      │  │           +─ Sort(?author) [#1]
      │  │           │  `─ MergeJoin(?fromOrg) [#1]
      │  │           │     +─ Sort(?fromOrg) [#1]
      │  │           │     │  `─ NaryJoin(?cit) [#1]
      │  │           │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │  │           │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │  │           │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │  │           │     │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │  │           │     `─ Scan[PSOC](?fromOrg, onto:Organization-name, ?fromOrgNm) [#1]
      │  │           `─ Scan[PSOC](?author, onto:firstName, ?authorNm) [#2]
      │  `─ MergeJoin(?author) [#5]
      │     +─ Sort(?author) [#5]
      │     │  `─ Union [#5]
      │     │     +─ Union [#2]
      │     │     │  +─ MergeJoin(?fromOrg) [#1]
      │     │     │  │  +─ Sort(?fromOrg) [#1]
      │     │     │  │  │  `─ NaryJoin(?cit) [#1]
      │     │     │  │  │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │     │     │  │  │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │     │     │  │  │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │     │     │  │  │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │     │     │  │  `─ Scan[PSOC](?fromOrg, <http://ld.thomsonreuters.com/feed/schema/commonName>, ?fromOrgNm) [#5.0M]
      │     │     │  `─ MergeJoin(?fromOrg) [#1]
      │     │     │     +─ Sort(?fromOrg) [#1]
      │     │     │     │  `─ NaryJoin(?cit) [#1]
      │     │     │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │     │     │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │     │     │     │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │     │     │     │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │     │     │     `─ Scan[PSOC](?fromOrg, <http://permid.org/ontology/organization/hasAKAName>, ?fromOrgNm) [#1.3M]
      │     │     `─ Union [#3]
      │     │        +─ MergeJoin(?fromOrg) [#1]
      │     │        │  +─ Sort(?fromOrg) [#1]
      │     │        │  │  `─ NaryJoin(?cit) [#1]
      │     │        │  │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │     │        │  │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │     │        │  │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │     │        │  │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │     │        │  `─ Scan[PSOC](?fromOrg, <http://permid.org/ontology/organization/hasOfficialName>, ?fromOrgNm) [#5.3M]
      │     │        `─ Union [#2]
      │     │           +─ MergeJoin(?fromOrg) [#1]
      │     │           │  +─ Sort(?fromOrg) [#1]
      │     │           │  │  `─ NaryJoin(?cit) [#1]
      │     │           │  │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │     │           │  │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │     │           │  │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │     │           │  │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │     │           │  `─ Scan[PSOC](?fromOrg, <http://permid.org/ontology/organization/hasShortName>, ?fromOrgNm) [#4.9M]
      │     │           `─ MergeJoin(?fromOrg) [#1]
      │     │              +─ Sort(?fromOrg) [#1]
      │     │              │  `─ NaryJoin(?cit) [#1]
      │     │              │     +─ Scan[PSOC](?cit, onto:PersonCitation-text, ?eventText) [#5]
      │     │              │     +─ Scan[PSOC](?cit, onto:PersonCitation-Organization-isFrom, ?fromOrg) [#5]
      │     │              │     +─ Scan[PSOC](?cit, onto:PersonCitation-Person-hasAuthor, ?author) [#5]
      │     │              │     `─ Scan[PSOC](?cit, onto:PersonCitation-eventDate, ?eventDate) [#5]
      │     │              `─ Scan[PSOC](?fromOrg, onto:Organization-name, ?fromOrgNm) [#1]
      │     `─ Scan[PSOC](?author, <http://www.w3.org/2006/vcard/ns#given-name>, ?authorNm) [#4.4M]
      `─ Empty [#1]

image

Are you able to share the data set and/or query with us as well? Privately will work if the data is sensitive, and we also have an obfuscation functionality

I am sorry but we are using external data sources that I am not allowed to share. That includes a set of ontologies and tons of assertions.

I am under the impression that the indexes are not used because of some missing setup. And I was hopping to identify the missing pieces with what I have described and supplied.

Is that possible?

Thanks for your help

We might be able to help with something like a query optimizer hint, but we would need to see the original query in order to do so. The actual data is less necessary.

select ?eventDate ?fromOrg ?fromOrgNm ?author ?authorNm ?eventText ?eventLocfrom

from list of different graphs ...

where {

?cit rdf:type onto:PersonCitation .

?cit onto:PersonCitation-eventDate ?eventDate .

?cit onto:PersonCitation-text ?eventText .

?cit onto:PersonCitation-Organization-isFrom ?fromOrg .

?fromOrg onto:Organization-name ?fromOrgNm .

?cit onto:PersonCitation-Person-hasAuthor ?author .

?author onto:firstName ?authorNm .

optional { ?cit onto:PersonCitation-location ?eventLoc . }

}

The onto:Organization-name and onto:firstName data properties have sub-properties. Four out of five assertions are done at the sub-property level.

Hi Daniel,

This query plan is actually supposed to be very fast if the cardinality estimations were correct. There are scans with 4.5M cardinality in the plan but those scans are merge joined with very selective operators where Sideways Information Passing (SIP) is used to skip over unrelated portions of the scan.

Based on the information you provided we think the problem is related to named graphs. How many graphs do you have in the FROM cause? What kind of performance do you see if you use a single graph in the FROM clause? Which version of Stardog are you using?

Best,
Evren

Hi Evren,

Here are my answers:

(1) Included Graphs:
My query have the following graphs in the FROM clause:

  • 1 graph for our ontologies
  • 1 graph for the external source ontologies
  • 1 graph for the ontologies mapping our ontologies to the external source ontologies with subPropertyOf axioms
  • 1 graph for our internal assertions for citations, some persons and some organization
  • 4 graphs for the external source assertions including most of the persons and organizations

Most of the graphs have only tens or hundreds of triples, except the external source assertion graphs that are containing overall 350 millions triples. In fact, all the graphs of that source are containing more that a billion triples and we intend to use all of them later.

(2) Performance on a single graph:

Since the ontologies and the assertions are not mixed in any graph, it is impossible to run a query doing reasoning with only one graph. I did a similar test using the external source ontology graph and one of the assertion graph. And the performance was really good.

The test consisted of doing something like:

  • ?org onto:superProperty ?propValue
  • where superProperty has about 20 different sub-properties and the assertions are at the level of the sub-properties.

I also did a second test for which the results were surprizing. I have replace the above SPARQL line by two lines as:

  • ?subProperty rdfs:subPropertyOf* onto:superProperty .
  • ?org ?subProperty ?propValue .

For that second test, I have also set the reasoning to OFF, as I was performing it in SPARQL myself. For that test, the performance was really poor.

By accident, in one of that second test executions, I have let the reasoning to ON. And that time, the performance was excellent.

It seems that no matter if we are performing inference in SPARQL or let StarDog performing it, the reasoning needs to always be set to ON. Right?

(3) We are using version 6.1.0

Thanks for your help Evren!

Hi Daniel,

Two quick things to note:

  1. The schema graphs do not need to be included in the FROM clause. Schema is processed independently of queries so the axioms and rules in your schemas will be taken into account even if they are not in the FROM clause.

  2. Reasoning can be set on or off for each query separately. When reasoning is on for a query you can selectively turn it off for some parts of the query selectively using a query hint: https://www.stardog.com/docs/#_query_answering

We have a good idea what is going wrong here but to make sure we might have couple follow up questions that I'll send in private.

Thanks,
Evren

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.