Unexpected SPARQL FILTER results

Hi folks,

I am having trouble with unexpected SPARQL query results that use FILTER. I am using Stardog 7.04 on Windows OS.

I am constructing to query to find Subjects that have the exact same "set" of triples. In this example data, :Set2 should be identified as the only exact match with :Set1 , while :Set1 and :Set3 are NOT an exact match because of the value :VAL_E .

@prefix :  <https://www.example.org/Eg#>.

:Set1 :hasValue  :VAL_A, :VAL_B, :VAL_C, :VAL_D .
:Set2 :hasValue  :VAL_A, :VAL_B, :VAL_C, :VAL_D .
:Set3 :hasValue  :VAL_A, :VAL_B, :VAL_C, :VAL_D, :VAL_E .
:Set4 :hasValue  :VAL_A, :VAL_B .
:Set5 :hasValue  :VAL_F, :VAL_G, :VAL_H, :VAL_I, :VAL_J .

I am told this query returns the correct result of :Set1, :Set2 when executed on Apache Jena Fuseki and on Ontotext GraphDB.

PREFIX  :  <https://www.example.org/Eg#>
SELECT DISTINCT ?s1 ?s2 {
  ?s1 ?p ?o .
  ?s2 ?p ?o .
  FILTER NOT EXISTS { ?s1 ?p1 ?o1 . FILTER NOT EXISTS { ?s2 ?p1 ?o1 } }
  FILTER NOT EXISTS { ?s2 ?p2 ?o2 . FILTER NOT EXISTS { ?s1 ?p2 ?o2 } }
  FILTER (STR(?s1) < STR(?s2))
}

But in Stardog Studio, I get:

:Set1 :Set2  
:Set2 :Set4  
:Set1 :Set4  
:Set3 :Set4  
:Set2 :Set3  
:Set1 :Set3  

Similarly, I want to find "What subject in the data has an identical set of P,O to :Set1? The answer should be :Set2 and only :Set2. But when I run this query:

PREFIX  :  <https://www.example.org/Eg#>

SELECT DISTINCT  ?s2 {
  :Set1 ?p ?o .
  ?s2 ?p ?o .
  FILTER NOT EXISTS { :Set1 ?p1 ?o1 . FILTER NOT EXISTS { ?s2 ?p1 ?o1 } }
  FILTER NOT EXISTS { ?s2 ?p2 ?o2 . FILTER NOT EXISTS { :Set1 ?p2 ?o2 } } # omit match from :Set3 to :Set1
  FILTER (STR(:Set1) < STR(?s2))
}

I get the result:

:Set2
:Set4

What is :Set4 present? I posted this question on StackOverflow which is where I was told expected result of only :Set2 is present for Fuseki and GraphDB.

Thanks for your help!

Tim

1 Like

Can you try this:

SELECT DISTINCT ?s1 ?s2 {                                                                                                      
#pragma optimizer.filters.notexists off                                                                                                                                       
  ?s1 ?p ?o .                                                                                                                                                                 
  ?s2 ?p ?o .                                                                                                                                                                 
  FILTER NOT EXISTS { ?s1 ?p1 ?o1 . FILTER NOT EXISTS { ?s2 ?p1 ?o1 } }                                                                                                       
  FILTER NOT EXISTS { ?s2 ?p2 ?o2 . FILTER NOT EXISTS { ?s1 ?p2 ?o2 } }                                                                                                       
  FILTER (STR(?s1) < STR(?s2))                                                                                                                                                
}

Hi Jess,

Is it a stardog-admin command to set the optimizer? I can't find that in the user manual.

As-is, your query for me yields:
:Set1 :Set2
:Set1 :Set3
:Set2 :Set3
:Set1 :Set4
:Set2 :Set4
:Set3 :Set4

T
PS: On vacation today so my replies will be delayed.

Thanks Tim, we'll take a look.

Speaking of pragmas. I don't think there's a comprehensive list of all the available pragmas in Stardog. They're in the docs but scattered around. It's just a suggesting but there are getting to be quite a few and it might be nice to have them in a list. I don't use them often enough to ever remember them.

We can definitely create a list of the main ones (describe strategy, reasoning control, etc). The fine grained optimizer hint used here likely won't ever appear in the documentation. This type of hint is primarily useful to confirm bugs and provide a temporary workaround for said bugs until a fix is released. The set of optimizers changes over time and there are no plans to stabilize these hints.

Here is the list so far:

Describe Directionality
#pragma describe.strategy bidirectional
#pragma describe.strategy cbd

Equality
#pragma equality.identity ?p1,?p2
#pragma assume.iri ?o, ?o2

Joins
#pragma group.joins
#pragma join.bind off

Optimizer
#pragma optimizer.filters.exists off
#pragma optimizer.filters.notexists off
#pragma optimizer.inline.from off

Filters
#pragma push.filters off
#pragma push.filters default
#pragma push.filters aggressive

Indexes
#pragma literal.index off
#pragma literal.index default
#pragma literal.index aggressive

Virtual Transparency
#pragma virtual.transparency off

Reasoning
#pragma reasoning off

1 Like

Hi Jess,

Are you able to recreate the issue on your side? This will help me to know that it is not my SPARQL that is somehow cause.

Cheers!

Hi Tim,

Not yet, sorry about the delay, just busy with something else. Should look into it this week, feel free to ping me if that doesn't happen.

Best,
Pavel

Thanks, Pavel. It is good to know you folks will be looking into it. I hope you are doing well. - Tim

So, I tried this on AG 7.0.3 and got the same results:

s1 s2
Set1 Set2
Set2 Set3
Set3 Set4
Set1 Set3
Set2 Set4
Set1 Set4

So clearly something strange is going on. I was, however, able to approach this differently by doing the following:

    PREFIX  :  <https://www.example.org/Eg#>
    select (group_concat(?s) as ?same) where {
      {
        SELECT ?s (group_concat(?prop_val;separator='|') as ?values) where {
          {
            select ?s ?p ?o where {
              ?s ?p ?o .
            } order by ?s ?p ?o
          }
          bind(concat(str(?p),':', str(?o)) as ?prop_val)
        } group by ?s
      }
    } group by ?values having(count(distinct(?s)) > 1)

I know it's not exactly what you were looking for, but it has the upside of finding all the sets that have the same members in the same bucket (as opposed to many pairs), and will handle general entity comparison with multiple properties. For the simple case you can strip it down to

PREFIX  :  <https://www.example.org/Eg#>
select (group_concat(?s) as ?same) where {
  {
    SELECT ?s (group_concat(?o;separator='|') as ?values) where {
      {
        select ?s ?o where {
          ?s :hasValue ?o .
        } order by ?s ?o
      }
    } group by ?s
  }
} group by ?values having(count(distinct(?s)) > 1)

your query has one major pitfall: group_concat does not make any assumption about there ordering, which means it's never guaranteed that the concatenated bindings will be in the same order for each entity.

for example you might get

<p1>:<o1>|<p2>:<o2> for <s1>
and
<p2>:<o2>|<p1>:<o1> for <s2>
those would clearly be the same set of property-value pairs, but would fail with your query.

It also doesn't help to use a subquery to sort the property-value bindings first as it doesn't matter for group_concat - you might get lucky depending on the implementation, or you might not ...

1 Like

Good point. I guess I have gotten lucky so far, as both AGraph and Neptune have preserved sort order in group_concat.

Does the following query work for you? It works for me in Blazegraph, in which your original query also doesn't work.

SELECT DISTINCT ?s1 ?s2 {
  ?s1 ?p ?o . ?s2 ?p ?o .
  FILTER NOT EXISTS {
    ?s1 ?p ?o . ?s2 ?p ?o . 
    ?s1 ?p1 ?o1 . FILTER NOT EXISTS { ?s2 ?p1 ?o1 } }
  FILTER NOT EXISTS {
    ?s1 ?p ?o . ?s2 ?p ?o .
    ?s2 ?p2 ?o2 . FILTER NOT EXISTS { ?s1 ?p2 ?o2 } }
  FILTER (STR(?s1) < STR(?s2))
}

If so, then this article is probably related: The Problem of Correlation and Substitution in SPARQL.

2 Likes

OK, I just had a chance to look into this. Again, sorry that it took a bit. But here's what happens, let's look at

  ?s1 ?p ?o .
  ?s2 ?p ?o .
  FILTER NOT EXISTS { ?s1 ?p1 ?o1 . FILTER NOT EXISTS { ?s2 ?p1 ?o1 } }

For the query to work as you expect, when the first FILTER NOT EXISTS is evaluated for specific values of ?s1 and ?s2, both these variables should be replaced by their values. This is where the spec gets imprecise because there's no exact definition of variable substitution or even if any occurrence of a variable should be substituted or not. If we look at

?s1 ?p1 ?o1 . FILTER NOT EXISTS { ?s2 ?p1 ?o1 }

(the first FILTER's body), ?s1 is a distinguished variable (projected to the results if this pattern is evaluated in isolation) so it's the same variable as ?s1 in the outer scope. Thus it has to be replaced by its value from the outer scope. The case of ?s2 is, however, less clear. The above pattern, if viewed alone, is equivalent (in terms of matched results) to

?s1 ?p1 ?o1 . FILTER NOT EXISTS { [] ?p1 ?o1 }

i.e. with ?s2 being anonymised, so one could argue either way if it's the same ?s2 as in the outer scope or not. If it is, then it should be replaced by a value when the top-level FILTER NOT EXISTS is evaluated. If not, then not.

Stardog does not replace ?s2 by a value coming from ?s1 ?p ?o . ?s2 ?p ?o which breaks the intended behaviour. I'd guess neither do Blazegraph and AG, while Ontotext and Jena do the replacements. The way Stanislav fixed the query is by pushing ?s2 manually into the first-level filters so it does get replaced. The article he linked to is a good reference on this sort of issues. It has been acknowledged as an issue in SPARQL and there was even a W3C Community Group to fix it, but I'm not sure it delivered any useful resolutions.

This is something we'd happily consider changing since Stardog's view on variable substitution is that it should work purely syntactically without any regard for variable scoping.

Hope it helps,
Pavel

UPD: I created an internal ticket PLAT-1704 to track this.

2 Likes

Confirmed that this code provides the expected result. Thank you - I never would have come up with this on my own.

Thanks Pavel. Your explanation is very helpful and I look forward to the future fix in Stardog.