Broken variable scope semantics with sub-select

Problem description

Variable scope semantics (see SPARQL 1.1 Query language specification) seems to be violated in the presence of sub-select and optionals.

Reproduction steps

Given the following dataset:

INSERT DATA {
	GRAPH <urn:test:graph> {
        <urn:test:s> <urn:test:p> <urn:test:o> .
        <urn:test:x> <urn:test:y> <urn:test:z> .
    }
}

the following query:

SELECT DISTINCT  ?s ?unbound ?unexpectedResult
WHERE
  { {
      SELECT DISTINCT ?s ?unbound
      WHERE { 
        ?s <urn:test:p> ?o .
        OPTIONAL {?s <urn:whatever:DoesNotExist> ?unbound }
      } LIMIT 1
     }
     OPTIONAL { ?unbound ?someProperty ?unexpectedResult . }
   
 }

leads to this results:

s,unbound,unexpectedResult
urn:test:s,urn:test:s,urn:test:o
urn:test:s,urn:test:x,urn:test:z

Expected behavior

The documented query should lead to this results:

s,unbound,unexpectedResult
urn:test:s,,

according to my understanding of the variable scope specification.

Is this correct? Or am I missing something?

Hi Ruben,

First of all, thanks again for a very well-structured, easy to reproduce case report. Definitely makes our life easier!

Second, the case is certainly not straightforward, but here's why I believe the result conforms to the spec. The subquery produces the following solution (IRIs shortened): ?s=:s, i.e. there's no value for ?unbound. The last pattern in the last OPTIONAL produces two solutions: ?unbound=:s, ?someProperty=:p, ?unexpectedResult=:o and ?unbound=:x, ?someProperty=:y, ?unexpectedResult=:z. Now the question is what the left outer join should return.

The key here is the definition of compatible solutions in https://www.w3.org/TR/sparql11-query/#BasicGraphPattern Both inner and outer join's semantics in SPARQL is based on it. The definition says that 2 solutions are compatible if they agree on values of all variables which are present in both solutions. The non-obvious corollary here is that if some solution does not have a value for a variable, e.g. ?unbound in this case, then it's compatible with any solution with any value for ?unbound (provided there're no other shared variables).

It's possible to see this effect without OPTIONAL and a subquery, just with something like

select * {
values (?x ?y) { (<urn:s> UNDEF) }
values (?y ?z) { (<urn:o> 1) (<urn:t> 2) }
}

This issue does indeed come up more often in the presence of OPTIONALs because people often write something like

?subject :p ?o .
optional { ?o a ?type }
optional { ?type rdfs:label ?label }

The intention is clear: to obtain labels for types which may not be present. But the problem is that ?type may be unbound, and then the join on it potentially returns rdfs:label values for everything. That leads to both correctness and performance problems.

Does this make sense?
Pavel

Hi Pavel!

Thanks for the quick reply.
A colleague of mine also advocated in that direction, arguing that the semantics of unbound mean that other graph patterns can freely "put" values into the variable (bind it), I however disagree.

I my opinion, unbound means that there is no mapping for the variable for the specific solution mapping. We could say that there is no RDF term for the variable (or that the value is null, in programming language jargon).

So considering this definition:

Two solution mappings μ1 and μ2 are compatible if, for every variable v in dom(μ1) and in dom(μ2), μ1(v) = μ2(v).
Here, μ1(v) = μ2(v) means that μ1(v) and μ2(v) are the same RDF term.

and considering:

  • μ1 as the projection variables of the subselect
  • μ2 the variables of the optional pattern

for each mapping solution in dom(μ1) the RDF term μ1(unbound) would not be given (empty). This means that for solution mappings in dom(μ2) to be compatible with μ2, the RDF term of μ2(unbound) should also be unbound (empty), otherwise the 2 solution mappings are not compatible and hence cannot be a solution to the graph pattern.

(Sidenote: I tested with another triplestore just to compare, and the behavior there was the one I expected.

I my opinion, unbound means that there is no mapping for the variable for the specific solution mapping. We could say that there is no RDF term for the variable (or that the value is null , in programming language jargon).

This, this is correct.

  • μ1 as the projection variables of the subselect

No, μ1 is not a set of variables. It's a solution, that is a function from a set of variables to RDF terms. In this case it's a single mapping of ?s to :s. The set of variables mapped by it, i.e. its domain or dom(μ1), is {?s}. The ?unbound variable is not mapped by μ1 despite it being in the projection. It's an in-scope variable according to 18.2.1 you cited, but it's not in dom(μ1). Note that 18.2.1 says "... if there is a way for a variable to be in the domain...", i.e. a variable may be in-scope but still not in the domain of the mapping.

Then, since as ?unbound is not in dom(μ1), μ1 is compatible with any solution which does not map ?s to anything but :s. There're no restrictions on μ(?unbound) for the other solution. Thus μ2 is compatible with μ1.

You can run my simpler query against sparql.org with something like

curl -G -v "http://sparql.org/sparql" --data-urlencode "query=select * {
> values (?x ?y) { (<urn:s> UNDEF) }
> values (?y ?z) { (<urn:o> 1) (<urn:t> 2) }
> }"

and should get this

{
  "head": {
    "vars": [ "x" , "y" , "z" ]
  } ,
  "results": {
    "bindings": [
      {
        "x": { "type": "uri" , "value": "urn:s" } ,
        "y": { "type": "uri" , "value": "urn:o" } ,
        "z": { "type": "literal" , "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "value": "1" }
      } ,
      {
        "x": { "type": "uri" , "value": "urn:s" } ,
        "y": { "type": "uri" , "value": "urn:t" } ,
        "z": { "type": "literal" , "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "value": "2" }
      }
    ]
  }
}

that is, the solution in the first VALUES which does not map ?y is compatible with both solutions in the second VALUES which do.

Cheers,
Pavel

Hi Pavel!

Ok, that is true:

So I reformulate to:

  • μ1 as the solution mappings of the projection variables of the subselect
  • μ2 as the solution mappings of the variables of the optional pattern

According to the spec this must apply

Two solution mappings μ1 and μ2 are compatible if, for every variable v in dom(μ1) and in dom(μ2), μ1(v) = μ2(v).

So if this must apply to every variable, then this must be fulfilled:

  • μ1(?unbound) = μ2(?unbound)

Following the definition:

μ(x) for the solution mapping variable x to RDF term t : { (x, t) }

I understand that the RDF term μ1(?unbound) = UNDEF cannot be equals to neither possible RDF term values of μ2(?unbound) (see RDF term-equality):

  • UNDEF = urn:test:s (not compatible)
  • UNDEF = urn:test:x (not compatible)

which ultimately should lead to the OPTIONAL pattern not matching.


Bottom line

What we are discussing here then is the definition of equality (μ1(v) = μ2(v)) between specific RDF terms (μ(v)), more specifically the UNDEF RDF term.

Intuitively, your proposed interpretation of the specification is imo equivalent to saying "the undefined value is equal to any value", which is not logical.

(By comparison, e.g. in a programming language like Java this would mean that any expression of the like anyPossibleObject == null would always be true).

I don't quite agree with the "bottom line". There's no such thing as "UNDEF RDF Term". UNDEF is a purely syntactic construct. Having no value for a variable in a solution is not the same as having some special value which isn't equal to anything (except of itself). The analogy with null does not work.

Most importantly, this is false:

this must be fulfilled: μ1(?unbound) = μ2(?unbound)

because μ1(v) = μ2(v) applies to all (and only) variables which are both in the domain of μ1 and in the domain of μ2. ?unbound is not in the domain of μ1. It is not mapped by μ1. Again, having no value for ?unbound is not the same as mapping it to some special null value.

"the undefined value is equal to any value" also isn't quite what the semantics says. If you have no value, you cannot technical speak of it being equal to something or not. You need to have 2 values to determine equality.

Best,
Pavel

Hmm alright, let me try to reduce the discussion to the core.

  • Your interpretation:

    • An unbound value means that the μ1(v) mapping is not in the domain (dom(μ1)) of the solution mapping. This means, that no μ1(?unbound) = μ2(?unbound) check takes place
  • My interpretation

    • A μ1(v) mapping is always part of the domain of the solution mapping, leading possibly to an undefined/unbound value. This means, μ1(?unbound) exists and is empty/undefined/unbound (meaning that it maps to an undefined RDF term), which cannot be equal to a defined one (μ2(?unbound))

I think that in the current state of the specification, your interpretation seems indeed to be the correct one, since the definition of μ(v) does not point to the concept of "an undefined/empty RDF term t". I'm still though inclined to think that this is an error in the specification (and as you also said, real usage e.g. of OPTIONAL follows this intuition).

I will start a discussion in the SPARQL 1.2 community group to improve the specification of the expected behaviour.

@pavel Many thanks for your patience and support! :slight_smile:

Right, this is the difference. But before you propose the change, let me ask if you want these two queries:

# Q1
select * {
values (?x ?y) { (<urn:s> UNDEF) }
values (?y ?z) { (<urn:o> 1) (<urn:t> 2) }
}

and

# Q2
select * {
values ?x { <urn:s> }
values (?y ?z) { (<urn:o> 1) (<urn:t> 2) }
}

to return the same or different results? This is based on whether all in-scope variables are always in the domain or simply all query's variables are in the domain. The choice is actually important for what optimiser can and cannot do to the query under your semantics.

Well, in the same lines as my previous answer, imo:

  • In order to be consistent with your interpretation, both queries should return the same results (urn:s,urn:o,1 and urn:s,urn:t,2), since unbound variables (in this case ?y) are not in the domain
  • In order to be consistent with my interpretation, the queries should return different results (the first one, would return an empty result, the second would return the results urn:s,urn:o,1 and urn:s,urn:t,2, since all variables (even if unbound as is ?y) are in the domain

(And just as in my previous answer, I like the second one better, but over anything I prefer consistency)

OK, agreed, but then, if you really want to distinguish UNDEF (or null) from no value, and treat it as a first-class RDF term, you need to specify what happens in other situations where the SPARQL semantics says something about a variable having no value. For example, cases like BIND(expression as ?x). Currently an error raised during evaluation of expression means ?x is unbound, i.e. no value, not in the domain of the solution. Now, under your proposal, there's a choice of no value vs UNDEF.

Same for filters, say, there's FILTER(?x = expression). Currently, if the optimizer can statically figure out that ?x cannot be bound (or that expression would always evaluate to error), it can just remove the whole graph pattern under that filter since it won't produce solutions. If you start using UNDEF there, it won't be possible because the other operand of the equality can be evaluated to UNDEF too, and then the filter can succeed. You can maybe work around it by still using the unbound value semantics for expression evaluation but only using UNDEF for OPTIONALs, but it can quickly get hairy given the semantics of FILTERs in OPTIONALs.