Actually it probably doesn’t use a loop join but the plan is misleading. What happens is that for SPARQL Update queries, as opposed to read queries, we defer planning till runtime. There’re several reasons for that, mostly update sequences (multiple update queries in a sequence). In those cases it’s often easier to find the best plan for each query after seeing changes from the previous queries. It interacts with reasoning in some subtle ways too (e.g. what happens when a previous query in the sequences changes the schema).
We have an open ticket to make the explain command / API useful for update queries (#3297). I can’t promise any particular date when this is going to be fixed, except that we’ll give it another thought. Maybe we’ll just improve it for singleton sequences since it’s a common special case.
I couldn’t get any perfomance on the query I wrote initially, so I’ve ended up just doing a select and generating a turtle file.
This is the query I ended up using
select ?a (min(?dato) as ?publisertDato) ?g where {
graph ?g {
?a a arkiv:Saksmappe.
?b arkiv:parent ?a.
?b arkiv:publisertDato ?dato.
FILTER(NOT EXISTS{?a arkiv:publisertDato ?q})
}
} group by ?a ?g
This query made Stardog crash after pulling around 2 000 000 results from the iterator in java. Seems to be a memory issue, I presume the group by is stopping stardog from releasing memory while handling the query.
We ended up putting a limit on it, and then inserting the generated data into stardog, and then running it again, over and over, until it stopped returning data.
Two hash joins certainly do create some (native) memory pressure here. However I don’t really understand how making it a SELECT, generating a Turtle file, and then adding that back to Stardog can work any better. The main problem is that if you pull results from Stardog in this way, you’d also pull all RDF terms (IRIs, literals, etc.) whereas if you run it as an INSERT, that won’t happen (at least for values of ?a and ?g). Stardog then wouldn’t decode them and would operate directly on the numerical representation.
You may want to experiment a little with -XX:MaxDirectMemorySize for the server, maybe increase it, and see if INSERT performance changes in any way. If it seems frozen for a long time, thread dumps may help (it might be spilling intermediate results to disk).
Also, how exactly Stardog crashes? Does the OOM happen on the server side?
Also, an independent problem here is that the optimizer may have wrongly decided to not push the NOT EXISTS filter before rewriting it into an outer join. Can you quickly try something like:
select ?a (min(?dato) as ?publisertDato) ?g where {
graph ?g {
{ ?a a arkiv:Saksmappe . FILTER(NOT EXISTS{?a arkiv:publisertDato ?q}) }
?b arkiv:parent ?a.
?b arkiv:publisertDato ?dato.
}
} group by ?a ?g
Possibly in the INSERT version as well. Please show the plan if it doesn’t make a difference.
Does it actually run faster? It must also use much less memory (hash joins materialize intermediate results, merge joins don’t) so if the original INSERT problem was caused by insufficient memory and spilling to disk, then this could work inside INSERT. But since aggregation also accumulates some results in memory a bump to -XX:MaxDirectMemorySize may still be necessary. One needs to try though.
Thanks Pavel. The production team just wiped our production database right now, so I can’t try anymore today. We are moving from the old SQL based system today and running up our new system with Stardog over the coming days.
To test anymoe I’ll have to run everything up locally, but since that takes a while I’m going to start working on the ICV perfomance with reasoning instead.