Insert query always using loop join

Hi,

My insert queries are always using loop join:


PREFIX arkiv: <http://www.arkivverket.no/standarder/noark5/arkivstruktur/>

insert{
	graph ?g{
		?a arkiv:publisertDato ?publisertDato.
	}
}
where {

  	graph ?g {  
		?a a arkiv:Saksmappe.	  	
		?b arkiv:parent ?a.
		?b arkiv:publisertDato ?dato.
	}                  
   
                          
 
} 

####################################################################

prefix arkiv: <http://www.arkivverket.no/standarder/noark5/arkivstruktur/>

From all
From all
INSERT
`─ Projection(?a AS ?subject, ?genbslfk AS ?predicate, ?publisertDato AS ?object, ?g AS ?context)
   `─ Bind(<http://www.arkivverket.no/standarder/noark5/arkivstruktur/publisertDato> AS ?genbslfk)
WHERE
`─ LoopJoin(_) [#9223372036854.8M]
   +─ LoopJoin(_) [#165065601.9M]
   │  +─ Scan[SPO](?a, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, arkiv:Saksmappe){?g} [#5.6M]
   │  `─ Scan[SPO](?b, arkiv:parent, ?a){?g} [#29.4M]
   `─ Scan[SPO](?b, arkiv:publisertDato, ?dato){?g} [#25.4M]

When I change it to a select query it uses merge join.


PREFIX arkiv: <http://www.arkivverket.no/standarder/noark5/arkivstruktur/>

select * where {

  	graph ?g {  
		?a a arkiv:Saksmappe.	  	
		?b arkiv:parent ?a.
		?b arkiv:publisertDato ?dato.
	}                  
   
                          
 
} 

#######################################################################

The Query Plan:

prefix arkiv: <http://www.arkivverket.no/standarder/noark5/arkivstruktur/>

From all
Projection(?g, ?a, ?b, ?dato) [#20.7M]
`─ HashJoin(?a) [#20.7M]
   +─ MergeJoin(?b) [#25.4M]
   │  +─ Scan[PSOC](?b, arkiv:publisertDato, ?dato){?g} [#25.4M]
   │  `─ Scan[PSOC](?b, arkiv:parent, ?a){?g} [#29.4M]
   `─ Scan[POSC](?a, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, arkiv:Saksmappe){?g} [#5.6M]

These are simplifications of a more advanced insert query that I have, but I’m seeing the same in that query.

Cheers,
Håvard

PS: Using 5.1.0.

Actually it probably doesn’t use a loop join but the plan is misleading. What happens is that for SPARQL Update queries, as opposed to read queries, we defer planning till runtime. There’re several reasons for that, mostly update sequences (multiple update queries in a sequence). In those cases it’s often easier to find the best plan for each query after seeing changes from the previous queries. It interacts with reasoning in some subtle ways too (e.g. what happens when a previous query in the sequences changes the schema).

We have an open ticket to make the explain command / API useful for update queries (#3297). I can’t promise any particular date when this is going to be fixed, except that we’ll give it another thought. Maybe we’ll just improve it for singleton sequences since it’s a common special case.

Thanks,
Pavel

Ok. Thank you Pavel.

I couldn’t get any perfomance on the query I wrote initially, so I’ve ended up just doing a select and generating a turtle file.

This is the query I ended up using

select  ?a (min(?dato) as ?publisertDato) ?g where {

        graph ?g {
			?a a arkiv:Saksmappe.
			?b arkiv:parent ?a.
			?b arkiv:publisertDato ?dato.
			FILTER(NOT EXISTS{?a arkiv:publisertDato ?q})
        }

} group by ?a ?g

This query made Stardog crash after pulling around 2 000 000 results from the iterator in java. Seems to be a memory issue, I presume the group by is stopping stardog from releasing memory while handling the query.

We ended up putting a limit on it, and then inserting the generated data into stardog, and then running it again, over and over, until it stopped returning data.

The Query Plan:

From all
Projection(?a, ?publisertDato, ?g) [#2.4M]
`─ Group(by=[?g, ?a] aggregates=[(MIN(?dato) AS ?publisertDato)]) [#2.4M]
   `─ Filter(!(Bound(?q))) [#6.1M]
      `─ HashJoinOuter(?a) [#12.2M]
         +─ HashJoin(?a) [#12.2M]
         │  +─ MergeJoin(?b) [#15.5M]
         │  │  +─ Scan[PSOC](?b, <http://www.arkivverket.no/standarder/noark5/arkivstruktur/parent>, ?a){?g} [#15.6M]
         │  │  `─ Scan[PSOC](?b, <http://www.arkivverket.no/standarder/noark5/arkivstruktur/publisertDato>, ?dato){?g} [#12.3M]
         │  `─ Scan[POSC](?a, rdf:type, <http://www.arkivverket.no/standarder/noark5/arkivstruktur/Saksmappe>){?g} [#3.3M]
         `─ Scan[PSOC](?a, <http://www.arkivverket.no/standarder/noark5/arkivstruktur/publisertDato>, ?q){?g} [#12.3M]

Or maybe it’s the two hash joins that are causing the issue?

Two hash joins certainly do create some (native) memory pressure here. However I don’t really understand how making it a SELECT, generating a Turtle file, and then adding that back to Stardog can work any better. The main problem is that if you pull results from Stardog in this way, you’d also pull all RDF terms (IRIs, literals, etc.) whereas if you run it as an INSERT, that won’t happen (at least for values of ?a and ?g). Stardog then wouldn’t decode them and would operate directly on the numerical representation.

You may want to experiment a little with -XX:MaxDirectMemorySize for the server, maybe increase it, and see if INSERT performance changes in any way. If it seems frozen for a long time, thread dumps may help (it might be spilling intermediate results to disk).

Also, how exactly Stardog crashes? Does the OOM happen on the server side?

Cheers,
Pavel

Also, an independent problem here is that the optimizer may have wrongly decided to not push the NOT EXISTS filter before rewriting it into an outer join. Can you quickly try something like:

select  ?a (min(?dato) as ?publisertDato) ?g where {

        graph ?g {
			{ ?a a arkiv:Saksmappe . FILTER(NOT EXISTS{?a arkiv:publisertDato ?q}) }
			?b arkiv:parent ?a.
			?b arkiv:publisertDato ?dato.			
        }

} group by ?a ?g

Possibly in the INSERT version as well. Please show the plan if it doesn’t make a difference.

Thanks,
Pavel


The Query Plan:

From all
Projection(?a, ?publisertDato, ?g) [#2.4M]
`─ Group(by=[?g, ?a] aggregates=[(MIN(?dato) AS ?publisertDato)]) [#2.4M]
   `─ HashJoin(?b) [#6.1M]
      +─ MergeJoin(?a) [#6.1M]
      │  +─ Filter(!(Bound(?q))) [#1.7M]
      │  │  `─ MergeJoinOuter(?a) [#3.3M]
      │  │     +─ Scan[POSC](?a, rdf:type, <http://www.arkivverket.no/standarder/noark5/arkivstruktur/Saksmappe>){?g} [#3.3M]
      │  │     `─ Scan[PSOC](?a, <http://www.arkivverket.no/standarder/noark5/arkivstruktur/publisertDato>, ?q){?g} [#12.3M]
      │  `─ Scan[POSC](?b, <http://www.arkivverket.no/standarder/noark5/arkivstruktur/parent>, ?a){?g} [#15.6M]
      `─ Scan[PSOC](?b, <http://www.arkivverket.no/standarder/noark5/arkivstruktur/publisertDato>, ?dato){?g} [#12.3M]

That does look better. There is still a hash join there though.

Does it actually run faster? It must also use much less memory (hash joins materialize intermediate results, merge joins don’t) so if the original INSERT problem was caused by insufficient memory and spilling to disk, then this could work inside INSERT. But since aggregation also accumulates some results in memory a bump to -XX:MaxDirectMemorySize may still be necessary. One needs to try though.

Cheers,
Pavel

Thanks Pavel. The production team just wiped our production database right now, so I can’t try anymore today. We are moving from the old SQL based system today and running up our new system with Stardog over the coming days.

To test anymoe I’ll have to run everything up locally, but since that takes a while I’m going to start working on the ICV perfomance with reasoning instead.

Thanks again for all the help Pavel.

The guy from our production team says there’s nothing in the logs from the crash btw. Just shows that it started up again.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.