Optimize query caching: possible?

pierluigi · May 1, 2018, 9:52pm

We have a query template we need to execute often, of which this query is an example:

select distinct ?s ?score ?expr ?desc
        where {`
          ## get synsets from the upper layer (i.e. Wordnet synsets) ...
         {    ?s ?p ?l . 
           	?s a wno:Synset .
    	## children (direct hyponyms) if any
            optional { ?s wno:hyponym ?desc } . 
          }
          	union
    	## ... or get nodes from the named graphs, incl. both CG extracts and dbpedia 
          { graph ?g { 
            ?s ?p ?l .
    	## children (direct subclasses) if any
            optional {?desc rdfs:subClassOf ?s} }
          } .
          ## returns the "official" labels of anything
          { ?s rdfs:label ?expr } 
          	union 
          { graph ?g { ?s rdfs:label ?expr }} .
            ## ... all matching 'encryption' -- this is the lucene extension
        	(?l ?score) <tag:stardog:api:property:textMatch> "encryption".
        }

As you can see the db we query is search-enabled. The server runs on a 16 cores/92G RAM box, with the STARDOG_JAVA_ARGS set at 16/16/72 respectively. This particular db contains ~250M triples. About 90% of those are from a partial mirror of dbpedia stored in a named graph.

This query takes ~8 secs to run the first time. I am not sure whether the query is optimal in formulation (I assume it isn’t) and the data itself could probably be reshaped effectively in many ways – iow, no complaint about performance in and of itself Regardless of that, though, I would have expected it to run considerably faster on subsequent execution – i…e, I’d have thought query reuse to be aggressive. However, re-running only shaves a second or so (doesn;t get much faster than 7 secs). Is there a way to increase or facilitate caching of partial results or some such mechanism? Is query reuse hindered in this case by the Lucene query? (BTW, the lucene search predicate makes it a lot faster than SPARQL contains or similar)

NOTE: this is on a dev-only server, with a very sparse query load so far.

NOTE2: I have tested with both read_optimized and default as memory.mode, no real difference maybe default was slightly faster actually.

zachary.whitley · May 2, 2018, 12:33am

You might find a recent blog post by the Stardog folks on query optimization helpful 7 Steps to Fast SPARQL Queries | Stardog

I’m guessing that the one second savings you’re seeing on subsequent queries is just the query plan that’s being cached.

Can you send a copy of the query plan. “Stardog query explain ...”

pavel · May 2, 2018, 7:46am

Stardog does not cache query results but normally subsequent runs are still faster due to the OS and disk caches (in addition to the query plan cache).

So your best course of action is to try to make the query generally faster. The query plan would help.

Best,
Pavel

mkh · May 2, 2018, 5:19pm

Pavel,

this is an interesting statement you make there. I assume this also holds for intermediate data you materialise while processing queries, i.e., nothing of this is cached explicitly?

I’m asking because we can clearly feel positive effects when we run a set of representative queries after starting up our stardog instance. So these effects are only due to OS, disk & query plan cache?

What does the ‘Page Cache Hit Ratio’ refer to that is shown in the ‘stardog-admin server status’ command?

Thanks,
Marcel

pavel · May 2, 2018, 5:35pm

Hi Marcel,

Yes, Stardog does cache data (in pages) which it reads from disk. Those, however, are not intermediate query results, it’s just indexed data (e.g. triples). Page Cache Hit Ratio refers to that.

The plan cache, however, often makes a big difference. It’s not very uncommon to have queries for which optimization takes longer than subsequent execution.

Cheers,
Pavel

PS. The env variable for the server should be STARDOG_SERVER_JAVA_ARGS.

pierluigi · May 2, 2018, 6:04pm

query plan here:

Slice(offset=0, limit=100) [#100]
`─ Distinct [#7.4K]
   `─ Projection(?s, ?score, ?expr, ?desc) [#7.4K]
      `─ MergeJoin(?s) [#7.4K]
         +─ HashJoin(?l) [#3.7K]
         │  +─ Union [#243.2M]
         │  │  +─ MergeJoinOuter(?s) [#910K]
         │  │  │  +─ MergeJoin(?s) [#910K]
         │  │  │  │  +─ Scan[POS](?s, rdf:type, <http://wordnet-rdf.princeton.edu/ontology#Synset>) [#118K]
         │  │  │  │  `─ Scan[SPO](?s, ?p, ?l) [#5.9M]
         │  │  │  `─ Scan[PSO](?s, <http://wordnet-rdf.princeton.edu/ontology#hyponym>, ?desc) [#89K]
         │  │  `─ MergeJoinOuter(?s) [#242.3M]
         │  │     +─ Scan[SPOC](?s, ?p, ?l){?g} [#242.3M]
         │  │     `─ Scan[POSC](?desc, rdfs:subClassOf, ?s){?g} [#609K]
         │  `─ Full-Text(query='encryption') -> (results=?l, scores=?score) [#1.8K]
         `─ Union [#14.6M]
            +─ Scan[PSO](?s, rdfs:label, ?expr) [#209K]
            `─ Scan[PSOC](?s, rdfs:label, ?expr){?g} [#14.4M]

(Anecdotal: without the OPTIONAL clauses, this takes half the time.)

Thanks for the STARDOG_SERVER… correction! I guess the old (or wrong) form is still accepted? I have had that var in the environment for months and seems to operate as expected.

stephen · May 2, 2018, 6:07pm

It definitely isn't wrong, per se. STARDOG_JAVA_ARGS applies to every stardog/stardog-admin command, while STARDOG_SERVER_JAVA_ARGS applies exclusively to stardog-admin server start

pierluigi · May 2, 2018, 6:09pm

Does it make sense to set both?

jess · May 2, 2018, 6:10pm

The query can be optimized by setting the query.all.graphs option to true which will allow you to avoid duplication of graph patterns over default and named graphs.

Jess

jess · May 2, 2018, 6:14pm

It looks like you could benefit from duplicating this part of the query in either side of the first union:

e.g.

         ## get synsets from the upper layer (i.e. Wordnet synsets) ...
         {    ?s ?p ?l . 
           	?s a wno:Synset .
(?l ?score) <tag:stardog:api:property:textMatch> "encryption".
    	## children (direct hyponyms) if any
            optional { ?s wno:hyponym ?desc } . 
          }
          	union
    	## ... or get nodes from the named graphs, incl. both CG extracts and dbpedia 
          { graph ?g { 
            ?s ?p ?l .
(?l ?score) <tag:stardog:api:property:textMatch> "encryption".
    	## children (direct subclasses) if any
            optional {?desc rdfs:subClassOf ?s} }
          } .

Can you share the plan without the OPTIONALs?

stephen · May 2, 2018, 6:17pm

If you have JVM args that you need to pass to any other commands that aren't stardog-admin server start, sure.

pierluigi · May 2, 2018, 7:00pm

Here is the no-opts plan:

The Query Plan:

prefix wno: <http://wordnet-rdf.princeton.edu/ontology#>

Slice(offset=0, limit=100) [#100]
`─ Distinct [#7.4K]
   `─ Projection(?s, ?score, ?expr, ?desc) [#7.4K]
      `─ MergeJoin(?s) [#7.4K]
         +─ HashJoin(?l) [#3.7K]
         │  +─ Union [#243.2M]
         │  │  +─ MergeJoin(?s) [#910K]
         │  │  │  +─ Scan[POS](?s, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, wno:Synset) [#118K]
         │  │  │  `─ Scan[SPO](?s, ?p, ?l) [#5.9M]
         │  │  `─ Scan[SPOC](?s, ?p, ?l){?g} [#242.3M]
         │  `─ Full-Text(query='encryption') -> (results=?l, scores=?score) [#1.8K]
         `─ Union [#14.6M]
            +─ Scan[PSO](?s, <http://www.w3.org/2000/01/rdf-schema#label>, ?expr) [#209K]
            `─ Scan[PSOC](?s, <http://www.w3.org/2000/01/rdf-schema#label>, ?expr){?g} [#14.4M]

Tried the query with your suggestion (distribute textSearch inside the disjuncts) and that was significantly faster as well

system · May 17, 2018, 12:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Poor performance on query for predicates Support	6	680	January 17, 2019
SPARQL Query Optimization Tips and Tricks	5	574	November 2, 2022
Lucene search -> correct triple Feature Request	1	437	November 13, 2018
Check all named graphs Support	9	342	February 3, 2021
Scalability issues in Stardog Cloud Support	16	418	November 8, 2024

Optimize query caching: possible?

Related topics