Query timeout, simple query, medium dataset

semanticfire · July 4, 2017, 1:14pm

Hi,

We have a database containing 19M triples in various graphs.
The server runs on 2 cores with 4GB of memory.

If we query for resources with a specific URI pattern the query will not finish and end up with the 5m timeout.

select distinct ?s where 
        { graph <http://data.resc.info/kro> 
        { ?s ?p ?o .         
         filter ( strstarts(str(?s), 'http://data.resc.info/kro/object/0758200000000003'))
      	}
}

If we copy all the data from the data.resc.info/kro graph into a separate database the performance is fine.
I've looked at the query plans and for both DB's they look the same.
Any hints on how to improve the performance are welcome.

Bart

pavel · July 4, 2017, 2:29pm

Hi Bart,

I suspect there’s a difference in the query plans, specifically in the scan index: one scan uses SC and the other SPO. Could you check this and send the plans along if that’s not the case.

Is it true that your named graph contains most of the 19M triples?

Cheers,
Pavel

semanticfire · July 4, 2017, 2:36pm

Hi Pavel,

no the database consists of 12 Named graphs, having between 4.3M and 15k triples each.
The graph we select from contains about 4.3M triples, moving these to a separate DB helps with performance

Full db:

Distinct [#541K]
`─ Projection(?s) [#2,1M]
   `─ Filter(STRSTARTS(Str(?s), "http://data.resc.info/kro/object/0758200000000003")) [#2,1M]
      `─ Scan[SPO](?s, ?p, ?o){<http://data.resc.info/kro>} [#4,3M]

Paritial db:

Distinct [#666K]
`─ Projection(?s) [#2,1M]
   `─ Filter(STRSTARTS(Str(?s), "http://data.resc.info/kro/object/0758200000000003")) [#2,1M]
      `─ Scan[SPO](?s, ?p, ?o){<http://data.resc.info/kro>} [#4,2M]

Thanks
Bart

pavel · July 5, 2017, 9:52am

This is indeed strange. Usually such large performance difference manifest themselves in plans. Any chance we can get the data (possibly in the obfuscated form) to reproduce this on our end? How long does the query take on the separate databases?

Also, depending on how the data looks like (e.g. the number of distinct subject IRIs), it might make sense to push the distinct into a subquery to first get unique subjects before the expensive filtering.

Best,
Pavel

semanticfire · July 5, 2017, 10:21am

Hi Pavel,

We can arrange the transfer of the dataset, could we go offline about the conditions ?

In the meantime I’ve tried your proposal and that was a huge improvement!

select ?s ?p ?o where 
        { 
          {select distinct ?s where {
          graph <http://data.resc.info/kro> 
        { ?s ?p ?o .         
         
      	}}} .
          ?s ?p ?o
        filter ( strstarts(str(?s), 'http://data.resc.info/kro/object/0758200000000003'))
}

Result in a query that finishes, although not super fast.

Bart

pavel · July 5, 2017, 10:29am

OK, but that query isn't equivalent to the original one, which only asks for subjects. If you only need subjects, then the second ?s ?p ?o isn't needed and things would be faster. If you also need predicates and objects, then you need it (you can also use DESCRIBE, by the way).

Sure, drop me a line at pavel@stardog.com and we can discuss how to do it, I can also explain the obfuscation thing. I'm on the road today so might be some delays in responding, sorry about that!

Cheers,
Pavel

semanticfire · July 5, 2017, 10:38am

You are right, the original query was a describe, but we broke that down to get a minimal query to replicate the problem.
Would there be a performance benefit in describe ?

If I omit the graph clause then the query runs into problems again. In the end causing unrecoverable memory errors where the only resolution is to completely reboot the graph server. This is something which is obviously bothering us.
Any advice is greatly appreciated.

Log output:

Exception in thread "XNIO-1 task-5" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
        at java.util.zip.ZipFile$ZipFileInflaterInputStream.<init>(ZipFile.java:393)
        at java.util.zip.ZipFile.getInputStream(ZipFile.java:374)
        at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
        at java.util.jar.JarFile.getManifest(JarFile.java:180)
        at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:981)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.logging.log4j.core.impl.MutableLogEvent.getThrownProxy(MutableLogEvent.java:338)
        at org.apache.logging.log4j.core.pattern.ExtendedThrowablePatternConverter.format(ExtendedThrowablePatternConverter.java:61)
        at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38)
        at org.apache.logging.log4j.core.layout.PatternLayout$PatternSerializer.toSerializable(PatternLayout.java:333)
        at org.apache.logging.log4j.core.layout.PatternLayout.toText(PatternLayout.java:232)
        at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:217)
        at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:57)
        at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:177)
        at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
        at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
        at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
        at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
        at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
        at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
        at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
        at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
        at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
Exception in thread "XNIO-1 I/O-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.HashMap.newNode(HashMap.java:1742)
        at java.util.HashMap.putVal(HashMap.java:630)
        at java.util.HashMap.put(HashMap.java:611)
        at java.util.HashSet.add(HashSet.java:219)
        at sun.nio.ch.EPollSelectorImpl.updateSelectedKeys(EPollSelectorImpl.java:131)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:98)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.xnio.nio.WorkerThread.run(WorkerThread.java:522)
Exception in thread "XNIO-1 task-9" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "XNIO-1 task-2" java.lang.OutOfMemoryError: GC overhead limit exceeded

system · July 19, 2017, 10:38am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Check all named graphs Support	9	340	February 3, 2021
Query performance issues Support	8	1246	February 28, 2017
Scalability issues in Stardog Cloud Support	16	413	November 8, 2024
Very different query plans for very similar queries, involving blank nodes Support	2	264	May 9, 2022
Query Throughtput Stardog Support	5	525	May 16, 2017

Query timeout, simple query, medium dataset

Related topics