“GC overhead limit exceeded” DBPedia sparql endpoint

bsteenwi · May 5, 2021, 12:52pm

Hi all,

I just want to reopen the question asked in this previous related issue:

As we are suggesting in pyrdf2vec to setup a local SPARQL endpoint using Stardog, several people might have to deal with this related issue (crf. this blogpost).

Currently, we make simple select requests to our local SPARQL endpoint:

SELECT ?p ?o ?dt WHERE { BIND( IRI("' + noi + '") AS ?s ) ?s ?p ?o. BIND (datatype(?o) AS ?dt)

as the node of interest (noi) varies a lot, we just use some string concatenation to create the query.
Together with our host description, we make requests using a python http client to:

self.host + '/' + self.db + '/query?query=' + query
headers=[("Accept", "application/sparql-results+json")]

As we have a long list of nodes of interest, all of our code can be ran in a multiprocessing fashion.

When making a lot of these http requests using multiple processors, we start noticing that after a while, the response rate of the local SPARQL endpoint starts to decrease. When using the stardog-admin server status command, I can see the rate/sec dropping after a while. Also the Memory Heap gets close to the predefined 8G.

As mentioned in the related issue, our script stops after a while with the following errors:

Unexpected error encountered: invalid http version, `-Error-Code: 000012`
Unexpected error encountered: invalid http version, `Stardog-Error-Message: GC overhead limit exceeded`

At that point, I can't stop the Stardog server anymore using the stardog-admin server stop command but I have to kill the java process manually.

We did not notify these issues when using 1 single core (despite the fact that it would take a while to get all the results).

pavel · May 5, 2021, 1:25pm

Hi Bram,

Can you collect the heap dump when the server runs out of memory (using -XX:+HeapDumpOnOutOfMemoryError) and share with us? You can run the server with less heap so the dump is smaller and the OOM happens faster. You can put it up somewhere like Dropbox or Google Drive.

Also couple of questions:

what Stardog version are you using?
does your client close result sets and connections?

Best,
Pavel

bsteenwi · May 6, 2021, 7:05am

Hi Pavel,

Thanks for your quick reply.

I'm using Stardog 7.4.5
It might indeed be a problem that our client does not properly close the connection.
We are using the faster_than_request python library to make these requests, and i'm not sure if they handle this properly..

I have now switched to another implementation and will also check our setup with the pystardog client to make sure the problem is related to the Stardog Server or to our current client setup.

Best,
Bram

bsteenwi · May 6, 2021, 12:58pm

So I changed some of our application code to pystardog, to make sure the connections are closed.
The OOM issue might be resolved now (not sure, still testing) but I still notice a large decay in query performance:
With a fresh restarted server

Backup Storage Directory : .backup
CPU Load                 : 95.78 %
Connection Timeout       : 10m
Export Storage Directory : .exports
Memory Heap              : 5.3G (Max: 7.9G)
Memory Mode              : DEFAULT{Starrocks.dict_block_cache=10, Starrocks.block_cache=20, Native.starrocks=70, Heap.dict_value=50, Starrocks.txn_block_cache=5, Heap.dict_index=50, Starrocks.untracked_memory=20, Starrocks.memtable=40, Starrocks.buffer_pool=5, Native.query=30}
Memory Query Blocks      :   0B (Max: 8.5G)
Memory RSS               : 15G
Named Graph Security     : false
Platform Arch            : amd64
Platform OS              : Linux 4.15.0-23-generic, Java 1.8.0_275
Query All Graphs         : false
Query Timeout            : 5m
Security Disabled        : true
Stardog Home             : /users/bsteenwi
Stardog Version          : 7.4.5
Strict Parsing           : false
Uptime                   : 3 minutes 51 seconds
Databases                :
+----------+-------------+------------------------------------------------------+-----------------------------------------------+
|          |             |                     Transactions                     |                    Queries                    |
+----------+-------------+-------+-------+-----------+---------------+----------+---------+---------+---------------+-----------+
| Database | Connections | Open  | Total | Avg. Size | Avg. Time (s) | Rate/sec | Running |  Total  | Avg. Time (s) | Rate/sec  |
+----------+-------------+-------+-------+-----------+---------------+----------+---------+---------+---------------+-----------+
| dbpedia  |          18 |     0 |     0 |         0 |         0.000 |    0.000 |       6 | 329,360 |         0.010 | 1,466.114 |
+----------+-------------+-------+-------+-----------+---------------+----------+---------+---------+---------------+-----------+

But after a while:

Backup Storage Directory : .backup
CPU Load                 : 90.88 %
Connection Timeout       : 10m
Export Storage Directory : .exports
Memory Heap              : 7.5G (Max: 7.8G)
Memory Mode              : DEFAULT{Starrocks.dict_block_cache=10, Starrocks.block_cache=20, Native.starrocks=70, Heap.dict_value=50, Starrocks.txn_block_cache=5, Heap.dict_index=50, Starrocks.untracked_memory=20, Starrocks.memtable=40, Starrocks.buffer_pool=5, Native.query=30}
Memory Query Blocks      :   0B (Max: 8.5G)
Memory RSS               : 16G
Named Graph Security     : false
Platform Arch            : amd64
Platform OS              : Linux 4.15.0-23-generic, Java 1.8.0_275
Query All Graphs         : false
Query Timeout            : 5m
Security Disabled        : true
Stardog Home             : /users/bsteenwi
Stardog Version          : 7.4.5
Strict Parsing           : false
Uptime                   : 18 minutes 24 seconds
Databases                :
+----------+-------------+------------------------------------------------------+----------------------------------------------+
|          |             |                     Transactions                     |                   Queries                    |
+----------+-------------+-------+-------+-----------+---------------+----------+---------+---------+---------------+----------+
| Database | Connections | Open  | Total | Avg. Size | Avg. Time (s) | Rate/sec | Running |  Total  | Avg. Time (s) | Rate/sec |
+----------+-------------+-------+-------+-----------+---------------+----------+---------+---------+---------------+----------+
| dbpedia  |          13 |     0 |     0 |         0 |         0.000 |    0.000 |      18 | 542,516 |         0.215 |  494.014 |
+----------+-------------+-------+-------+-----------+---------------+----------+---------+---------+---------------+----------+

pavel · May 6, 2021, 1:04pm

That's interesting. The database is not being updated, right?

I'd suggest printing GC info and if you see that GC activity increases over time, share heap dumps with us.

Best,
Pavel

bsteenwi · May 6, 2021, 2:31pm

Hi Pavel,

No updates, just select queries.

Kind regards,
Bram

pavel · May 7, 2021, 7:24am

I don't really see any evidence of leaks or GC related issues in this heap. Strongly reachable objects are like 2.5Gb only so there's a lot of room to play with.

Do you see increased GC activity when running your test? If not, but queries still get slower, we'll need to resort to profiling (I personally prefer async-profiler) to see how the time distribution changes over time. We can take a look at that if you describe the exact sequence of steps to reproduce the behaviour (might take a while).

Thanks,
Pavel

bsteenwi · May 7, 2021, 9:38am

I suspect I had an OOM last night:

https://drive.google.com/file/d/18wU8qWKEgjkBk8Ad-VDG4w2Z6O9LHbnX/view?usp=sharing

java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to java_pid12785.hprof ...
Heap dump file created [30495245066 bytes in 114.905 secs]
WARN  2021-05-06 18:17:53,570 [stardog-user-11] com.complexible.stardog.protocols.http.server.StardogUndertowErrorHandler:accept(65): Unexpected exception was handled by the server
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) ~[?:1.8.0_275]
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) ~[?:1.8.0_275]
	at io.undertow.server.DefaultByteBufferPool.allocate(DefaultByteBufferPool.java:148) ~[undertow-core-2.2.1.Final.jar:2.2.1.Final]
	at io.undertow.conduits.DeflatingStreamSinkConduit.deflateData(DeflatingStreamSinkConduit.java:483) ~[undertow-core-2.2.1.Final.jar:2.2.1.Final]
	at io.undertow.conduits.DeflatingStreamSinkConduit.write(DeflatingStreamSinkConduit.java:137) ~[undertow-core-2.2.1.Final.jar:2.2.1.Final]
	at io.undertow.conduits.DeflatingStreamSinkConduit.write(DeflatingStreamSinkConduit.java:158) ~[undertow-core-2.2.1.Final.jar:2.2.1.Final]
	at org.xnio.conduits.ConduitStreamSinkChannel.write(ConduitStreamSinkChannel.java:158) ~[xnio-api-3.8.0.Final.jar:3.8.0.Final]
	at io.undertow.channels.DetachableStreamSinkChannel.write(DetachableStreamSinkChannel.java:179) ~[undertow-core-2.2.1.Final.jar:2.2.1.Final]
	at io.undertow.server.HttpServerExchange$WriteDispatchChannel.write(HttpServerExchange.java:2146) ~[undertow-core-2.2.1.Final.jar:2.2.1.Final]
	at org.xnio.channels.Channels.writeBlocking(Channels.java:202) ~[xnio-api-3.8.0.Final.jar:3.8.0.Final]
	at io.undertow.io.UndertowOutputStream.write(UndertowOutputStream.java:164) ~[undertow-core-2.2.1.Final.jar:2.2.1.Final]
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) ~[?:1.8.0_275]
	at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282) ~[?:1.8.0_275]
	at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125) ~[?:1.8.0_275]
	at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207) ~[?:1.8.0_275]
	at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator._flushBuffer(WriterBasedJsonGenerator.java:2016) ~[jackson-core-2.11.2.jar:2.11.2]
	at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator._writeFieldName(WriterBasedJsonGenerator.java:174) ~[jackson-core-2.11.2.jar:2.11.2]
	at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.writeFieldName(WriterBasedJsonGenerator.java:152) ~[jackson-core-2.11.2.jar:2.11.2]
	at com.fasterxml.jackson.core.JsonGenerator.writeStringField(JsonGenerator.java:1691) ~[jackson-core-2.11.2.jar:2.11.2]
	at com.stardog.stark.query.io.json.AbstractSPARQLJSONWriter.writeValue(AbstractSPARQLJSONWriter.java:233) ~[stardog-stark-query-json-7.4.5.jar:?]
	at com.stardog.stark.query.io.json.SPARQLJSONSelectResultWriter.handle(SPARQLJSONSelectResultWriter.java:92) ~[stardog-stark-query-json-7.4.5.jar:?]
	at com.stardog.stark.query.io.json.SPARQLJSONSelectResultWriter.handle(SPARQLJSONSelectResultWriter.java:32) ~[stardog-stark-query-json-7.4.5.jar:?]
	at com.stardog.stark.query.io.QueryResultWriters$$Lambda$888/1754351350.accept(Unknown Source) ~[?:?]
	at com.stardog.stark.query.ClosingSpliterator.forEachRemaining(ClosingSpliterator.java:38) ~[stardog-stark-query-api-7.4.5.jar:?]
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) ~[?:1.8.0_275]
	at com.stardog.stark.query.io.QueryResultWriters.write(QueryResultWriters.java:142) ~[stardog-stark-query-io-7.4.5.jar:?]
	at com.stardog.stark.query.io.QueryResultWriters.write(QueryResultWriters.java:127) ~[stardog-stark-query-io-7.4.5.jar:?]
	at com.complexible.stardog.protocols.http.server.ProtocolUtils.writeTupleResponse(ProtocolUtils.java:705) ~[stardog-protocols-http-server-7.4.5.jar:?]
	at com.complexible.stardog.protocols.http.server.ProtocolUtils.executeReadQuery(ProtocolUtils.java:597) ~[stardog-protocols-http-server-7.4.5.jar:?]
	at com.complexible.stardog.protocols.http.server.ProtocolUtils.executeReadQuery(ProtocolUtils.java:568) ~[stardog-protocols-http-server-7.4.5.jar:?]
	at com.complexible.stardog.protocols.http.server.ProtocolUtils.executeReadQuery(ProtocolUtils.java:526) ~[stardog-protocols-http-server-7.4.5.jar:?]
	at com.complexible.stardog.protocols.http.server.SPARQLProtocol.executeQuery(SPARQLProtocol.java:227) ~[stardog-protocols-http-server-7.4.5.jar:?]
Exception in thread "Stardog.Executor-3" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.Arrays.copyOf(Arrays.java:3308)
	at com.codahale.metrics.UniformSnapshot.getValues(UniformSnapshot.java:93)
	at com.stardog.metrics.KGMetricsModule$ReachMeasurer.handleEvent(KGMetricsModule.java:329)
	at com.stardog.metrics.KGMetricsModule$QueryEventMeasurer.lambda$handleEvent$1(KGMetricsModule.java:407)
	at com.stardog.metrics.KGMetricsModule$QueryEventMeasurer$$Lambda$892/1120522861.accept(Unknown Source)
	at java.util.Optional.ifPresent(Optional.java:159)
	at com.stardog.metrics.KGMetricsModule$QueryEventMeasurer.handleEvent(KGMetricsModule.java:407)
	at com.stardog.metrics.KGMetricsModule$FilteredEventMeasurer.handleEvent(KGMetricsModule.java:350)
	at com.stardog.metrics.KGMetricsModule$KGMetricsKernelModule.lambda$null$5(KGMetricsModule.java:235)
	at com.stardog.metrics.KGMetricsModule$KGMetricsKernelModule$$Lambda$799/1363006922.accept(Unknown Source)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1556)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
	at com.stardog.metrics.KGMetricsModule$KGMetricsKernelModule.lambda$handleEvent$6(KGMetricsModule.java:235)
	at com.stardog.metrics.KGMetricsModule$KGMetricsKernelModule$$Lambda$795/51722490.run(Unknown Source)
	at com.complexible.stardog.security.ShiroUtils.lambda$executeAs$0(ShiroUtils.java:284)
	at com.complexible.stardog.security.ShiroUtils$$Lambda$796/1637751568.get(Unknown Source)
	at com.complexible.stardog.security.ShiroUtils.executeAs(ShiroUtils.java:243)
	at com.complexible.stardog.security.ShiroUtils.executeAs(ShiroUtils.java:283)
	at com.complexible.stardog.security.ShiroUtils.executeAsSuperUser(ShiroUtils.java:279)
	at com.stardog.metrics.KGMetricsModule$KGMetricsKernelModule.handleEvent(KGMetricsModule.java:231)
	at com.complexible.stardog.KernelEventNotifier$Wrapper.handle(KernelEventNotifier.java:153)
	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.google.common.eventbus.Subscriber.invokeSubscriberMethod(Subscriber.java:87)
Exception in thread "Stardog.Executor-5" java.lang.OutOfMemoryError: GC overhead limit exceeded

Our setup is rather straightforward:
We just install a DBPedia dump as defined in our blogpost above.
Then we pick some cities from DBPedia and execute the following scripts:

Execute the defined SPARQL query for a given city
For all resulting objects, if they are URI's, execute the previous SPARQL query again but for that object
stop at a certain defined depth (e.g. 3)

I can provide the scripts to test this

The first 1000's of queries are really fast (like if some of the results were already cached from runs before, even if we restarted the server).

Best,
Bram

pavel · May 7, 2021, 1:21pm

Yes, this is an OOM for sure. Let's do one of the following:

reproduce the same on a smaller heap (let's say up to 8G) so it's easier for us to analyse. I don't see why that use case would need so much heap.
provide the scripts to reproduce the issue on our side

Reproducing with 7.6.2 (or 7.6.3 which should be released later today) would make sense too.

Thanks,
Pavel

bsteenwi · May 7, 2021, 4:16pm

Hi Pavel,

A new OOM with Stardog 7.6.2 (smaller heap size):

I have also added the script which causes this OOM crash
bench_star.py.txt (11.0 KB)

Kind regards,
and thanks in advance,
Bram

pavel · May 9, 2021, 12:39pm

Thanks, this is useful. There's definitely a problem with queries not properly terminated and GC'd thus accumulating on the heap but we need to look a little deeper into it.

Best,
Pavel

bsteenwi · May 10, 2021, 7:40am

Hi Pavel,

If we need to provide some additional scripts/code, please let me know.

Best,
Bram

pavel · May 10, 2021, 7:52am

I can run the script but it'd help if you point me to the DBPedia dump that you're using. I have some small version locally but your script terminates after running about 11K queries so I guess you're using a bigger subset.

Thanks,
Pavel

PS. I'm not great at reading Python but it looks like your script reuses the same connection for multiple concurrent queries, doesn't it? This is generally discouraged. For Java clients we recommend client side connection pooling to avoid creating a connection every time but still avoid sharing the connection across threads. Also, it'd be good to close result set. These are just general suggestions, I don't yet see how not following them would lead to that kind of heap dump.

bsteenwi · May 10, 2021, 8:14am

Hi Pavel,

I'm using the 2015-10 dump (only the en version, following these steps).

I also noticed our script was reusing the same connection, but I have a similar behaviour when closing it directly and using fewer cores. I'm not sure how we can close the result set in Python, maybe I should start to refresh my Java skills

Best,
Bram

pavel · May 10, 2021, 1:58pm

One quick question: do you run Stardog server on the same computer that runs the client script?

bsteenwi · May 10, 2021, 4:35pm

No, two separate nodes.

(so not related to this issue anymore)

pavel · May 10, 2021, 6:47pm

OK, I think I have reproduced the problem. The issue is pretty specific, it requires some extremely fast queries coming with high degree of concurrency so that the queries would run faster than a background thread to perform one particular kind of cleanup. I will create a ticket about it.

As a workaround I can suggest two options:

less concurrency (how much less depends on the number of cores on your machine)
add kernel.event.dispatch.async=false to your stardog.properties and restart the server. The queries will probably run a little slower (since the cleanup will be synchronous) but that should eliminate the backlog.

Let us know if that helps,
Pavel

bsteenwi · May 11, 2021, 6:08pm

Hi Pavel,

kernel.event.dispatch.async=false does the trick!

Thanks for all the help.

Best,
Bram

system · May 25, 2021, 6:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query timeouts and killing fails Support	9	1586	May 1, 2022
Full CPU usage when doing nothing Bug	5	1323	September 11, 2018
Stardog is running out of RAM with 16GB RAM and 15 millions triples Bug	3	857	March 18, 2020
Deal withh "GC overhead limit exceeded" when setup a local sparql endpoint Support	10	394	February 2, 2021
java.lang.OutOfMemoryError: Java heap space Support	19	3006	May 12, 2017

“GC overhead limit exceeded” DBPedia sparql endpoint

Related topics