Stardog crashing when uploading data

Joel,

This is just an update to say nothing new to report. I was on vacation Friday to Wednesday of this week. Back working the issue full-time.

You might want to consider upgrading to an 8 core machine while we continue to work the problem. It would reduce the chances of error. It is not a cure, just a bandaid.

Thank you for the router / gateway info. It supports my colleague's thoughts.

... that said, I have a test that just now caught a heap management error that might be part of your experience. I can only hope. More on this Monday after I review.

Matthew

Yes, I finally have a repeat of your crash conditions.

Joel,

We have a fix in test now. Might have a testing release for you Wednesday. It will be a full product release, not just libStarrocks.so. There were changes needed in the Java layer to better handle rollbacks that occur after a broken connection, i.e. after a router decides a connection has been idle too long.

Matthew

Sounds like great news! Looking forward to try it out!

Hi Joel,

I have uploaded a recent nightly snapshot to a shared location. If you send me your email in a DM, I'll give you access.

Thanks,
Pavel

Thanks, I've sent you a DM with my email!

Brief update,

I've updated Stardog with the nightly build. I'll let it run over night and we'll see what the logs say tomorrow.

So the nightly build looks promising, no crash seems to have occurred! The log shows no exceptions or errors during the night, so I triggered our clients to upload this morning, and below is the output from stardog.log for the most recent upload from our clients. The log now shows exceptions regarding transactions and rollbacks.

stardog-08-17.log (562.5 KB)

Thanks for letting us know, these messages are indeed expected. They're not always ideal and we have an open ticket to improve the messaging but they all have the same underlying cause: a tx was successfully rolled back from your client's concurrent tx rollback call, but the server thread which started the tx tried to continue and could not.

Thanks for your patience and continued testing of this, make sure to send us more reports if you experience crashes. This will be a part of the 8.1 release in early September.

Best,
Pavel

It seems like this still doesn't solve the uploading issues, or at least not some of them. Looking through the clients logs, Stardog still behaves weird from time to time. For example one of the clients sends it's add data requests successfully, and then milliseconds later gets the commit request rejected with the reason "There is no current transaction":

Could not commit transaction: 9ce61d2c-d603-46f3-ae3e-b657c389d337. Failed with status 400 and message {"code":"0D0TU2","message":"There is no current transaction"}

Just to be clear, if you mean that the transaction was already rolled back and the message simply isn't accurate, the client didn't send a rollback request.

Here is from a different upload with the same result. Successful uploads, rejected commit, but with a different message:

Could not commit transaction: f3955d3b-0e60-4154-ba0c-43ea9a614359. Failed with status 400 and message {"code":"0D0TU2","message":"f3955d3b-0e60-4154-ba0c-43ea9a614359 is not a known transaction."}

On second thought, this could also be an issue with our clients. Perhaps Stardog is simply taking to long to answer, so the request is repeated by the client. Looking through the stardog.log again, this could be the case, as the same transaction id occurs multiple times in the exceptions?

ubuntu@ip-10-1-3-55:~$ sudo cat /var/opt/stardog/stardog.log | grep -B 3 -A 15 f3955d3b-0e60-4154-ba0c-43ea9a614359
INFO  2022-08-17 07:41:14,893 [Stardog.Executor-1413] com.complexible.stardog.index.Index:printInternal(324): Parsing triples: 100% complete in 00:03:58 (257K triples - 1.1K triples/sec)
INFO  2022-08-17 07:41:14,941 [Stardog.Executor-1459] com.complexible.stardog.index.Index:printInternal(324): Parsing triples: 100% complete in 00:03:58 (210K triples - 0.9K triples/sec)
WARN  2022-08-17 07:41:15,399 [stardog-user-56] com.complexible.stardog.protocols.http.server.StardogUndertowErrorHandler:accept(68): Unexpected exception was handled by the server
com.complexible.stardog.server.UnknownTransactionException: f3955d3b-0e60-4154-ba0c-43ea9a614359 is not a known transaction.
        at com.complexible.stardog.db.DatabaseImpl.openConnection(DatabaseImpl.java:995) ~[stardog-8.1.0-SNAPSHOT-20220814.jar:?]
        at com.complexible.stardog.StardogKernel.getConnection(StardogKernel.java:1265) ~[stardog-8.1.0-SNAPSHOT-20220814.jar:?]
        at com.complexible.stardog.protocols.http.server.KernelHttpService$ConnectionOptions.connect(KernelHttpService.java:128) ~[stardog-protocols-http-server-8.1.0-SNAPSHOT-20220814.jar:?]
        at com.complexible.stardog.protocols.http.server.TransactionService.commitTransaction(TransactionService.java:188) ~[stardog-protocols-http-server-8.1.0-SNAPSHOT-20220814.jar:?]
        at com.stardog.http.server.undertow.jaxrs.ExtractRoutes.lambda$handleIt$5(ExtractRoutes.java:192) ~[stardog-protocols-http-server-8.1.0-SNAPSHOT-20220814.jar:?]
        at org.apache.shiro.subject.support.SubjectRunnable.doRun(SubjectRunnable.java:120) [shiro-core-1.8.0.jar:1.8.0]
        at org.apache.shiro.subject.support.SubjectRunnable.run(SubjectRunnable.java:108) [shiro-core-1.8.0.jar:1.8.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_342]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_342]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
INFO  2022-08-17 07:41:15,420 [Stardog.Executor-1461] com.complexible.stardog.index.Index:printInternal(324): Computing statistics: 13% complete in 00:00:20 (159.0K triples/sec)
WARN  2022-08-17 07:41:15,435 [stardog-user-56] com.complexible.stardog.protocols.http.server.StardogUndertowErrorHandler:accept(68): Unexpected exception was handled by the server
com.complexible.stardog.server.UnknownTransactionException: ec9f6ee8-896b-476e-b6be-518570727375 is not a known transaction.
        at com.complexible.stardog.db.DatabaseImpl.openConnection(DatabaseImpl.java:995) ~[stardog-8.1.0-SNAPSHOT-20220814.jar:?]
        at com.complexible.stardog.StardogKernel.getConnection(StardogKernel.java:1265) ~[stardog-8.1.0-SNAPSHOT-20220814.jar:?]
--
INFO  2022-08-17 07:41:18,750 [Stardog.Executor-1459] com.complexible.stardog.index.Index:printInternal(324): Parsing triples: 96% complete in 00:03:49 (291K triples - 1.3K triples/sec)
INFO  2022-08-17 07:41:18,925 [Stardog.Executor-1413] com.complexible.stardog.index.Index:printInternal(324): Parsing triples: 95% complete in 00:03:44 (310K triples - 1.4K triples/sec)
WARN  2022-08-17 07:41:19,028 [stardog-user-46] com.complexible.stardog.protocols.http.server.StardogUndertowErrorHandler:accept(68): Unexpected exception was handled by the server
com.complexible.stardog.server.UnknownTransactionException: f3955d3b-0e60-4154-ba0c-43ea9a614359 is not a known transaction.
        at com.complexible.stardog.db.DatabaseImpl.openConnection(DatabaseImpl.java:995) ~[stardog-8.1.0-SNAPSHOT-20220814.jar:?]
        at com.complexible.stardog.StardogKernel.getConnection(StardogKernel.java:1265) ~[stardog-8.1.0-SNAPSHOT-20220814.jar:?]
        at com.complexible.stardog.protocols.http.server.KernelHttpService$ConnectionOptions.connect(KernelHttpService.java:128) ~[stardog-protocols-http-server-8.1.0-SNAPSHOT-20220814.jar:?]
        at com.complexible.stardog.protocols.http.server.TransactionService.rollbackTransaction(TransactionService.java:206) ~[stardog-protocols-http-server-8.1.0-SNAPSHOT-20220814.jar:?]
        at com.stardog.http.server.undertow.jaxrs.ExtractRoutes.lambda$handleIt$5(ExtractRoutes.java:192) ~[stardog-protocols-http-server-8.1.0-SNAPSHOT-20220814.jar:?]
        at org.apache.shiro.subject.support.SubjectRunnable.doRun(SubjectRunnable.java:120) [shiro-core-1.8.0.jar:1.8.0]
        at org.apache.shiro.subject.support.SubjectRunnable.run(SubjectRunnable.java:108) [shiro-core-1.8.0.jar:1.8.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_342]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_342]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
INFO  2022-08-17 07:41:19,063 [Stardog.Executor-1459] com.complexible.stardog.index.Index:printInternal(324): Parsing triples: 100% complete in 00:02:42 (30K triples - 0.2K triples/sec)
INFO  2022-08-17 07:41:19,583 [Stardog.Executor-1457] com.complexible.stardog.index.Index:printInternal(324): Computing statistics: 14% complete in 00:00:24 (142.7K triples/sec)
INFO  2022-08-17 07:41:20,115 [Stardog.Executor-1413] com.complexible.stardog.index.Index:printInternal(324): Parsing triples: 100% complete in 00:02:30 (3.4K triples - 0.0K triples/sec)
INFO  2022-08-17 07:41:20,128 [Stardog.Executor-1467] com.complexible.stardog.index.Index:printInternal(324): Parsing triples: 100% complete in 00:04:14 (152K triples - 0.6K triples/sec)
INFO  2022-08-17 07:41:20,152 [Stardog.Executor-1469] com.complexible.stardog.index.Index:printInternal(324): Parsing triples: 100% complete in 00:02:51 (60K triples - 0.3K triples/sec)```

It does look like the tx was concurrently closed. Technically there could be only 3 reasons: i) commit, ii) rollback, iii) closure due to inactivity.
Usually rollback is the most common. I guess if you enable DEBUG logging in log4j2.xml for com.complexible.stardog.db.DatabaseConnectionImpl we should be able to see it. But start with an empty stardog.log then because there will be a lot of output.

Yeah, it looks like you have different HTTP requests trying to commit and rollback the same tx, possibly concurrently. They both fail in the same way which makes me think that there was another request earlier (commit or rollback) which has succeeded and closed the tx. The DEBUG log would show it (at the expense of some effort to sift through a lot of logs).

Posting an update to let you know I haven't forgotten about this. I'm working on an update for our clients to change when and how requests are repeated. I'll let you know when I've tested it, and we can hopefully close this.

Hi again,

After running our new clients for a while now, it seems safe to assume the nightly build you provided solved the problem! The nightly build has other bugs, like the Stardog Explorer not being able to show all data, but as long as we can query the data with sparql it's good enough until 8.1 is released.

Thank you Matthew, Pavel and anyone else who put effort into solving this, really appreciate how much effort you've put into helping us.