Stardog crashing when uploading data

Joel_Vik · July 7, 2022, 8:39am

Hi,

We have this setup where about 60 applications are upploading data to Stardog using the HTTP API. They can sometimes upload data simultaneously, and when Stardog is receiving data from several sources at the same time, Stardog crashes with what appears to be memory issues. Sometimes a stacktrace is logged, two of which are attached below, and sometimes java provides a core dump, one which is attached below.

The Stardog version is 7.8.0. It it run on a server with 16GB RAM, of which Stardog is given

-Xms6g -Xmx6g -XX:MaxDirectMemorySize=12g

Any help, ideas or feedback is appreciated. Either solutions or how I can troubleshoot this further.

Below is output from stardog.log and other hopefully useful info.

The stardog.log file doesn't output more info than this,

INFO  2022-07-07 05:07:33,586 [Stardog.Executor-101] com.complexible.stardog.index.Index:printInternal(319): Parsing triples: 99% complete in 00:04:03 (749K triples - 3.1K triples/sec)
malloc(): unsorted double linked list corrupted

where malloc is the only info Stardog seems to log before crashing. Some other samples of the one line error message is

malloc(): smallbin double linked list corrupted
# or
corrupted size vs. prev_size
# or 
# [ timer expired, abort... ]
# or 
malloc(): invalid size (unsorted)

Stardog is logging the following memory options when it's starting:

INFO  2022-07-07 08:09:26,964 [main] com.complexible.stardog.cli.impl.ServerStart:call(266): Memory options
INFO  2022-07-07 08:09:26,964 [main] com.complexible.stardog.cli.impl.ServerStart:call(267): Memory mode: DEFAULT{Starrocks.block_cache=20, Starrocks.dict_block_cache=10, Native.starrocks=70, Heap.dict_value=50, Starrocks.txn_block_cache=5, Heap.dict_index=50, Starrocks.memtable=40, Starrocks.untracked_memory=20, Starrocks.buffer_pool=5, Native.query=30}
INFO  2022-07-07 08:09:26,965 [main] com.complexible.stardog.cli.impl.ServerStart:call(268): Min Heap Size: 6.0G
INFO  2022-07-07 08:09:26,966 [main] com.complexible.stardog.cli.impl.ServerStart:call(269): Max Heap Size: 5.8G
INFO  2022-07-07 08:09:26,966 [main] com.complexible.stardog.cli.impl.ServerStart:call(270): Max Direct Mem: 12G
INFO  2022-07-07 08:09:26,967 [main] com.complexible.stardog.cli.impl.ServerStart:call(271): System Memory: 15G

Which is set with the JVM args

-Xms6g -Xmx6g -XX:MaxDirectMemorySize=12g

I've included two stacktraces that occurs during the upload and and a java core dump:

error_adding_data.txt (9.4 KB)
errpr_adding_data2.txt (20.9 KB)
hs_err_pid8735.log (184.4 KB)

matthewv · July 7, 2022, 1:42pm

server has 16g
-Xms6g -Xmx6g -XS:MaxDirectMemorySize=12g -> 6 + 12 = 18g

I recommend reserving 2g for operating system and minor overflow. So bring MaxDirectMemorySize down to 8g --> 6 + 8 = 14g

Joel_Vik · July 7, 2022, 1:47pm

What a simple solution! I thought the 12gb was the total memory and 6gb heap was taken out of that pool. I'll try changing the config.

matthewv · July 7, 2022, 1:48pm

If the problem continues, the follow-up question will be:

How many databases are the 60 applications feeding? All into one, each into a different, or somewhere in between. Makes a difference on the rocksdb buffer allocations.

Joel_Vik · July 7, 2022, 2:27pm

So I changed the JVM args to

-Xms6g -Xmx6g -XX:MaxDirectMemorySize=8g

as you suggested, restarted Stardog and about 10 minutes later Stardog crashed again. So the problem continues.

There is only one database, and it "only" has 26 million triples

Same error as before,

ERROR 2022-07-07 14:16:40,805 [stardog-user-11] com.complexible.stardog.db.DatabaseConnectionImpl:apply(796): There was an error adding data
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 2000
...

corrupted double-linked list

matthewv · July 7, 2022, 2:47pm

Single database. Thank you.

There is a known issue in 7.8 where internal iterators (query) are using oversized buffers. My colleague noted that the stack trace from your first crash was utilizing those impacted iterators. The preferred solution is to upgrade to 7.9. A potential workaround is to again lower all three memory parameters by 2g in your 7.8 installation. This will add another 4g of free space as cushion against the oversized buffers.

Joel_Vik · July 7, 2022, 3:01pm

I had on my todo list to upgrade Stardog to 8.0, so I'll give that a try!

Joel_Vik · July 7, 2022, 3:52pm

So I upgraded the server to 8.0.0 by simply doing sudo apt-get install -y stardog=8.0.0. Stardog 8.0.0 gave me a new warning that Stardog used more than 90% of system memory, so I lowered heap to 5g and tried uploading again, but Stardog still crashes.

This time the error was

ERROR 2022-07-07 15:45:33,226 [stardog-user-15] com.complexible.stardog.db.DatabaseConnectionImpl:apply(796): There was an error adding data
com.stardog.starrocks.StarrocksException: std::bad_alloc
...

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe6d97d9add, pid=760, tid=0x00007fe622f19700
#
# JRE version: OpenJDK Runtime Environment (8.0_312-b07) (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
# Java VM: OpenJDK 64-Bit Server VM (25.312-b07 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libStarrocks.so+0x834add]  rocksdb::LRUHandleTable::FindPointer(rocksdb::Slice const&, unsigned int)+0x3d

I'll try with a clean install of 8.0.0 instead and see if that changes anything

pavel · July 8, 2022, 6:38am

Hi Joel,

I am not sure a different installation would change things here but doesn't hurt to try. Could you do the following:

attach the full dump file from 8.0
describe the workload to help us reproduce the behaviour. For example, a previous stacktrace looks like you sometimes delete a named graph from the data. The more details you include here, the faster we can figure out what's going on. Of course, if your code/data is in open source, it'd be ideal

Thanks,
Pavel

Joel_Vik · July 8, 2022, 7:51am

Hi Pavel,

Thank you for reaching out.

The problem still persists in 8.0. A dump file along with a stacktrace is provided below. They are from seperate crashes, but both occured with Stardog 8.0.

About the workload, our setup is like this:
Each of these 60 applications upload an rdf describing an underlying system. They upload their rdf to their own namedgraph in the Stardog database. As these systems change quite often, and we can't always tell what have changed, the applications send a clear request for their named graph, before uploading the new data. So, with the HTTP API (API Reference | ReDoc), each application does the following:

Begin a transaction
Clear the data for their named graph (using /clear with graph-uri query parameter)
Upload the current version of the system to the cleared named graph. The system is split up in smaller graphs and uploaded in separate requests. The graphs are in rdf/xml syntax and each request should not contain more than one or two mb of data, at most. This could be hundreds or even thousands of requests for each application.
Commit transaction
If anything goes wrong with the upload (from the applications perspective), rollback transaction

Depending on the system, step 3 can take seconds or up to an hour. The uploading of data occurs every third hour.

Unfortunately, neither the code or data is open source. But the code is fairly simple, as it just sends http requests with a given rdf/xml payload.

I'll just point out for clarity that for a single application, or even a few of them uploading at the same time, this setup works perfectly fine. It is when we increase the number of applications uploading data that Stardog begins to crash. If I manually trigger the uploads a few applications at a time, Stardog has no problem handling all of the data.

Stacktrace from a separate crash from 8.0:
bad_alloc.txt (9.4 KB)
Dump file from 8.0:
hs_err_pid761.log (220.9 KB)

pavel · July 8, 2022, 7:35pm

OK, thanks. We will try to figure out what's going on. Can you also detail the environment, e.g. CPU, OS version, disk type, Java version, also container or bare metal.

Thanks,
Pavel

Amir_Hosseinbor · July 11, 2022, 10:04am

Hello,
Joel is on vacation and asked me to follow up on this.
We are running Stardog on a m5.xlarge instance on AWS.
It has 4 vCPU and 16gb memory.
lsb_release -a on the instance gives me:

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal

Disk type: 80 gb, gp2 ssd with 240 IOPS. No encryption is used at the moment.
Java version: 1.8.0_312
We are not running it on Docker etc.

matthewv · July 11, 2022, 10:17pm

Thank you. Just letting you know I am active on the issue. Your input helps. Sadly I do not yet have an explanation.

matthewv · July 12, 2022, 2:19pm

Is it possible to gather the logs from the server? Specifically, from the $STARDOG_HOME directory:

stardog.log
starrocks.log
data/LOG*

Amir_Hosseinbor · July 13, 2022, 7:52am

Hello,
Thanks for reaching out. Since the limitations for uploading files I had to divide the logs in several zipped files.
logs.zip (7.5 MB)
LOG.old.zip (6.5 MB)
LOG.old03-05.zip (5.3 MB)
LOG.old06-09.zip (6.0 MB)
LOG.old10-13.zip (5.9 MB)

matthewv · July 13, 2022, 1:56pm

Thank you. Reviewing now.

pavel · July 13, 2022, 3:45pm

Hi Amir,

A question: when your client uses the Stardog HTTP API to add/remove data, am I correct to assume that you use explicit /{db}/transaction/begin requests to begin a tx and then use /{db}/{txid}/add or /{db}/{txid}/remove requests to add/remove data within that tx? If so, is it ever the case that two or more clients are concurrently adding/removing data within the same transaction, i.e. use the same /{db}/{txid} part of the request?

Thanks,
Pavel

Amir_Hosseinbor · July 14, 2022, 12:39pm

Hello Pavel and Matthew,

Each client starts with requesting "beginTransaction", then the client adds data and ends with committing the transaction.
As long as we dont receive the same transactionId from /{db}/transaction/begin and all the clients requests transaction/begin individually (which they are), they shouldn't be adding/removing data within the same transaction, right?

matthewv · July 15, 2022, 6:53pm

Status update: Your logs show 3 distinct failure modes. My investigation has so far isolated two bugs that at least contribute to the failures. One bug might be the root cause, but we are not yet duplicating your exact failures. The fixes for the two bugs are already scheduled for the upcoming 8.0.1 maintenance release. The timing of the maintenance release is still to be determined.

matthewv · July 15, 2022, 7:04pm

We are willing to supply a replacement libStarrocks.so that contains the two fixes. Please let us know if you are interested in testing the fixes in your environment.

Topic		Replies	Views
(Fresh) server crashes with out of memory error after installation Support	3	376	April 30, 2021
Out-Of-Memory exception while uploading not large triple files (100mb) Support	5	886	July 11, 2017
java.lang.OutOfMemoryError: Java heap space Support	19	2992	May 12, 2017
Infinitely loading data Stardog Studio	19	1088	October 22, 2018
Connection to Stardog causes java.lang.OutOfMemoryError Bug	2	433	January 10, 2020

Stardog crashing when uploading data

Related topics