Stardog crashing when uploading data

Hi,

We have this setup where about 60 applications are upploading data to Stardog using the HTTP API. They can sometimes upload data simultaneously, and when Stardog is receiving data from several sources at the same time, Stardog crashes with what appears to be memory issues. Sometimes a stacktrace is logged, two of which are attached below, and sometimes java provides a core dump, one which is attached below.

The Stardog version is 7.8.0. It it run on a server with 16GB RAM, of which Stardog is given

-Xms6g -Xmx6g -XX:MaxDirectMemorySize=12g

Any help, ideas or feedback is appreciated. Either solutions or how I can troubleshoot this further.


Below is output from stardog.log and other hopefully useful info.

The stardog.log file doesn't output more info than this,

INFO  2022-07-07 05:07:33,586 [Stardog.Executor-101] com.complexible.stardog.index.Index:printInternal(319): Parsing triples: 99% complete in 00:04:03 (749K triples - 3.1K triples/sec)
malloc(): unsorted double linked list corrupted

where malloc is the only info Stardog seems to log before crashing. Some other samples of the one line error message is

malloc(): smallbin double linked list corrupted
# or
corrupted size vs. prev_size
# or 
# [ timer expired, abort... ]
# or 
malloc(): invalid size (unsorted)

Stardog is logging the following memory options when it's starting:

INFO  2022-07-07 08:09:26,964 [main] com.complexible.stardog.cli.impl.ServerStart:call(266): Memory options
INFO  2022-07-07 08:09:26,964 [main] com.complexible.stardog.cli.impl.ServerStart:call(267): Memory mode: DEFAULT{Starrocks.block_cache=20, Starrocks.dict_block_cache=10, Native.starrocks=70, Heap.dict_value=50, Starrocks.txn_block_cache=5, Heap.dict_index=50, Starrocks.memtable=40, Starrocks.untracked_memory=20, Starrocks.buffer_pool=5, Native.query=30}
INFO  2022-07-07 08:09:26,965 [main] com.complexible.stardog.cli.impl.ServerStart:call(268): Min Heap Size: 6.0G
INFO  2022-07-07 08:09:26,966 [main] com.complexible.stardog.cli.impl.ServerStart:call(269): Max Heap Size: 5.8G
INFO  2022-07-07 08:09:26,966 [main] com.complexible.stardog.cli.impl.ServerStart:call(270): Max Direct Mem: 12G
INFO  2022-07-07 08:09:26,967 [main] com.complexible.stardog.cli.impl.ServerStart:call(271): System Memory: 15G

Which is set with the JVM args

-Xms6g -Xmx6g -XX:MaxDirectMemorySize=12g

I've included two stacktraces that occurs during the upload and and a java core dump:

error_adding_data.txt (9.4 KB)
errpr_adding_data2.txt (20.9 KB)
hs_err_pid8735.log (184.4 KB)

  • server has 16g
  • -Xms6g -Xmx6g -XS:MaxDirectMemorySize=12g -> 6 + 12 = 18g

I recommend reserving 2g for operating system and minor overflow. So bring MaxDirectMemorySize down to 8g --> 6 + 8 = 14g

What a simple solution! I thought the 12gb was the total memory and 6gb heap was taken out of that pool. I'll try changing the config.

If the problem continues, the follow-up question will be:

How many databases are the 60 applications feeding? All into one, each into a different, or somewhere in between. Makes a difference on the rocksdb buffer allocations.

So I changed the JVM args to

-Xms6g -Xmx6g -XX:MaxDirectMemorySize=8g

as you suggested, restarted Stardog and about 10 minutes later Stardog crashed again. So the problem continues.

There is only one database, and it "only" has 26 million triples

Same error as before,

ERROR 2022-07-07 14:16:40,805 [stardog-user-11] com.complexible.stardog.db.DatabaseConnectionImpl:apply(796): There was an error adding data
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 2000
...

corrupted double-linked list

Single database. Thank you.

There is a known issue in 7.8 where internal iterators (query) are using oversized buffers. My colleague noted that the stack trace from your first crash was utilizing those impacted iterators. The preferred solution is to upgrade to 7.9. A potential workaround is to again lower all three memory parameters by 2g in your 7.8 installation. This will add another 4g of free space as cushion against the oversized buffers.

I had on my todo list to upgrade Stardog to 8.0, so I'll give that a try!

So I upgraded the server to 8.0.0 by simply doing sudo apt-get install -y stardog=8.0.0. Stardog 8.0.0 gave me a new warning that Stardog used more than 90% of system memory, so I lowered heap to 5g and tried uploading again, but Stardog still crashes.

This time the error was

ERROR 2022-07-07 15:45:33,226 [stardog-user-15] com.complexible.stardog.db.DatabaseConnectionImpl:apply(796): There was an error adding data
com.stardog.starrocks.StarrocksException: std::bad_alloc
...

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe6d97d9add, pid=760, tid=0x00007fe622f19700
#
# JRE version: OpenJDK Runtime Environment (8.0_312-b07) (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
# Java VM: OpenJDK 64-Bit Server VM (25.312-b07 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libStarrocks.so+0x834add]  rocksdb::LRUHandleTable::FindPointer(rocksdb::Slice const&, unsigned int)+0x3d

I'll try with a clean install of 8.0.0 instead and see if that changes anything

Hi Joel,

I am not sure a different installation would change things here but doesn't hurt to try. Could you do the following:

  1. attach the full dump file from 8.0
  2. describe the workload to help us reproduce the behaviour. For example, a previous stacktrace looks like you sometimes delete a named graph from the data. The more details you include here, the faster we can figure out what's going on. Of course, if your code/data is in open source, it'd be ideal :slight_smile:

Thanks,
Pavel

Hi Pavel,

Thank you for reaching out.

The problem still persists in 8.0. A dump file along with a stacktrace is provided below. They are from seperate crashes, but both occured with Stardog 8.0.

About the workload, our setup is like this:
Each of these 60 applications upload an rdf describing an underlying system. They upload their rdf to their own namedgraph in the Stardog database. As these systems change quite often, and we can't always tell what have changed, the applications send a clear request for their named graph, before uploading the new data. So, with the HTTP API (API Reference | ReDoc), each application does the following:

  1. Begin a transaction
  2. Clear the data for their named graph (using /clear with graph-uri query parameter)
  3. Upload the current version of the system to the cleared named graph. The system is split up in smaller graphs and uploaded in separate requests. The graphs are in rdf/xml syntax and each request should not contain more than one or two mb of data, at most. This could be hundreds or even thousands of requests for each application.
  4. Commit transaction
  5. If anything goes wrong with the upload (from the applications perspective), rollback transaction

Depending on the system, step 3 can take seconds or up to an hour. The uploading of data occurs every third hour.

Unfortunately, neither the code or data is open source. But the code is fairly simple, as it just sends http requests with a given rdf/xml payload.

I'll just point out for clarity that for a single application, or even a few of them uploading at the same time, this setup works perfectly fine. It is when we increase the number of applications uploading data that Stardog begins to crash. If I manually trigger the uploads a few applications at a time, Stardog has no problem handling all of the data.

Stacktrace from a separate crash from 8.0:
bad_alloc.txt (9.4 KB)
Dump file from 8.0:
hs_err_pid761.log (220.9 KB)

OK, thanks. We will try to figure out what's going on. Can you also detail the environment, e.g. CPU, OS version, disk type, Java version, also container or bare metal.

Thanks,
Pavel

Hello,
Joel is on vacation and asked me to follow up on this.
We are running Stardog on a m5.xlarge instance on AWS.
It has 4 vCPU and 16gb memory.
lsb_release -a on the instance gives me:

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal

Disk type: 80 gb, gp2 ssd with 240 IOPS. No encryption is used at the moment.
Java version: 1.8.0_312
We are not running it on Docker etc.

Thank you. Just letting you know I am active on the issue. Your input helps. Sadly I do not yet have an explanation.

Is it possible to gather the logs from the server? Specifically, from the $STARDOG_HOME directory:

  • stardog.log
  • starrocks.log
  • data/LOG*

Hello,
Thanks for reaching out. Since the limitations for uploading files I had to divide the logs in several zipped files.
logs.zip (7.5 MB)
LOG.old.zip (6.5 MB)
LOG.old03-05.zip (5.3 MB)
LOG.old06-09.zip (6.0 MB)
LOG.old10-13.zip (5.9 MB)

Thank you. Reviewing now.

Hi Amir,

A question: when your client uses the Stardog HTTP API to add/remove data, am I correct to assume that you use explicit /{db}/transaction/begin requests to begin a tx and then use /{db}/{txid}/add or /{db}/{txid}/remove requests to add/remove data within that tx? If so, is it ever the case that two or more clients are concurrently adding/removing data within the same transaction, i.e. use the same /{db}/{txid} part of the request?

Thanks,
Pavel

Hello Pavel and Matthew,

Each client starts with requesting "beginTransaction", then the client adds data and ends with committing the transaction.
As long as we dont receive the same transactionId from /{db}/transaction/begin and all the clients requests transaction/begin individually (which they are), they shouldn't be adding/removing data within the same transaction, right?

Status update: Your logs show 3 distinct failure modes. My investigation has so far isolated two bugs that at least contribute to the failures. One bug might be the root cause, but we are not yet duplicating your exact failures. The fixes for the two bugs are already scheduled for the upcoming 8.0.1 maintenance release. The timing of the maintenance release is still to be determined.

We are willing to supply a replacement libStarrocks.so that contains the two fixes. Please let us know if you are interested in testing the fixes in your environment.