Stardog crashing when uploading data

Hello Matthew,
Sorry for the delayed replay.
I would love to try it out and see if it works. :slightly_smiling_face:

Amir, I have sent you a direct message with information on how to retrieve a pre-release version of the fix.

Please continue discussions in this public forum once you have a chance to retrieve and install the file.

Good luck,
Matthew

Hello Matthew,
Thank you for the information.
I started the stardog service with the new release and will check the server later today if it still crashes.
I'll get back here tomorrow if we didn't have any crashes, and sooner if it still crashes.

Thanks!

If it crashes, please include the set of log files like before.

Hello Matthew,
I checked the server this morning and saw that it crashed again during the night.
logs.zip (1.8 MB)

Sad. Thank you for the logs.

Amir,

I have posted another libStarrocks.so. The download information is again part of a direct message.

Your recent logs show all crashes being at one place, not several with the previous logs. I believe this to be progress and a strong suggestion to thread timing within our code. This new library makes one subtle change with that in mind.

Again, your logs have been extremely helpful given that your data and source code is not available for public release / viewing. Please gather stardog.log and starrocks.log again after new library runs.

That said, I have not yet reproduced your crash. It would be helpful you could easily give a more detailed sequence of the API calls your client is making.

Thank you,
Matthew

Hi Matthew,

We updated Stardog with the new libStarrocks.so last friday. Unfortunately Stardog still crashes. Below are the the logs from the weekend.

I'll try to provide a more detailed sequence.

  • Each application has a constant variable with its named graph URI, called namedGraphURI. Each named graph URI is unique, based on a UUID.
  • Each application does the following when uploading data to Stardog:
  1. Begin a transaction using {db}/transaction/begin. Save the transactionID returned from Stardog into a variable, lets call it transactionID
  2. Clear data using {db}/transactionID/clear with graph-uri=namedGraphURI
  3. Add batches of data with {db}/transactionID/add with "Content-Type" set to "application/rdf+xml" and graph-uri=namedGraphURI. Request body is rdf/xml content as a string. No compression is used.
  4. Repeat step 3 until all data is uploaded
  5. Commit transaction with {db}/transaction/commit/transactionID

If any step fails, a rollback request is sent with {db}/transaction/rollback/transactionID, and the uploading ends.

After the execution is done, successful or not, the application will wait 3 hours before uploading again.

data_logs_22-25.zip (4.7 MB)
stardog_logs_22-25.zip (4.3 MB)

Joel,

Thank you for the update. The crashes are now from entirely new and different places. I am still analyzing.

I have a simulator already that mostly matches your 5 step packet sequence. My step 3 only happens once per transaction, and it uses gzip data. I will update my simulator.

Do you know if your 5 steps happen over a single TCP connection to the server, or does each HTTP API call start an independent TCP connection? I am currently using one connection for all five steps. That could be a significant difference.

Thank you,
Matthew

The application has a connection pool of 200, with a limit of 20 connections per HTTP route/target host. (Using the basic Unirest for Java)

Ok. But for a given transaction will only one connection be used, or possibly several? i.e. one connection for the entire set of transaction steps, or maybe one connection per step?

(P.S. just looked at unirest. it does not explicitly state one way or the other. maybe the easiest answer is to run a packet trace that does not include data/payload, such as:

sudo tcpdump port 5820

thank you again, Matthew)

(P.P.S. are you using unirest's asynchronous capability such that the one or more API calls in your step 3 might actually overlap?)

Like you said, Unirest doesn't state how it uses the connections, but I assume it's using more than one when requests are sent so frequent.

I'll look into it more and try the packet trace you suggested.

All requests are done on a single thread, and we're not using any asynchronous capabilities from Unirest

I sent the tcpdump in a pm to you Matthew

Packet trace says one connection. Ok, that means my current load simulator already matches your environment in that respect. I am working a new area of investigation and will report back.

Thank you again for all the help.
Matthew

I am continuing to build a simulator that will hopefully repeat the bug your code stimulates. So far the bug does not show up. I am now using an address sanitizer tool from Google so that even one appearance of the bug, even if mild, should show up (AddressSanitizer ยท google/sanitizers Wiki ยท GitHub).

This implies that I might ask you to deploy a new libStarrocks that includes the address sanitizer tool later Thursday. That decision partially depends only whether or not your server already has the needed tool library. Would you execute the following command on your server:

find /lib/. -name libasan*

Example output from my server:
/lib/./gcc/x86_64-linux-gnu/10/libasan_preinit.o
/lib/./gcc/x86_64-linux-gnu/10/libasan.a
/lib/./gcc/x86_64-linux-gnu/10/libasan.so
/lib/./x86_64-linux-gnu/libasan.so.6.0.0
/lib/./x86_64-linux-gnu/libasan.so.6

The ".6" version is not important. My other server has ".5".

Thank you,
Matthew

Alright, sounds exciting!

find /lib/. -name libasan* returns no results. That means a new libStarrocks is required, or do I need to install the Address Sanitizer?

I believe I have worked out how to make our software build process include the sanitizer within libStarrocks. We use a non-normal build tool. I will report again soon.

My overnight simulation ran without problem, which is not helpful.

Joel,

I have been putting increasing levels of load against a 4 core 15G ram server. Even at 13,200 simultaneous transactions, nothing breaks. (garbage collections gets really mad, but crashes do not happen and ASAN sees nothing bad)

A colleague has suggested that maybe I should explore the rollback conditions. Example: I am on a local network so long delays in responding to the client are not severing the TCP connection. And Stardog's underlying storage library RocksDB is known to stall the application if it gets too far behind.

I am going to pick up the rollback thread of research over the Address Sanitize (libasan). I do have a libasan version built for you, but the tool has its own issues. Going to hold that back for now.

Are there any timeout rules that you directly set with the Unirest library, or are you just taking its defaults? And are there network routers/gateways between your Stardog server and clients? And are you setting any TCP KeepAlive requirements at the client?

Thank you,
Matthew

Thank you for your continued effort Matthew,

It's feels strange how the crashes are not occurring for you, even if only 20 applications uploads at the same time, Stardog crashes for us. Perhaps it is the amount of data each transaction handles that put the strain on Stardog? If 13,200 transactions is fine, I'm leaning more towards the amount of data being uploaded. I'll look into further how much data we actually try to upload and get back to you.

I'm not sure looking into rollbacks are the way forward, as Stardog already has crashed before our applications even send a rollback request.

We're simply using the default settings of Unirest. There could be quite a few routers/gateways between Stardog and the applications. Each application is set up in a different place, spread out in more or less all of Sweden. TCP KeepAlive is not being fiddled with.

So I went through one of the applications, and it tries to upload a total of 151 mb of data to Stardog. It's still not uploading as much data as others, so I think it could be up to 300-400 mb for a single application. The data is split into multiple requests as explained in previous posts.