AWS marketplace Stardog disk usage virtual graph import

Hello, I have launched Stardog in AWS following the instructions from here. For this purpose I am using an r5.2xlarge instance with 500GB of disk.

I imported a PostgreSQL virtual graph into Stardog with no problems. However, when I tried to import a second virtual graph I got the following error:

Error importing into orfium for source mlc. com.complexible.stardog.plan.eval.ExecutionException: An error occurred adding RDF to the index: com.complexible.stardog.index.IndexException: com.complexible.stardog.index.IndexException: No space left on device

I run df -h to see the disk usage and It has this output:

Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 432K 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/nvme0n1p1 8,0G 8,0G 20K 100% /
/dev/nvme1n1 493G 19G 449G 5% /var/opt/stardog
tmpfs 6,3G 0 6,3G 0% /run/user/1000

The partition that has the stardog data seems to be only 5% used. However, the root partition is full, which seems to be caused by 5.8G in the /tmp folder.

This is another Java setting: java.io.tmpdir. Goal is to move your java temp directory to a subdirectory on /var/opt/stardog. Example: /var/opt/stardog/java_tmp

Here is an example from one of my servers:
export STARDOG_SERVER_JAVA_ARGS="-Xms30g -Xmx30g -XX:MaxDirectMemorySize=64g -Djava.io.tmpdir=/raid/java_tmp"

More here: Server Configuration | Stardog Documentation 7.5.0

Thanks. This worked.

However, the process of importing the virtual graph was really slow and after some time I got this error:

Error importing into orfium for source mlc. com.complexible.stardog.plan.eval.ExecutionException: com.complexible.stardog.plan.eval.operator.OperatorException: JDBC driver internal error: Max retry reached for the download of #chunk335 (Total chunks: 1045) retry=10, error=net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver encountered communication error. Message: Error encountered when downloading a result chunk: HTTP status=403.
at net.snowflake.client.jdbc.DefaultResultStreamProvider.getInputStream(DefaultResultStreamProvider.java:65)
at net.snowflake.client.jdbc.SnowflakeChunkDownloader$2.call(SnowflakeChunkDownloader.java:867)
at net.snowflake.client.jdbc.SnowflakeChunkDownloader$2.call(SnowflakeChunkDownloader.java:781)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

There was some sort of communications failure between Stardog and Snowflake. If the call was taking awhile, it's possible that Snowflake killed it or otherwise timed it out. Getting a 403 forbidden after it was already working suggests something like that. We've users hit throttling limits on cloud services before. There may be some settings you can tweak in Snowflake.

Roughly how much data are you expecting to materialize? There might be better options than virtual import. And why not leave the data in place and query it as needed?

I want to validate the quality constraints in some data sources and it was suggested to me that importing the data in Stardog would have a better importance.
Also, I wanted to connect these data sources via reasoning rules and I wasn't able to achieve this between different virtual graphs.
Finally, as I have described here Stardog explorer does not visualize virtual graphs.

The size of the data that caused the error is around 15GB. We also want to import two more data sources. One with size 10GB and another 1GB.

Constraint validation can be done over a virtual source, but like anything else, it's faster when the data is local.

If you are bulk moving data, the best way to do it right now is via our Nifi or Spark integration. virtual import is best for CSVs or materializing/caching precise parts of an upstream source. It is future work to connect virtual import to Spark/Nifi.

re: reasoning over multiple named graphs. This is supported, but without your mappings and rules/data model, we can't say what the problem is.

And as noted in the other thread, you just need to change a configuration option to use Explorer with virtual data.