Slow data loading even in bulk load settings

I am trying to load a freebase .gz file of approx 30GB using bulk_load settings.
My system configuration:
96G RAM.
.5T storage
40 cores cpu
linux os.
I have keep strict.parsing=false and memory.mode=bulk_load.
I have first created an empty database using database.properties file and
then add data to database using add functionality.
I have set jvm configurations as : STARDOG_SERVER_JAVA_ARGS="-Xmx24g -Xms24g -XX:MaxDirectMemorySize=40g"
freebase has almost 3.5 billion triples.
It already 8.5 hrs but only completed 33% parsing.
snapshot of stardog.log

(1080.0M triples - 33.0K triples/sec)
INFO 2019-10-08 11:39:53,969 [Stardog.Executor-8] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 34% complete in 09:05:32 (1081.0M triples - 33.0K triples/sec)
INFO 2019-10-08 11:40:39,286 [Stardog.Executor-8] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 34% complete in 09:06:17 (1082.0M triples - 33.0K triples/sec)

--speed is just 33.0K/sec.
Please tell me what is the problem, how do I speed up.?

The right thing to do is to use the stardog-admin db create -n {name} /path/to/file. That is the command that's meant to be used with memory.mode=bulk_load. You can specify database properties using the -o option.

Best,
Pavel

I had a similar issue with loading form a single uncompressed file via command line and observed two problems: the temp file got really big and loading started fast but got very slow.
The solution for me was chunking the input into smaller files.

The import is consequentially no longer in a single transaction.

Problem with this command is, whenever I am trying to give path of .gz file.
It is giving Unknown file format not found kind of error.
I throughout it is not recognizing .gz file format.
What is the solution for this?

Is the file on the server side or not? If not, given the size, it's better to copy it there. Normally .gz should be treated as an archive without problems. You can of course unpack/repack it under .zip or something.

@joergunbehauen Hi Jörg, good to see you around. Is this with Stardog 7 or 6? The temp file problem would be interesting to reproduce but the data add command should let you add data in multiple files in a single transaction. The main benefit would be multi-threaded parsing of RDF.

Best,
Pavel

It is giving this error:
shishir18106@compute-0-2 ~]$ stardog-admin db create -n KnowledgeBaseDB /scratch/shishir18106/freebase_database/freebase-rdf-latest.gz
No known format matches the file: /scratch/shishir18106/freebase_database/freebase-rdf-latest.gz

where /scratch is the shared partition across server.

You need to add the file type extension before the gz. latest.rdf.gz assuming it’s RDF/XML

do I need to write like that
stardog-admin db create -n db -- /scratch/shishir18106/freebase_database/freebase-rdf-latest.gz

where I need to give type extension?
Could you please write exact create command for me ?

This was using stardog 6.2.1 on an ec2 m5d.2xl instance loading a ~10B triples BSBM data, so this was already some time ago. Initial Loading speed was 200k triples/sec and deteriorated into the 20k triples/sec

I haven't checked with stardog 7 yet.

(Apologies to @Shishir_Singhal for hijacking your thread, i just wanted to point out that splitting the input file might increase ingestion speed )

You need to rename the file

mv freebase-rdf-latest.gz freebase-latest.rdf.gz

And then load

stardog-admin db create -n db -- /scratch/shishir18106/freebase_database/freebase-latest.rdf.gz

Zach means you should rename the input file to include the RDF extension before .gz. E.g. it'd be *.rdf.gz if the data is in RDF/XML, *.ttl.gz is the data is in Turtle, and so on.

in freebase, data is in NTRIPLES format, so what could be the create command in this case?

Jörg, the speed deterioration is most likely due to the lack of memory. It's still a little surprising that it was faster for you in separate transactions with 6. It's usually not. But anyway, very little of what was true about Stardog 6 regarding write performance applies to Stardog 7. Virtually everything (except of RDF parsers) is new when it comes to writing data.

Then freebase.nt.gz

What is diffrence between in following two commands in bulk_load settings for loading .gz file

  1. stardog-admin db create -n db -- <.gz file> and
  2. stardog data add -f turtle --compression gzip myDb file.bin (database is created using stardog-admin db create -c database.properties).

Is 2nd one load all data in RAM while doing parsing, because first I have used 2nd way, and it giving problems and server has been shutdown?

Is 1st method keep data in storage while parsing?
or is there any other difference?
Please clear my doubt.

The main difference is that data add is a transactional update of an existing database during which the database is fully operational and can serve other read and write requests. The db create command creates a new database and the database is not available until all data is parsed, loaded, and indexed. It's throughput is thus substantially higher. They work in pretty different ways but none keeps all data in memory. Various caches are used to keep some frequently used data in memory.

Best,
Pavel

thnxx @pavel and @zachary.whitley

Why parsing speed during bulk_load is decreasing time to time.
Here is stardog.log

(230.0M triples - 88.1K triples/sec)
INFO 2019-10-08 16:51:32,251 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:43:36 (231.0M triples - 88.3K triples/sec)
INFO 2019-10-08 16:51:38,863 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:43:43 (232.0M triples - 88.4K triples/sec)
INFO 2019-10-08 16:51:45,321 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:43:49 (233.0M triples - 88.6K triples/sec)
INFO 2019-10-08 16:51:52,470 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:43:56 (234.0M triples - 88.7K triples/sec)
INFO 2019-10-08 16:52:12,820 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:44:17 (235.0M triples - 88.4K triples/sec)
INFO 2019-10-08 16:52:31,329 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:44:35 (236.0M triples - 88.2K triples/sec)
INFO 2019-10-08 16:52:46,024 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:44:50 (237.0M triples - 88.1K triples/sec)
INFO 2019-10-08 16:52:59,456 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:45:03 (238.0M triples - 88.0K triples/sec)
INFO 2019-10-08 16:53:20,684 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:45:25 (239.0M triples - 87.7K triples/sec)
INFO 2019-10-08 16:53:33,085 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:45:37 (240.0M triples - 87.7K triples/sec)

Its starts from 165K triplse/sec and drops to 87.
why is like that ??

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.