Slow data loading even in bulk load settings

Shishir_Singhal · October 8, 2019, 6:12am

I am trying to load a freebase .gz file of approx 30GB using bulk_load settings.
My system configuration:
96G RAM.
.5T storage
40 cores cpu
linux os.
I have keep strict.parsing=false and memory.mode=bulk_load.
I have first created an empty database using database.properties file and
then add data to database using add functionality.
I have set jvm configurations as : STARDOG_SERVER_JAVA_ARGS="-Xmx24g -Xms24g -XX:MaxDirectMemorySize=40g"
freebase has almost 3.5 billion triples.
It already 8.5 hrs but only completed 33% parsing.
snapshot of stardog.log

(1080.0M triples - 33.0K triples/sec)
INFO 2019-10-08 11:39:53,969 [Stardog.Executor-8] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 34% complete in 09:05:32 (1081.0M triples - 33.0K triples/sec)
INFO 2019-10-08 11:40:39,286 [Stardog.Executor-8] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 34% complete in 09:06:17 (1082.0M triples - 33.0K triples/sec)

--speed is just 33.0K/sec.
Please tell me what is the problem, how do I speed up.?

pavel · October 8, 2019, 9:34am

The right thing to do is to use the stardog-admin db create -n {name} /path/to/file. That is the command that's meant to be used with memory.mode=bulk_load. You can specify database properties using the -o option.

Best,
Pavel

joergunbehauen · October 8, 2019, 9:52am

I had a similar issue with loading form a single uncompressed file via command line and observed two problems: the temp file got really big and loading started fast but got very slow.
The solution for me was chunking the input into smaller files.

The import is consequentially no longer in a single transaction.

Shishir_Singhal · October 8, 2019, 9:53am

Problem with this command is, whenever I am trying to give path of .gz file.
It is giving Unknown file format not found kind of error.
I throughout it is not recognizing .gz file format.
What is the solution for this?

pavel · October 8, 2019, 10:04am

Is the file on the server side or not? If not, given the size, it's better to copy it there. Normally .gz should be treated as an archive without problems. You can of course unpack/repack it under .zip or something.

@joergunbehauen Hi Jörg, good to see you around. Is this with Stardog 7 or 6? The temp file problem would be interesting to reproduce but the data add command should let you add data in multiple files in a single transaction. The main benefit would be multi-threaded parsing of RDF.

Best,
Pavel

Shishir_Singhal · October 8, 2019, 10:15am

It is giving this error:
shishir18106@compute-0-2 ~]$ stardog-admin db create -n KnowledgeBaseDB /scratch/shishir18106/freebase_database/freebase-rdf-latest.gz
No known format matches the file: /scratch/shishir18106/freebase_database/freebase-rdf-latest.gz

where /scratch is the shared partition across server.

zachary.whitley · October 8, 2019, 10:19am

You need to add the file type extension before the gz. latest.rdf.gz assuming it’s RDF/XML

Shishir_Singhal · October 8, 2019, 10:23am

do I need to write like that
stardog-admin db create -n db -- /scratch/shishir18106/freebase_database/freebase-rdf-latest.gz

where I need to give type extension?
Could you please write exact create command for me ?

joergunbehauen · October 8, 2019, 10:23am

This was using stardog 6.2.1 on an ec2 m5d.2xl instance loading a ~10B triples BSBM data, so this was already some time ago. Initial Loading speed was 200k triples/sec and deteriorated into the 20k triples/sec

I haven't checked with stardog 7 yet.

(Apologies to @Shishir_Singhal for hijacking your thread, i just wanted to point out that splitting the input file might increase ingestion speed )

zachary.whitley · October 8, 2019, 10:29am

You need to rename the file

mv freebase-rdf-latest.gz freebase-latest.rdf.gz

And then load

stardog-admin db create -n db -- /scratch/shishir18106/freebase_database/freebase-latest.rdf.gz

pavel · October 8, 2019, 10:29am

Zach means you should rename the input file to include the RDF extension before .gz. E.g. it'd be *.rdf.gz if the data is in RDF/XML, *.ttl.gz is the data is in Turtle, and so on.

Shishir_Singhal · October 8, 2019, 10:31am

in freebase, data is in NTRIPLES format, so what could be the create command in this case?

pavel · October 8, 2019, 10:32am

Jörg, the speed deterioration is most likely due to the lack of memory. It's still a little surprising that it was faster for you in separate transactions with 6. It's usually not. But anyway, very little of what was true about Stardog 6 regarding write performance applies to Stardog 7. Virtually everything (except of RDF parsers) is new when it comes to writing data.

pavel · October 8, 2019, 10:32am

Then freebase.nt.gz

Shishir_Singhal · October 8, 2019, 10:48am

What is diffrence between in following two commands in bulk_load settings for loading .gz file

stardog-admin db create -n db -- <.gz file> and
stardog data add -f turtle --compression gzip myDb file.bin (database is created using stardog-admin db create -c database.properties).

Is 2nd one load all data in RAM while doing parsing, because first I have used 2nd way, and it giving problems and server has been shutdown?

Is 1st method keep data in storage while parsing?
or is there any other difference?
Please clear my doubt.

pavel · October 8, 2019, 11:00am

The main difference is that data add is a transactional update of an existing database during which the database is fully operational and can serve other read and write requests. The db create command creates a new database and the database is not available until all data is parsed, loaded, and indexed. It's throughput is thus substantially higher. They work in pretty different ways but none keeps all data in memory. Various caches are used to keep some frequently used data in memory.

Best,
Pavel

Shishir_Singhal · October 8, 2019, 11:15am

thnxx @pavel and @zachary.whitley

Shishir_Singhal · October 8, 2019, 11:26am

Why parsing speed during bulk_load is decreasing time to time.
Here is stardog.log

(230.0M triples - 88.1K triples/sec)
INFO 2019-10-08 16:51:32,251 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:43:36 (231.0M triples - 88.3K triples/sec)
INFO 2019-10-08 16:51:38,863 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:43:43 (232.0M triples - 88.4K triples/sec)
INFO 2019-10-08 16:51:45,321 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:43:49 (233.0M triples - 88.6K triples/sec)
INFO 2019-10-08 16:51:52,470 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:43:56 (234.0M triples - 88.7K triples/sec)
INFO 2019-10-08 16:52:12,820 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:44:17 (235.0M triples - 88.4K triples/sec)
INFO 2019-10-08 16:52:31,329 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:44:35 (236.0M triples - 88.2K triples/sec)
INFO 2019-10-08 16:52:46,024 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:44:50 (237.0M triples - 88.1K triples/sec)
INFO 2019-10-08 16:52:59,456 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:45:03 (238.0M triples - 88.0K triples/sec)
INFO 2019-10-08 16:53:20,684 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:45:25 (239.0M triples - 87.7K triples/sec)
INFO 2019-10-08 16:53:33,085 [Stardog.Executor-10] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 7% complete in 00:45:37 (240.0M triples - 87.7K triples/sec)

Its starts from 165K triplse/sec and drops to 87.
why is like that ??

system · October 22, 2019, 11:26am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Uploading freebase: memory config Support	4	639	December 15, 2017
Setting for bulk load Support	5	1144	August 23, 2018
Stardog add is too slow (loading split files) Support	3	375	October 23, 2020
Stuck in computing statistics phase in 100% while loading freebase using bulk_load settings. Is there any reason for that? Support	5	553	October 24, 2019
Loading Yago into Stardog Community Support	2	670	June 11, 2017

Slow data loading even in bulk load settings

Related topics