I am trying to load a freebase .gz file of approx 30GB using bulk_load settings.
My system configuration:
96G RAM.
.5T storage
40 cores cpu
linux os.
I have keep strict.parsing=false and memory.mode=bulk_load.
I have first created an empty database using database.properties file and
then add data to database using add functionality.
I have set jvm configurations as : STARDOG_SERVER_JAVA_ARGS="-Xmx24g -Xms24g -XX:MaxDirectMemorySize=40g"
freebase has almost 3.5 billion triples.
It already 8.5 hrs but only completed 33% parsing.
snapshot of stardog.log
(1080.0M triples - 33.0K triples/sec)
INFO 2019-10-08 11:39:53,969 [Stardog.Executor-8] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 34% complete in 09:05:32 (1081.0M triples - 33.0K triples/sec)
INFO 2019-10-08 11:40:39,286 [Stardog.Executor-8] com.complexible.stardog.index.Index:printInternal(314): Parsing triples: 34% complete in 09:06:17 (1082.0M triples - 33.0K triples/sec)
--speed is just 33.0K/sec.
Please tell me what is the problem, how do I speed up.?
The right thing to do is to use the stardog-admin db create -n {name} /path/to/file. That is the command that's meant to be used with memory.mode=bulk_load. You can specify database properties using the -o option.
I had a similar issue with loading form a single uncompressed file via command line and observed two problems: the temp file got really big and loading started fast but got very slow.
The solution for me was chunking the input into smaller files.
The import is consequentially no longer in a single transaction.
Problem with this command is, whenever I am trying to give path of .gz file.
It is giving Unknown file format not found kind of error.
I throughout it is not recognizing .gz file format.
What is the solution for this?
Is the file on the server side or not? If not, given the size, it's better to copy it there. Normally .gz should be treated as an archive without problems. You can of course unpack/repack it under .zip or something.
@joergunbehauen Hi Jörg, good to see you around. Is this with Stardog 7 or 6? The temp file problem would be interesting to reproduce but the data add command should let you add data in multiple files in a single transaction. The main benefit would be multi-threaded parsing of RDF.
It is giving this error:
shishir18106@compute-0-2 ~]$ stardog-admin db create -n KnowledgeBaseDB /scratch/shishir18106/freebase_database/freebase-rdf-latest.gz
No known format matches the file: /scratch/shishir18106/freebase_database/freebase-rdf-latest.gz
where /scratch is the shared partition across server.
This was using stardog 6.2.1 on an ec2 m5d.2xl instance loading a ~10B triples BSBM data, so this was already some time ago. Initial Loading speed was 200k triples/sec and deteriorated into the 20k triples/sec
I haven't checked with stardog 7 yet.
(Apologies to @Shishir_Singhal for hijacking your thread, i just wanted to point out that splitting the input file might increase ingestion speed )
Zach means you should rename the input file to include the RDF extension before .gz. E.g. it'd be *.rdf.gz if the data is in RDF/XML, *.ttl.gz is the data is in Turtle, and so on.
Jörg, the speed deterioration is most likely due to the lack of memory. It's still a little surprising that it was faster for you in separate transactions with 6. It's usually not. But anyway, very little of what was true about Stardog 6 regarding write performance applies to Stardog 7. Virtually everything (except of RDF parsers) is new when it comes to writing data.
The main difference is that data add is a transactional update of an existing database during which the database is fully operational and can serve other read and write requests. The db create command creates a new database and the database is not available until all data is parsed, loaded, and indexed. It's throughput is thus substantially higher. They work in pretty different ways but none keeps all data in memory. Various caches are used to keep some frequently used data in memory.