Uploading freebase: memory config

I have been trying to upload the Freebase data (triples -- the legacy data found here: Data Dumps  |  Freebase API (Deprecated)  |  Google Developers).

I set memory.mode=bulk_upload in stardog.properties, max heap is 16G and max direct memory is 40G, on a machine with 52G total and 8 cores. Disk is 0.5T. Linux (ubuntu 17.04).

The data is loading currently at about 100K triples/sec., which (absolute value aside) is not any faster than when I tried with memory.mode=default (or rather, not set). This is on a new-db creation and strict.parsing=false. Is there an expectation that bulk_load should run faster? Do you have any suggestion to improve loading speed significantly?

Thank you

Hi,

Try setting memory.mode=bulk_load instead of “bulk_upload.” With “bulk_upload,” since it doesn’t match one of the predefined settings, it simply stays at default.

You can confirm the setting is correct on server start. In the log you should see Memory mode: BULK_LOAD. Chances are your log may have a message about ignoring “bulk_upload” and using DEFAULT instead.

Thanks! Miraculously I had set it correctly, just misspelled it in the message above. It worked definitely better in the end: it took 8 hours to parse the triples:

INFO 2017-12-01 06:54:05,304 [XNIO-1 task-9] com.complexible.stardog.StardogKernel:printInternal(314): Parsing triples: 100% complete in
08:03:40 (3130.8M triples - 107.9K triples/sec)

But the parsing rate was steady (slightly increasing toward the end). Indexing and stats took much less. What’s the most direct way to improve the data load performance, number of CPUs?

You can try to load from multiple files (i.e. split the input).

Cheers,
Pavel

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.