Missing data in repository after import?

Hello,

I was importing data from Yago4 into stardog.
I've imported the same data in both stardog and another triplestore and checked online.

Now it seems that stardog skipped (?) some triples. That is if I run this query on my stardog endpoint I get an empty resultset, while the results exists (you can check online also [1]).

PREFIX schema: <http://schema.org/>
SELECT * FROM <http://yago-knowledge.org> WHERE {

?v0 schema:worksFor <http://yago-knowledge.org/resource/Dominick_Daly> .
  ?v0 schema:birthDate ?v3 .

} LIMIT 10

The result should be:

http://yago-knowledge.org/resource/Robert_Dalrymple_Ross  "1827"^^xsd:gYear

Note that this query returns other results indeed.

PREFIX schema: <http://schema.org/>
SELECT * FROM <http://yago-knowledge.org> WHERE {

  ?v0 schema:birthDate ?v3 .

} LIMIT 10

[1] Sparql | Yago Project

Hi Matteo,

Thanks for your report. Can you share or inspect your stardog.log? It sounds like Stardog encountered some parsing errors during the data load.

Best,
Noah

Hi,
I've checked the log, and there are a couple of

INFO  2020-08-31 09:04:10,854 [stardog-user-1] com.complexible.stardog.StardogKernel:write(77): /data/yago/import/part-yago-wd-facts-lite.01.nt: '0000' is not a valid value for datatype http://www.w3.org/2001/XMLSchema#gYear [L8885804]

Does this mean that it skips the entire file when it encounters an error?

Hi Matteo,

Stardog will stop adding data from the data file during db create when the parser reports an error. In this case, the parser detected an invalid value for the XML Schema datatype:

/Users/noahgorstein/Downloads/yago-wd-facts.nt.gz: '0000' is not a valid value for datatype http://www.w3.org/2001/XMLSchema#gYear [L3511416]

You can disable this default parsing behavior and allow Stardog to ingest this data with invalid values by setting the database configuration option strict.parsing=false at database creation time like so:

stardog-admin db create -n yago3 -o strict.parsing=false -- ~/Downloads/yago-wd-facts.nt.gz                
Bulk loading data to new database yago3.
Loaded 22,219,358 triples to yago3 from 1 file(s) in 00:05:11.530 @ 71.3K triples/sec.
Successfully created database 'yago3'.

I was able to run your original query once I set strict.parsing=false (as seen above) and get back your expected result:

http://yago-knowledge.org/resource/Robert_Dalrymple_Ross  "1827"^^xsd:gYear

Please note: strict.parsing is an immutable database configuration option. Its value can not be changed after the database is created.

Let us know if that helps.

Cheers,
Noah

Hi Noah,
you've been the most helpful, thanks a lot!

I've seen this policy in other cases/systems, and I am always confused.
Do you have any pointer for me to understand why a system should "stop" loading and not report any error except in the log?

Was I supposed to know whether my data was loaded without inspecting the log line by line? Or did I miss something?

Thanks again!

Hi Matteo,

You're very welcome. In addition to the error being reported in the stardog.log, you should have had this parser error reported in your console. For example, when I first attempted to load the data (without modifying strict.parsing) I got the following error:

❯ stardog-admin db create -n yago2 ~/Downloads/yago-wd-facts.nt.gz
Bulk loading data to new database yago2.
Errors were encountered during loading:
/Users/noahgorstein/Downloads/yago-wd-facts.nt.gz: '0000' is not a valid value for datatype http://www.w3.org/2001/XMLSchema#gYear [L3511416]
Loaded 3,500,139 triples to yago2 from 1 file(s) in 00:00:55.446 @ 63.1K triples/sec.
Successfully created database 'yago2'.

I think we can do better documenting the behavior of this command on parser errors like this. I can see how it could be confusing.

Best,
Noah

I see, I have a workflow where steps are automated, so maybe I've missed it and just read the last line.

thanks again