New language detection function

I just posted a new release of the Stardog-extension functions that includes a new language detection function. RDF has nice support for language tags but most data doesn't include this information it can be impossible to add for anything other than a trivial data set.

The functions come in two types the first detects the most likely language of a string and returns the string with the appropriate language tag.


select ?result where { bind(lang:detect("Stardog graph database") AS ?result) }

should return

"Stardog graph database"@en"

The second version returns a score for each language

 prefix array: <http://semantalytics.com/2017/09/ns/stardog/kibble/array/>

select ?result where { bind(array:toString(lang:score("Stardog graph database")) AS ?result) }

should return

[ [ "en"^^<http://www.w3.org/2001/XMLSchema#string> "1.0"^^<http://www.w3.org/2001/XMLSchema#double> ] [ "xh"^^<http://www.w3.org/2001/XMLSchema#string> "0.9931761961991691"^^<http://www.w3.org/2001/XMLSchema#double> ] [ "zu"^^<http://www.w3.org/2001/XMLSchema#string> 
.....

The library used detects 74 languages and you can get a list of the languages with the function lang:detectbleLanguages. A word of caution. The complete model is approximately 3.4Gb and takes about 10sec to load. I've done what I can to cache the model so given sufficient memory it should only take the 10 seconds the first time you run it. There are also separate functions like lang:detectFrom that take a list of languages to detect that would load much faster.

There are also new functions for computing ngrams in both the array and strings package that return arrays of tokens and strings respectively.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.