About textMatch (Lucene) usage

(Taeho Hwang) #1

Hi Stardog team,

Suppose that I have some instances having different rdfs labels.
http://test.com/TEST16 rdfs:Label “username” .
http://test.com/TEST17 rdfs:Label “user_name” .
http://test.com/TEST18 rdfs:Label “user-name” .
http://test.com/TEST19 rdfs:Label “user” .
where {}

What I need to do is retrieving all instances using textMatch in the order of text similarity score.
But the following query gives only “user-name” and “user” with the same score as below.

select *
where {
?s rdfs:Label ?o .
(?o ?score) tag:stardog:api:property:textMatch (“user~” 0 100).

s o score
:TEST18 user-name 2.54044508934021
:TEST19 user 2.54044508934021

So I put wildcard *
select *
where {
?s rdfs:Label ?o .
(?o ?score) tag:stardog:api:property:textMatch (“user*~” 0 100).

and now it gives all of them with the same score.

s o score
:TEST16 username 1.0
:TEST17 user_name 1.0
:TEST18 user-name 1.0
:TEST19 user 1.0

Can you provide a proper way to differentiate them with different scores?

Thank you in advance.

(Lorenz B.) #2

Given that the fulltext index is based on Lucene, it’s default score just the common information retrieval score which only considers term frequency and document frequency (and some boosting). A String similarity like edit distance etc. is not taken into account, would be too expensive computing it besides the index lookup.

(Pedro Oliveira) #3

Hi Hwang,

As Lorenz referred, the Lucene score is not a proper text similarity score, it’s just a value used by lucene to decide if a result is relevant to a query or not.
If you need an actual similarity score, you can pass the results through a similarity metric, like the ones given by the kibbles-string-metric referred in this post. Just add the release jar to Stardog’s classpath, restart the server, and several distance metrics will be available in SPARQL.