About textMatch (Lucene) usage

Hi Stardog team,

Suppose that I have some instances having different rdfs labels.
insert{
http://test.com/TEST16 rdfs:Label "username" .
http://test.com/TEST17 rdfs:Label "user_name" .
http://test.com/TEST18 rdfs:Label "user-name" .
http://test.com/TEST19 rdfs:Label "user" .
}
where {}

What I need to do is retrieving all instances using textMatch in the order of text similarity score.
But the following query gives only "user-name" and "user" with the same score as below.

select *
where {
?s rdfs:Label ?o .
(?o ?score) tag:stardog:api:property:textMatch ("user~" 0 100).
}

s o score
:TEST18 user-name 2.54044508934021
:TEST19 user 2.54044508934021

So I put wildcard *
select *
where {
?s rdfs:Label ?o .
(?o ?score) tag:stardog:api:property:textMatch ("user*~" 0 100).
}

and now it gives all of them with the same score.

s o score
:TEST16 username 1.0
:TEST17 user_name 1.0
:TEST18 user-name 1.0
:TEST19 user 1.0

Can you provide a proper way to differentiate them with different scores?

Thank you in advance.

Given that the fulltext index is based on Lucene, it’s default score just the common information retrieval score which only considers term frequency and document frequency (and some boosting). A String similarity like edit distance etc. is not taken into account, would be too expensive computing it besides the index lookup.

Hi Hwang,

As Lorenz referred, the Lucene score is not a proper text similarity score, it's just a value used by lucene to decide if a result is relevant to a query or not.
If you need an actual similarity score, you can pass the results through a similarity metric, like the ones given by the kibbles-string-metric referred in this post. Just add the release jar to Stardog's classpath, restart the server, and several distance metrics will be available in SPARQL.

-pedro

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.