Similarity Search: Values in the query

Hi Folks,
To help me learn Similarity Search, I created some example data for people that have eye color, hair color, gender, and the type of vehicle they drive (attached). I successfully created a model and then the query looking for people similar to eg:Bob :

PREFIX : http://schema.org/
PREFIX eg: http://www.example.org/people/
PREFIX spa: tag:stardog:api:analytics:

SELECT ?similarPersonLabel ?confidence
WHERE {
graph spa:model {
:simModel spa:arguments (?eyes ?hair ?gender ?drives) ;
spa:confidence ?confidence ;
spa:parameters [ spa:limit 5 ] ;
spa:predict ?similarPerson .
}
{ ?similarPerson rdfs:label ?similarPersonLabel }
{

        SELECT 
        (spa:set(?eyes) as ?eyes) 
        (spa:set(?hair) as ?hair)
        (spa:set(?gender) as ?gender)
        (spa:set(?drives) as ?drives)

        ?person
        {
            ?person eg:eyes   ?eyes ;
                    eg:hair   ?hair ;
                    eg:gender ?gender ;
                    eg:drives ?drives .
            VALUES ?person { eg:Bob } # Who is similar to Bob?
        }
        GROUP BY ?person
    } 
}
ORDER BY DESC(?confidence)

How can I create a query that, instead of looking for "Who is similar to Bob", it looks for similarity based on characteristics specified within the query, like:

eg:eyes eg:blue ;
eg:hair eg:black ;
eg:gender eg:male ;
eg:drives eg:truck .

Later I want to develop a UI that allows the user to identify similar nodes based on parameters they choose (Female, drives a truck ; Male, brown hair; etc.).

Thanks!

People-Similarity-InstanceData-Stardog.TTL (756 Bytes)

Hi Tim,

Thanks for trying out the similarity search feature. Let me first provide a few pieces of background which I think will make things clearer.

The spa:set function is an interim implementation of Stardog's extended solutions feature. It is an aggregate function which collects the values of several bindings into a single set. For example, consider the following set of triples in the database:
eg:Bob eg:eyes eg:blue
eg:Bob eg:eyes eg:brown
If we collect these using spa:set, we logically get a new binding which contains both values in a set, something like {eg:blue, eg:brown}. When doing a similarity search, Bob will be similar to somebody with either blue or brown eyes and more similar to somebody with both blue and brown eyes, but not similar to somebody with green eyes. If you don't have more than one triple in your database supplying the eye color for each person (ie this is a functional property), then you don't need to use spa:set to create the aggregate of bindings.

The similarity search query is executed once for each set of bindings you provide to spa:arguments. In the example you shared, there is only one set of bindings as input. This is due to the fact that you group by ?person and provide a constant binding to eg:Bob. If you wanted to perform a search based on constant values, you can substitute them into the spa:arguments list directly, e.g.

graph spa:model {
:simModel spa:arguments (eg:blue eg:black eg:male eg:truck) ;

This integrates naturally with SPARQL and you could also perform a similarity search over all people who drive trucks by replacing:

VALUES ?person { eg:Bob }

with

VALUES ?drives { eg:truck }

In this case, you would conceptually see a single solution for each person who drives a truck, each solution would be used as input to the similarity search with the features as specified and the similarity result would include one solution for each similar entity. The result typically includes the entity whose features as given as input, which can be excluded here with filter(?similarPerson != ?person).

Hope this helps. Let us know if you have any further thoughts.

Jess

1 Like

Your detailed explanation is very helpful. My example data is too artificial. IRL, I will have several bindings with the same predicate. I will create a new example for my learning purposes and try it out using your advice here.

Does the model also work with inferencing? This is something I will also test and may result in a follow up post if I get stuck.

Cheers!

Tim

Here is my follow up question for how to use reasoning with Similarity Search. See the attached files for Ontology, Instance Data, Model, and Query.

With the instance data and ontology loaded, and Reasoning "On", I can query to determine who drives a vehicle ( ?vehicle a veh:Vehicle ) and return the vehicle they drive:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX veh: <http://www.example.org/vehicle#>
PREFIX eg: <http://www.example.org/people/>

SELECT ?personName ?vehicle
WHERE{
   ?person veh:drives ?vehicle ;
     rdfs:label ?personName .
     ?vehicle a veh:Vehicle .
 }

It is unclear to me how to leverage reasoning in the Model and Query for Similarity Search. Each person drives a unique vehicle, so the spa:arguments will be a unique list for each person. Not surprisingly, the results returned by my model and query (attached) match "Bob" with "Bob" and no one else.

Instead of matching on the list of vehicle instance values, how do I find similarities based on Type of vehicle? In other words, how to use super class veh:Vehicle in the Model and Query? Or subClasses like veh:Car or veh:Truck ?

I hope I have explained my question adequately. The data is still quite artificial but is closer to a real-life use case.
People-VehicleInference-InstanceData.TTL (2.1 KB)
People-VehicleInference-Ont.ttl (3.1 KB)
People-VehicleInference-Similarity-Drives-Model.rq (1.5 KB)
People-VehicleInference-Similarity-Drives-Query.rq (1.3 KB)

Tim,

Logical (OWL) reasoning and similarity search work independently. The similarity index uses only what you give it as input to perform the search. This means that if you insert "Bob drives a truck" into the similarity index, it will have no knowledge of some other fact which may be inferred by your ontology, such as "Bob drives a vehicle".

That said, it is not impossible for these two features to interact. One possibility is to include schema/derived information in the features of the index. The example you provided builds the index like so:

    SELECT
    (spa:set(?drives) as ?drives)
    ?person
    {
        ?person veh:drives ?drives .
    }
    GROUP BY ?person

This allows for a person to be declared as driving more than one vehicle and the set of vehicles will be accumulated in the ?drives binding. If you want to include schema information, you can query the types of the vehicle like so:

?drives a ?vehicleType

You can add another feature to the index by included the spa:set(?vehicleType) binding in the spa:arguments list. If you run this query with reasoning, you will get bindings for the direct and inferred assertions of the vehicle types. This effectively will allow build the similarity index taking the ontology into account. The similarity scores will be highest when people drive the exact same type of car as asserted type as well as the inferred types will be included in the set. The score will decrease as there is less similarity, ie fewer common superclasses.

Jess

Jess,
Your explanation is very good and I almost get it. I still have something wrong in my Model, Query, or Ontology (or all of those?). When I execute the query, Bob shows no similarity with Sally, even though both drive types of Cars ( veh:SportsCar and veh:SedanCar , respectively). I've attached the model and query which I execute using the Ontology and Instance Data posted previously. Thanks for your patience. If I can get this figured out I will be using it extensively. People-VehicleInference-Similarity-Drives-Model.rq (1.7 KB)
People-VehicleInference-Similarity-Drives-Query.rq (1.5 KB)

Tim,

Did you execute the update to build the index with reasoning on? What results do you get for this fragment of the query:

        ?person veh:drives ?drives .
        ?drives a          ?vehicleType .

Do you see the types inferred by the ontology?

Jess

I did NOT have Reasoning = On when I built the model. So I turned it on and ran the model only to receive the attached error "Logical service query required. ..."

In answer to your question, I do get the inferred vehicle types in the query result.

It sounds like the fact that reasoning was not applied when building the index explains why the results were not as you expected.

The error is a bug. Let me dig into it and see if there's a workaround I can suggest. I will get back to you.

Jess

Replying to keep this topic alive. Any work-around or a fix planned for the next release?
Cheers,
Tim

Hi Tim,
Thanks for the bump. We're actively working on this and plan to have a fix available in the May release.
Jess

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.