Merge nodes by rdfs label

I have the following challenge. After extracting product information from documents I end up with more products than needed - example "Product A", "Product A Enterprise", "Product A Platform". All three nodes are really describing only one Product. So my questions:

  • When should I merge the extracted information?
  • Before inserting triples or after?
  • If after can I use some similarity functionality that Stardog has?

The above names are rdfs:labels - I wish we could have an array of rdfs:labels - so that I retain the names after merging .... Maybe SKOS prefLabel and altLabels can be useful in this situation?

Thanks for your help,
Radu

Like everything, it depends. Some of the things you might want to consider are the following. If you're going to generate 1 billion triples only to merge it down to 10k then that's a lot of work to load all that data only to throw most of it away and you might be better off doing it before you load it. If on the other had you have a moderate amount of data sparql is very flexible and should make it fairly easy to do unless it's already easy to do as part of your pipeline. If you need to merge it with other data sources then you're doing it in sparql.

I created a set of functions for Stardog for doing string metric comparisons like what you're looking for https://github.com/semantalytics-stardog/kibbles-string-metric

You can have multiple rdfs:label 's

You might want to look into Stardogs sameAs reasoning.

Zach,

Thanks for sharing your approach. It makes more sense for me to it before hand.

As far as string metrics - really nice Java library. For python I used fuzzywuzzy - with some degree of success. Hope this is useful for others who have a python nlp pipeline.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.