Merge nodes by rdfs label

R3000 · December 18, 2019, 5:46pm

I have the following challenge. After extracting product information from documents I end up with more products than needed - example "Product A", "Product A Enterprise", "Product A Platform". All three nodes are really describing only one Product. So my questions:

When should I merge the extracted information?
Before inserting triples or after?
If after can I use some similarity functionality that Stardog has?

The above names are rdfs:labels - I wish we could have an array of rdfs:labels - so that I retain the names after merging .... Maybe SKOS prefLabel and altLabels can be useful in this situation?

Thanks for your help,
Radu

zachary.whitley · December 18, 2019, 6:20pm

Like everything, it depends. Some of the things you might want to consider are the following. If you're going to generate 1 billion triples only to merge it down to 10k then that's a lot of work to load all that data only to throw most of it away and you might be better off doing it before you load it. If on the other had you have a moderate amount of data sparql is very flexible and should make it fairly easy to do unless it's already easy to do as part of your pipeline. If you need to merge it with other data sources then you're doing it in sparql.

I created a set of functions for Stardog for doing string metric comparisons like what you're looking for https://github.com/semantalytics-stardog/kibbles-string-metric

You can have multiple rdfs:label 's

You might want to look into Stardogs sameAs reasoning.

R3000 · December 18, 2019, 6:28pm

Zach,

Thanks for sharing your approach. It makes more sense for me to it before hand.

As far as string metrics - really nice Java library. For python I used fuzzywuzzy - with some degree of success. Hope this is useful for others who have a python nlp pipeline.

system · January 1, 2020, 6:28pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Federated Endpoint Query fails for rdfs:label property Support	14	519	July 31, 2020
Use rdfs:label by default for graph visualization in Studio Feature Request	0	565	May 12, 2020
Internal server error with reasoning queries and slow reasoning issue Support	13	1539	April 19, 2017
Csv virtual graph multiple triples for the same subject Support	13	546	April 11, 2019
About textMatch (Lucene) usage Support	3	828	June 25, 2018

Merge nodes by rdfs label

Related topics