SMS2 mappings how to remove spaces from a field

Hi all,

This is a simple topic but I cant find any example of how to do that.

I'm creating a VG from a MSQL Server DB and to that end I'm using the following template.

BIND(template("http://test.com/ontology-v1/onto1-{field}") AS ?entity)

The point is that the "field" contains data as follows:

"1    "
"2    "

and I just want to remove any space out of it.

I have tried to use SPARQL queries but also I haven't found a good example for that.

What would be the best practice here to remove the trailing spaces so that I can form the without them?

Hi Irlan,

you should be able to get rid of the spaces in the projection of the SQL query using the REPLACE function. Something like:

SELECT REPLACE(field, ' ', '') AS fieldNoSpace ...

and then you can use it in your template:

BIND(template("http://test.com/ontology-v1/onto1-{fieldNoSpace}") AS ?entity)

I hope this solves your problem.

Best regards

Lars

1 Like

Hi Lars,

Thanks a lot for your quick response!
It does resolve my problem.

One question regarding the approach, so that means that when this kind of things happen one has to do the cleaning in the SQL, JSON, CSV side of the mapping not in the SPARQL side right?

Hi Irlan,

Let me try to give a more complete answer.

Generally, the amount of transformations you can perform in the WHERE clause of SMS bindings for SQL sources is limited to creating TEMPLATEs for IRIs and casting fields to different RDF types. The functions that can be used when importing CSV or JSON is much richer. We go into the details on that here.

Also, in general, we encourage users to not think of IRIs as being intended for human consumption. You can map that column to an IRI directly and the spaces will be percent-encoded such that the training spaces will be converted to %20 character sequences. This is a bit ugly, but it will lead to the best performance. We go into the performance details here.

That said, if you're using Stardog to unify disparate data sources and in one source you have the trailing spaces and in the other you don't, you'll need these columns to match or else the IRI's won't match. In this case you can do the transformation in the FROM SQL clause. Lars' suggestion is good in this case. I also did some experimenting and found that casting to an integer works too, but you'll only want to try that if all your columns are numeric-only.

My tests were:

create TABLE numeric_space (
  id INT NOT NULL,
  text_id VARCHAR(5) NOT NULL,
  primary key (id)
);

INSERT INTO numeric_space (id, text_id) VALUES (1, '1');
INSERT INTO numeric_space (id, text_id) VALUES (2, '2    ');
INSERT INTO numeric_space (id, text_id) VALUES (3, '');

MAPPING
FROM SQL {
  SELECT id, text_id, cast(text_id as int) intid FROM numeric_space
}
TO {
  ?s <urn:has> ?o.
  ?s <urn:intid> ?intid.
  ?s <urn:trimmed> ?trimmed.
}
WHERE {
  BIND(TEMPLATE("urn:id:{id}") AS ?s)
  BIND(TEMPLATE("urn:textid:{text_id}") AS ?o)
  BIND(TEMPLATE("urn:trimmed:{intid}") AS ?trimmed)
}

stardog $ sq foo 'select * from <virtual://textid> { ?s ?p ?o }'
+----------+-------------+--------------------------+
|    s     |      p      |            o             |
+----------+-------------+--------------------------+
| urn:id:1 | urn:has     | urn:textid:1             |
| urn:id:1 | urn:intid   | "1"                      |
| urn:id:1 | urn:trimmed | urn:trimmed:1            |
| urn:id:2 | urn:has     | urn:textid:2%20%20%20%20 |
| urn:id:2 | urn:intid   | "2"                      |
| urn:id:2 | urn:trimmed | urn:trimmed:2            |
| urn:id:3 | urn:has     | urn:textid:              |
| urn:id:3 | urn:intid   | "0"                      |
| urn:id:3 | urn:trimmed | urn:trimmed:0            |
+----------+-------------+--------------------------+

Query returned 9 results in 00:00:00.183

stardog $ sq foo 'select * from <virtual://textid> { ?s <urn:trimmed> <urn:trimmed:2> }'
+----------+
|    s     |
+----------+
| urn:id:2 |
+----------+

Regards,
-Paul

1 Like

Hi Paul, and thanks a lot for your answer as well.
I will take into account your suggestions.

Regarding IRIs not suitable for human consumption I agree with that. The issue here is that sometimes we need to build an IRI and we need to create it in a way where we are combining data coming from more that one field since this is the only way we can guarantee to have something unique when we unify sources. And since this is an ETL process we haven't found so far the way of creating something unique that is also associated to the entity that we need to create. Any further suggestions here are welcome. But, maybe that is a topic for another thread :wink: which I will definitely need in the near future. All in all, great support :slight_smile:

Hi Irlan,

sure, let us know if you need further assistance. Regarding your point on the creation of unique IRIs

we need to build an IRI and we need to create it in a way where we are combining data coming from more that one field

You can use multiple fields/columns in an IRI template. For more info see our documentation or the R2RML docs. So for example, the following template would work as well:

BIND(template("http://test.com/ontology-v1/onto1-{field1}-{field2}") AS ?entity)

Best regards
Lars

1 Like

Hi Irlan,

Since this is an ETL process, the performance concerns are not a priority. Also, you can copy or import your virtual graph into a temporary named graph in one step and use SPARQL on the Stardog side to clean and transform your data as you move it into your final named graph. This is just another option - the way you're dealing with IRIs in the mappings is fine.

-Paul

Hi Paul,

Thanks for your answer.
I disagree with you here. There are some "detractors" of KG-based approaches and we really need to make from the ETL to the performance of the main named graph optimal, and that is the reason of my request since some time ago we implemented this ETL process with another tool and now we are doing it with Stardog.

Best,
Irlan

There won't be a performance hit if you go with the SQL-based approach that I mentioned.

The performance hit I'm referring to occurs when you try to find a node in the graph based on a SQL-calculated field. I'm my example above, it was when I queried for ?s <urn:trimmed> <urn:trimmed:1>. Because the trimmed column is a calculated field, the database in this case has to do a full table scan to find rows where the casted value of the id_text field equates to 1. In an ETL scenario, we're alreading scanning the whole table, adding a minimal additional cost of converting the data on the fly.

-Paul