Percent encoding in SMS?

Hello Stardog folks,

I currently use hash encoding for date and character source data to create IRI-safe values. This seems like overkill. Is there a percent encoding in SMS like there is for R2RML, to convert values like " 12 Dec 2016" to :

12%20Dec%202016 

Similar for IRIs that come from Character fields that have spaces, like: “Hispanic or Latino”
to:

Hispanic%20or%20Latino

Cheers,

Tim

You can make it a part of your SQL query by calling the URLENCODE() function. (or something similar depending on your database). I think there is a property setting to specify non-standard functions that you’d like to call. I can’t remember it off the top of my head but give me a second and I’ll find it.

The parameter is sql.functions

“A comma-separated list of SQL function names to register with the parser. If an R2RML view (using rr:sqlQuery) fails to parse, this option can be set to allow use of non-standard functions.”

Hi Zach,

I am uploading a source CSV file into the graph using an SMS TTL file. Here is how I currently hash the value for ethnicity:

study:ethnicity code:Enthnicity_{#ethnicity}  ;

Is there a corresponding way to percent encode this incoming value instead of hashing?

Cheers

Tim

I’m a little confused by the ‘#’ in there. Was that a typo? Can you elaborate a little bit more on what you mean by hashing? I believe that if the value is included as part of a URL template it will automatically be percent encoded. Is that not happening?

Not a typo. :slightly_smiling_face: I am likely not approaching this in the correct way.

If I leave out the hash I get the message:

...HttpClientException: IRI included an encoded space: '32' 

So I was hashing the incoming values to enable IRI creation.

Perhaps my question should be: How do I use a URL template to automatically percent encode these values?

My upload command is:

call stardog-admin virtual import myGraph DM_mappings.TTL DM_subset.csv

Hi Tim,

This is something that we currently don’t support, but R2RML defines IRI-safe versions of templates, and we have a ticket in place to implement that on the SMS side.

“I am disappoint” :disappointed_relieved: but glad it is on the radar. Do you have an ETA for implementation? We are defining data upload/conversion scripts and documentation for an industry project and:

study:ethnicity code:Ethnicity_Hispanic%20or%Latino .

would be much easier for our new project staff to understand than the corresponding:

study:ethnicity code:Ethnicity_gnja37oohiiipittns2ro9rma4k5q8i5 .

It’s slightly outside of your current workflow but you might want to try importing your csv files into an RDBMS and try using the URLENCODE function that I suggested earlier or you could get all fancy with a bash script and awk and do the url encoding there.

No ETA at the moment. However I would make the point that something like rdfs:label is more appropriate for a “human-readable” name of something. So if you had, elsewhere in your mapping…

code:Ethnicity_gnja37oohiiipittns2ro9rma4k5q8i5 rdfs:label "Hispanic or Latino" .

You could extend existing queries to include it without much work:

BEFORE
select ?study ?ethnicity { ?study study:ethnicity ?ethnicity }

WITH BNODE
select ?study ?ethnicity { ?study study:ethnicity [ rdfs:label ?ethnicity ] }

WITH VARIABLE
select ?study ?ethnicity { 
  ?study study:ethnicity  ?eth.
  ?eth rdfs:label ?ethnicity ;
    other:predicate ?here .
}

I like the way you think, Zach. :+1:

I’m already processing the data using R to convert the source SAS Transport Format (XPT) to CSV for the SMS process, so I could URL encode at the same time. I had hoped to leave the source data as pristine as possible, but this likely the best kludge for me until SMS gets URL encoding in place.

R code mock up:

library(utils)
ethnicity <-"Hispanic or Latino"
en_ethnicity <- URLencode(ethnicity)

then in the SMS:

study:ethnicity code:Enthnicity_{en_ethnicity}  ;

Thanks for time walking me through this. Case closed and have a great day!

Tim

Stephen, you make a good point regarding rdfs:label for human readability. The hashed IRI is fine from the machine’s perspective. The “decode” of our terms is in another graph. We’re trying to strike a compromise between machine processing and human readability, mainly because we need to sell this approach to entry-level RDF folks. “See - that value here in the instance graph is linked to the terminology graph over here using this IRI…” :slight_smile: :slight_smile:

But yeah, I totally get what you are saying!

Tim

Thanks, but I have to agree with Stephen that something that would be queryable would be good. I’m not sure what your exact use case is so I’m just throwin’ options out there.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.