SPARQL best practice: Filter by time

borgendorf · September 26, 2018, 12:00am

Hi there. I have a database of events that have occurred, and each event has a datetime associated with it, and is in its own named graph. The following query runs hourly to build a report of what has happened in the last hour. The problem I'm having is that the more data in the database, the longer this query takes (for instance, it took 15 seconds this time). Is there a more efficient way to say, "Only bring back objects from the last hour"?

select distinct ?username where {
  graph ?g {

  ?user a <tag:champtc:enterprise#Employee> .
  ?user <tag:champtc:enterprise#hasAccountName> ?username .
  ?t <tag:champtc:core#hasEventTime> ?time .

  BIND( (NOW() - "P0DT1H"^^xsd:dayTimeDuration) AS ?beginTime ) .
  FILTER( ?time >= ?beginTime )
   }
} GROUP BY ?username

pavel · September 26, 2018, 1:39pm

Hi Ian,

One strange thing in your query is the lack of connection between the

?user a <tag:champtc:enterprise#Employee> .
?user <tag:champtc:enterprise#hasAccountName> ?username .

and the

?t <tag:champtc:core#hasEventTime> ?time .

pattern. That means the query should return all combinations of users and time records. That's typically inefficient. Is this intended?

Best,
Pavel

borgendorf · September 26, 2018, 3:03pm

Thanks for the reply, Pavel. There are multiple connected objects in the graph that is being searched, and the object that has the time has an unknown type (it could be any number of things) and is not necessarily the same object that the user property is on. The query is trying to say, "Return the unique usernames of any named graphs that have a time within the last hour."

Even if I can make that part of the query more specific, is there a better way to do the "in the last hour" part without having to first select everything and then filter out the ones I don't want?

Thanks again.

zachary.whitley · September 26, 2018, 3:21pm

I hope you don't mind my jumping on your question but it's related to some things I've been thinking about lately. I don't think Stardog has any sort of indexing for datatype. Is that correct? I think you're going to have a similar problem filtering with dates that you have filtering on strings that is solved for the latter by full text search. I had asked a question a while ago about hooking into bulk loading because I had wanted to explore timeseries indexes and possibly see if you could use http://www.chronix.io/ for time.

Would it work better if you translated dateTime into epoc and do a numeric filter? I thought you might be able to do a xsd:integer(now()) but it looks like that doesn't work so you'd need to write a custom function.

You might be able to use something like https://www.timescale.com/ and virtual graphs.

pavel · September 26, 2018, 3:25pm

Well, the "within the last hour" condition is tricky because Stardog does not maintain indexes over dates. There're, of course, tricks you can do but they'd require changes in modeling, e.g. storing the date and the hour for each time instances (as integer literals), and then using it in the query, e.g.

?t <tag:champtc:core#hasEventHour> "10"^^xsd:int .

(10 is just an example, you can compute it in the query using now(), etc., just avoid >= filters)

But I'm suspecting there might be other problems here as well. Can you show the query plan and tell the version of Stardog?

Thanks,
Pavel

borgendorf · September 26, 2018, 3:30pm

Here is the query plan. I'm pretty sure we are using Stardog 5.2.3.

Distinct [#54]
`─ Projection(?username) [#54]
   `─ Group(by=[?username]) [#54]
      `─ Filter(?time >= ?beginTime) [#63K]
         `─ Bind((NOW() - "P0DT1H"^^xsd:dayTimeDuration) AS ?beginTime) [#126K]
            `─ LoopJoin(_) [#126K]
               +─ MergeJoin(?user) [#56]
               │  +─ Scan[POSC](?user, rdf:type, <tag:champtc:enterprise#Employee>){?g} [#56]
               │  `─ Scan[PSOC](?user, <tag:champtc:enterprise#hasAccountName>, ?username){?g} [#56]
               `─ Scan[PSO](?t, <tag:champtc:core#hasEventTime>, ?time){?g} [#2.3K]

pavel · September 26, 2018, 3:34pm

OK, would it be easy for you to try 5.3.4 and see how it performs? There's a query optimization issue which we recently fixed and which is relevant to your case (the top join should be on the graph variable and not a loop join).

Cheers,
Pavel

borgendorf · September 26, 2018, 3:37pm

Fantastic. I'll get it on the ToDo board and get it into our app. Stay tuned!

pavel · September 26, 2018, 3:40pm

Just to be clear: it won't completely solve your problem, i.e. query runtime will increase as the amount of data grows. But it should be better than now, especially if named graphs are relatively small compared to the size of the database.

Cheers,
Pavel

borgendorf · September 26, 2018, 3:51pm

No worries. Support is part of my job, too. I understand not overpromising.

zachary.whitley · September 26, 2018, 4:06pm

It looks like you could convert your dateTime to a lexicographically sortable string and possibly use full text search. DateTools (Lucene 7.3.1 API)

I'm sure it wouldn't take much to wrap that into a function.

borgendorf · October 9, 2018, 9:49pm

Upgrading to 5.3.4 helped, and what really speeds it up is knowing that our named graphs are all named after the primary type of object they are, so I added a ?g a ... statement in there so it wasn't looking through all of the subjects.

system · October 23, 2018, 9:55pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to write SPARQL query for JSON data? Support	2	411	August 23, 2021
SPARQL Query Optimization Tips and Tricks	5	589	November 2, 2022
All data in a SPARQL query Support	2	674	April 26, 2017
SPARQL query to SELECT available Named Graphs Stardog Studio	10	138	January 6, 2025
Some inference rules work, others don't. Unable to query for rules Support	11	310	January 19, 2025

SPARQL best practice: Filter by time

Related topics