SPARQL best practice: Filter by time

Hi there. I have a database of events that have occurred, and each event has a datetime associated with it, and is in its own named graph. The following query runs hourly to build a report of what has happened in the last hour. The problem I'm having is that the more data in the database, the longer this query takes (for instance, it took 15 seconds this time). Is there a more efficient way to say, "Only bring back objects from the last hour"?

select distinct ?username where {
  graph ?g {

  ?user a <tag:champtc:enterprise#Employee> .
  ?user <tag:champtc:enterprise#hasAccountName> ?username .
  ?t <tag:champtc:core#hasEventTime> ?time .

  BIND( (NOW() - "P0DT1H"^^xsd:dayTimeDuration) AS ?beginTime ) .
  FILTER( ?time >= ?beginTime )
   }
} GROUP BY ?username

Hi Ian,

One strange thing in your query is the lack of connection between the

?user a <tag:champtc:enterprise#Employee> .
?user <tag:champtc:enterprise#hasAccountName> ?username .

and the

?t <tag:champtc:core#hasEventTime> ?time .

pattern. That means the query should return all combinations of users and time records. That's typically inefficient. Is this intended?

Best,
Pavel

Thanks for the reply, Pavel. There are multiple connected objects in the graph that is being searched, and the object that has the time has an unknown type (it could be any number of things) and is not necessarily the same object that the user property is on. The query is trying to say, "Return the unique usernames of any named graphs that have a time within the last hour."

Even if I can make that part of the query more specific, is there a better way to do the "in the last hour" part without having to first select everything and then filter out the ones I don't want?

Thanks again.

I hope you don't mind my jumping on your question but it's related to some things I've been thinking about lately. I don't think Stardog has any sort of indexing for datatype. Is that correct? I think you're going to have a similar problem filtering with dates that you have filtering on strings that is solved for the latter by full text search. I had asked a question a while ago about hooking into bulk loading because I had wanted to explore timeseries indexes and possibly see if you could use http://www.chronix.io/ for time.

Would it work better if you translated dateTime into epoc and do a numeric filter? I thought you might be able to do a xsd:integer(now()) but it looks like that doesn't work so you'd need to write a custom function.

You might be able to use something like https://www.timescale.com/ and virtual graphs.

Well, the "within the last hour" condition is tricky because Stardog does not maintain indexes over dates. There're, of course, tricks you can do but they'd require changes in modeling, e.g. storing the date and the hour for each time instances (as integer literals), and then using it in the query, e.g.

?t <tag:champtc:core#hasEventHour> "10"^^xsd:int .

(10 is just an example, you can compute it in the query using now(), etc., just avoid >= filters)

But I'm suspecting there might be other problems here as well. Can you show the query plan and tell the version of Stardog?

Thanks,
Pavel

Here is the query plan. I'm pretty sure we are using Stardog 5.2.3.

Distinct [#54]
`─ Projection(?username) [#54]
   `─ Group(by=[?username]) [#54]
      `─ Filter(?time >= ?beginTime) [#63K]
         `─ Bind((NOW() - "P0DT1H"^^xsd:dayTimeDuration) AS ?beginTime) [#126K]
            `─ LoopJoin(_) [#126K]
               +─ MergeJoin(?user) [#56]
               │  +─ Scan[POSC](?user, rdf:type, <tag:champtc:enterprise#Employee>){?g} [#56]
               │  `─ Scan[PSOC](?user, <tag:champtc:enterprise#hasAccountName>, ?username){?g} [#56]
               `─ Scan[PSO](?t, <tag:champtc:core#hasEventTime>, ?time){?g} [#2.3K]

OK, would it be easy for you to try 5.3.4 and see how it performs? There's a query optimization issue which we recently fixed and which is relevant to your case (the top join should be on the graph variable and not a loop join).

Cheers,
Pavel

Fantastic. I'll get it on the ToDo board and get it into our app. Stay tuned!

Just to be clear: it won't completely solve your problem, i.e. query runtime will increase as the amount of data grows. But it should be better than now, especially if named graphs are relatively small compared to the size of the database.

Cheers,
Pavel

No worries. Support is part of my job, too. I understand not overpromising. :smile:

It looks like you could convert your dateTime to a lexicographically sortable string and possibly use full text search. https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/document/DateTools.html

I'm sure it wouldn't take much to wrap that into a function.

Upgrading to 5.3.4 helped, and what really speeds it up is knowing that our named graphs are all named after the primary type of object they are, so I added a ?g a ... statement in there so it wasn't looking through all of the subjects.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.