Batch loading & incremental loads

I am sure this is a very basic question that has been solved many times. I just don't seem to be able to find any good guidance.

We batch-load a pile of data. Then we repeat this in regular intervals, say once a day. We want to delete what is not in the new load, update what was updated, add what is new, and leave the rest alone. Fairly typical, I assume.

The questions now become:

  1. Are there efficient patterns to do this?
  2. How would you keep lineage and time-stamps? Not sure if PROV is granular enough.
  3. How would you arrange it so that the temporal data is still available and one could run "as-of" queries.

Any pointers, tips, code snippets would be much appreciated.

You've got a bunch of questions in there so I'll start by getting some information about your particular situation. Lets start with your data. Are you in control of how the data is delivered or are you suck working with what you are given? It sounds like you're being given a sliding window of the data. What format is it being delivered in? Are these CSV files that you're then mapping or are you being given RDF?

Assume I get CSV or the like. Could be JSON too. basically a batch of data-units that are then transformed into instances of one or more RDFS/OWL classes. To take a simple example: Assume I get a list of installed Computers each day. Some dropped of since yesterday (decomissioned), some are new, and some may have changed something (e.g., installed memory). Similarly for employees, departments, applications, ...

I should have asked how large are the files you expect to work with. 500Gb vs 5Mb would make a big difference in how you might approach it. If it was CSV or some columnar format, assuming it was normalized, you could just do a diff to get removed/added/same. I don't think you could do that with JSON. Are you going to have consistent identifiers? If not than you've an entity resolution problem and things just got a lot more interesting.

If the data isn't too large you could just map it and load it into separate named graphs. You'd still have to be careful with blank nodes so I'd avoid them if possible.

I'm not sure exactly what you're requirements would be for lineage and time-stamps. What granularity are you looking for? Or a better question would be what type of use case are you looking to support?

"As-of" queries would probably be easiest to do if you simply stored a complete snapshot in separate graphs and stored the valid dates in another graph or the default graph but depending on how quickly the data was turning over there would be a lot of redundant information. There's probably something clever you could do here with breaking it down into non-overlapping graphs and tracking the set of named graphs but it would get complex to track.

That's just off the top of my head. It seems like there are a lot of options but without knowing more about your exact use case it's tough to say what direction you should go. I'd say try a few out and iterate quickly. The nice thing about RDF is it's flexible enough that you can try out a lot of ideas quickly without having to start from scratch each time.

Leaving size aside as I'd rather move towards a repeatable recipe:

I get the idea of doing the work "outside" of the triple-store. That's the same as you would do with any other database ETL.

I have predictable identifiers and no blank nodes (at least not yet).

It does happen regularly that a resources is "assembled" from more than one source. So, I might have files A,B, C and they are merged into a resource (say computer basics from A, cost/accounting from B, ownership/support teams from C). I suppose we'll split that into multiple classes and link instances or merge them through multiple class membership.

But people do want to know: Where did this field come from? When was it last updated. In some sense, it calls for reification. Possibly, this could be done by mixing graphs that are suitably named. But it all "feels" rather messy and I would like to think that this is a fairly common scenario. Maybe people solve and keep this housekeeping it outside of the graph?

The "easiest" solution would be to have (bi-)temporal support and just load :slight_smile:

I'm all for semantic web technologies but I'll take the quick win when I get the chance. :slight_smile:

There is a possibility that snapshots might be coming up in the 6 release. I seem to remember something mentioned about that. You might also want to look into the Stardog statement ideintifiers for tracking more granular provenance. There isn't much info on that functionality. Let me know if you're interested and I could help track down what there is.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.