Loading data Coming from Kafka

We are doing some huge data conversion. The data goes into Kafka and then several consumer including, one for loading the data into stardog. In this scenario, bulk loading is not available, as the data is in kafka. This is a streaming job in a sense. Hence my question is: Do you have any advise on how to do that ?

I am using Sparql update over Http, and parallelizing calls, still it is not that fast. My second solution would be do try to buffer and send bigger packet at once. I am using INSERT DATA, so I guess instead of sending 1 record at the time per connection, I could send multiple record in the INSERT DATA.

I wonder if you have any suggestion for the matter. Iā€™d like to make that process as fast as possible.

We are also loading data from Kafka to Stardog and have run into similar issues. Batching has helped quite a bit over having a 1 to 1 mapping of kafka message to stardog insert.

The approach we are just embarking on this week is to adaptively commit based off of how far we are from the head of the Kafka topic. So something like:

partitionLag = Paritions.map( partition.maxOffset - partition.currentOffset)
commitAfterMessages = partitionLag.map( Math.Max(partition.Lag / 10, 100000) )

The idea being bigger batches have a higher throughput, but when we are at the head of the topic we prefer small batches, or possibly no batching.

1 Like

I see smart. I was thinking about batching indeed, but then there is more to it. thank you for the tip

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.