Query Throughtput Stardog

Hi all, I would like to have a thoughtput of 100k queries per second for queries of the type:

describe http://geophy.io/buildings/ID

Where ID will range from 1…100M

Currently I can have a max throughput of 1000 queries per second. The queries return 26 triples, in average .

This could be translated to a 26K triples per second, which seems to be very very slow.

The task I have is that I need to return all triples of a large amount of specific resources.

We have tried these queries as well:
“0 - Triples: 31186 Time elapsed (1000 runs): 1.73s, describe http://geophy.io/buildings/id
"1 - Triples: 31250 Time elapsed (1000 runs): 1.48s, select * where { http://geophy.io/buildings/id ?p ?o } "
“2 - Triples: 31250 Time elapsed (1000 runs): 2.75s, CONSTRUCT { ?s ?p ?o } where {?s ?p ?o filter (?s= http://geophy.io/buildings/id )}”
“3 - Triples: 31250 Time elapsed (1000 runs): 2.69s, CONSTRUCT { http://geophy.io/buildings/id ?p ?o } where {http://geophy.io/buildings/id ?p ?o }”

Is there any strategy that I could use to speed up the query throughput ?

My suggestion would be to batch resources on the client side and run fewer queries of the form

SELECT * WHERE {
?s ?p ?o  .
VALUES ?s { IRI_1 IRI_2 ... IRI_n }
}
ORDER BY ?s

This should improve throughput by reducing the number of client/server exchanges. You can run experiments for different values of n.

Cheers,
Pavel

Note that, you can also combine VALUES with DESCRIBE:

DESCRIBE ?s VALUES ?s { IRI_1 IRI_2 ... IRI_n }

Other possible ways to improve performance:

  1. Execute queries in parallel. Depending on how many cores you have on the client and the server you can use 4, 8 or maybe more threads that will improve the throughput.
  2. Experiment with different formats. NTriples vs Turtle might make a difference but which one will be faster depends on your triples.
  3. Increasing the server heap size might improve caching efficiency depending on your data size and heap.
  4. Alternatively using a caching layer between Stardog and your app would help too.

If none of these suggestions work then you might change your data layout such that the resources you want to export are stored in a specific named graph and you can use data export -g named-graph or the equivalent SPARQL query.

Best,
Evren

Thank you all. This definitely improved the performance.

Running without transactions will also speed things up. So if you are doing .begin() and .commit() around your query then removing those will help speed things up.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.