Would the performance be an issue in production?

For some example queries in this tutorial,

it seems taking a long time. For example:

SELECT ?actor ?name (count(?movie) as ?numMovies) 
    ?actor :hasName ?name .
    ?actor :actedIn ?movie .
GROUP BY ?actor ?name
ORDER BY DESC(?numMovies)

This query takes over 10k ms, and most of the path queries take over 1K ms. In production system, this would be too bad to be useful. Single query's time should be limited to a few ms, ideally. For several hundred miliseconds, that's usually not tolerable.

Am I missing something in terms of performance?

You can try restarting your Stardog server with more memory allocated (the defaults are somewhat low). Set the $STARDOG_SERVER_JAVA_ARGS variable to something like -Xms4g -Xmx4g -XX:MaxDirectMemorySize=8g (or higher) depending on the memory limitations of your machine, and you should see increased performance.

Is there a configuration file under the root directory of the installation? Where to set the $STARDOG_SERVER_JAVA_ARGS? Thanks.

That is set as an environment variable in the shell in which Stardog is started. If Stardog is running under systemd, you would set it in /etc/stardog.env.sh.

I added the following to my ~/.bashrc on Ubuntu and it doesn't seem to take effect:

export STARDOG_JAVA_ARGS="-Xms4g -Xmx4g -XX:MaxDirectMemorySize=8g"

Is it set in the shell that you're running stardog from?


and did you restart Stardog? You probably want to be setting STARDOG_SERVER_JAVA_ARGS and not STARDOG_JAVA_ARGS.

Yes. I changed it "STARDOG_SERVER_JAVA_ARGS" and restart the server. This query's time is reduced from over 10k to over 6k with the movie data:

    ?domain ?prop ?range
    ?subject ?prop ?object .
    ?subject a ?domain .
    optional {
        ?object a ?oClass .
    bind(if(bound(?oClass), ?oClass, datatype(?object)) as ?range)
    filter (?prop != rdf:type && ?prop != rdfs:domain && ?prop != rdfs:range)

But the following query still takes over 10K to complete.

    START ?x {?x :hasName "Kevin Bacon"} 
    END ?y {?y :hasName "Nick Offerman"}
    ?movie a :Movie ;
      :hasTitle ?title ;
      :hasYear ?year .
    ?x :actedIn ?movie ;
        :hasName ?xName .
    ?y :actedIn ?movie ;
        :hasName ?yName .
    FILTER (?year >= 2010)

For both queries, they seems slow. My Ubuntu has 64G memory and 8 cores. Are these performances expected? How about on your computer to test? Thanks.

I am testing in the Stardog Studio. After i restarted the server from the command line and relaunch the Studio, is there something special needed to be done, to re-config the studio to work with the newly started database? I just re-launch the Studio and click 'connect' to reconnect the database.

I'm seeing 27s and 21s seconds for the two queries with the same 4G/8G memory allocation you're running with.

You can look over the query plan to see what's happening.

  1 Reduced [#3.8M]
  2 `─ Projection(?domain AS ?subject, ?prop AS ?predicate, ?range AS ?object, <tag:stardog:api:context:default> AS ?context) [#3.8M]
  3    `─ Bind(IF(Bound(?oClass), ?oClass, Datatype(?object)) AS ?range) [#3.8M]
  4       `─ HashJoinOuter(?object) [#3.8M]
  5          +─ MergeJoin(?subject) [#3.8M]
  6          │  +─ Scan[PSOC](?subject, rdf:type, ?domain) [#921K]
  7          │  `─ Filter((?prop != rdf:type && (?prop != rdfs:domain && ?prop != rdfs:range))) [#3.8M]
  8          │     `─ Scan[SPO](?subject, ?prop, ?object) [#3.8M]
  9          `─ Scan[PSOC](?object, rdf:type, ?oClass) [#921K]

This is the explanation file for the query. HashJoinOuter was highlighted in red in the plan panel.

Hi Martin,

I'd say that the SELECT and the CONSTRUCT queries probably can't run <1 sec because they process a lot of data (pretty much all of the database). The SELECT query also returns >500K results so there's a noticeable ORDER BY and HTTP overhead, too. But the PATHS query I'd expect to be faster and we'll take a look at what's going on. One thing to notice there is that the problem is related to querying for attributes of intermediate nodes in paths, i.e. this reduced version:

    START ?x {?x :hasName "Kevin Bacon"} 
    END ?y {?y :hasName "Nick Offerman"}
    ?movie a :Movie ;
      :hasYear ?year .
    ?x :actedIn ?movie .
    ?y :actedIn ?movie .
    FILTER (?year >= 2010)

completes in <1s for me.

Thanks for the report,

Hi, Pavel:

I know 1ms is too demanding. My problem seems to be the fact that I can't get the performance you and Zachary get by running the two queries. Zachary said he got 27ms and 21ms for the two queries above as shown. And for your reduced version, I still get 2925 ms (see the screenshot). I did set the variable as below:
export STARDOG_HOME="/home/martin/stardog-7.0.3"
export PATH="/home/martin/stardog-7.0.3/bin:$PATH"
export PATH="/opt/gradle-6.0.1/bin:$PATH"
export STARDOG_SERVER_JAVA_ARGS="-Xms4g -Xmx4g -XX:MaxDirectMemorySize=8g"

-Xms4g -Xmx4g -XX:MaxDirectMemorySize=8g

I am working on Kubuntu 18.04, with 64G ram and 8 cores.

Well, Zach said 21s, not 21ms. It's about x2 less for me but in the ballpark. 3s for one path on your machine is definitely too much comparing to 500ms on my OSX laptop (16G RAM). Is Stardog running locally for you? Does the time change if you run the query multiple times?

Oh, also I have an SSD disk which is quite important. If your home is on an HDD it could affect the results.

I restarted the Studio and the reduced version takes 389ms. I am running stardog locally. Is this normal? But you still get 1s, much lower.

Again, I think there's some confusion re: units here. 389ms is consistent with my experience of <1s, that is, 1000ms. I normally get the path back in about 500ms.

As I said, we created a ticket to look into performance for the full version of this query. Generally querying for properties of intermediate nodes in paths comes at a cost when many paths go through the same nodes (so not just the nodes are repeated but their associated properties too). But it doesn't seem to be the case here since there's only a single path.


Hi, Pavel:

Based on your current customer feedback, is that a real problem when a single query takes about 300-500 ms in a production system? If the KG is used as part of an analysis tool, it may be ok for an analyst to wait for several hundreds of millseconds or even a few seconds for the query to be completed. Instead, if the KG is used to support a realtime system, i.e. a chatbot to interact with many customers, would that be too slow? (In such a case, assume that the KG query time alone should be limited to < 100ms). It would be great if you can share some information on this based on your current customer experiences and feedback.

Well, I don't think people would use path queries to provide data for real time tasks, e.g. for UI stuff. Searching for paths is naturally seen as a more of an analytic task which requires possibly deep graph traversals (unless you constrain it in some specific ways, e.g. with the MAX LENGTH keyword). We have customers who use Stardog to power UI but those queries are typically more like finding properties/connections of specific nodes in the graphs, i.e. low latency requires selective patterns. And then there's typically stuff between a database and the UI layer, e.g. caches, to ensure responsiveness.