PATHS ALL query memory leak / unresponsive?

Hello!

I'm doing some custom machine learning work with stardog, and as part of that, I need to run a series of hundreds of PATHS ALL queries. The individual queries all work well, returning results in ~1 second, but after running many of them in sequence, stardog becomes unresponsive.

As the queries run, stardog starts to use more and more memory. Once it reaches its memory allocation, then the CPU starts churning at 100% and the server is no longer responsive, requiring a restart. (Seems to me like there may be a memory leak?)

Running the server with more memory (have tried up to -Xmx16g -Xms16g, direct memory 32g) allows the system to process more queries before locking up, but does not solve the problem. I'm running stardog through docker, and the container does have a higher memory limit than stardog is configured to use.

The database is really small too - only ~6K triples - so it really doesn't seem like it should need that much memory.

The specific query comes from the following python format string, where {s} and {t} are two specific entities, and {max_length} is provided and is usually 5:

    query = f"""
        PATHS ALL
        START ?s = {s}
        END   ?t = {t}
        VIA   {{
            GRAPH <###subgraph_name###> {{
                ?s a ?sc . ?t a ?tc .
            }}
                    
            ?sc a <http://www.w3.org/ns/shacl#NodeShape> .
            ?tc a <http://www.w3.org/ns/shacl#NodeShape> .

            # In the case of triples of the form  s --(r)-> t
            {{
                ?sc <http://www.w3.org/ns/shacl#property> ?_prop .
                ?_prop <http://www.w3.org/ns/shacl#class> ?tc ;
                    <http://www.w3.org/ns/shacl#path> ?p .
                GRAPH <###subgraph_name###> {{ ?s ?p ?t . }}
                BIND('forward' AS ?direction)
            }}
            UNION
            # In the case of triples of the form  t --(r)-> s
            {{
                ?tc <http://www.w3.org/ns/shacl#property> ?_prop .
                ?_prop <http://www.w3.org/ns/shacl#class> ?sc ;
                    <http://www.w3.org/ns/shacl#path> ?p .
                GRAPH <###subgraph_name###> {{ ?t ?p ?s . }}
                BIND('inverse' AS ?direction)
            }}
        }}
        MAX LENGTH {max_length}
    """

Any guidance would be very much appreciated. Thank you!

Hi, the first thing I'd suggest is to rule out the possibility of not closing query results and/or connections if your client executes these queries in some sort of a loop. If it does not fully consume results nor closes the connection, the server will keep the result set around and it could lead to the behaviour you're observing. Feel free to share your code.

You can also use stardog-admin server metrics | grep queries.running to see if the number of active queries isn't growing.

If all seems well, the next step would be to collect the Java heap dump when the server is stuck and share with us. You don't need a large heap for that, feel free to lower -Xmx to get a smaller dump file.

Cheers,
Pavel

Hi Pavel,

Thanks for the quick response!

I changed my code to use the following pattern for all stardog access, and stardog-admin server metrics was reporting 0-1 active queries at a time, so I don't think it should be leaving resources open too long.

def stardog_conn():
    conn_details = dict(
        endpoint='http://localhost:5820',
        username='admin',
        password='admin'
    )
    return stardog.Connection('db_name', **conn_details)

def processing_function():
    with stardog_conn() as conn:
        res = conn.select(query)
    # process res
    return processed_res

Here is a link to a heap dump with -Xmx256m -Xms256m max direct 512m, so hopefully that's enough to gain some insight - let me know if you'd like me to run it again with more memory.

I did notice this time that the stardog docker container does give an explicit java.lang.OutOfMemoryError: GC overhead limit exceeded, too.

Cheers!

Thanks a lot. There's a leak of some specific form here indeed.

While we're investigating, try to do the following: add query.plan.reuse=never to your database options (either with metadata set or as an argument to db create if you reload the data). You can restart the server too to clean up the heap.

If it doesn't help, please share another dump (you can now delete the previous one).

Thanks,
Pavel

Hi Pavel,

Thank you, that query.plan.reuse=never is working.

Cheers!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.