Clickhouse JDBC Support For Virtual Graphs?

Clickhouse has a JDBC client
I configured Clickhouse as a Generic SQL

  • clickhouse-jdbc-0.2.4-jar-with-dependencies.jar in the external jar directory
  • ru.yandex.clickhouse.ClickHouseDriver as JDBC Driver Class
  • jdbc:clickhouse://:8123/ as URL
  • toDateTime registered as SQL Function Names
  • added debug log4j on virtual graphs

The connection work and Mappings where made for the tables in the database
But SPARL queries fail because the generated SQL uses FETCH instead of LIMIT

How do configure the SQL query planner to use LIMIT?
com.complexible.stardog.plan.eval.operator.OperatorException: Unable to execute virtual graph query. SQL string:

    SELECT "t0"."ip"
    FROM (SELECT "ip", "hits"
    FROM "ghost_stats"."ghost_load_v0"
    WHERE "pdate" = '2020-07-15' AND "timestamp" = toDateTime('2020-07-15 00:00:00')) AS "t0"
    INNER JOIN (SELECT "ip", "hits"
    FROM "ghost_stats"."ghost_load_v0"
    WHERE "pdate" = '2020-07-15' AND "timestamp" = toDateTime('2020-07-15 00:00:00')) AS "t2" ON "t0"."ip" = "t2"."ip"
    FETCH NEXT 1000 ROWS ONLY

I’ll start out by saying that clock house isn’t officially supported by Stardog. Your luck in getting an unofficially supporter database is going to depend on how far it strays from a supported one. The default sql flavor is oracle which is where the fetch is coming from. You may have better luck choosing one of the other types that may be closer to the click house implementation.

The vg property is sql.dialect, supported values are listed in the docs. . That might get you close enough to play around with it but any serious use will probably require an officially supported dialect.

You may have better luck with Druid if that's an option https://druid.apache.org

Zachary thanks for the hint,
I got a simple query to work with Clickhouse, by adding to "other option' the property.
sql.dialect => HIVE
Hive uses Limit and not Fetch

But the next problem is that filtering does not work, because the generated SQL put the filter predicate in the ON clause and not the WHERE clause. Clickhouse wants the predicate in the where clause

FROM (SELECT `ip`, `hits`
FROM `ghost_stats`.`ghost_load_v0`
WHERE `pdate` = '2020-07-15' AND `timestamp` = toDateTime('2020-07-15 00:00:00')) `t0`
INNER JOIN (SELECT `ip`, `hits`
FROM `ghost_stats`.`ghost_load_v0`
WHERE `pdate` = '2020-07-15' AND `timestamp` = toDateTime('2020-07-15 00:00:00')) `t2` ON `t0`.`ip` = `t2`.`ip` AND 400 < `t0`.`hits`
INNER JOIN (SELECT `ip`, `hits`
FROM `ghost_stats`.`ghost_load_v0`
WHERE `pdate` = '2020-07-15' AND `timestamp` = toDateTime('2020-07-15 00:00:00')) `t4` ON `t0`.`ip` = `t4`.`ip` AND `t2`.`ip` = `t4`.`ip`
LIMIT 1000

Clickhouse has these ANSI SQL Compatibility limitations:

  1. Are there generic SQL features that can be declared as not supported, so that the planner will not use them?
  2. Also why did the planner create all those joins? For the simple SPARQL query:
SELECT ?ghost ?label ?hits {
GRAPH <virtual://clickhouse> {
    ?ghost a          loc:Ghost ;
           rdfs:label ?label;
           load:hits ?hits
    FILTER(?hits > 400)
}
}

I would have expected the SQL Query:

SELECT `ip`, `hits`
FROM `ghost_stats`.`ghost_load_v0`
WHERE `pdate` = '2020-07-15' AND 
`timestamp` = toDateTime('2020-07-15 00:00:00') AND
`hits` > 400
LIMIT 1000

The joins are introduced because several ?ghost bindings might match multiple bindings for ?label and ?hits and we need to generate the combination of those bindings. Do you have a unique key in this table? I don't have experience with Clickhouse, but we do ask the JDBC driver for this information. If you have a unique key, you can define it in your virtual graph using the unique.key.sets property.

I have given up on Clickhouse and Hive for now, but will need to get them working, if Stardog will ever be used in my production enviroment.

For the proof of concept, I loaded a slice of data into postgres. The auto generated mappings conflicted across clickhouse and postgres, so I just removed all the virtual graphs and started over.

The postgres table defined primary keys, which gave the ?subject (per_row node) the id:

BIND(template("http://api.stardog.com/ghost_load_v0/timestamp={timestamp};pdate={pdate};ip={ip}") AS ?subject)

Instead of ?subject with all the table and columns in the row id as seen in clickhouse autogenerated Mappings.

The resulting query was the simple form I expected:

SELECT "ip", "hits"
FROM "public"."ghost_load_v0"
WHERE "timestamp" = 1595030400 AND 400 < "hits"
GROUP BY "ip", "hits"
FETCH NEXT 10 ROWS ONLY

Can I assume that this simplification will happen when I set the unique.key.sets property in the Other Options, when I resume clickhouse integration?

The set of unique keys is what we use to prune joins. With Postgres, we can obtain them from the database schema. If we can't get them automatically from Clickhouse, you can provide them manually in the form of the unique.key.sets property.

Regarding the ON vs WHERE filter, this would be a bit more work. What exact error are you getting back from Clickhouse? Which specific item in the list of limitations were you mentioning?

Setting the unique.key.sets property with the indexes for the tables allowed the autogenerated mappings to create a ?subject template with only the index columns. The clickhouse auto generated mapping looks to be the same as for the Postgres version of the table. This is good!

Setting the sql.dialect property to HIVE or MYSQL and Virtual Graph Type as "Generic SQL" the generated SQL uses LIMIT and not Fetch. This is good!

But now the generated SQL is generating string matches with null strings (''), This is bad.

The SPARQL query:

    PREFIX : <http://api.stardog.com/>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX stardog: <tag:stardog:api:>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    PREFIX gconf: <http://api.stardog.com/ghost_config_v0#>
    SELECT ?ip ?city{
    GRAPH <virtual://clickhouse_test> {
    ?subject rdf:type :ghost_config_v0;
    gconf:ip ?ip ;
    gconf:pdate "2020-07-18";
    gconf:city ?city.
    FILTER(?city = "LONDON" )
    }
    }
    LIMIT 10

Gives Plan (notice the null strings)

    VirtualGraphSql<virtual://clickhouse_test> [#2] {
    +─ RelNode=
    +─    LogicalSort(fetch=[10])
    +─      LogicalFilter(condition=[AND(=($0, ''), =($8, ''))])
    +─        JdbcTableScan(table=[[ghost_stats, ghost_config_v0]])
    +─ Vars=
    +─    ?ip <- COLUMN($1)^^xsd:string
    `─    ?city <- CONSTANT("LONDON"^^xsd:string)

And a SQL query (notice the null strings)

    SELECT *
    FROM `ghost_stats`.`ghost_config_v0`
    WHERE `pdate` = '' AND `city` = ''
    LIMIT 10

This SPARQL query gives the expected results when using Postgres Virtual Graph

Can you share the template for the ?subject variable from the mapping? Or just the entire mapping...

Clickhouse
BIND(template("http://api.stardog.com/ghost_config_v0/pdate={pdate};ip={ip}") AS ?subject)

Postgres

BIND(template("http://api.stardog.com/ghost_config_v0/pdate={pdate};ip={ip}") AS ?subject)

And the source SQL query is just select * from table without WHERE?

yes
SELECT *
FROM ghost_stats.ghost_config_v0
clickhouse_conf.sparql (4.6 KB)

is this using sql.dialect=HIVE? Can you share the query plan after setting sql.dialect=MYSQL and also commenting it out? I know the query will generate FETCH instead of LIMIT, but it will help narrow down why the pdate/city fields are being constrained erroneously.

Can you also run the virtual source_metadata command with the arguments vg.properties ghost_stats ghost_load_v0? This will print some metadata which might be influencing how we generate the specific queries here.

Sql.dialect set to HIVE or MYSQL gives the same results
Removing the option and the 'Save' fails to load.

I configured the "clickhouse_test" database and the "clickhouse_test" virtual graph using the Stardog Studio. I did a:
find /opt/stardog -name "*click*"
and found the database file
/opt/stardog/data/data/db_meta/clickhouse_test.properties
But not the virtual graph properties file, where should I look for that file?
stardog-7.3.2/bin/stardog-admin virtual source_metadata ./data/data/db_meta/clickhouse_test.properties ghost_stats ghost_load_v0
yeilds:
The url cannot be null

Ah, I see. If you run stardog-admin virtual options clickhouse_test > clickhouse_test.properties, it will create the properties file you need to run virtual source_metadata.

Sorry, I can't get stardog-admin to run with the right environment to pick up the jar files:

stardog-admin virtual source_metadata --verbose clickhouse_test_vg.properties ghost_stats ghost_load_v0
Loading driver ru.yandex.clickhouse.ClickHouseDriver
ru.yandex.clickhouse.ClickHouseDriver
The detailed stack trace for the error is:
java.lang.ClassNotFoundException: ru.yandex.clickhouse.ClickHouseDriver
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:259)
at com.complexible.stardog.virtual.cli.VirtualGraphSourceMetadata.openConnection(VirtualGraphSourceMetadata.java:63)
at com.complexible.stardog.virtual.cli.VirtualGraphSourceMetadata.call(VirtualGraphSourceMetadata.java:147)
at com.complexible.stardog.virtual.cli.VirtualGraphSourceMetadata.call(VirtualGraphSourceMetadata.java:29)
at com.complexible.stardog.cli.CLIBase.execute(CLIBase.java:56)
at com.complexible.stardog.cli.admin.CLI.main(CLI.java:200)

more clickhouse_test_vg.properties
#Configuration options for the virtual graph clickhouse_test
#Tue Jul 28 15:26:40 EDT 2020
sql.dialect=HIVE
query.translation=DEFAULT
sql.functions=toDateTime
namespaces=\=http\://api.stardog.com/\u0002stardog\=tag\:stardog\:api\:\u0002owl\=http\://www.w3.org/2002/07/owl\#\u0002rdf\=http\://www.w3.org/1999/02/22-rdf-syntax-ns\#\u0002xsd\=http\://www.w3.org/2001/XMLSchema\#\u0002fn\=http\://www.w3.org/2005/xpath-functions\#\u0002rdfs\=http\://www.w3.org/2000/01/rdf-schema\#
percent.encode=true
default.mapping.include.tables=ghost_config_v0, ghost_load_v0
mappings.syntax=sms2
parser.sql.quoting=NATIVE
jdbc.url=jdbc\:clickhouse\://opal.akamai.com\:8123/ghost_stats
optimize.import=true
unique.key.sets=(ghost_config_v0.pdate, ghost_config_v0.ip), (ghost_load_v0.pdate, ghost_load_v0.ip, ghost_load_v0.timestamp)
jdbc.driver=ru.yandex.clickhouse.ClickHouseDriver

echo $STARDOG_EXT
/opt/stardog/ext

ls $STARDOG_EXT/*click*
/opt/stardog/ext/clickhouse-jdbc-0.2.4-jar-with-dependencies.jar

Can you copy that jar file to the client/cli directory under your Stardog binary installation?

Alternatively, you can try adding this in log4j2.xml:

<Logger name="com.complexible.stardog.virtual.vega.calcite.VegaJdbcSchema"
      level="DEBUG" additivity="false">
   <AppenderRef ref="stardogAppender"/>
</Logger>

and then send us the stardog.log output