Clickhouse JDBC Support For Virtual Graphs?

johnz · July 18, 2020, 4:46pm

Clickhouse has a JDBC client
I configured Clickhouse as a Generic SQL

clickhouse-jdbc-0.2.4-jar-with-dependencies.jar in the external jar directory
ru.yandex.clickhouse.ClickHouseDriver as JDBC Driver Class
jdbc:clickhouse://:8123/ as URL
toDateTime registered as SQL Function Names
added debug log4j on virtual graphs

The connection work and Mappings where made for the tables in the database
But SPARL queries fail because the generated SQL uses FETCH instead of LIMIT

How do configure the SQL query planner to use LIMIT?
com.complexible.stardog.plan.eval.operator.OperatorException: Unable to execute virtual graph query. SQL string:

    SELECT "t0"."ip"
    FROM (SELECT "ip", "hits"
    FROM "ghost_stats"."ghost_load_v0"
    WHERE "pdate" = '2020-07-15' AND "timestamp" = toDateTime('2020-07-15 00:00:00')) AS "t0"
    INNER JOIN (SELECT "ip", "hits"
    FROM "ghost_stats"."ghost_load_v0"
    WHERE "pdate" = '2020-07-15' AND "timestamp" = toDateTime('2020-07-15 00:00:00')) AS "t2" ON "t0"."ip" = "t2"."ip"
    FETCH NEXT 1000 ROWS ONLY

zachary.whitley · July 18, 2020, 5:02pm

I’ll start out by saying that clock house isn’t officially supported by Stardog. Your luck in getting an unofficially supporter database is going to depend on how far it strays from a supported one. The default sql flavor is oracle which is where the fetch is coming from. You may have better luck choosing one of the other types that may be closer to the click house implementation.

The vg property is sql.dialect, supported values are listed in the docs. . That might get you close enough to play around with it but any serious use will probably require an officially supported dialect.

You may have better luck with Druid if that's an option https://druid.apache.org

johnz · July 19, 2020, 5:32pm

Zachary thanks for the hint,
I got a simple query to work with Clickhouse, by adding to "other option' the property.
sql.dialect => HIVE
Hive uses Limit and not Fetch

But the next problem is that filtering does not work, because the generated SQL put the filter predicate in the ON clause and not the WHERE clause. Clickhouse wants the predicate in the where clause

FROM (SELECT `ip`, `hits`
FROM `ghost_stats`.`ghost_load_v0`
WHERE `pdate` = '2020-07-15' AND `timestamp` = toDateTime('2020-07-15 00:00:00')) `t0`
INNER JOIN (SELECT `ip`, `hits`
FROM `ghost_stats`.`ghost_load_v0`
WHERE `pdate` = '2020-07-15' AND `timestamp` = toDateTime('2020-07-15 00:00:00')) `t2` ON `t0`.`ip` = `t2`.`ip` AND 400 < `t0`.`hits`
INNER JOIN (SELECT `ip`, `hits`
FROM `ghost_stats`.`ghost_load_v0`
WHERE `pdate` = '2020-07-15' AND `timestamp` = toDateTime('2020-07-15 00:00:00')) `t4` ON `t0`.`ip` = `t4`.`ip` AND `t2`.`ip` = `t4`.`ip`
LIMIT 1000

Clickhouse has these ANSI SQL Compatibility limitations:

Are there generic SQL features that can be declared as not supported, so that the planner will not use them?
Also why did the planner create all those joins? For the simple SPARQL query:

SELECT ?ghost ?label ?hits {
GRAPH <virtual://clickhouse> {
    ?ghost a          loc:Ghost ;
           rdfs:label ?label;
           load:hits ?hits
    FILTER(?hits > 400)
}
}

I would have expected the SQL Query:

SELECT `ip`, `hits`
FROM `ghost_stats`.`ghost_load_v0`
WHERE `pdate` = '2020-07-15' AND 
`timestamp` = toDateTime('2020-07-15 00:00:00') AND
`hits` > 400
LIMIT 1000

jess · July 19, 2020, 6:46pm

The joins are introduced because several ?ghost bindings might match multiple bindings for ?label and ?hits and we need to generate the combination of those bindings. Do you have a unique key in this table? I don't have experience with Clickhouse, but we do ask the JDBC driver for this information. If you have a unique key, you can define it in your virtual graph using the unique.key.sets property.

johnz · July 20, 2020, 2:55pm

I have given up on Clickhouse and Hive for now, but will need to get them working, if Stardog will ever be used in my production enviroment.

For the proof of concept, I loaded a slice of data into postgres. The auto generated mappings conflicted across clickhouse and postgres, so I just removed all the virtual graphs and started over.

The postgres table defined primary keys, which gave the ?subject (per_row node) the id:

BIND(template("http://api.stardog.com/ghost_load_v0/timestamp={timestamp};pdate={pdate};ip={ip}") AS ?subject)

Instead of ?subject with all the table and columns in the row id as seen in clickhouse autogenerated Mappings.

The resulting query was the simple form I expected:

SELECT "ip", "hits"
FROM "public"."ghost_load_v0"
WHERE "timestamp" = 1595030400 AND 400 < "hits"
GROUP BY "ip", "hits"
FETCH NEXT 10 ROWS ONLY

Can I assume that this simplification will happen when I set the unique.key.sets property in the Other Options, when I resume clickhouse integration?

jess · July 20, 2020, 3:06pm

The set of unique keys is what we use to prune joins. With Postgres, we can obtain them from the database schema. If we can't get them automatically from Clickhouse, you can provide them manually in the form of the unique.key.sets property.

Regarding the ON vs WHERE filter, this would be a bit more work. What exact error are you getting back from Clickhouse? Which specific item in the list of limitations were you mentioning?

johnz · July 28, 2020, 3:34pm

Setting the unique.key.sets property with the indexes for the tables allowed the autogenerated mappings to create a ?subject template with only the index columns. The clickhouse auto generated mapping looks to be the same as for the Postgres version of the table. This is good!

Setting the sql.dialect property to HIVE or MYSQL and Virtual Graph Type as "Generic SQL" the generated SQL uses LIMIT and not Fetch. This is good!

But now the generated SQL is generating string matches with null strings (''), This is bad.

The SPARQL query:

    PREFIX : <http://api.stardog.com/>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX stardog: <tag:stardog:api:>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    PREFIX gconf: <http://api.stardog.com/ghost_config_v0#>
    SELECT ?ip ?city{
    GRAPH <virtual://clickhouse_test> {
    ?subject rdf:type :ghost_config_v0;
    gconf:ip ?ip ;
    gconf:pdate "2020-07-18";
    gconf:city ?city.
    FILTER(?city = "LONDON" )
    }
    }
    LIMIT 10

Gives Plan (notice the null strings)

    VirtualGraphSql<virtual://clickhouse_test> [#2] {
    +─ RelNode=
    +─    LogicalSort(fetch=[10])
    +─      LogicalFilter(condition=[AND(=($0, ''), =($8, ''))])
    +─        JdbcTableScan(table=[[ghost_stats, ghost_config_v0]])
    +─ Vars=
    +─    ?ip <- COLUMN($1)^^xsd:string
    `─    ?city <- CONSTANT("LONDON"^^xsd:string)

And a SQL query (notice the null strings)

    SELECT *
    FROM `ghost_stats`.`ghost_config_v0`
    WHERE `pdate` = '' AND `city` = ''
    LIMIT 10

This SPARQL query gives the expected results when using Postgres Virtual Graph

jess · July 28, 2020, 3:47pm

Can you share the template for the ?subject variable from the mapping? Or just the entire mapping...

johnz · July 28, 2020, 3:51pm

Clickhouse
BIND(template("http://api.stardog.com/ghost_config_v0/pdate={pdate};ip={ip}") AS ?subject)

Postgres

BIND(template("http://api.stardog.com/ghost_config_v0/pdate={pdate};ip={ip}") AS ?subject)

jess · July 28, 2020, 3:52pm

And the source SQL query is just select * from table without WHERE?

johnz · July 28, 2020, 4:00pm

yes
SELECT *
FROM ghost_stats.ghost_config_v0
clickhouse_conf.sparql (4.6 KB)

jess · July 28, 2020, 4:12pm

is this using sql.dialect=HIVE? Can you share the query plan after setting sql.dialect=MYSQL and also commenting it out? I know the query will generate FETCH instead of LIMIT, but it will help narrow down why the pdate/city fields are being constrained erroneously.

jess · July 28, 2020, 4:16pm

Can you also run the virtual source_metadata command with the arguments vg.properties ghost_stats ghost_load_v0? This will print some metadata which might be influencing how we generate the specific queries here.

johnz · July 28, 2020, 5:02pm

Sql.dialect set to HIVE or MYSQL gives the same results
Removing the option and the 'Save' fails to load.

I configured the "clickhouse_test" database and the "clickhouse_test" virtual graph using the Stardog Studio. I did a:
find /opt/stardog -name "*click*"
and found the database file
/opt/stardog/data/data/db_meta/clickhouse_test.properties
But not the virtual graph properties file, where should I look for that file?
stardog-7.3.2/bin/stardog-admin virtual source_metadata ./data/data/db_meta/clickhouse_test.properties ghost_stats ghost_load_v0
yeilds:
The url cannot be null

jess · July 28, 2020, 5:36pm

Ah, I see. If you run stardog-admin virtual options clickhouse_test > clickhouse_test.properties, it will create the properties file you need to run virtual source_metadata.

johnz · July 28, 2020, 7:36pm

Sorry, I can't get stardog-admin to run with the right environment to pick up the jar files:

stardog-admin virtual source_metadata --verbose clickhouse_test_vg.properties ghost_stats ghost_load_v0
Loading driver ru.yandex.clickhouse.ClickHouseDriver
ru.yandex.clickhouse.ClickHouseDriver
The detailed stack trace for the error is:
java.lang.ClassNotFoundException: ru.yandex.clickhouse.ClickHouseDriver
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:259)
at com.complexible.stardog.virtual.cli.VirtualGraphSourceMetadata.openConnection(VirtualGraphSourceMetadata.java:63)
at com.complexible.stardog.virtual.cli.VirtualGraphSourceMetadata.call(VirtualGraphSourceMetadata.java:147)
at com.complexible.stardog.virtual.cli.VirtualGraphSourceMetadata.call(VirtualGraphSourceMetadata.java:29)
at com.complexible.stardog.cli.CLIBase.execute(CLIBase.java:56)
at com.complexible.stardog.cli.admin.CLI.main(CLI.java:200)

more clickhouse_test_vg.properties
#Configuration options for the virtual graph clickhouse_test
#Tue Jul 28 15:26:40 EDT 2020
sql.dialect=HIVE
query.translation=DEFAULT
sql.functions=toDateTime
namespaces=\=http\://api.stardog.com/\u0002stardog\=tag\:stardog\:api\:\u0002owl\=http\://www.w3.org/2002/07/owl\#\u0002rdf\=http\://www.w3.org/1999/02/22-rdf-syntax-ns\#\u0002xsd\=http\://www.w3.org/2001/XMLSchema\#\u0002fn\=http\://www.w3.org/2005/xpath-functions\#\u0002rdfs\=http\://www.w3.org/2000/01/rdf-schema\#
percent.encode=true
default.mapping.include.tables=ghost_config_v0, ghost_load_v0
mappings.syntax=sms2
parser.sql.quoting=NATIVE
jdbc.url=jdbc\:clickhouse\://opal.akamai.com\:8123/ghost_stats
optimize.import=true
unique.key.sets=(ghost_config_v0.pdate, ghost_config_v0.ip), (ghost_load_v0.pdate, ghost_load_v0.ip, ghost_load_v0.timestamp)
jdbc.driver=ru.yandex.clickhouse.ClickHouseDriver

echo $STARDOG_EXT
/opt/stardog/ext

ls $STARDOG_EXT/*click*
/opt/stardog/ext/clickhouse-jdbc-0.2.4-jar-with-dependencies.jar

jess · July 28, 2020, 9:57pm

Can you copy that jar file to the client/cli directory under your Stardog binary installation?

jess · July 28, 2020, 9:59pm

Alternatively, you can try adding this in log4j2.xml:

<Logger name="com.complexible.stardog.virtual.vega.calcite.VegaJdbcSchema"
      level="DEBUG" additivity="false">
   <AppenderRef ref="stardogAppender"/>
</Logger>

and then send us the stardog.log output

system · August 11, 2020, 9:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Stardog Cloud (Explorer and Studio) - SPARQL query error Stardog Cloud	18	669	June 27, 2023
SQL errors on querying virtual graphs Support	7	1122	July 21, 2017
Performance issue when querying Virtual Graphs Bug	7	746	May 24, 2019
Virtual graphs - error Bug	12	1559	July 11, 2019
SPARQL query on virtual graph Support	7	751	August 10, 2018

Clickhouse JDBC Support For Virtual Graphs?

Related topics