Would like to make a virtual Graph to hive 1.1 using the stardog server accesses hive through a secure ha_proxy to a hive jdbc port on the remote Hadoop cluster
stardog studio -> stardog server -> ha_proxy -> hive Hadoop cluster
To get the stardog server to access hive, putting the recommend hive-jdbc-2.3.2.jar into an external directory, did not work. Instead I copied all the hive files from the Hadoop server to the Stardog external directory
The hive jdbc jar might be:
hive-jdbc-1.1.0-cdh5.9.0.jar
The virtual graph mapping definition Loaded
PREFIX : <http://test.akamai.com/location/entities#>
PREFIX i: <http://test.akamai.com/location/entities#>
PREFIX loc: <http://test.akamai.com/location#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX stardog: <tag:stardog:api:>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
# Simple mapping to make Metro instances from a hive table with only 725 rows
MAPPING :TestMetro
FROM SQL {
SELECT * FROM metros
}
TO {
?metro a loc:Metro .
?country a loc:Country .
?continent a loc:Continent .
?metro loc:LocatedIn ?country .
?country loc:LocatedIn ?continent .
}
WHERE {
BIND(template("http://test.akamai.com/location/entities#Metro_{metro}") as ?metro)
BIND(template("http://test.akamai.com/location/entities#Country_{country}") as ?country)
BIND(template("http://test.akamai.com/location/entities#Continent_{continent}") as ?continent)
}
But a more complicated Query does NOT work , because the Stardog generated SQL query is not valid:
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 5:11 Table not found 'pp'
Here is the /opt/startdog/data/stardog.log output Stardog_hive.log (9.8 KB)
Thanks for the query plan and the log. I'm assuming you've made locatedIn a transitive property and executed this query with reasoning? The reason for the error is that Stardog uses recursive CTE in SQL to answer queries for transitive properties. Hive doesn't support recursive CTE and thus fails on this query.
Is there a work around? like use generic SQL instead of Hive?
Is there any documentation on the restrictions of virtual graphs?
For example, how do you support transitive property for Non-SQL db, or limited/read only SQL-like, such as Clickhouse, Presto
I would like to show the power of has_a relationship, which uses the transitive property.
How about a work around where the transitive inference is done inside stardog and not in Hive?
Is there a way to define a "dumb" binding with indirection?
For example, the hive query would just return the "rawLocatedIn" relations without transitive property?
Then something transforms the returned raw_relationship to the real LocatedIn relationships, which have the transitive property.
Is there a pattern for moving inference into stardog?
We are planning to allow transparently querying transitive predicates over databases that don't support recursive SQL, but don't have a concrete timeline for implementing this feature. In the meantime, we recommend materializing the subset of the virtual graph into Stardog which supports this type of query.
So StarDog planner depends on SQL features that might not be supported by the SQL Server. I am surprised because working around constraints, seems to be a generic problem for a planner. The plans I looked at so far have some really clever parts.
Is there a way to mark SQL servers with constraint_properties like "supports_recursive_CTE", and then have StarDog fail with a warning when the constraints is not met?
That's certainly something we could (and should) be doing. We handle things similarly in a lot cases for upstream platform differences include SQL syntax deviations (LIMIT/FETCH is a big one), ability to join which is not supported in the general case in MongoDB or Cassandra, etc. When we implement this for recursive SQL / transitive properties, it will extend to evaluating the transitive property inside Stardog which requires additional round trips to the upstream data source.