I reviewed the "Trillion Edge Knowledge Graph" report (Trillion Triple KG Benchmark| Stardog) and it is impressive. From this report, I sense that queries should likely be sub-second regardless of size or complexity of data. Is that a correct assumption?
In general, is there a formula for or way to estimate how long queries should take to execute from virtualized data stores (in my case from Databricks, which is subsequently stored on Azure)?
Would the factors below effect on query time:
a) Number of virtualized data tables (e.g., 5, 10, 50, 100)?
b) Size of the virtualized data being accessed (e.g., KB, MB, GB)?
b) Executing query from Stardog Explorer, Stardog Studio, or Databricks (via pystardog
)
On the flip side, I have experienced ~50 second query times using Stardog Explorer and Studio, with virtualized data on Databricks. And I sense, it is something that I am doing. For example, the query below often takes ~48k ms to complete. This virtual graph has one table stored on Azure (customer
) with the rest stored on Databricks.
The query plan:
prefix : <tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:model:>
prefix xsd: <http://www.w3.org/2001/XMLSchema>
From virtual
Slice(offset=0, limit=1000) [#475]
`─ OrderBy(DESC(?price), offset=0, limit=1000) [#475]
`─ Distinct [#475]
`─ Projection(?full_name, ?city, ?price, ?product_name) [#475]
`─ Bind("Texas" AS ?state) sortedBy=?state [#475]
`─ HashJoin(?customer) [#475]
+─ Union [#860]
│ +─ Union [#360]
│ │ +─ Union [#270]
│ │ │ +─ Union [#180]
│ │ │ │ +─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Databricks-C360-Demo-Database> [#90] {
│ │ │ │ │ +─ RelNode=
│ │ │ │ │ +─ LogicalProject(cid=[$1])
│ │ │ │ │ +─ LogicalFilter(condition=[IS NOT NULL($1)])
│ │ │ │ │ +─ JdbcTableScan(table=[[jamie_costabile, stardog, rewards]])
│ │ │ │ │ +─ Query=
│ │ │ │ │ +─ SELECT `cid`
│ │ │ │ │ +─ FROM `jamie_costabile`.`stardog`.`rewards`
│ │ │ │ │ +─ WHERE `cid` IS NOT NULL
│ │ │ │ │ +─ Vars=
│ │ │ │ │ +─ ?customer <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:stardog_demo.c360.Customers:{cid/0})
│ │ │ │ │ } [#90]
│ │ │ │ `─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Databricks-C360-Demo-Database> [#90] {
│ │ │ │ +─ RelNode=
│ │ │ │ +─ LogicalProject(cid=[$1])
│ │ │ │ +─ LogicalFilter(condition=[IS NOT NULL($1)])
│ │ │ │ +─ JdbcTableScan(table=[[jamie_costabile, stardog, creditcards]])
│ │ │ │ +─ Query=
│ │ │ │ +─ SELECT `cid`
│ │ │ │ +─ FROM `jamie_costabile`.`stardog`.`creditcards`
│ │ │ │ +─ WHERE `cid` IS NOT NULL
│ │ │ │ +─ Vars=
│ │ │ │ +─ ?customer <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:stardog_demo.c360.Customers:{cid/0})
│ │ │ │ } [#90]
│ │ │ `─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Databricks-C360-Demo-Database> [#90] {
│ │ │ +─ RelNode=
│ │ │ +─ LogicalProject(cid=[$0])
│ │ │ +─ LogicalFilter(condition=[IS NOT NULL($0)])
│ │ │ +─ JdbcTableScan(table=[[jamie_costabile, stardog, us_customers]])
│ │ │ +─ Query=
│ │ │ +─ SELECT `cid`
│ │ │ +─ FROM `jamie_costabile`.`stardog`.`us_customers`
│ │ │ +─ WHERE `cid` IS NOT NULL
│ │ │ +─ Vars=
│ │ │ +─ ?customer <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:stardog_demo.c360.Customers:{cid/0})
│ │ │ } [#90]
│ │ `─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Databricks-C360-Demo-Database> [#90] {
│ │ +─ RelNode=
│ │ +─ LogicalProject(cid=[$6])
│ │ +─ LogicalFilter(condition=[IS NOT NULL($6)])
│ │ +─ JdbcTableScan(table=[[jamie_costabile, stardog, purchases]])
│ │ +─ Query=
│ │ +─ SELECT `cid`
│ │ +─ FROM `jamie_costabile`.`stardog`.`purchases`
│ │ +─ WHERE `cid` IS NOT NULL
│ │ +─ Vars=
│ │ +─ ?customer <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:stardog_demo.c360.Customers:{cid/0})
│ │ } [#90]
│ `─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Azure-C360-Demo-Database> [#500] {
│ +─ RelNode=
│ +─ LogicalProject(cid=[$0])
│ +─ JdbcTableScan(table=[[stardog_demo, c360, Customers]])
│ +─ Query=
│ +─ SELECT [cid]
│ +─ FROM [stardog_demo].[c360].[Customers]
│ +─ Vars=
│ +─ ?customer <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:stardog_demo.c360.Customers:{cid/0})
│ } [#500]
`─ HashJoin(?address) [#280]
+─ Union [#630]
│ +─ Union [#540]
│ │ +─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__TJC-Databricks> [#90] {
│ │ │ +─ RelNode=
│ │ │ +─ LogicalProject(location=[$47])
│ │ │ +─ LogicalFilter(condition=[IS NOT NULL($47)])
│ │ │ +─ JdbcTableScan(table=[[hive_metastore, temp, t_hco_to_delete]])
│ │ │ +─ Query=
│ │ │ +─ SELECT `location`
│ │ │ +─ FROM `hive_metastore`.`temp`.`t_hco_to_delete`
│ │ │ +─ WHERE `location` IS NOT NULL
│ │ │ +─ Vars=
│ │ │ +─ ?address <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:jamie_costabile.stardog.us_locations:{location/0})
│ │ │ } [#90]
│ │ `─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Azure-C360-Demo-Database> [#450] {
│ │ +─ RelNode=
│ │ +─ LogicalProject(location=[$6])
│ │ +─ LogicalFilter(condition=[IS NOT NULL($6)])
│ │ +─ JdbcTableScan(table=[[stardog_demo, c360, Customers]])
│ │ +─ Query=
│ │ +─ SELECT [location]
│ │ +─ FROM [stardog_demo].[c360].[Customers]
│ │ +─ WHERE [location] IS NOT NULL
│ │ +─ Vars=
│ │ +─ ?address <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:jamie_costabile.stardog.us_locations:{location/0})
│ │ } [#450]
│ `─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Databricks-C360-Demo-Database> [#90] {
│ +─ RelNode=
│ +─ LogicalProject(id=[$0])
│ +─ LogicalFilter(condition=[IS NOT NULL($0)])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, us_locations]])
│ +─ Query=
│ +─ SELECT `id`
│ +─ FROM `jamie_costabile`.`stardog`.`us_locations`
│ +─ WHERE `id` IS NOT NULL
│ +─ Vars=
│ +─ ?address <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:jamie_costabile.stardog.us_locations:{id/0})
│ } [#90]
`─ HashJoin(?customer) [#180]
+─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Azure-C360-Demo-Database> [#405] {
│ +─ RelNode=
│ +─ LogicalProject(cid=[$0], location=[$6], full_name=[$7])
│ +─ LogicalFilter(condition=[AND(IS NOT NULL($6), IS NOT NULL($7))])
│ +─ JdbcTableScan(table=[[stardog_demo, c360, Customers]])
│ +─ Query=
│ +─ SELECT [cid], [location], [full_name]
│ +─ FROM [stardog_demo].[c360].[Customers]
│ +─ WHERE [location] IS NOT NULL AND [full_name] IS NOT NULL
│ +─ Vars=
│ +─ ?customer <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:stardog_demo.c360.Customers:{cid/0})
│ +─ ?address <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:jamie_costabile.stardog.us_locations:{location/1})
│ +─ ?full_name <- COLUMN($2)^^xsd:string
│ } [#405]
`─ HashJoin(?product) [#180]
+─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Databricks-C360-Demo-Database> [#222] {
│ +─ RelNode=
│ +─ LogicalProject(id=[$0], id1=[$18], product=[$19], cid1=[$25], phone=[$30], price2=[$47], id4=[$51], city=[$57])
│ +─ LogicalJoin(condition=[=($25, $73)], joinType=[inner])
│ +─ LogicalJoin(condition=[=($25, $65)], joinType=[inner])
│ +─ LogicalJoin(condition=[AND(=($51, $58), =($62, 'Texas'))], joinType=[inner])
│ +─ LogicalJoin(condition=[AND(IS NOT NULL($51), IS NOT NULL($57))], joinType=[inner])
│ +─ LogicalJoin(condition=[AND(=($0, $42), >($47, '10000':VARCHAR(2048)))], joinType=[inner])
│ +─ LogicalJoin(condition=[AND(=($0, $33), =($25, $39))], joinType=[inner])
│ +─ LogicalJoin(condition=[IS NOT NULL($30)], joinType=[inner])
│ +─ LogicalJoin(condition=[AND(=($10, $18), =('highlighters':VARCHAR(2048), LOWER($19)), IS NOT NULL($19))], joinType=[inner])
│ +─ LogicalJoin(condition=[=($0, $9)], joinType=[inner])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, purchases]])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, purchases]])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, products]])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, us_customers]])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, purchases]])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, purchases]])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, us_locations]])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, us_locations]])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, us_customers]])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, us_customers]])
│ +─ Query=
│ +─ SELECT `purchases`.`id`, `products`.`id` `id1`, `products`.`product`, `us_customers`.`cid` `cid1`, `us_customers`.`phone`, `purchases2`.`price` `price2`, `us_locations`.`id` `id4`, `us_locations`.`city`
│ +─ FROM `jamie_costabile`.`stardog`.`purchases`
│ +─ INNER JOIN `jamie_costabile`.`stardog`.`purchases` `purchases0` ON `purchases`.`id` = `purchases0`.`id`
│ +─ INNER JOIN `jamie_costabile`.`stardog`.`products` ON `purchases0`.`pid` = `products`.`id` AND 'highlighters' = LOWER(`products`.`product`) AND `products`.`product` IS NOT NULL
│ +─ INNER JOIN `jamie_costabile`.`stardog`.`us_customers` ON `us_customers`.`phone` IS NOT NULL
│ +─ INNER JOIN `jamie_costabile`.`stardog`.`purchases` `purchases1` ON `purchases`.`id` = `purchases1`.`id` AND `us_customers`.`cid` = `purchases1`.`cid`
│ +─ INNER JOIN `jamie_costabile`.`stardog`.`purchases` `purchases2` ON `purchases`.`id` = `purchases2`.`id` AND `purchases2`.`price` > '10000'
│ +─ INNER JOIN `jamie_costabile`.`stardog`.`us_locations` ON `us_locations`.`id` IS NOT NULL AND `us_locations`.`city` IS NOT NULL
│ +─ INNER JOIN `jamie_costabile`.`stardog`.`us_locations` `us_locations0` ON `us_locations`.`id` = `us_locations0`.`id` AND `us_locations0`.`state` = 'Texas'
│ +─ INNER JOIN `jamie_costabile`.`stardog`.`us_customers` `us_customers0` ON `us_customers`.`cid` = `us_customers0`.`cid`
│ +─ INNER JOIN `jamie_costabile`.`stardog`.`us_customers` `us_customers1` ON `us_customers`.`cid` = `us_customers1`.`cid`
│ +─ Vars=
│ +─ ?customer <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:stardog_demo.c360.Customers:{cid/3})
│ +─ ?address <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:jamie_costabile.stardog.us_locations:{id/6})
│ +─ ?city <- COLUMN($7)^^xsd:string
│ +─ ?contactInfo <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:ContactInfo:{cid/3})
│ +─ ?phone <- COLUMN($4)^^xsd:string
│ +─ ?order <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:Orders:{id/0})
│ +─ ?price <- COLUMN($5)^^xsd:double
│ +─ ?product <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:jamie_costabile.stardog.products:{id/1})
│ +─ ?product_name <- COLUMN($2)^^xsd:string
│ } [#222]
`─ Union [#180]
+─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Databricks-C360-Demo-Database> [#90] {
│ +─ RelNode=
│ +─ LogicalProject(id=[$0])
│ +─ LogicalFilter(condition=[IS NOT NULL($0)])
│ +─ JdbcTableScan(table=[[jamie_costabile, stardog, products]])
│ +─ Query=
│ +─ SELECT `id`
│ +─ FROM `jamie_costabile`.`stardog`.`products`
│ +─ WHERE `id` IS NOT NULL
│ +─ Vars=
│ +─ ?product <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:jamie_costabile.stardog.products:{id/0})
│ } [#90]
`─ VirtualGraphSql<virtual://TJC-C360-Demo-Update-Azure-DB__data__Databricks-C360-Demo-Database> [#90] {
+─ RelNode=
+─ LogicalProject(pid=[$1])
+─ LogicalFilter(condition=[IS NOT NULL($1)])
+─ JdbcTableScan(table=[[jamie_costabile, stardog, purchases]])
+─ Query=
+─ SELECT `pid`
+─ FROM `jamie_costabile`.`stardog`.`purchases`
+─ WHERE `pid` IS NOT NULL
+─ Vars=
+─ ?product <- TEMPLATE(tag:stardog:designer:TJC-C360-Demo-Update-Azure-DB:data:jamie_costabile.stardog.products:{pid/0})
} [#90]
Aside from what surfaces inspecting the query plan above, what are the most likely reasons for queries to or from Databricks requiring > 10 seconds to complete?