Correlated Subqueries, ranking, Stardog Arrays

mbarbieri77 · July 29, 2020, 12:04pm

Hi,
I have been playing with an e-commerce sample database I migrated from SQL Server to Stardog.
I have been translating SQL queries to SPARQL and have a few of them already. However, I'm finding difficult to do things like calculating running totals, moving averages, row numbers and ranking, etc.
Before I explain the two subqueries below, here is a snippet of the model:
product -> hasCategory -> category
order -> hasCustomer -> customer
The first query is working fine. However, I need to find an alternative for the second one, where I need to return the two most recent orders for each customers in the outer query. In SQL, one of the solutions would be to use a correlated query (pass customerID to the inner query for each record of the outer query).

I have notice a few issues with SPARQL, logged below:

github.com/w3c/sparql-12

JOIN LATERAL or Correlated Subquery

opened 11:13AM - 17 Jul 19 UTC

VladimirAlexiev

## Why? GraphQL is a hot topic amongst developers and tool vendors. There ar…e several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation. GraphQL queries are hierarchical and quite regular. Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level. Assuming a simple companies graph structure and some reasonable `order` and `limit` syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL: ```graphql { country(id:"...bulgaria") { city (order: {population:DESC}, limit: 2) { id name population company (order: {revenue:DESC}, limit 2) { id name revenue } } } } ``` If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem. `limit` in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city. We at Onto believe that to implement this efficiently, you need the subquery to run **in a loop** for every row of the parent query. ## Previous work This is a common problem in databases. Eg see [StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?](https://stackoverflow.com/questions/1124603/grouped-limit-in-postgresql-show-the-first-n-rows-for-each-group), which gives the following solutions: 1. `<child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>)` aka using Windowing functions 2. `<parent-query> JOIN LATERAL (<child-query>)`, see PostgreSQL ([FROM](https://www.postgresql.org/docs/devel/sql-select.html#SQL-FROM), [Lateral](https://www.postgresql.org/docs/current/static/queries-table-expressions.html#QUERIES-LATERAL), [SELECT](http://www.postgresql.org/docs/current/static/sql-select.html), [heap.io blog](https://heap.io/blog/engineering/postgresqls-powerful-new-join-type-lateral) Dec 2014), SQL Server ([cross apply](http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/)) 3. `WITH` Common Table Expression 4. `JOIN (COUNT...GROUP BY) WHERE <=5` A Correlated Subquery is like `LEFT JOIN LATERAL ... ON true`, see [StackOverflow: Difference between Lateral and Subquery](https://stackoverflow.com/questions/28550679/what-is-the-difference-between-lateral-and-a-subquery-in-postgresql): - Correlated subquery has limitations: it cannot return multiple columns and multiple rows - LATERAL is made for that: it's assumed automatically for cross-joins `FROM x1,x2` where `x2` is a table function (eg `unnest`) Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning). However, LATERAL is usually faster than Windowing functions (depending on data and indexing). Two of the leading GraphQL implementations for RDMBS use LATERAL: [Hasura](https://hasura.io/) and [Join Monster](https://join-monster.readthedocs.io/) ## Proposed solution A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows. - We've tried structuring with CONSTRUCT but it doesn't support ordering and we don't want to mess with `rdf:List` - StarDog's [Extending the Solution](https://www.stardog.com/blog/extending-the-solution/) with ARRAY may be relevant, but will require too large extension of SPARQL - So we're thinking of using SELECT with a UNION-style distribution of parent and child values (Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows) | country | city | city_name | population | company | company_name | revenue | COMMENT | | --- | --- | --- | --- | --- | --- | --- | --- | | geo:732800/ | geo:727011/ | Sofia | 1152556 | | | | | | | geo:727011/ | | | co:123 | Sofia Foo Co | 987 | Sofia company 1 | | | geo:727011/ | | | co:456 | Sofia Bar Co | 123 | Sofia company 2 | | geo:732800/ | geo:728193/ | Plovdiv | 340494 | | | | | | | geo:728193/ | | | co:789 | Plovdiv Foo Co | 987 | Plovdiv company 1 | | | geo:728193/ | | | co:012 | Plovdiv Bar Co | 123 | Plovdiv company 2 | Assume this construct (other syntax suggestions are welcome!): ```sparql {foo} UNION LATERAL(?var) {bar} ``` - `?var` must be bound by `foo` (and exported in its `select` if it's a subquery) - `?var` must also be exported by `bar` - `bar` is iterated for every binding of `?var` - the results of `bar` are appended to the results of `foo` Then we could use it to implement the query in question: ```sparql select ?country ?city ?city_name ?population ?company ?company_name ?revenue { {select ?country ?city ?city_name ?population { bind(<http://sws.geonames.org/732770/> as ?country) ?country x:city ?city. ?city x:name ?city_name. ?city x:population ?population. } order by desc(?population) limit 2} UNION LATERAL(?city) {select ?city ?company ?company_name ?revenue { ?city x:company ?company. ?company x:name ?company_name. ?company x:revenue ?revenue } order by desc(?revenue) limit 2} } ``` (It's more likely to have inverse links `?city x:country ?country` and `?company x:city ?city` but for simplicity we use straight links) ## Considerations for backward compatibility None?

github.com/w3c/sparql-12

Support window functions

opened 09:05PM - 03 Apr 19 UTC

kasei

query function

SPARQL should add support for window functions. This would increase expressivity… and address some existing use cases such as "limit per resource". # Why Window functions would allow computing values that are unavailable in SPARQL 1.1 queries: * row numbering and ranking * quantiles * moving averages * running totals These can be used to address use cases such as limiting the result set to a specific number of results for each resource ("limit per resource"). For example, consider a query to retrieve information about web posts: ```sparql SELECT ?post ?title ?author ?date WHERE { ?post a sioc:Post ; dc:title ?title ; sioc:has_creator ?author } ``` Given that a post can have any number of titles and authors, we might wish to restrict our query to only providing information about at most 3 authors for any individual post. This isn't easily done using standard SPARQL, but can be addressed using window functions. # Previous work * I've implemented window functions (with the strawman syntax shown below) in [Kineo](https://github.com/kasei/kineo/). * Window functions in [SQLServer](https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-2017) * Window functions in [SQLite](https://sqlite.org/windowfunctions.html) * Window functions in [PostgreSQL](https://www.postgresql.org/docs/9.1/tutorial-window.html) # Proposed solution Using a `RANK` window function, we can filter the result set of the example query above with a `HAVING` clause: ```sparql PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX sioc: <http://rdfs.org/sioc/ns#> SELECT ?post ?title ?author ?date WHERE { ?post a sioc:Post ; dc:title ?title ; sioc:has_creator ?author } HAVING (RANK() OVER (PARTITION BY ?post ORDER BY ?author) <= 2) ``` This will take the the result set from matching the basic graph pattern, and partition it into groups based on the value of `?post`. Within each partition, rows will be sorted by `?author`, and then assigned an increasing integer rank. Finally, these rows will be filtered to keep only those with a rank less than or equal to `2`. The final result set will be the concatenation of rows in each partition. Beyond this use case, existing aggregates (e.g. `AVG` and `SUM`) can be used with windows to support things like moving averages and running totals. # Considerations for backward compatibility None.

github.com/w3c/sparql-12

GROUP_CONCAT sorting

opened 05:47AM - 03 Apr 19 UTC

kasei

query

There should be a way to sort group rows before aggregation, particularly for GR…OUP_CONCAT. # Why The use of `GROUP_CONCAT` is limited by the output not being deterministic with respect to the ordering of values in the group. For example, a `GROUP_CONCAT(?name)` aggregate might produce `"Alice Bob Eve"` in one implementation, and `"Bob Alice Eve"` in another (or even within the same implementation from different query evaluations). Allowing users to specify an implicit/explicit ordering for rows in an aggregation group would improve interoperability. # Previous work This was raised as [ISSUE-66](https://www.w3.org/2009/sparql/track/issues/66) and subsequently postponed during work on SPARQL 1.1. Jindřich Mynarz discusses this as a [potential SPARQL 1.2 feature](http://blog.mynarz.net/2017/06/what-i-would-like-to-see-in-sparql-12.html). # Considerations for backward compatibility At the user-level, this is purely additive as SPARQL 1.1 aggregate groups do not have any explicit ordering. However, it would require updates to the existing definition of [SPARQL Algebra Set Functions](https://www.w3.org/TR/sparql11-query/#setFunctions) which are defined in terms of multisets, not ordered sequences.

I see you guys have proposed Arrays:
Extending the Solution | Stardog

Query 1: Select all products that belong to the Seafood category

SELECT
  ?productName
  ?unitPrice
  ?unitsInStock
WHERE { # outer query
  ?product a :product ;
             :productName ?productName ;
             :unitPrice ?unitPrice ;
             :unitsInStock ?unitsInStock ;
             :hasCategory ?category . 
    { # inner query
      SELECT 
        ?category
      WHERE {
        ?category a :category ;
                    :categoryID ?categoryID ;
                    :name "Seafood" .
      }
    }
  }
  ORDER BY
    ?productName

Query 2 (doesn't work): Select the two most recent orders of each customer

SELECT DISTINCT 
  ?customerID 
  ?city
WHERE { # outer query
  ?customer a :customer ;
              :customerID ?customerID ;
              :city ?city ;
              ^:hasCustomer ?order .
  { # inner query
    SELECT
        ?order
    WHERE {
      ?order a :order ;
               :orderID ?orderID ;
               :orderDate ?orderDate ;
               :hasCustomer ?customer .
    }
      ORDER BY
        DESC(?orderDate)
      LIMIT 2
    }
  }
ORDER BY
  ?customerID 
  ?city 
  DESC(?orderDate)

zachary.whitley · July 29, 2020, 1:43pm

There are a bunch of questions here so I'll just answer a couple of the easy ones and then myself or someone can follow up on the query question.

The issues you posted from the github.com/s3c/sparql-12 aren't really issues but are actually proposals for a sparql 1.2 recommendation.

Stardog has implemented arrays. See the Stardog documentation here

Do you have some sample data you can share for these queries?

mbarbieri77 · July 29, 2020, 5:30pm

Yes, please just load the data in the zip file into Stardog and you will be able to execute the queries.
dumpdataNTRIPLE7.nt.zip (217.6 KB)

I have this same database on a SQL Server, so, i'm just leaving here the things I can do using SQL that I'm finding difficult to implement in SPARQL.

-- Query: Select the 3 most recent orders of each customer
-- For each customer record, go and get the two most recent orders.
-- An INNER JOIN could've been used, however, CROSS APPLY is more efficient when combined with SELECT TOP.

SELECT 
    cst.CustomerID, 
    cst.City,
    cpp.OrderID, 
    cpp.OrderDate   
FROM 
    Customer AS cst
CROSS APPLY (
    SELECT TOP 3 
        ord.OrderID, ord.OrderDate, cst.CustomerID
    FROM 
        [Order] AS ord
    WHERE 
        ord.customerid = cst.customerid -- reference to the outer query (correlated subquery)
    ORDER BY 
        ord.OrderDate DESC
) AS cpp
ORDER BY 
    cst.CustomerID, 
    cst.City,
    cpp.OrderDate DESC

-- Windowed Functions

-- Calculating row numbering and ranking, quantiles, moving averages, and running totals.
-- Reference: OVER Clause (Transact-SQL) - SQL Server | Microsoft Learn

-- Query: Select the 3 most recent orders of each customer
-- This query replaces the previous one by using the more efficient Windowed Function.

SELECT 
    ptt.*
FROM
(
    SELECT
    cst.CustomerID, 
    cst.City,
    ord.OrderID, 
    ord.OrderDate, 
    ROW_NUMBER() OVER(PARTITION BY cst.CustomerID ORDER BY ord.OrderDate DESC) AS [RowNumber]
    FROM Customer AS cst
    INNER JOIN [Order] AS ord 
    ON cst.CustomerID = ord.CustomerID
) ptt
WHERE 
    ptt.[RowNumber] <= 3

-- Query: Top 3 most expensive product in each product category

SELECT 
    ptt.*
FROM
(
    SELECT
        ctg.CategoryName,
        prd.ProductName, 
        prd.UnitPrice,
        ROW_NUMBER() OVER(PARTITION BY ctg.CategoryID ORDER BY prd.UnitPrice DESC) AS [RowNumber]
    FROM 
        Product prd
        INNER JOIN Category ctg 
        ON prd.CategoryID = ctg.CategoryID  
) ptt
WHERE 
    ptt.[RowNumber] <= 3

-- Query: Order total quantity and percentage by product

SELECT 
    ord.OrderID, 
    ord.ProductID, 
    ord.Quantity,  
    SUM(ord.Quantity) OVER(PARTITION BY ord.OrderID) AS Total,  
    CAST(1. * ord.Quantity / SUM(ord.Quantity) OVER(PARTITION BY ord.OrderID) * 100 AS DECIMAL(5,2)) AS "PercByProduct"  
FROM 
    OrderDetail ord 
WHERE 
    ord.OrderID IN(10248,10249, 10250);

GO

mbarbieri77 · August 3, 2020, 1:19pm

A solution could be to run the following query for each customer in a loop and union the result sets in the end.

However, there is no loop functions in SPARQL either. Only way is to code it in the application.

SELECT *
WHERE {
  ?orderDetail :hasProduct ?product ; 
               :belongsToOrder ?order .
  ?order       :hasCustomer ?customer ;
               :orderDate ?orderDate .
  FILTER (?customer = :customer-ALFKI)
}
ORDER BY
  ?customer
  DESC(?orderDate)
LIMIT 3

pavel · August 3, 2020, 2:30pm

Hi Marcelo,

Yes, you are correct about correlated subqueries, their absence in SPARQL causes this sort of issues. For now your application would have to loop externally.

The Stardog feature which comes closest is the recent Stored Query Service: Home | Stardog Documentation Latest . Right now its executes subqueries in the uncorrelated way, i.e. as normal subqueries in SPARQL, but it'd be relatively easy for us to support correlated evaluation, i.e. as LATERAL joins in Postgres.

We're monitoring sparql-12/100 and generally welcome it as a candidate for SPARQL 1.2.

Cheers,
Pavel

system · August 17, 2020, 2:30pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Limit and sort per level? Support	11	506	September 21, 2021
Extending the Solution Stardog Blog Discussion	5	1447	March 27, 2017
Full Graphql React app Support	2	422	September 30, 2020
Importing results from query Support	8	403	February 20, 2019
How to store tree with ordered children in RDF? How to traverse such structure in SPARQL? Support	4	636	September 7, 2018

Correlated Subqueries, ranking, Stardog Arrays

Related topics