Limit and sort per level?

lapidus · September 6, 2021, 1:11pm

Perhaps a basic SPARQL question but ...

We have:
:BlogPost :author :Author
:Author :interest :Topic

How do we retrieve max 3 Blog posts with max 2 Authors each and max 1 Topic each? (Ideally with sorting possible at each level.)
How would we retrieve the above so that nesting it (into something like the below) would be facilitated:

items = [ 
   {
     id: "my-blog-post"
     type: "BlogPost"
     author: [ { 
       id: "john-stevenson",
       type: "Author",
       interest: [ { 
           id : "economy",
           type: "Topic"
        } ]
      ... 1 more
   } ]
  ... 2 more
]

Thanks!

pavel · September 6, 2021, 1:28pm

As basic as it may sound, you cannot do it in pure SPARQL with a single query. A LIMIT requires a subquery but subqueries in SPARQL are uncorrelated i.e. they're executed once while you need the (say, authors) subquery to execute once per blog post (so you get 2 authors per post, not 2 authors in total).

Again, you need correlated subqueries which in Stardog means Stored Query Service. Something like:

select * {
  { select * { ?post a :Post } # some pattern to select posts, e.g. by date range, etc. 
    limit 3 } 
  service<query://authors> { [] sqs:input ?post ; sqs:vars ?author } # select 2 authors max per post
  service<query://topics> { [] sqs:input ?author ; sqs:vars ?topic } # select 1 topic per author
}

I haven't tried it so syntax could be slightly off but hopefully you get the idea.

Cheers,
Pavel

lapidus · September 6, 2021, 1:41pm

Aha! Thank you for the clear explanation!

After some more digging I also found this:

Is the array() an option to SQS?

Was also mentioned here:

github.com/w3c/sparql-12

JOIN LATERAL or Correlated Subquery

opened 11:13AM - 17 Jul 19 UTC

VladimirAlexiev

## Why? GraphQL is a hot topic amongst developers and tool vendors. There ar…e several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation. GraphQL queries are hierarchical and quite regular. Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level. Assuming a simple companies graph structure and some reasonable `order` and `limit` syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL: ```graphql { country(id:"...bulgaria") { city (order: {population:DESC}, limit: 2) { id name population company (order: {revenue:DESC}, limit 2) { id name revenue } } } } ``` If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem. `limit` in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city. We at Onto believe that to implement this efficiently, you need the subquery to run **in a loop** for every row of the parent query. ## Previous work This is a common problem in databases. Eg see [StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?](https://stackoverflow.com/questions/1124603/grouped-limit-in-postgresql-show-the-first-n-rows-for-each-group), which gives the following solutions: 1. `<child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>)` aka using Windowing functions 2. `<parent-query> JOIN LATERAL (<child-query>)`, see PostgreSQL ([FROM](https://www.postgresql.org/docs/devel/sql-select.html#SQL-FROM), [Lateral](https://www.postgresql.org/docs/current/static/queries-table-expressions.html#QUERIES-LATERAL), [SELECT](http://www.postgresql.org/docs/current/static/sql-select.html), [heap.io blog](https://heap.io/blog/engineering/postgresqls-powerful-new-join-type-lateral) Dec 2014), SQL Server ([cross apply](http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/)) 3. `WITH` Common Table Expression 4. `JOIN (COUNT...GROUP BY) WHERE <=5` A Correlated Subquery is like `LEFT JOIN LATERAL ... ON true`, see [StackOverflow: Difference between Lateral and Subquery](https://stackoverflow.com/questions/28550679/what-is-the-difference-between-lateral-and-a-subquery-in-postgresql): - Correlated subquery has limitations: it cannot return multiple columns and multiple rows - LATERAL is made for that: it's assumed automatically for cross-joins `FROM x1,x2` where `x2` is a table function (eg `unnest`) Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning). However, LATERAL is usually faster than Windowing functions (depending on data and indexing). Two of the leading GraphQL implementations for RDMBS use LATERAL: [Hasura](https://hasura.io/) and [Join Monster](https://join-monster.readthedocs.io/) ## Proposed solution A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows. - We've tried structuring with CONSTRUCT but it doesn't support ordering and we don't want to mess with `rdf:List` - StarDog's [Extending the Solution](https://www.stardog.com/blog/extending-the-solution/) with ARRAY may be relevant, but will require too large extension of SPARQL - So we're thinking of using SELECT with a UNION-style distribution of parent and child values (Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows) | country | city | city_name | population | company | company_name | revenue | COMMENT | | --- | --- | --- | --- | --- | --- | --- | --- | | geo:732800/ | geo:727011/ | Sofia | 1152556 | | | | | | | geo:727011/ | | | co:123 | Sofia Foo Co | 987 | Sofia company 1 | | | geo:727011/ | | | co:456 | Sofia Bar Co | 123 | Sofia company 2 | | geo:732800/ | geo:728193/ | Plovdiv | 340494 | | | | | | | geo:728193/ | | | co:789 | Plovdiv Foo Co | 987 | Plovdiv company 1 | | | geo:728193/ | | | co:012 | Plovdiv Bar Co | 123 | Plovdiv company 2 | Assume this construct (other syntax suggestions are welcome!): ```sparql {foo} UNION LATERAL(?var) {bar} ``` - `?var` must be bound by `foo` (and exported in its `select` if it's a subquery) - `?var` must also be exported by `bar` - `bar` is iterated for every binding of `?var` - the results of `bar` are appended to the results of `foo` Then we could use it to implement the query in question: ```sparql select ?country ?city ?city_name ?population ?company ?company_name ?revenue { {select ?country ?city ?city_name ?population { bind(<http://sws.geonames.org/732770/> as ?country) ?country x:city ?city. ?city x:name ?city_name. ?city x:population ?population. } order by desc(?population) limit 2} UNION LATERAL(?city) {select ?city ?company ?company_name ?revenue { ?city x:company ?company. ?company x:name ?company_name. ?company x:revenue ?revenue } order by desc(?revenue) limit 2} } ``` (It's more likely to have inverse links `?city x:country ?country` and `?company x:city ?city` but for simplicity we use straight links) ## Considerations for backward compatibility None?

pavel · September 6, 2021, 1:58pm

Yeah, it's an old post which we wrote at the time when we first realised that supporting path queries as subqueries (in, say, SELECT queries) is going to be difficult because their results do not quite fit into the SPARQL notion of "solution" (a fixed-size set of variable bindings). That was before SQS. SQS works a bit differently than was envisioned in the post: the solution is still fixed-size but actual values could be arrays. If you execute path (sub)queries and project ?path, you may see those arrays in their internal representation. There're some functions to handle them: str, stardog:length, stardog:any, and stardog:all.

Yes, the SPARQL 1.2 issue is exactly there to add support for correlated subqueries (actually "lateral" in Postgres would be more accurate because they're not limit to a single column value). As you discovered, its use cases could be very basic. It'd be a pretty major extension though so we decided not to wait but instead added support to SQS. Supporting correlated subqueries without service (i.e. beyond SQS) would require extensions to SPARQL syntax, we decided to avoid that for the time being.

lapidus · September 6, 2021, 2:56pm

Thank you.

Hmm, I wonder if I can make this generic enough to work for any combination of classes and properties without having to predefine Stored Queries?

It also needs to tap into the language fallback functionality:

So a complete, more generic example would be:

The class :Country has properties :name, :headOfState.
The class :Person has properties :name, :birthDate
We can determine a priori that :name is an rdf:langString and want to use that information to apply a fallback.
We want to get the top 5 Countries ranked by population and we want to get 3 Heads of State ranked by birthDate.

So basically:

?country a :Country;
   :population ?population .
   :name ?name . // Apply Fallback

   ORDER BY ?population
   LIMIT 5

// Correlated Query:

?country  :headOfState ?Person;
   :birthDate ?birthDate .
   :name ?name . // Apply Fallback

   ORDER BY ?birthDate
   LIMIT 3

Can this be achieved? I.e. how customizable is the SQS? Can it accept entire graph patterns?
(We have 1000s of variations on this theme so a generic solution is important )

Thanks!

lapidus · September 6, 2021, 6:40pm

Just one more thought ... if SQS is the only option ... is it possible to create those temporarily on the fly without much penalty?

Ex:

"Scan Query For Correlated Subqueries"
Create SQS
Run query
Clean up SQS if not needed

(It seems to take too many milliseconds using stardog-admin for this workaround approach to be valid in the wild ... But trying to think of generic solutions ...)

pavel · September 6, 2021, 8:03pm

SQS is currently the only way to use correlated subqueries in Stardog. We're definitely interested in supporting correlated execution for general subqueries but we aren't yet ready to offer a syntactic extension for that. I guess it's technically possible to provide a SERVICE similar to SQS but where the query would be represented explicitly in the body (rather than referenced by name), and evaluate that in the correlated way... but we don't support it yet.

SQS is very flexible since you can use arbitrary graph patterns in your stored queries.

I understand the inconvenience of having to store queries on the fly. However if you're concerned with latency, you should do it programmatically using one of our supported APIs (Java or directly over HTTP) since invoking stardog-admin requires a launch of a client JVM which is probably where the milliseconds are spent.

Best,
Pavel

lapidus · September 7, 2021, 6:21am

We're definitely interested in supporting correlated execution for general subqueries but we aren't yet ready to offer a syntactic extension for that.

Thanks! Many +1 on this one

I guess it's technically possible to provide a SERVICE similar to SQS but where the query would be represented explicitly in the body (rather than referenced by name)

Hehe, that was what I was thinking. Good to know.

However if you're concerned with latency, you should do it programmatically using one of our supported APIs

Nice, yes, I tried and it seems to give me a decent 100-200ms performance to store a query. So we could perhaps split these queries into 2 steps:

Hash each type of subquery on the fly and upsert it by hashed ID to SQS
Run the query using said hashes

ps. I was able to use the POST to add a new SQS. But the PUT doesn't update the query body for me (still returning 204 as if the query worked). Could it be a bug?
https://stardog-union.github.io/http-docs/#operation/updateStoredQuery

Thanks!

lapidus · September 7, 2021, 7:47am

One more related question ...

Does this limitation also prevent us from getting correct number of results at the top level?

For example if we want to get exactly 5 blogs posts but also fetch related all related authors in the same request?

SELECT * {
   
   ?blogPost a :BlogPost ;
       :author ?author .
}
LIMIT 5

How could we ensure to always get at most 5 blog posts here? Would this use a "normal subquery"?

LorenzB · September 7, 2021, 8:29am

LIMIT does just works on the number bindings, it doesn't know the semantics you might be interested in. If there are multiple authors for a single blog post, indeed the number of possible bindings might already be more than the LIMIT defined. IF you have 5 author this will lead to 5 bindings aka rows.

In that case the way to go is to get the blog posts in a subquery first, then get for those blog posts the data:

SELECT * {
{SELECT * {?blogPost a :BlogPost } LIMIT 5} #get 5 blog posts here
?blogPost :author ?author .
}

lapidus · September 7, 2021, 10:10am

Thank you! That clarifies this part

system · September 21, 2021, 10:10am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Correlated Subqueries, ranking, Stardog Arrays Support	5	1058	August 17, 2020
Extending the Solution Stardog Blog Discussion	5	1447	March 27, 2017
Virtual graphs - error Bug	12	1554	July 11, 2019
SPARQL query on inferred data does not return any data Support	6	941	July 1, 2018
How to query for the (symmetric) CBD of a resource Support	6	653	May 9, 2017

Limit and sort per level?

Related topics