I'm testing Voicebox via Knowledge kits. I'm wondering how did you define the number of example queries that the user must create in order to enhance the accuracy of Voicebox? In the documentation, you state that the minimum number of queries is 100, but for best accuracy, 500 queries per database is recommended.
Shouldn't the number of queries be purely relative to the complexity of the data model? If I have a very simple data model, 500 queries seems a bit overkill. On the other hand, if I have a complex data model, 500 queries may be an underestimation.
If Voicebox uses the stored queries as few-shot examples in LLM prompts, how are the "correct" stored queries selected as the few-shot examples? I assume not all queries are given as a context for the LLM.
Finally, I'm wondering whether it's possible to automate the generation of the example queries or does the user need to manually create each query.
You are absolutely right that the number of questions depend on the data model complexity as well as the kind of questions relevant for the use case. The numbers we mention in the documentation are general guidance and are not hard requirements.
We support auto-generating example queries. The number of queries we auto generate is adjusted based on the data model size. See more details in our docs about this:
Once you publish your data model, example stored queries will be created automatically for you. You will be notified via email when these examples are ready. In order to generate example queries, Voicebox will inspect the instance data within the database. If you are publishing your data model in Designer with associated CSV files, no further action is needed. If you would like to import the instance data using other means, you can publish your data model again after the data is imported for the examples to be recreated again.
With respect to selection of few-shot examples, when Voicebox example questions are added they are indexed automatically with vector embeddings. Then Stardog's semantic search capability is used to detect which examples are relevant for the user question based on semantic similarity.