The configuration is used both to adapt the speech recognition model, as well as to train models for detecting intents and entities for your specific application.
A Speechly configuration contains training data for machine learning models. It describes a number of example utterances that your users might be saying, and from which an intent and possibly a number of entities should be parsed. The example utterances are written in a Markdown-like syntax:
*search do you have [blue](color) [jackets](product)
The above example defines the user utterance “do you have blue jackets”, assigns this to have intent search, and defines two entities that are named color and product, with values blue and jackets, respectively. The intent and entities are returned to your application, and based on these your voice UI can carry out the action requested by the user. In this case the UI should update a search result view to show only blue jackets.
A configuration must contain at least a few example utterances for every functionality of your voice UI. In general, the more example utterances you can provide, the better.
However, this is not as tedious as you might think! Even simple Speechly configurations can be written as compact Templates that are then expanded into a large set of example utterances during model training. For example, the configuration
product = [t shirts | hoodies | jackets | jeans | slacks | shorts | sneakers | sandals]
color = [black | white | blue | red | green | yellow | purple | brown | gray]
*search do you have $color(color) $product(product)
declares two variables, product and color, and assigns to both a list of relevant values. The 3rd line defines a Template that generates 72 example utterances that each start with “do you have”, followed by a color entity and a product entity, with their values taken from the respective lists:
*search do you have [black](color) [t shirts](product)
*search do you have [white](color) [t shirts](product)
*search do you have [blue](color) [t shirts](product)
*search do you have [gray](color) [sandals](product)
All of these 72 example utterances are compactly defined just by the three lines of “code” above.
It is useful to think of preparing the example utterances as the task of “programming” a data generator. You can learn more about how this is done from the Speechly Annotation Language Syntax Reference as well as Speechly Annotation Language Semantics.
Note! You can see the example utterances that are generated from the Templates using either the “show sample” button in the Speechly dashboard, or using the the sample command in the command line tool.
The intent of an utterance that indicates what the user in general wants. It is defined in the beginning of an example with the syntax *intent_name, i.e. the name of the intent prefixed by an asterisk. Every example utterance must have an intent assigned to it.
Intents capture the various functionalities of your voice UI. For example, a shopping application might use different intents for searching products, adding products to the cart, removing products from the cart, and going to the checkout.
Entities are “local snippets of information” in an utterance that describe details relevant to the users need. An entity has a name, and a value. An utterance can contain several entities.
They are defined using the syntax [entity value](entity name).
[entity value](entity name)
In the shopping example above, the entities are color and product that have the values blue and jackets, respectively. An entity can take different values, and your configuration should give a variety of examples of these.
Our spoken language understanding system extracts intents and entities from the user’s speech input, and returns these to your application. When using one of our Client Libraries, handling of intents and entities is done via our Client API. The same API provides your application with a raw transcript of the users speech.
Since Speechly is a spoken language understanding system, it is important to use example utterances that as precisely as possible reflect how users talk. An example utterance is (probably) good, if it sounds natural when spoken out aloud.
Notice that how something spoken can depend on the context. For example, the number 16500 could be either the price of a car, or a (US) zip-code. However, it is spoken quite differently depending on the context: “sixteen thousand five hundred” (price) vs “one six five zero zero” (zip-code). A good configuration takes such contextual details into account.
Last updated by Mathias Lindholm on February 16, 2022 at 11:37 +0200
Found an error on our documentation? Please file an issue or make a pull request