Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Sample] Request for a speech-focused sample #1981

Open
stevengum opened this issue Nov 15, 2019 · 6 comments
Open

[New Sample] Request for a speech-focused sample #1981

stevengum opened this issue Nov 15, 2019 · 6 comments
Assignees
Labels
feature-request A request for new functionality or an enhancement to an existing one. needs-triage The issue has just been created and it has not been reviewed by the team.

Comments

@stevengum
Copy link
Member

Is your feature request related to a problem? Please describe.
We currently enable speech out of the box for the C# Echo Bot and Core Bot samples and generators. These samples are our "getting-started" samples and don't delve into the nuances of the protocol with speech.

Now that Direct Line Speech is GA, we should have a speech-focused sample.

Describe the solution you'd like
A sample focused on speech (which may be through a headless device) should be created.

Features of the sample:

  • Discussion on setting of InputHints
  • Dialog design that doesn't rely on suggested actions, lists or cards
  • Examples and links of using SSML instead of just plain text for the Activity.Speak property.

FYI @ryanlengel, @darrenj, @lauren-mills, @gabog who have experience in designing headless device solutions. Are there any "gotchas" that should be discussed in a speech-focused sample?

[enhancement]

@gabog
Copy link
Contributor

gabog commented Nov 15, 2019

Hi @stevengum, here are some notes from my end:

  • Not setting the inputhints properly can cause conversations to stop or clients like cortana to crash. This is often omitted because devs test on emulator.
  • The speech track (Speak property) doesn't need to match the Text property and it some cases it will be very different based on the channel. A channel with a screen could say: here are your appointments for today, a channel without it would probably ready the appointments out loud.
  • Pluralization, in some cases we can get away with some formatting in text, but with speech we need to put extra though on how we say "for one person" or "for two people".
  • Making dates more natural, in text we can present something like "for 11/15/2019 at 5 PM" which sounds OK but is not very natural in speech, sometimes it is better to have logic to parse dates and times and say something like "for tomorrow at 5 PM" or "next Monday at 5 in the afternoon", etc.
  • QnAMaker responses, some QnAMaker responses have a lot of text, this sound horrible in headless bots because we don't support barge in in many cases an you need to wait until the bot finishes talking before you can ask something else (I would say that QnAMaker is not Speech friendly in general).
  • Enumerations of suggested actions, in some cases we created logic to lead suggested actions out loud in the form of "You can say X, Y or Z" or "You can say X, Y and Z", the default would be to read read the list without and or or which sounds very wierd.
  • Not sure if this changed in webchat lately, but adaptive cards have a Speak property that is not used by wechat and may be confusing to some devs.

On the understanding side.

  • Speech normalization may add extra "." at the end of an utterance that would confuse LUIS.
  • Soundex, sometimes the suggested actions may want you to chose the name of a person or the name of a place that is not trained in LUIS (like a restaurant name), speech will not always understand the exact string you are looking for and you will need to match the utterance against the suggested actions list using soundex or some other algorithm rather than straight string matching.
  • Utterances can be long and vague, LUIS is supper important when using speech input.

This is all I can think of so far.
Will update this post if I can think of anything else.

@darrenj
Copy link

darrenj commented Nov 15, 2019

Gabo has most things covered.

  • Decorating any Speak property with SSML enables you to control the voice and even the tone of voice. For example
  • Sample will need to have the steps required to enable websockets (for direct line speech) on the app service. We do this automatically as part of VA and the ARM template.
  • There is a test harness for speech, you can see this and some other instructions here
  • I think the sample should use Language Generation albeit in preview form as this will allow us to show how to provide speech and text friendly responses. e.g.
# NewUserIntroCard
[Activity
    Text= Some text
    Speak=Speech friendly response
]

@stevengum
Copy link
Member Author

  • Sample will need to have the steps required to enable websockets (for direct line speech) on the app service. We do this automatically as part of VA and the ARM template.

The C# Core bot and Echo bot ARM templates were updated to support WebSockets, and the Startup.cs was updated to include the necessary app.UseWebSockets(); call. So we should be set here in regards to enabling WebSocket usage from the bot and on the App Service; we just need to mirror this work in the Speech-first sample.

There is work to be done on the Resource Provider to enable creating of the DLS channel via ARM templates and Azure CLI which I believe @DDEfromOR is working on.

  • There is a test harness for speech, you can see this and some other instructions here

We do need to update the Core and Echo bot READMEs to mention the test DLS client and the Speech SDKs.

  • The speech track (Speak property) doesn't need to match the Text property and it some cases it will be very different based on the channel. A channel with a screen could say: here are your appointments for today, a channel without it would probably ready the appointments out loud.

For DLS, the current behavior is that the Speak property needs to be set, the channel does not use the Text property from Activity for Speech generation.

  • Enumerations of suggested actions, in some cases we created logic to lead suggested actions out loud in the form of "You can say X, Y or Z" or "You can say X, Y and Z", the default would be to read read the list without and or or which sounds very wierd.

For non-headless devices/UIs (headful? heady?) it is important to preserve any use of GUI as applicable. However, if possible I think that building for one channel (DLS, Web Chat with Speech, or Cortana) and then generalizing is the better approach. We've seen this approach with MS Teams which has a lot more

  • Not sure if this changed in webchat lately, but adaptive cards have a Speak property that is not used by wechat and may be confusing to some devs.

@compulim?

@ryanisgrig
Copy link

For reference we have a tutorial on enabling DLS with the VA at https://microsoft.github.io/botframework-solutions/clients-and-channels/tutorials/enable-speech/1-intro/

Most of the steps are turning on the resources for the bot to work but there is some on how to change the voice with SSML.

@johnataylor
Copy link
Member

We agreed to postpone major new samples until after we target dotnet 3.1

@cleemullins
Copy link
Contributor

@johnataylor What are we doing with this? Can Monica, Michael, Ashely, or Eric drive this one?

@johnataylor johnataylor added R10 and removed R9 labels Apr 17, 2020
@tracyboehrer tracyboehrer removed the R10 label Jun 22, 2020
@gabog gabog added feature-request A request for new functionality or an enhancement to an existing one. needs-triage The issue has just been created and it has not been reviewed by the team. and removed enhancement labels Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A request for new functionality or an enhancement to an existing one. needs-triage The issue has just been created and it has not been reviewed by the team.
Projects
None yet
Development

No branches or pull requests

7 participants