Understanding AI and Speech Recognition with Azure Cognitive Services

Understanding AI and Speech Recognition with Azure Cognitive Services

Introduction

In this tutorial, we’re going create a voice controlled game where you move a landing mars rover. We’ll be using 2 different services from Microsoft’s Azure Cognitive Services: LUIS and Speech to Text.

Voice controlled game made with Unity

You can download the project files here.

Setting up LUIS

LUIS is a Microsoft, machine learning service that can convert phrases and sentences into intents and entities. This allows us to easily talk to a computer and have it understand what we want. In our case, we’re going to be moving an object on the screen.

To setup LUIS, go to www.luis.ai and sign up.

LUIS homepage for Microsoft Cognitive Services

Once you sign up, it should take you to the My Apps page. Here, we want to create a new app.

My Apps page for Azure Cognitive Services

When your app is created you will be taken to the Intents screen. Here, we want to create a new intent called Move. An intent is basically the context of a phrase. Here, we’re creating an intent to move the object.

Intents options for the VoiceControlApp

Then go to the Entities screen and create two new entities. MoveDirection and MoveDistance (both Simple entity types). In a phrase, LUIS will look for these entities.

Entities for Voice Control App

Now let’s go back to the Intents screen and select our Move intent. This will bring us to a screen where we can enter in example phrases. We need to enter in examples so that LUIS can learn about our intent and entities. The more the better.

Make sure that you reference all the types of move directions at least once:

  • forwards
  • backwards
  • back
  • left
  • right

Move elements for LUIS project

Now for each phase, select the direction and attach a MoveDirection entity to it. For the distance (numbers) attach a MoveDistance entity to it. The more phrases you have and the more different they are – the better the final results will be.

Move direction setup for voice controlled app

Once that’s done, click on the Train button to train the app. This shouldn’t take too long.

LUIS app setup with Train button highlighted

When complete, you can click on the Test button to test out the app. Try entering in a phrase and look at the resulting entities. These should be correct.

Once that’s all good to go, click on the Publish button to publish the app – allowing us to use the API.

LUIS app setup with Text button selected

The info we need when using the API, is found by clicking on the Manage button and going to the Keys and Endpoints tab. Here, we need copy the Endpoint url.

Authoring Key for voice control app in LUIS

Testing the API with Postman

Before we jump into Unity, let’s test out the API using Postman. For the url, paste in the Endpoint up until the first question mark (?).

Postman page with GET key highlighted

Then for the parameters, we want to have the following:

  • verbose – if true, will return all intents instead of just the top scoring intent
  • timezoneOffset – the timezone offset for the location of the request in minutes
  • subscription-key – your authoring key (at the end of the Endpoint)
  • q – the question to ask

Postman with Params tab open

Then if we press Send, a JSON file with our results should be sent.

Postman with JSON response shown

Speech Services

LUIS is used for converting a phrase to intents and entities. We still need something to convert our voice to text. For this, we’re going to use Microsoft’s cognitive speech services.

What we need from the dashboard is the Endpoint location (in my case westus) and Key 1.

Endpoint and Key for Microsoft Speech-to-Text services

We then want to download the Speech SDK for Unity here. This is a .unitypackage file we can just drag and drop into the project.

Speech SDK installation instructions

Creating the Unity Project

Create a new Unity project or use the included project files (we’ll be using those). Import the Speech SDK package.

New Unity project with Speech SDK folder highlighted

For the SDK to work, we need to go to our Project Settings (Edit > Project Settings…) and set the Scripting Runtime Version to .Net 4.x Equivalent. This is because the SDK and even us will be using some new C# features.

Project Settings in Unity with Player Configuration adjusted

LUIS Manager Script

Create a new C# script (right click Project > Create > C# Script) and call it LUISManager.

We’re going to need to access a few outside namespaces for this script.

For our variables, we have the url and subscription key. These are used to connect to LUIS.

Our resultTarget will be the object we’re moving. It’s of type Mover, which we haven’t created yet so just comment that out for now.

We then have our events. onSendCommand is called when the command is ready to be sent. onStartRecordVoice is called when we start to record our voice and onEndRecordVoice is called when we stop recording our voice.

Finally, we have our instance – allowing us to easily access the script.

Let’s subscribe to the onSendCommand event – calling the OnSendCommand function.

The OnSendCommand function will simply start the CalculateCommand coroutine, which is the main aspect of this script.

Calculating the Command

In the LUISManager script, create a coroutine called CalculateCommand which takes in a string.

First thing we do, is check if the command is empty. If so, return.

Then we create our web request, download handler, set the url and send the request.

Once we get the web request, we need to convert it from a JSON file to our custom LUISResult class, then inform the mover object.

LUIS Result

The LUIS result class is basically a collection of three classes that build up the structure of the LUIS JSON file. Create three new scripts: LUISResultLUISIntent and LUISEntity.

LUISResult:

LUISIntent:

LUISEntity:

Back in Unity, let’s create a new game object (right click Hierarchy > Create Empty) and call it LUISManager. Attach the LUISManager script to it and fill in the details.

  • Url – the url endpoint we entered into Postman (endpoint url up to the first question mark)
  • Subscription Key – your authoring key (same one we entered in Postman)

LUISManager object in the Unity Inspector

Recording our Voice

The next step in the project, is to create the script that will listen to our voice and convert it to text. Create a new C# script called VoiceRecorder.

Like with the last one, we need to include the outside namespaces we’re going to access.

Our first variables the sub key and service region for the speech service.

Then we need to know if we’re currently recording, what our current command is to send and is the command ready to be sent?

Finally, we got our completion task. This is a part of the new C# task system which we’re going to use as an alternative to using coroutines, as that’s what the Speech SDK uses.

Let’s start with the RecordAudio function. async is basically an alternative to using coroutines. These allow you to pause functions and wait certain amounts of time before continuing. In our case, we need this to allow the SDK time to convert the audio to text.

First, let’s say we’re recording and create a config class which holds our data.

Then we can create our recognizer. This is what’s going to convert the voice to text.

Inside of the using, we’re going to create an event to set the curCommand when the recognition has completed. Then we’re going to start listening to the voice. When the completion task is triggered, we’ll stop listening and convert the audio – tagging it as ready to be sent.

The CommandCompleted function gets called when the command is ready to be sent.

In the Update function, we want to check for the keyboard input on the return key. This will toggle the recording.

Then underneath that (still in the Update function) we check for when we’re ready to send a command, and do so.

Attach this script also to the LUISManager object.

  • Subscription Key – Speech service key 1
  • Service Region – Speech service region

Voice Recorder Script component added to LUIS Manager

Mover Script

Create a new C# script called Mover. This is going to control the player.

Our variables are just our move speed, fall speed, default move distance, the position on the floor for the player and target position.

In the Start function, let’s set the target position to be our position.

In the Update function, we’ll move us towards the target position and fall downwards until we hit the floor Y position.

ReadResult takes in a LUISResult and figures out a move direction and move distance – updating the target position.

GetEntityDirection takes in a direction as a string and converts it to a Vector3 direction.

Scene Setup

Back in the editor, create a new cube (right click Hierarchy > 3D Object > Cube) and call it Mover. Attach the Mover script and set the properties:

  • Move Speed – 2
  • Fall Speed – 1
  • Default Move Dist – 1
  • Floor Y Pos – 0

Also set the Y position to 15.

Script added to Rover object in Unity

For the ground, let’s create a new empty game object called Environment. Then create a new plane as a child called Ground.

  • Set the scale to 8

White plane created within Unity project

Drag in the MarsMaterial on to the plane (Textures folder). Then parent the camera to the mover.

Camera options in the Unity Inspector

Let’s now rotate the directional light so it’s facing directly downwards. This makes it so we can see where the mover is going to land on the ground – allowing the player to finely position it.

  • Set the Rotation to 90, -30, 0

Unity Light rotated on the X-axis

Let’s also add in a target object. There won’t be any logic behind it, just as a goal for the player.

Target object within Unity project

Creating the UI

Create a canvas with two text elements – showing the altitude and current move phrase.

Unity UI setup for voice controlled game

Now let’s create the UI script and attach it to the LUISManager object. Since we’re using TextMeshPro, we’ll need to reference the namespace.

For our variables, we’re just going to have our info text, speech text and mover object.

In the OnEnable function, let’s subscribe to the events we need and un-subscribe to them in the OnDisable function.

In the Update function, we’re going to just update the info text to display the player’s Y position.

Here’s the functions the events call. It basically updates the speech text to show when your recording your voice, calculating it and execute it.

Make sure to connect the text elements and mover object.

Unity UI Script component from the Unity Inspector

Conclusion

Now we’re done! You can press play and command the cube with your voice! If you missed some of the links you can find them here:

Tutorials on game, web and mobile app development