Twilio: How to Connect with Existing Voice Assistant

Today's post serves the purpose of explaining Twilio's recent bidirectional streaming feature, the very feature which made it possible to receive the call audio while still being able to send audio for playback asynchronously.

This was not possible before, when communication had to be unidirectional: either you could receive the voice audio but could not respond to it, or you could send audio to the call but not get anything back.

Let's get right into the tutorial. One needs to understand three things if he or she hopes to use Twilio in conjunction with her own Voice Assistant (abbreviated VA).

1. How the backend of the VA receives the user's voice and returns audio responses (HTTP or Websocket? Streaming or one-off? In which audio format?)

2. How to serve calls using a node.js server (How can a phone call to a twilio number perform a request to a Node server?)

3. How to open and use a bidirectional communication channel between node.js code and the actual phone call (Through which protocol?)

We will tackle these questions one at a time. But first, let's see the prerequisites:

* Some experience with node.js HTTP servers and websockets

* A public server with an open port

* SSL keys for the domain of the server

* A Twilio account with a Twilio number, either toll-free or house number

Now, with that said, let's move on to the tutorial!

The VA backend

Most Voice Assistants work in turns. That is, they receive a user utterance (a user phrase), they generate the answer as voice audio, they send it back to the user and then wait for the next utterance, which starts the next dialogue turn.

The user cannot speak again before getting the response from the VA. So, we mainly have two data flows, the voice data from the client to the VA (flow A) and the synthesized voice response from the VA (flow B). This turn-taking model is illustrated in the following graphic:

The next thing to know about our presupposed VA model, is that flow A is an audio stream, whereas flow B is sent as a single message. This follows from the way audio is generated: user voice audio is produced sequentially and can thus be streamed. On the other hand, the VA's response synthesized voice is generated totally, not partially - the first second is available as soon as the whole message is available.

Finally, turn taking is managed using two signals. The first one, RESTART_COMMUNICATION, is sent by the client at the beginning of the user's turn (start of flow A). The other signal is SINGLE_UTTERANCE_END. It is sent by the VA backend as soon as it recognizes the end of single phrase in the incoming audio (alternatively this can be controlled by the client as well). The SINGLE_UTTERANCE_END signal is sent to the client to prevent it from sending further audio. The process can be seen in detail in the flow diagram below:

Audio encodings, what a mess!

When I first tried to connect our team's VA to Twilio, one of the first things I checked was Twilio's audio format. I had hoped for an option for wav or mp3 encodings. Alas! The only supported format is that mysterious mulaw (or u-law or μ-law) encoding, which I learned that is aimed specifically for speech audio.

I also learned that it is not widely used outside of telephony; chances are that in your system you won't have mulaw, which means that you will have to do some audio conversion. Although this might take a significant amount of work, I will not cover this part. I will only mention that for python, you can convert

wav

mulaw

by using the

audioop

module in this code snippet.

with wave.open(wavfile, "rb") as wav:                                                                                              
    # load frames                                                                                                                  
    raw_wav= wav.readframes(wav.getnframes())                                                                                      
    # downsample                                                                                                                   
    raw_wav_8khz, st = audioop.ratecv(raw_wav,2,1,24000,8000,None)                                                                 
    # to explain the above:                                                                                                        
    #  2: sample depth in bytes                                                                                                    
    #  1: number of channels                                                                                                       
    #  24000: samplerate                                                                                                           
    #  8000: desired samplerate                                                                                                    
                                                                                                                                   
    # convert to mulaw                                                                                                             
    raw_ulaw = audioop.lin2ulaw(raw_wav_8khz,wav.getsampwidth())

Twilio node.js server

From this chapter onwards, we will be discussing the Twilio server itself.

To insert programming logic into your voice call, you have to use a node server. Below you can see the project structure of the tutorial.

This is a node project and the server is its entry point. There are some xml files with the server's TwiML resposes (don't worry if you don't know what that is) in the templates directory. Also, there is a keys folder, with the server HTTPS keys. Last, but not least, the npm dependencies are:

* httpdispatcher

* websocket

If you are experienced with node.js, you can skip to the last part of the article ("putting it all together").

Taking it one step at a time: part 1 - TwiML without Streams

Before understanding what the node server does, let's create a simple Twilio server and connect it to a Twilio number in a baby example. This baby example will later grow into our fully functional Twilio front-end for our VA!

Every node.js server for Twilio has an endpoint returning some TwiML. In our case, the server will have an endpoint on path

/twiml

accepting HTTP POST, responding with the the contents of file

templates/babyapp.xml

, i.e.:

<Response>
  <Say>This is your Voice Assistant speaking!</Say>
</Response>

What does the TwiML do? Simply it informs Twilio on how to respond to the phone call. In this case, it will merely say "This is your Voice Assistant speaking!" with a synthesized voice, before hanging up. This is the essential file for Twilio.

So, before you continue, make sure you have this file. Also, you need to make sure the dependencies are installed (as mentioned above,

httpdispatcher

and

websocket

). Finally, it is necessary to spin up a server to serve this TwiML. The code responsible for sharing the TwiML is as follows:

Store the code in

twilio_server.js

. The server's work is done in

dispatcher

onPost

method:

templates/babyapp.xml

is read and then sent back to the client as HTTP POST reponse body. Now you have to install the dependencies (npm-install) and run the server:

node twilio_server.js

Now you should be able to access the TwiML in

app.xml

using curl:

curl -d -X https://yourhost:1312/twiml

This should return:

<Response>
  <Say> This is your Voice Assistant speaking! </Say>
</Response>

To make this work one final step is needed. First, install the twilio-cli and setup your twilio credentials. Then, connect the endpoint of the server you just made with your Twilio number:

twilio phone-numbers:update +111234567890 --voice-url http://your.host:1314/twiml

Now, if you have configured your Twilio number correctly, you should be able to call it and listen to the message you have specified with the

<Say>

tag!

Taking it one step at a time: part 2 - pipe them messages!

Now it's time to change the TwiML to allow bidirectional communication with the call:

<?xml version="1.0" encoding="UTF-8" ?>
<Response>
  <Say>This is your Voice Assistant speaking!</Say>
  <Connect>
      <Stream url="wss:your.host:1314/socket" ></Stream>
  </Connect>
</Response>

Save the new TwiML in a new file named

app.xml

, inside the

templates

dir.

What is this new sorcery? In simple terms,

<Stream>

is used to set the websocket to which the call will connect and

<Connect>

is used to make this websocket bidirectional. The latter tag is the one of the additions made in July 2020, which made it easy to use Twilio with VA's.

As you can see, the websocket address has the same host and port as before. The reason for this is that we will handle this connection through the same server that we use for serving the TwiML. We will call this websocket the Twilio websocket. You need one more, one for communicating with the VA. We will call this websocket the VA websocket. In the schema below, note the role of the the node server in the middle:

In essence, the Twilio server is just middleware between the Twilio cloud service (which sends voice audio and receives what should be played back to the human on the phone) and our VA. It is, at the same time, a server for twilio to connect to, and a VA client.

Both websockets are mostly described by their message handler; they do not perform any special actions when closed, neither do they do something special when opened, apart from binding the message handlers.

Let's jump right into the code that handles incoming Twilio messages:

Twilio's messages are named events, and there are 5 types of events:

connected

start,

media

mark

and

close

. They are described in detail in Twilio's docs.

Connected

and

close

events are just one-off events sent at the beginning and the ending of the call.

The

start

event only occurs once too. When using the

 <Stream>

tag, Τwilio sends voice audio in base64 encoded chunks, proceeded by a single

start

message. This

start

message is handy for getting the

streamSID

of the call (a unique ID used when sending stuff to Twilio) and, of course, it signifies the start of the voice stream.

This is a good time to start the communication with the VA using the

RESTART_COMMUNICATION

signal. Also it is necessary to turn the

mode

attribute to

"recording"

. This attribute has two values:

"recording"

"waiting"

. Mode

"recording"

is during flow A is being sent, while mode

"waiting"

is for the rest of the time. "Waiting" is not a really good name though, as there is stuff going on in the waiting phase too, as we will see in the next section.

The

media

event is sent many times: it contains chunks of streaming voice audio, recorded from the user's phone. This kind of message should be sent to the VA. Also, any necessary preprocessing should happen here. e.g changing the encoding.

Lastly, the

mark

event carries a message. In Twilio it is often used for notifications. In this case, it is received from the Twilio call when playback of a VA's response has ended (if you're feeling confused, things will clear out when you read the next section). Here, a new conversation turn starts - the user can speak again. Attribute

this.mode

is set back to

"recording"

and the

RESTART_COMMUNICATION

signal is sent to the VA.

Messages on the VA websocket

Next is the handler of the VA websocket:

Similarly to the handler of the Twilio socket, this one accepts only string messages: it handles the

SINGLE_UTTERANCE_END

messages we talked about before and the audio responses from the VA. The former are used to stop the recorded audio from the call from being sent to the VA.

The latter are prefixed with

_MULAW

and are sent to Twilio, to be played back to the user. But this not enough - a mark message is also sent right after the audio. This step is important, because it tells Twilio to notify the server as soon as the sound clip that was just sent has been played to the user. Notice the

"name"

field: its value is the same as the name in the mark message Twilio will send afterwards.

Putting it all together

Now it's time to unify things into one big working server. The suggested solution is to create a class named

MediaStream

, and create an instance of this class for every incoming Twilio connection. The two websockets we already described are attributes of

MediaStream

. The setting up of the sockets, i.e. the connection of the client to the VA and the binding of onMessage(), onOpen(), onError() etc is done in the class constructor.

Some details of importance:

1. The

<Stream>

tag only supports wss (the secure websocket protocol). That means that you must have an HTTPS server with the keys ready in a separate keys folder (lines 1-15).

By the way, the options object used to create the server also contains the path for the Twilio websocket endpoint provided in the TwiML. This HTTPS server will be fed into the constructor of the WebsocketServer object. The latter will listen to the endpoint specified in the TwiML (lines 16-20).

2. The code in lines 26-47 is the same as the first part, it is made for serving the TwiML inside the .xml.

3. The next part (lines 49-81) might look a bit perplexing. Essentially, it is the code that establishes the connections with Twilio and the VA. A bit more explantion is due:

Everything starts when Twilio is connected to the websocket endpoint. This causes the creation of a MediaStream object, with the connection object as an argument. The code in the constructor ensures that the processing of messages from Twilio will start after the websocket with the VA has been opened.

This is achieved by binding the

prepareWebsockets()

function to the onConnected event, in line 61: when the connection with the VA is complete, then it is safe to bind the message handlers to the websocket connections.

That's all for today. Let me know how it went, and feel free to ask any questions in the comments section. Happy coding!