Making audio searchable with Cloud Speech

Last month Cloud Speech introduced a new word-level timestamps feature: audio transcriptions now include the start and end timestamp for each word. This opens up tons of possibilities: developers can now skip to the exact moment in an audio file where a word was spoken, display the relevant text while audio is playing, or search a library of audio for a specific term.

With the ability to search an audio file I wanted to try this feature out on videos. To do this I extracted the audio track from a video file, sent it to Cloud Speech, and built a frontend for searching the audio transcription JSON. The result is the following demo (it’s best viewed with sound in my recent ML API presentation):

In addition to the Speech API, the demo also uses Cloud Functions, Cloud Storage, and App Engine for hosting. Here’s a diagram of how the backend works:

Architecture of Speech Timestamps demo

Step 1: Extracting audio with ffmpeg and Cloud Functions

Because Cloud Speech lets you provide the Cloud Storage URL for an audio file to transcribe, I decided to store all of my video and audio content in Cloud Storage. Then in my App Engine frontend I could get the video and associated transcription JSON directly from Cloud Storage.

I wanted to be able to drop a video file into Cloud Storage and automatically have the transcription show up in another storage bucket. Sounds magical, right? I implemented this with Cloud Functions: a compute solution for writing functions that are automatically triggered by certain cloud events. Functions are written in Node.js, and you specify the type of event that will trigger each function. In this case I triggered my function every time a new file was added to my video bucket. I split the transcription process into two functions:

**extractAudio**: Extract the audio from a video and transcode it into a format the Speech API accepts (I used FLAC encoding)
**transcribeAudio**: Send the FLAC file to the Speech API and upload transcriptions to Cloud Storage

The extractAudio function uses the google-cloud Node module for accessing Cloud Storage and fluent-ffmpeg for extracting and transcoding audio. To get ffmpeg working in my Cloud Functions environment I needed to upload the ffmpeg binaries when I deployed my function and tell fluent-ffmpeg the path to those binaries.

Here’s the full list of npm dependencies:

We’ll also define variables for each of our Cloud Storage buckets: one for videos, one for the FLAC audio files, and one for the transcription JSON:

The function will do the following:

Download the video file from Cloud Storage
Extract the audio and transcode it to FLAC format for Cloud Speech
Upload the FLAC file to Cloud Storage

The function receives an event parameter, which will give us data on the file that triggered the event. Here’s an outline of our function:

Next we’ll write the function to download our video to Cloud Storage. We can save the file to a local disk in our Cloud Function environment by writing it to the /tmp directory:

Once we’ve got the video file available locally in Cloud Functions, we’re ready to extract and transcode the audio with ffmpeg:

The last step is to upload the flac file to a new Cloud Storage bucket. You can find the code for uploading files in the google-cloud documentation.

Step 2: Transcribing audio with Cloud Speech

To get our audio transcription and timestamp data we’ll write a Cloud Function called transcribeAudio which will be triggered whenever a flac file is added to our audio bucket. For this function we’ll need to instantiate a speech client with google-cloud Node and then write our transcription function:

We just need to call longRunningRecognize() to make a request to Cloud Speech with our client. This will kick off a long running speech operation and return the final transcription results when it finishes:

We can then write the transcriptions to a local JSON file:

The last step is uploading our JSON file to Cloud Storage in the same way we did in the first function.

Woohoo! Now we’ve got an entirely serverless solution that generates timestamp transcriptions from a video. Note that you’ll want to periodically delete the contents of tmp/ from your Cloud Functions file system to avoid hitting a memory limit. You can do this with the rimraf npm module.

Get Started

To start using the timestamps functionality in your own apps, head over to the Speech API timestamp docs. For details on Cloud Functions check out the docs here or watch my teammate Bret’s awesome talk on Cloud Functions.

I’d love to see what you build with the Speech API and Cloud Functions. Let me know what you think in the comments or find me on Twitter @SRobTweets.