This AI Can Separate Speech, Music and Sound Effects from Movie Soundtracks

Written by whatsai | Published 2021/10/25
Tech Story Tags: ai | artificial-intelligence | music | music-industry | youtube-transcripts | youtubers | hackernoon-top-story | machine-learning | web-monetization

TLDR

Mitsubishi and Indiana University have published a new model as well as a new dataset tackling this task of identifying the right soundtrack. The problem here is isolating any independent sound source from a complex acoustic scene like a movie scene or a youtube video where some sounds are not well balanced. If you successfully isolate the different categories in a soundtrack, it means that you can also turn up or down only one of them, like turning down the music a bit to hear all the other actors correctly.via the TL;DR App

Have you ever tuned in to a video or a TV show and the actors were completely inaudible, or the music was way too loud? Well, this problem, also called the cocktail party problem, may never happen again. Mitsubishi and Indiana University just published a new model as well as a new dataset tackling this task of identifying the right soundtrack. For example, if we take the same audio clip we just ran with the music way too loud, you can simply turn up or down the audio track you want to give more importance to the speech than the music.

The problem here is isolating any independent sound source from a complex acoustic scene like a movie scene or a youtube video where some sounds are not well balanced.

Sometimes you simply cannot hear some actors because of the music playing or explosions or other ambient sounds in the background.

Well, if you successfully isolate the different categories in a soundtrack, it means that you can also turn up or down only one of them, like turning down the music a bit to hear all the other actors correctly.

This is exactly what the researchers achieved. Learn more in the video!

Learn more:

References

►Read the full article:
https://www.louisbouchard.ai/isolate-voice-music-and-sound-effects-with-ai/
►Petermann, D., Wichern, G., Wang, Z., & Roux, J.L. (2021). The
Cocktail Fork Problem: Three-Stem Audio Separation for Real-World
Soundtracks. https://arxiv.org/pdf/2110.09958.pdf
►Project page: https://cocktail-fork.github.io/
►DnR dataset: https://github.com/darius522/dnr-utils#overview
►My Newsletter (A new AI application explained weekly to your emails!):
https://www.louisbouchard.ai/newsletter/

Video transcript

00:01

have you ever tuned in to a video or a

00:03

tv show and the sound was like this

00:10

where the actors are completely

00:11

inaudible or something like

00:14

this where the music is way too loud

00:18

well this problem also called the

00:20

cocktail party problem may never happen

00:22

again mitsubishi and indiana university

00:25

just published a new model as well as a

00:27

new data set tackling this task of

00:29

identifying the right soundtrack for

00:31

example if we take the same audio clip

00:34

we just ran with the music way too loud

00:36

you can simply turn up or down the audio

00:38

attack you want to give more importance

00:40

to the speech than the

00:42

music and straight into the hot pans the

00:46

problem here is isolating any

00:48

independent sound source from a complex

00:50

acoustic scene like a movie scene or

00:52

youtube video where some sounds are not

00:55

well balanced sometimes you simply

00:56

cannot hear some actors because of the

00:58

music playing or explosions or other

01:01

ambient sounds in the background well if

01:03

you successfully isolate the different

01:05

categories in a soundtrack it means that

01:07

you can also turn up or down only one of

01:10

them like turning down the music a bit

01:12

to hear all the other actors correctly

01:15

as we just did from someone that isn't a

01:17

native english speaker this will be

01:19

incredibly useful when listening to

01:21

videos with loud background music and

01:23

actors or speakers with a strong accent

01:26

i am not used to just imagine having

01:28

these three sliders in a youtube video

01:30

to manually tweak them how cool would

01:33

that be

01:39

you have a metal arm it could also be

01:41

incredibly useful for translations or

01:43

speech-to-speech applications where we

01:45

could just isolate the speaker to

01:47

improve the task's results here the

01:49

researchers focused on the task of

01:51

splitting a soundtrack into three

01:53

categories music speech and sound

01:56

effects three categories that are often

01:58

seen in movies or tv shows they called

02:00

this task the cocktail fork problem and

02:02

you can clearly see where they got the

02:04

name from and i'll spoil you the results

02:07

they are quite amazing as we will hear

02:09

in the next few seconds but first let's

02:11

take a look at how they receive a movie

02:14

soundtrack and transform it into three

02:16

independent soundtracks this is the

02:18

architecture of the model you can see

02:20

the input mixture y which is the

02:23

complete soundtrack at the top and at

02:25

the bottom all of our three output

02:27

sources x which i repeat are the speech

02:30

music and other sound effects separated

02:33

the first step is to encode the

02:34

soundtrack using a fourier transform on

02:37

different resolutions called stft or

02:40

short time fourier transform this means

02:42

that the input which is a soundtrack

02:44

having frequencies over time is first

02:47

split into shorter segments for example

02:49

here it is either split with 32 64 or

02:53

256 milliseconds windows then we compute

02:57

the fourier transform on each of these

02:59

shorter segments sending 8 milliseconds

03:01

at a time for each window or segment

03:04

this will give the fourier spectrum of

03:06

each segment analyzed on different

03:08

segment sizes for the same soundtrack

03:11

allowing us to have short-term and

03:13

long-term information on the soundtrack

03:15

by emphasizing specific frequencies from

03:17

the initial input if they appear more

03:20

often in a longer segment for example

03:22

this information initially represented

03:24

in time frequency is now replaced by the

03:27

fourier phase and magnitude components

03:29

or fourier spectrum which can be shown

03:32

in a spectrogram similar to this one

03:34

note that here we have only an

03:36

overlapping segment of 0.10 seconds but

03:39

it is the same thing in our case with

03:41

three different segment sizes also

03:44

overlapping then this transformed

03:46

representation simply containing more

03:48

information about the soundtrack is sent

03:50

into a fully connected block to be

03:52

transformed into the same dimension for

03:54

all branches this transformation is the

03:56

first module which is learned during

03:59

training of the algorithm we then

04:01

average the results as it is shown to

04:03

improve the model's capacity to consider

04:05

these multiple sources as a whole rather

04:08

than independently here the multiple

04:10

sources are the transformed soundtrack

04:12

using differently sized windows don't

04:15

give up yet we just have a few steps

04:17

left before hearing the final results

04:19

this average information is then sent

04:22

into a bidirectional long short-term

04:24

memory which is a type of recurrent

04:26

neural network allowing the model to

04:28

understand the inputs over time just

04:30

like a convolutional neural network

04:32

understands images over space if you are

04:35

not familiar with recurrent neural

04:36

networks i invite you to watch the video

04:38

i made introducing them this is the

04:40

second module that is trained during

04:42

training we average the results once

04:44

again and finally send them to each of

04:47

our three branches that will extract the

04:49

appropriate sounds for the category here

04:51

the decoder is simply fully connected

04:53

layers again as you can see on the right

04:56

they will be responsible for extracting

04:58

only the wanted information from our

05:00

encoded information of course this is

05:03

the third and last module that learns

05:05

during training in order to achieve this

05:07

and all these three modules are trained

05:10

simultaneously finally we just reverse

05:13

the first step taking the spectrum data

05:15

back into time frequency components and

05:18

voila we have our final soundtrack

05:20

divided into three categories as i said

05:22

earlier in the video this research

05:24

allows you to turn up or down the volume

05:26

of each category independently but an

05:29

example is always better than words so

05:31

let's quickly hear that on two different

05:33

clips

05:35

[Music]

05:41

i phil swift here for flex tape the

05:43

super strong waterproof tape

05:49

that could instantly patch bonds

05:53

as if it wasn't already cool enough the

05:55

separation also allows you to edit

05:57

specific soundtrack imminently to add

06:00

some filters or reverb we're strong and

06:03

once it's on it holds on tight and for

06:06

emergency auto repairs

06:09

it's correct even in the toughest

06:10

conditions

06:13

they also released the data set for this

06:15

new test by merging three separate data

06:17

sets one for speech one for music and

06:20

another for sound effects this way they

06:22

created soundtracks from which they

06:24

already had their real separated auto

06:26

channels and could train their model to

06:28

replicate this ideal separation of

06:30

course the merging or mixing steps

06:33

wasn't as simple as it sounds they had

06:35

to make the final soundtrack as

06:36

challenging as a real movie scene this

06:39

means that they had to make

06:40

transformations to the independent audio

06:42

tracks to have a good blend that sounds

06:44

realistic in order to be able to train a

06:47

model on this data set and then use it

06:50

in the real world i invite you to read

06:52

their paper for more technical detail

06:54

about their implementation and this new

06:56

data set they introduced if you'd like

06:58

to tackle this task as well if you do so

07:01

please let me know and send me your

07:02

progress i'd love to see that or rather

07:05

to hear that both are linked in the

07:07

description below thank you very much

07:09

for watching for those of you who are

07:10

still here and huge thanks to anthony

07:13

manilow the most recent youtube members

07:15

supporting the videos

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2021/10/25