Artificial Intelligence
New work with neural networks has created a system that can generate artificial sounds based on short video clips from YouTube. Some of the audio is convincing enough that it is difficult to tell that it has been created by an algorithm.
Artificial intelligence has been created that matches artistic styles, and composes its own music, but now Yipin Zhouand colleagues at the University of North Carolina at Chapel Hill in collaboration with Adobe Research have brought the creation of audio to the brink of the uncanny valley.
What the researchers did was trained a machine-learning algorithm to generate realistic soundtracks for short video clips. As you can see in the video below or at this site, some of the sounds are so realistic that they make it difficult to tell if they are real or not.
Two of the five traditional human senses,vision and sound are basic sources through which humans understand the world more than the others. Compared to other animals, we spend a lot of our biological processing on seeing and hearing.
In our environment, sound and vision are often correlated during events. Even when we watch a film or television, these two modalities combine to jointly affect our perceptions.
,
,
Now, in a new paper, the authors put forth the task of artificially generating sound given just visual input.
They started by applying machine learning to generate raw waveform samples based on the raw input video frames supplied to the neural net called SampleRNN. SampleRNN is a hierarchically structured recurrent neural network. Its coarse-to-fine structure enables the model to generate extremely long sequences and the recurrent structure of each layer captures the dependency between distant
samples. SampleRNN has previously been applied to speech synthesis and music generation tasks.
“Evaluations show that over 70% of the generated sound from our models can fool humans into thinking that they are real”
Zhou and colleagues used a subset of video clips from a Google collection called Audioset, which consists of over two million 10-second clips from YouTube. The videos are divided into human-labeled categories focusing on things like dogs, chainsaws, helicopters, and so forth.
“Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs,” they conclude.
They tested the ability of the artificial audio to fool users by setting up an experiment on Amazon’s Mechanical Turk. The evaluations showed that over 70% of the generated sound from the models could fool people into thinking that they were real.
The researchers surmise that such capabilities could help enable applications in virtual reality, by generating sound for virtual scenes automatically, or could provide additional accessibility to images or videos for people with visual impairments.
Source: 33rd Square