BMClogo

Humans naturally learn by making connections between vision and sound. For example, we can watch someone playing a cello and recognize that the cellist’s movements are producing the music we hear.

A new approach developed by researchers at MIT and elsewhere has improved the ability of AI models to learn in this way. This can be useful in applications like news and filmmaking, which can help curate multimodal content with automatic video and audio retrieval.

In the long run, this work can be used to improve the robot’s ability to understand real-life environments where auditory and visual information are often closely related.

The researchers improved their group’s previous work to create a method that could help machine learning models align visual data from corresponding audio and video clips without the need for human labels.

They adjusted how the original model was trained, so it learned a finer-grained correspondence between the specific video framework and the audio that occurred at that moment. The researchers also made some architectural tweaks to help the system balance two different learning goals, thereby improving performance.

To sum up, these relatively simple improvements improve the accuracy of their approach in the video retrieval task and classify the actions in audio-visual scenes. For example, the new method can automatically and accurately match the sound of the door and close its visual effects in a video clip.

“We are building AI systems that can handle the world like humans, because both audio and visual information appear immediately and can handle both modes seamlessly. Research.

Chief writer Edson Araujo, graduate student at Goethe University in Germany, joined the paper; Yuan Gong, former postdoctoral fellow at MIT; Saurabhchand Bhati, currently postdoctoral fellow at MIT; Samuel Thomas, Brian Kingsbury and Leonid Karlinsky at IBM Research; Rogerio Feris, chief scientist and manager of the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the spoken system group at the MIT-IBM Computer Science and Artificial Intelligence Laboratory (CSAIL); MIT-IBM Professor of Computer Science, MIT-IBM Watson AI Labs branch professor and senior writer Hilde Kuehne. This work will be presented at the Computer Vision and Pattern Recognition Conference.

synchronous

This work builds on the machine learning approach developed by researchers a few years ago that provides an efficient way to train multi-model models to process audio and visual data simultaneously without human labeling.

The researchers fed this model (called Cav-Mae, unlabeled video clips) and encoded the visual and audio data into representations called tokens, respectively. Using natural audio in recording, the model automatically learns to map the corresponding audio and visual token pairs in its internal representation space.

They found that using two learning objectives could balance the learning process of the model, which allowed Cav-Mae to understand the corresponding audio and visual data while improving its ability to recover video clips that match the user’s query.

But Cav-Mae treats audio and visual examples as one unit, so even if an audio event occurs within one second of the video, a 10-second video clip is drawn together.

In its improved model, called Cav-Mae Sync, the researchers divided the audio into smaller windows, which then compute the representation of the data, so it generates a separate representation corresponding to each smaller audio window.

During training, the model learns to associate a video framework with audio that occurs only in that framework.

“By doing this, the model can learn a finer-grained communication, and as we aggregate this information, performance can be performed later,” Araujo said.

They also incorporate architectural improvements that help the model balance its two learning goals.

Add “Swing Room”

The model combines a comparison goal in which it learns to correlate similar audio and visual data, and a reconstruction goal designed to restore specific audio and visual data based on user queries.

In Cav-Mae synchronization, researchers introduced two new types of data representations or tokens to improve the learning ability of the model.

They include dedicated “global tokens” that help achieve contrast learning goals and dedicated “registration tokens” to help the model focus on important details of rebuilding goals.

“Essentially, we add some swing space to the model, so it can perform each of the two tasks, i.e. contrast and reconstruction, more independently. This benefits overall performance.”

Although the researchers have some intuition, these enhancements will improve the performance of Cav-Mae Sync, but carefully combined strategies to move the model to the direction they want it to be.

“Because we have multiple modes, we need a good model, but we also need to get them together and collaborate,” Rouditchenko said.

Finally, their enhancements improve the model’s ability to retrieve videos based on audio queries and predict categories of audio-visual scenes such as dog barking or instrument playback.

Its results are more accurate than previous work, and it performs better than the more complex state-of-the-art methods that require a lot of training data.

“Sometimes, the very simple ideas or very few patterns you see in the data have a lot of value,” Araujo said.

In the future, researchers hope to incorporate new models into better data representations into Cav-Mae Sync, which can improve performance. They also want their system to process text data, which will be an important step in generating audio-visual verb models.

This work was funded in part by the German Federal Ministry of Education and Research and the MIT-IBM Watson AI Laboratory.

Source link