AI Learns to Connect Vision and Sound Without Human Intervention

AI Learns to Connect Vision and Sound Without Human Intervention

MIT researchers have developed a machine-learning model that can learn to associate audio and visual data without human intervention, enabling AI systems to process information more like humans. This innovation allows the AI to connect sight and sound to understand the world, much like humans do.

The new model, called CAV-MAE Sync, improves upon previous work by learning a finer-grained correspondence between specific video frames and the accompanying audio. This enables the AI to better retrieve videos based on audio queries and predict the class of an audio-visual scene, such as identifying the sound of a roller coaster or an airplane taking flight.

The model's ability to balance its learning process and learn from audio-visual data makes it a significant breakthrough in artificial intelligence. With potential applications in robotics, multimodal content curation, and future AI development, this technology could have a profound impact on various industries.

The researchers behind this project are affiliated with the MIT-IBM Watson AI Lab and aim to further improve CAV-MAE Sync by incorporating better data representations and enabling the system to handle text data [

About the author

TOOLHUNT

Effortlessly find the right tools for the job.

TOOLHUNT

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to TOOLHUNT.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.