Researchers from the University of Texas San Antonio have created an AI-based application capable of observing the actions taking place in a video and creating artificial sound effects to match those actions. The sound effects generated by the program are reportedly so realistic that when human observers were polled, they typically thought the sound effects were legitimate.
The program responsible for generating the sound effects, AudioFoley, was detailed in a study recently published in IEEE Transactions on Multimedia. According to IEEE Spectrum, the AI program was developed by Jeff Provost, professor at UT San Antonio, and Ph.D. student Sanchita Ghose. The researchers created the program utilizing multiple machine learning models joined together.
The first task in generating sound effects appropriate to the actions on a screen was recognizing those actions and mapping them to sound effects. To accomplish this, the researchers designed two different machine learning models and tested their different approaches. The first model operates by extracting frames from the videos it is fed and analyzing these frames for relevant features like motions and colors. Afterward, a second model was employed to analyze how the position of an object changes across frames, to extract temporal information. This temporal information is used to anticipate the next likely actions in the video. The two models have different methods of analyzing the actions in the clip, but they both use the information contained in the clip to guess what sound would best accompany it.
The next task is to synthesize the sound, and this is accomplished by matching activities/predicted motions to possible sound samples. According to Ghose and Prevost, AutoFoley was used to generate sound for 1000 short clips, featuring actions and items like a fire, a running horse, ticking clocks, and rain falling on plants. While AutoFoley was most successful in creating sound for clips where there didn’t need to be a perfect match between the actions and sounds, and it had trouble matching clips where actions happened with more variation, the program was still able to fool many human observers into picking its generated sounds over the sound that originally accompanied a clip.
Prevost and Ghose recruited 57 college students and had them watch different clips. Some clips contained the original audio, some contained audio generated by AutoFoley. When the first model was tested, approximately 73% of the students selected the synthesized audio as the original audio, neglecting the true sound that accompanied the clip. The other model performed slightly worse, with only 66% of the participants selecting the generated audio over the original audio.
Prevost explained that AutoFoley could potentially be used to expedite the process of producing movies, television, and other pieces of media. Prevost notes that a realistic Foley track is important to making media engaging and believable, but that the Foley process often takes a significant amount of time to complete. Having an automated system that could handle the creation of basic Foley elements could make producing media cheaper and quicker.
Currently, AutoFoley has some notable limitations. For one, while the model seems to perform well while observing events that have stable, predictable motions, it suffers when trying to generate audio for events with variation in time (like thunderstorms). Beyond this, it also requires that the classification subject is present in the entire clip and doesn’t leave the frame. The research team is aiming to address these issues with future versions of the application.