Deep learning has become a buzz word in many endeavors, and broadcasting organizations are also among those that have to start to explore all the potential it has to offer, from news reporting to feature films and programs, both in the cinemas and on TV.
As TechRadar reported, the number of opportunities deep learning presents in the field of video production, editing and cataloging are already quite high. But as is noted, this technology is not just limited to what is considered repetitive tasks in broadcasting, since it can also “enhance the creative process, improve video delivery and help preserve the massive video archives that many studios keep.”
As far as video generation and editing are concerned, it is mentioned that Warner Bros. recently had to spend $25M on reshoots for ‘Justice League’ and part of that money went to digitally removing a mustache that star Henry Cavill had grown and could not shave due to an overlapping commitment. The use of deep learning in such time-consuming and financially taxing processes in post-production will certainly be put to good use.
Even widely available solutions like Flo make it possible to use deep learning in creating automatically a video just by describing your idea. The software then searches for possible relevant videos that are stored in a certain library and edits them together automatically.
Flo is also able to sort and classify videos, making it easier to find a particular part of the footage. Such technologies also make it possible to easily remove undesirable footage or make a personal recommendation list based on a video somebody has expressed an interest in.
Google has come up with a neural network “that can automatically separate the foreground and background of a video. What used to require a green screen can now be done with no special equipment.”
The deep fake has already made a name for itself, both good and bad, but its potential use in special effects has already reached quite a high level.
The area where deep learning will certainly make a difference in the restoration of classic films, as the UCLA Film & Television Archive, nearly half of all films produced prior to 1950 have disappeared and 90% of the classic film prints are currently in a very poor condition.
Colorizing black and white footage is still a controversial subject among the filmmakers, but those who decide to go that route can now use Nvidia tools, which will significantly shorten such a lengthy process as it now requires that the artist colors only one frame of a scene and deep learning will do the rest from there. On the other hand, Google has come up with a technology that is able to recreate part of a video-recorded scene based on start and end frames.
Face/Object recognition is already actively used, from classifying a video collection or archive, searching for clips with a given actor or newsperson, or counting the exact time of an actor in a video or film. TechRadar mentions that Sky News recently used facial recognition to identify famous faces at the royal wedding.
This technology is now becoming widely used in sports broadcasting to, say, “track the movements of the ball, or to identify other key elements to the game, such as the goal.” In soccer (football) this technology, given the name VAR is actually used in many official tournaments and national leagues as a referee’s tool during the game.
Streaming is yet another aspect of broadcasting that can benefit from deep learning. Neural networks can recreate high definition frames from low definition input, making it possible for the viewer to benefit from better viewing, even if the original input signal is not fully up to the standard.
Experts Overcome Major Obstacle in AI Technology Using Brain Mechanism
A group of artificial intelligence (AI) experts from various institutions have overcome a “major, long-standing obstacle to increasing AI capabilities.” The team looked toward the human brain, which is the case for many AI developments. Specifically, the team focused on the human brain memory mechanism known as “replay.”
Gido van de Ven is the first author and a postdoctoral researcher. He was joined by principal investigator Andreas Tolias at Baylor, as well as Hava Siegelmann at UMass Amherst.
The research was published in Nature Communications.
The New Method
According to the researchers, they have come up with a new method that efficiently protects deep neural networks from “catastrophic forgetting.” When a neural network takes on new learning, it can forget what was previously learned.
This obstacle is what stops many AI advancements from taking place.
“One solution would be to store previously encountered examples and revisit them when learning something new. Although such ‘replay’ or ‘rehearsal’ solves catastrophic forgetting, constantly retraining on all previously learned tasks is highly inefficient and the amount of data that would have to be stored becomes unmanageable quickly,” the researchers wrote.
The Human Brain
The researchers drew inspiration from the human brain, since it is able to build up information without forgetting, which is not the case for AI neural networks. The current development was built on previous work done by the researchers, including findings regarding a mechanism in the brain that is believed to be responsible for preventing memories from being forgotten. This mechanism is the replay of neural activity patterns.
According to Siegelmann, the major development comes from “recognizing that replay in the brain does not store data,” but “the brain generated representations of memories at a high, more abstract level with no need to generate detailed memories.”
Siegelmann took this information and joined her colleagues in order to develop a brain-like replay with artificial intelligence, where there was no data stored. As is the case for the human brain, the artificial network takes what it has seen before in order to generate high-level representations.
The method was highly efficient, with even just a few replayed generated representations resulting in older memories being remembered while new ones were learned. Generative replay is effective at preventing catastrophic forgetting, and one of the major benefits is that it allows the system to generalize from one situation to another.
According to van de Ven, “If our network with generative replay first learns to separate cats from dogs, and then to separate bears from foxes, it will also tell cats from foxes without specifically being trained to do so. And notably, the more the system learns, the better it becomes at learning new tasks.”
“We propose a new, brain-inspired variant of replay in which internal or hidden representations are replayed that are generated by the network’s own, context-modulated feedback connections,” the team writes. “Our method achieved state-of-the-art performance on challenging continual learnings benchmarks without storing data, and it provides a novel model for abstract replay in the brain.”
“Our method makes several interesting predictions about the way replay might contribute to memory consolidation in the brian,” Van de Ven continues. “We are already running an experiment to test some of these predictions.”
Researchers Use Deep Learning to Turn Landmark Photos 4D
Researchers at Cornell University have developed a new method that utilizes deep learning in order to turn world landmark photos 4D. The team relied on publicly available tourist photos of major points like the Trevi Fountain in Rome, and the end results are 3D images that are maneuverable and can show changes in appearance over time.
The newly developed method takes in and synthesizes tens of thousands of untagged and undated photos, and it is a big step forward for computer vision.
The work is titled “Crowdsampling the Plenoptic Function,” and it was presented at the virtual European Conference on Computer Vision, which took place between Aug. 23-28.
Noah Snavely is an associate professor of computer science at Cornell Tech and senior author of the paper. Other contributors include Cornell doctoral student Zhengqi Li, first author of the paper, as well as Abe Davis, assistant professor of computer science in the Faculty of Computing and Information Science, and Cornell Tech doctoral student Wenqi Xian.
“It’s a new way of modeling scene that not only allows you to move your head and see, say, the fountain from different viewpoints, but also gives you controls for changing the time,” Snavely said.
“If you really went to the Trevi Fountain on your vacation, the way it would look would depend on what time you went — at night, it would be lit up by floodlights from the bottom. In the afternoon, it would be sunlit, unless you went on a cloudy day,” he continued. “We learned the whole range of appearances, based on time of day and weather, from these unorganized photo collections, such that you can explore the whole range and simultaneously move around the scene.”
Traditional Computer Vision Limitations
Since there can be so many different textures present that need to be reproduced, it is difficult for traditional computer vision to represent places accurately through photos.
“The real world is so diverse in its appearance and has different kinds of materials — shiny things, water, thin structures,” Snavely said.
Besides those barriers, traditional computer vision also struggles with inconsistent data. Plenoptic function is how something appears from every possible viewpoint in space and time, but in order to reproduce this, hundreds of webcams are required at the scene. Not only that, but they would have to be recording all throughout the day and night. This could be done, but it is an extremely resource-heavy task when looking at the number of scenes where this method would be required.
Learning from Other Photos
In order to get around this, the team of researchers developed the new method.
“There may not be a photo taken at 4 p.m. from this exact viewpoint in the data set. So we have to learn from a photo taken at 9. p.m. at one location, and a photo taken at 4:03 from another location,” said Snavely. “And we don’t know the granularity of when these photos were taken. But using deep learning allows us to infer what the scene would have looked like at any given time and place.”
A new scene representation called Deep Multiplane Images was introduced by the researchers in order to interpolate appearance in four dimensions, which are 3D and changes over time.
According to Snavely, “We use the same idea invented for creating 3D effects in 2D animation to create 3D effects in real-world scenes, to create this deep multilayer image by fitting it to all these disparate measurements from the tourists’ photos. It’s interesting that it kind of stems from this very old, classic technique used in animation.”
The study demonstrated that the trained model could create a scene with 50,000 publicly available images from various sites. The team believes that it could have implications in many areas, including computer vision research and virtual tourism.
“You can get the sense of really being there,” Snavely said. “It works surprisingly well for a range of scenes.”
The project received support from former Google CEO and philanthropist Eric Schmidt, as well as Wendt Schmidt.
AI Researchers Design Program To Generate Sound Effects For Movies and Other Media
Researchers from the University of Texas San Antonio have created an AI-based application capable of observing the actions taking place in a video and creating artificial sound effects to match those actions. The sound effects generated by the program are reportedly so realistic that when human observers were polled, they typically thought the sound effects were legitimate.
The program responsible for generating the sound effects, AudioFoley, was detailed in a study recently published in IEEE Transactions on Multimedia. According to IEEE Spectrum, the AI program was developed by Jeff Provost, professor at UT San Antonio, and Ph.D. student Sanchita Ghose. The researchers created the program utilizing multiple machine learning models joined together.
The first task in generating sound effects appropriate to the actions on a screen was recognizing those actions and mapping them to sound effects. To accomplish this, the researchers designed two different machine learning models and tested their different approaches. The first model operates by extracting frames from the videos it is fed and analyzing these frames for relevant features like motions and colors. Afterward, a second model was employed to analyze how the position of an object changes across frames, to extract temporal information. This temporal information is used to anticipate the next likely actions in the video. The two models have different methods of analyzing the actions in the clip, but they both use the information contained in the clip to guess what sound would best accompany it.
The next task is to synthesize the sound, and this is accomplished by matching activities/predicted motions to possible sound samples. According to Ghose and Prevost, AutoFoley was used to generate sound for 1000 short clips, featuring actions and items like a fire, a running horse, ticking clocks, and rain falling on plants. While AutoFoley was most successful in creating sound for clips where there didn’t need to be a perfect match between the actions and sounds, and it had trouble matching clips where actions happened with more variation, the program was still able to fool many human observers into picking its generated sounds over the sound that originally accompanied a clip.
Prevost and Ghose recruited 57 college students and had them watch different clips. Some clips contained the original audio, some contained audio generated by AutoFoley. When the first model was tested, approximately 73% of the students selected the synthesized audio as the original audio, neglecting the true sound that accompanied the clip. The other model performed slightly worse, with only 66% of the participants selecting the generated audio over the original audio.
Prevost explained that AutoFoley could potentially be used to expedite the process of producing movies, television, and other pieces of media. Prevost notes that a realistic Foley track is important to making media engaging and believable, but that the Foley process often takes a significant amount of time to complete. Having an automated system that could handle the creation of basic Foley elements could make producing media cheaper and quicker.
Currently, AutoFoley has some notable limitations. For one, while the model seems to perform well while observing events that have stable, predictable motions, it suffers when trying to generate audio for events with variation in time (like thunderstorms). Beyond this, it also requires that the classification subject is present in the entire clip and doesn’t leave the frame. The research team is aiming to address these issues with future versions of the application.
- Andrew Stein, Software Engineer Waymo – Interview Series
- Michael Schrage, Author of Recommendation Engines (The MIT Press) – Interview Series
- Scientists Detect Loneliness Through The Use Of AI And NLP
- Engineers Develop New Machine-Learning Method Capable of Cutting Energy Use
- Artificial Intelligence Enhances Speed of Discoveries For Particle Physics