Researchers at Cornell University have developed a new method that utilizes deep learning in order to turn world landmark photos 4D. The team relied on publicly available tourist photos of major points like the Trevi Fountain in Rome, and the end results are 3D images that are maneuverable and can show changes in appearance over time.
The newly developed method takes in and synthesizes tens of thousands of untagged and undated photos, and it is a big step forward for computer vision.
The work is titled “Crowdsampling the Plenoptic Function,” and it was presented at the virtual European Conference on Computer Vision, which took place between Aug. 23-28.
Noah Snavely is an associate professor of computer science at Cornell Tech and senior author of the paper. Other contributors include Cornell doctoral student Zhengqi Li, first author of the paper, as well as Abe Davis, assistant professor of computer science in the Faculty of Computing and Information Science, and Cornell Tech doctoral student Wenqi Xian.
“It’s a new way of modeling scene that not only allows you to move your head and see, say, the fountain from different viewpoints, but also gives you controls for changing the time,” Snavely said.
“If you really went to the Trevi Fountain on your vacation, the way it would look would depend on what time you went — at night, it would be lit up by floodlights from the bottom. In the afternoon, it would be sunlit, unless you went on a cloudy day,” he continued. “We learned the whole range of appearances, based on time of day and weather, from these unorganized photo collections, such that you can explore the whole range and simultaneously move around the scene.”
Traditional Computer Vision Limitations
Since there can be so many different textures present that need to be reproduced, it is difficult for traditional computer vision to represent places accurately through photos.
“The real world is so diverse in its appearance and has different kinds of materials — shiny things, water, thin structures,” Snavely said.
Besides those barriers, traditional computer vision also struggles with inconsistent data. Plenoptic function is how something appears from every possible viewpoint in space and time, but in order to reproduce this, hundreds of webcams are required at the scene. Not only that, but they would have to be recording all throughout the day and night. This could be done, but it is an extremely resource-heavy task when looking at the number of scenes where this method would be required.
Learning from Other Photos
In order to get around this, the team of researchers developed the new method.
“There may not be a photo taken at 4 p.m. from this exact viewpoint in the data set. So we have to learn from a photo taken at 9. p.m. at one location, and a photo taken at 4:03 from another location,” said Snavely. “And we don’t know the granularity of when these photos were taken. But using deep learning allows us to infer what the scene would have looked like at any given time and place.”
A new scene representation called Deep Multiplane Images was introduced by the researchers in order to interpolate appearance in four dimensions, which are 3D and changes over time.
According to Snavely, “We use the same idea invented for creating 3D effects in 2D animation to create 3D effects in real-world scenes, to create this deep multilayer image by fitting it to all these disparate measurements from the tourists’ photos. It’s interesting that it kind of stems from this very old, classic technique used in animation.”
The study demonstrated that the trained model could create a scene with 50,000 publicly available images from various sites. The team believes that it could have implications in many areas, including computer vision research and virtual tourism.
“You can get the sense of really being there,” Snavely said. “It works surprisingly well for a range of scenes.”
The project received support from former Google CEO and philanthropist Eric Schmidt, as well as Wendt Schmidt.
AI Researchers Design Program To Generate Sound Effects For Movies and Other Media
Researchers from the University of Texas San Antonio have created an AI-based application capable of observing the actions taking place in a video and creating artificial sound effects to match those actions. The sound effects generated by the program are reportedly so realistic that when human observers were polled, they typically thought the sound effects were legitimate.
The program responsible for generating the sound effects, AudioFoley, was detailed in a study recently published in IEEE Transactions on Multimedia. According to IEEE Spectrum, the AI program was developed by Jeff Provost, professor at UT San Antonio, and Ph.D. student Sanchita Ghose. The researchers created the program utilizing multiple machine learning models joined together.
The first task in generating sound effects appropriate to the actions on a screen was recognizing those actions and mapping them to sound effects. To accomplish this, the researchers designed two different machine learning models and tested their different approaches. The first model operates by extracting frames from the videos it is fed and analyzing these frames for relevant features like motions and colors. Afterward, a second model was employed to analyze how the position of an object changes across frames, to extract temporal information. This temporal information is used to anticipate the next likely actions in the video. The two models have different methods of analyzing the actions in the clip, but they both use the information contained in the clip to guess what sound would best accompany it.
The next task is to synthesize the sound, and this is accomplished by matching activities/predicted motions to possible sound samples. According to Ghose and Prevost, AutoFoley was used to generate sound for 1000 short clips, featuring actions and items like a fire, a running horse, ticking clocks, and rain falling on plants. While AutoFoley was most successful in creating sound for clips where there didn’t need to be a perfect match between the actions and sounds, and it had trouble matching clips where actions happened with more variation, the program was still able to fool many human observers into picking its generated sounds over the sound that originally accompanied a clip.
Prevost and Ghose recruited 57 college students and had them watch different clips. Some clips contained the original audio, some contained audio generated by AutoFoley. When the first model was tested, approximately 73% of the students selected the synthesized audio as the original audio, neglecting the true sound that accompanied the clip. The other model performed slightly worse, with only 66% of the participants selecting the generated audio over the original audio.
Prevost explained that AutoFoley could potentially be used to expedite the process of producing movies, television, and other pieces of media. Prevost notes that a realistic Foley track is important to making media engaging and believable, but that the Foley process often takes a significant amount of time to complete. Having an automated system that could handle the creation of basic Foley elements could make producing media cheaper and quicker.
Currently, AutoFoley has some notable limitations. For one, while the model seems to perform well while observing events that have stable, predictable motions, it suffers when trying to generate audio for events with variation in time (like thunderstorms). Beyond this, it also requires that the classification subject is present in the entire clip and doesn’t leave the frame. The research team is aiming to address these issues with future versions of the application.
Astronomers Apply AI to Discover and Classify Galaxies
A research group of astronomers, with most coming from the National Astronomical Observatory of Japan (NAOJ), are now applying artificial intelligence (AI) to ultra-wide field-of-view images of the universe captured by the Subaru Telescope. The group has managed to achieve a high accuracy rate for finding and classifying spiral galaxies in those images.
This technique is used along with citizen science, and the two are expected to lead to more discoveries in the future.
The researchers applied a deep-learning technique in order to classify galaxies in a large dataset of images which were obtained through the Subaru Telescope. Due to its extremely high sensitivity, the telescope has detected around 560,000 galaxies in the images.
The Subaru Telescope is important since the task of identifying that many galaxies by human eye for morphological classification would be nearly impossible. Thanks to the AI, the team was able to process the information without the need of human intervention.
The work was published in Monthly Notices of the Royal Astronomical Society.
Automated Processing Techniques
Starting in 2012, the world has seen a rapid development of automated processing techniques for extraction and judgement of features with deep-learning algorithms. These are often much more accurate than humans and are present in autonomous vehicles, security cameras and various other applications.
Dr. Ken-ichi Tadaki is a Project Assistant Professor at NAOJ. He is responsible for the idea that if AI is capable of classifying images of cats and dogs, there is no reason it should not be able to identify and distinguish “galaxies with spiral patterns” from “galaxies without spiral patterns.”
Through the use of training data prepared by humans, the AI was capable of successfully classifying the galaxy morphologies with an accuracy rate of 97.5%. After being applied to the full data set, the AI could identify spirals in about 80,000 galaxies.
Since the new technique was effective at identifying the galaxies, the group can now use it to classify galaxies into more detailed classes. This will be done by training the AI on many galaxies which have been classified by humans.
NAOJ runs a newly created citizen-science project called “GALAXY CRUISE,” which relies on citizens examining galaxy images that were taken with the Subaru Telescope. The citizens then look for features that suggest the galaxy is either merging or colliding with another galaxy.
Associate Professor Masayuki Tanaka is the advisor of “GALAXY CRUISE,” and he strongly believes in the study of galaxies through artificial intelligence.
“The Subaru Strategic Program is serious Big Data containing an almost countless number of galaxies. Scientifically, it is very interesting to tackle such big data with a collaboration of citizen astronomers and machines,” Tanaka says. “By employing deep-learning on top of the classifications made by citizen scientists in GALAXY CRUISE, chances are, we can find a great number of colliding and merging galaxies.”
The new technique created by the group of astronomers has big implications for the field. It is another example of how artificial intelligence will not only change life on our planet, but how it will also help us expand our knowledge beyond.
Researchers Create AI Model Capable Of Singing In Both Chinese and English
A team of researchers from Microsoft and Zhajiang University have recently created an AI model capable of singing in numerous languages. As VentureBeat reported, the DeepSinger AI developed by the team was trained on data from various music websites, using algorithms that captured the timbre of the singer’s voice.
Generating the “voice” of an AI singer requires algorithms that are capable of predicting and controlling both the pitch and duration of audio. When people sing, the noises they produce have vastly more complex rhythms and patterns compared to simple speech. Another problem for the team to overcome was that while there is a fair amount of speaking/speech training data available, singing training data sets are fairly rare. Combine these challenges with the fact that songs need to have both sound and lyrics analyzed, and the problem of generating singing is incredibly complex.
The DeepSinger system created by the researchers overcame these challenges by developing a data pipeline that mined and transformed audio data. The clips of singing were extracted from various music websites, and then the singing is isolated from the rest of the audio and divided into sentences. The next step was to determine the duration of every phoneme within the lyrics, resulting in a series of samples each representing a unique phoneme in the lyrics. Cleaning of the data is done to deal with any distorted training samples after the lyrics and accompanying audio samples are sorted according to confidence score.
The exact same methods seem to work for a variety of languages. DeepSinger was trained on Chinese, Cantone, and English vocal samples comprised from 89 different singers singing for over 92 hours. The results of the study found that the DeepSinger system was able to reliably generate high quality “singing” samples according to metrics like accuracy of pitch and how natural the singing sounded. The researchers had 20 people rate both songs generated by DeepSinger and the training songs according to these metrics and the gap between scores for the generated samples and genuine audio was quite small. The participants gave DeepSinger a mean opinion score that deviated by between 0.34 and 0.76.
Looking forward, the researchers want to try and improve the quality of the generated voices by jointly training the various submodels that comprise DeepSinger, done with the assistance of speciality technologies like WaveNet that are designed specifically for the task of generating natural sounding speech through audio waveforms.
The DeepSinger system could be used to help singers and other musical artists make corrections to work without having to head back into the studio for another recording session. IT could also potentially be used to create audio deepfakes, making it seem like an artist sang a song they never actually did. While it could be used for parody or satire, it’s also of dubious legality.
DeepSinger is just one of a wave of new AI-based music and audio systems that could transform how music and software interact. OpenAI recently released their own AI system, dubbed JukeBox, that is capable of producing original music tracks in the style of a certain genre or even a specific artist. Other musical AI tools include Google’s Magenta and Amazon’s DeepComposer. Magnets is an open source audio (and image) manipulation library that can be used to produce everything from automated drum backing to simple music based video games. Meanwhile, Amazon’s DeepComposer is targeted at those who want to train and customize their own music-based deep learning models, allowing the user to take pre-trained sample models and tweak the models to their needs.
You can listen to some of the audio samples generated by DeepSinger at this link.
- Dimitris Vassos, CEO, Co-founder, and Chief Architect of Omilia – Interview Series
- Human Brain’s Light Processing Ability Could Lead to Better Robotic Sensing
- Game Developers Look To Voice AI For New Creative Opportunities
- Udacity Launches RPA Developer Nanodegree Program in Conjunction with UiPath
- AI Used To Identify Gene Activation Sequences and Find Disease-Causing Genes