A team of researchers from Microsoft and Zhajiang University have recently created an AI model capable of singing in numerous languages. As VentureBeat reported, the DeepSinger AI developed by the team was trained on data from various music websites, using algorithms that captured the timbre of the singer’s voice.
Generating the “voice” of an AI singer requires algorithms that are capable of predicting and controlling both the pitch and duration of audio. When people sing, the noises they produce have vastly more complex rhythms and patterns compared to simple speech. Another problem for the team to overcome was that while there is a fair amount of speaking/speech training data available, singing training data sets are fairly rare. Combine these challenges with the fact that songs need to have both sound and lyrics analyzed, and the problem of generating singing is incredibly complex.
The DeepSinger system created by the researchers overcame these challenges by developing a data pipeline that mined and transformed audio data. The clips of singing were extracted from various music websites, and then the singing is isolated from the rest of the audio and divided into sentences. The next step was to determine the duration of every phoneme within the lyrics, resulting in a series of samples each representing a unique phoneme in the lyrics. Cleaning of the data is done to deal with any distorted training samples after the lyrics and accompanying audio samples are sorted according to confidence score.
The exact same methods seem to work for a variety of languages. DeepSinger was trained on Chinese, Cantone, and English vocal samples comprised from 89 different singers singing for over 92 hours. The results of the study found that the DeepSinger system was able to reliably generate high quality “singing” samples according to metrics like accuracy of pitch and how natural the singing sounded. The researchers had 20 people rate both songs generated by DeepSinger and the training songs according to these metrics and the gap between scores for the generated samples and genuine audio was quite small. The participants gave DeepSinger a mean opinion score that deviated by between 0.34 and 0.76.
Looking forward, the researchers want to try and improve the quality of the generated voices by jointly training the various submodels that comprise DeepSinger, done with the assistance of speciality technologies like WaveNet that are designed specifically for the task of generating natural sounding speech through audio waveforms.
The DeepSinger system could be used to help singers and other musical artists make corrections to work without having to head back into the studio for another recording session. IT could also potentially be used to create audio deepfakes, making it seem like an artist sang a song they never actually did. While it could be used for parody or satire, it’s also of dubious legality.
DeepSinger is just one of a wave of new AI-based music and audio systems that could transform how music and software interact. OpenAI recently released their own AI system, dubbed JukeBox, that is capable of producing original music tracks in the style of a certain genre or even a specific artist. Other musical AI tools include Google’s Magenta and Amazon’s DeepComposer. Magnets is an open source audio (and image) manipulation library that can be used to produce everything from automated drum backing to simple music based video games. Meanwhile, Amazon’s DeepComposer is targeted at those who want to train and customize their own music-based deep learning models, allowing the user to take pre-trained sample models and tweak the models to their needs.
You can listen to some of the audio samples generated by DeepSinger at this link.