The independent research organization OpenAI has recently released a new form of generative AI dubbed Jukebox, named as such due to its ability to generate music. The Jukebox AI is able to generate sounds based on attributes like instrumentation and even lyrics, and the OpenAI research team created the AI by training it on compressed audio clips and various snippets of lyrics.
As TechCrunch reported, the OpenAI researchers trained the model using raw audio clips, giving the model the ability to produce audio. This is in contrast to the approaches used to create other music generation applications, which often rely on “symbolic music” (like MIDI music) which is information about notes and pitches but no actual audio. The team of researchers utilized convolutional neural networks to train the model, compressing the audio, and encoding it into a format the neural network could interpret. Afterward, a transformer was used to generate compressed audio, which was upsampled in order to convert the data into an audio format.
When creating Jukebox, OpenAI had to create a method of dealing with the complex, dense nature of audio. The researchers dealt with the continuous nature of audio by breaking it up into more discrete, digestible sections, dividing songs up into bits that are 1/128th of a second long. The goal was to create an AI model capable of breaking down songs into chunks large enough that the problem doesn’t become intractable, yet small and precise enough that the models can learn the pattern of a song and reconstruct that pattern.
The technique utilized by OpenAI shares some commonalities with an older music-generation AI the company produced, called MuseNet. MuseNet was trained on MIDI files and was capable of generating music in a verity of styles, though it focused on the overall melody of a song and couldn’t produce lyrics. In contrast, Jukebox is able to write its own lyrics to accompany the music. The lyrics are ”co-written” by the OpenAI researchers, guiding the model towards creating lyrics in certain styles. The Jukebox system was trained on lyrics scraped from LyricWiki, with the training data consisting of text and metadata on 1.2 million songs.
When it comes to the lyrics of the model, the researchers first tried using a simple heuristic that stretched out lyrics to roughly the duration of a song, analyzing the text that corresponded with a particular chunk/segment of the song. This simple approach worked well in general, although the researchers found that when the lyrics were particularly fast it broke down. In order to deal with this problem, vocals were extracted from the song and aligned with the lyrical text to obtain word-level alignments for the lyrics. Afterward, an encoding layer was used for the lyrics along with an attention layer that mapped sections of the music to lyrics using key-value pairs. The result was that lyrics and vocals had a fairly precise match-up.
“While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a significant gap between these generations and human-created music. For example, while the generated songs show local musical coherence, follow traditional chord patterns, and can even feature impressive solos, we do not hear familiar larger musical structures such as choruses that repeat.”
Right now, the model is capable of producing a song that is recognizably in the style of a specific genre or even a specific artist. For example, it can produce songs in the style of Elvis Presley, Katy Perry, or Rage Against the Machine. Although the songs are recognizably within a genre or themed around a singer’s style, they are also fairly rough, often sounding like a parody or a poor cover version of a song. Nonetheless, the technical achievement is impressive. The researchers responsible for creating the AI generation system chose to work on a program capable of generating music specifically because the task was difficult, and the researchers plan to continue to refine their techniques. You can listen to some of the songs here.
- Lior Hakim, Co-founder & CTO of Hour One – Interview Series
- The Smart Enterprise: Making Generative AI Enterprise-Ready
- Flick Review: The Best Instagram Hashtag Tool to Boost Reach
- U.S. Imposes Export Restrictions on NVIDIA Chips to Certain Middle East Countries
- Tanguy Chau, Co-Founder & CEO of Paxton AI – Interview Series