stub Text-to-Music Generative AI : Stability Audio, Google's MusicLM and More - Unite.AI
Connect with us

Artificial Intelligence

Text-to-Music Generative AI : Stability Audio, Google’s MusicLM and More

Updated on

Music, an art form that resonates with the human soul, has been a constant companion of us all. Creating music using artificial intelligence began several decades ago. Initially, the attempts were simple and intuitive, with basic algorithms creating monotonous tunes. However, as technology advanced, so did the complexity and capabilities of AI music generators, paving the way for deep learning and Natural Language Processing (NLP) to play pivotal roles in this tech.

Today platforms like Spotify are leveraging AI to fine-tune their users' listening experiences. These deep-learning algorithms dissect individual preferences based on various musical elements such as tempo and mood to craft personalized song suggestions. They even analyze broader listening patterns and scour the internet for song-related discussions to build detailed song profiles.

The Origin of AI in Music: A Journey from Algorithmic Composition to Generative Modeling

In the early stages of AI mixing in the music world, spanning from the 1950s to the 1970s, the focus was primarily on algorithmic composition. This was a method where computers used a defined set of rules to create music. The first notable creation during this period was the Illiac Suite for String Quartet in 1957. It used the Monte Carlo algorithm, a process involving random numbers to dictate the pitch and rhythm within the confines of traditional musical theory and statistical probabilities.

Image generated by the author using Midjourney

Image generated by the author using Midjourney

During this time, another pioneer, Iannis Xenakis, utilized stochastic processes, a concept involving random probability distributions, to craft music. He used computers and the FORTRAN language to connect multiple probability functions, creating a pattern where different graphical representations corresponded to diverse sound spaces.

The Complexity of Translating Text into Music

Music is stored in a rich and multi-dimensional format of data that encompasses elements such as melody, harmony, rhythm, and tempo, making the task of translating text into music highly complex. A standard song is represented by nearly a million numbers in a computer, a figure significantly higher than other formats of data like image, text, etc.

The field of audio generation is witnessing innovative approaches to overcome the challenges of creating realistic sound. One method involves generating a spectrogram, and then converting it back into audio.

Another strategy leverages the symbolic representation of music, like sheet music, which can be interpreted and played by musicians. This method has been digitized successfully, with tools like Magenta's Chamber Ensemble Generator creating music in the MIDI format, a protocol that facilitates communication between computers and musical instruments.

While these approaches have advanced the field, they come with their own set of limitations, underscoring the complex nature of audio generation.

Transformer-based autoregressive models and U-Net-based diffusion models, are at the forefront of technology, producing state-of-the-art (SOTA) results in generating audio, text, music, and much more. OpenAI's GPT series and almost all other LLMs currently are powered by transformers utilizing either encoder, decoder, or both architectures. On the art/image side, MidJourney, Stability AI, and DALL-E 2 all leverage diffusion frameworks. These two core technologies have been key in achieving SOTA results in the audio sector as well. In this article, we will delve into Google's MusicLM and Stable Audio, which stand as a testament to the remarkable capabilities of these technologies.

Google's MusicLM

Google's MusicLM was released in May this year. MusicLM can generate high-fidelity music pieces, that resonate with the exact sentiment described in the text. Using hierarchical sequence-to-sequence modeling, MusicLM has the capability to transform text descriptions into music that resonates at 24 kHz over extended durations.

The model operates on a multi-dimensional level, not just adhering to the textual inputs but also demonstrating the ability to be conditioned on melodies. This means it can take a hummed or whistled melody and transform it according to the style delineated in a text caption.

Technical Insights

The MusicLM leverages the principles of AudioLM, a framework introduced in 2022 for audio generation. AudioLM synthesizes audio as a language modeling task within a discrete representation space, utilizing a hierarchy of coarse-to-fine audio discrete units, also known as tokens. This approach ensures high-fidelity and long-term coherence over substantial durations.

To facilitate the generation process, MusicLM extends the capabilities of AudioLM to incorporate text conditioning, a technique that aligns the generated audio with the nuances of the input text. This is achieved through a shared embedding space created using MuLan, a joint music-text model trained to project music and its corresponding text descriptions close to each other in an embedding space. This strategy effectively eliminates the need for captions during training, allowing the model to be trained on massive audio-only corpora.

MusicLM model also uses SoundStream as its audio tokenizer, which can reconstruct 24 kHz music at 6 kbps with impressive fidelity, leveraging residual vector quantization (RVQ) for efficient and high-quality audio compression.

An illustration of the independent pretraining process for the foundational models of MusicLM: SoundStream, w2v-BERT, and MuLan,

An illustration of the pretraining process of MusicLM: SoundStream, w2v-BERT, and Mulan | Image source: here

Moreover, MusicLM expands its capabilities by allowing melody conditioning. This approach ensures that even a simple hummed tune can lay the foundation for a magnificent auditory experience, fine-tuned to the exact textual style descriptions.

The developers of MusicLM have also open-sourced MusicCaps, a dataset featuring 5.5k music-text pairs, each accompanied by rich text descriptions crafted by human experts. You can check it out here: MusicCaps on Hugging Face.

Ready to create AI soundtracks with Google's MusicLM? Here's how to get started:

  1. Visit the official MusicLM website and click “Get Started.”
  2. Join the waitlist by selecting “Register your interest.”
  3. Log in using your Google account.
  4. Once granted access, click “Try Now” to begin.

Below are a few example prompts I experimented with:

“Meditative song, calming and soothing, with flutes and guitars. The music is slow, with a focus on creating a sense of peace and tranquility.”

“jazz with saxophone”

When compared to previous SOTA models such as Riffusion and Mubert in a qualitative evaluation, MusicLM was preferred more over others, with participants favorably rating the compatibility of text captions with 10-second audio clips.

MusicLM Performance comparision

MusicLM Performance, Image source: here

Stability Audio

Stability AI last week introduced “Stable Audio” a latent diffusion model architecture conditioned on text metadata alongside audio file duration and start time. This approach like Google's MusicLM has control over the content and length of the generated audio, allowing for the creation of audio clips with specified lengths up to the training window size.

Technical Insights

Stable Audio comprises several components including a Variational Autoencoder (VAE) and a U-Net-based conditioned diffusion model, working together with a text encoder.

An illustration showcasing the integration of a variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model

Stable Audio Architecture, Image source: here

The VAE facilitates faster generation and training by compressing stereo audio into a data-compressed, noise-resistant, and invertible lossy latent encoding, bypassing the need to work with raw audio samples.

The text encoder, derived from a CLAP model, plays a pivotal role in understanding the intricate relationships between words and sounds, offering an informative representation of the tokenized input text. This is achieved through the utilization of text features from the penultimate layer of the CLAP text encoder, which are then integrated into the diffusion U-Net through cross-attention layers.

An important aspect is the incorporation of timing embeddings, which are calculated based on two properties: the start second of the audio chunk and the total duration of the original audio file. These values, translated into per-second discrete learned embeddings, are combined with the prompt tokens and fed into the U-Net’s cross-attention layers, empowering users to dictate the overall length of the output audio.

The Stable Audio model was trained utilizing an extensive dataset of over 800,000 audio files, through collaboration with stock music provider AudioSparx.

Stable audio commercials

Stable audio Commercials

Stable Audio offers a free version, allowing 20 generations of up to 20-second tracks per month, and a $12/month Pro plan, permitting 500 generations of up to 90-second tracks.

Below is an audio clip that I created using stable audio.

Image generated by the author using Midjourney

Image generated by the author using Midjourney

“Cinematic, Soundtrack Gentle Rainfall, Ambient, Soothing, Distant Dogs Barking, Calming Leaf Rustle, Subtle Wind, 40 BPM”

The applications of such finely crafted audio pieces are endless. Filmmakers can leverage this technology to create rich and immersive soundscapes. In the commercial sector, advertisers can utilize these tailored audio tracks. Moreover, this tool opens up avenues for individual creators and artists to experiment and innovate, offering a canvas of unlimited potential to craft sound pieces that narrate stories, evoke emotions, and create atmospheres with a depth that was previously hard to achieve without a substantial budget or technical expertise.

Prompting Tips

Craft the perfect audio using text prompts. Here's a quick guide to get you started:

  1. Be Detailed: Specify genres, moods, and instruments. For eg: Cinematic, Wild West, Percussion, Tense, Atmospheric
  2. Mood Setting: Combine musical and emotional terms to convey the desired mood.
  3. Instrument Choice: Enhance instrument names with adjectives, like “Reverberated Guitar” or “Powerful Choir”.
  4. BPM: Align the tempo with the genre for a harmonious output, such as “170 BPM” for a Drum and Bass track.

Closing Notes

Image generated by the author using Midjourney

Image generated by the author using Midjourney

In this article, we have delved into AI-generated music/audio, from algorithmic compositions to the sophisticated generative AI frameworks of today like Google's MusicLM and Stability Audio. These technologies, leveraging deep learning and SOTA compression models, not only enhance music generation but also fine-tune listeners' experiences.

Yet, it is a domain in constant evolution, with hurdles like maintaining long-term coherence and the ongoing debate on the authenticity of AI-crafted music challenging the pioneers in this field. Just a week ago, the buzz was all about an AI-crafted song channeling the styles of Drake and The Weeknd, which had initially caught fire online earlier this year. However, it faced removal from the Grammy nomination list, showcasing the ongoing debate surrounding the legitimacy of AI-generated music in the industry (source). As AI continues to bridge gaps between music and listeners, it is surely promoting an ecosystem where technology coexists with art, fostering innovation while respecting tradition.

I have spent the past five years immersing myself in the fascinating world of Machine Learning and Deep Learning. My passion and expertise have led me to contribute to over 50 diverse software engineering projects, with a particular focus on AI/ML. My ongoing curiosity has also drawn me toward Natural Language Processing, a field I am eager to explore further.