Hearing, which involves the perception and understanding of generic auditory information, is crucial for AI agents in real-world environments. This auditory information encompasses three primary sound types: music, audio events, and speech. Recently, text-based Large Language Model (LLM) frameworks have shown remarkable abilities, achieving human-level performance in a wide range of Natural Language Processing (NLP) tasks. Additionally, instruction tuning, a training method using pairs of reference responses and user prompts, has become popular. This approach trains large language models to more effectively follow open-ended user instructions. However, current research is increasingly focused on enhancing large language models with the capability to perceive multimodal content.
Focusing on the same, in this article, we will be talking about SALMONN or Speech Audio Language Music Open Neural Network, a state of the art open speech audio language music neural network built by incorporating speech and audio encoders with a pre-trained text-based large language model into a singular audio-text multimodal model. The SALMONN model enables Large Language Models to understand and process generic audio inputs directly, and deliver competitive performance on a wide array of audio & speech tasks used in training including auditory information-based question answering, speech recognition and translation, speaker verification, emotion recognition, audio & music captioning, and much more. We will be taking a deeper dive into the SALMONN framework, and explore its working, architecture, and results across a wide array of NLP tasks. So let’s get started.
SALMONN : An Introduction to Single Audio-Text Multimodal Large Language Models
SALMONN stands for Speech Audio Language Music Open Neural Network, and it is a single audio-text multimodal large language model framework capable of perceiving and understanding three basic audio or sound types including speech, audio events, and music. The SALMONN model enables Large Language Models to understand and process generic audio inputs directly, and deliver competitive performance on a wide array of audio & speech tasks.
To boost its performance on both speech, and non-speech audio tasks, the SALMONN framework employs a dual encoder structure consisting of a BEATs audio encoder, and a speech encoder sourced from the Whisper speech model. Additionally, the SALMONN framework also uses a window-level Q-Former or query Transformer as a connection module to effectively convert an output sequence of variable-length encoder to augmented audio tokens of a variable number, and ultimately achieve high temporal resolution for audio-text alignment. The LoRA or Low Rank Adaptation approach is used as a cross-modal adaptor to the Vicuna framework to align its output space with its augmented input space in an attempt to further boost its performance. In the SALMONN framework, the ability to perform cross-modal tasks unseen during the training phase lost during training of instructions as cross-modal emergent abilities which is the primary reason why the SALMONN framework implements an additional few-shot activation stage to regain the LLM framework’s general emergent abilities.
Furthermore, the framework makes use of a wide array of audio events, music benchmarks, and speech benchmarks to evaluate its cognitive hearing abilities, and divides the benchmarks in three levels. At the first benchmark level, the framework trains eight tasks in instruction training including translation, audio captioning, and speech recognition. The other two benchmark levels are untrained tasks with the second level benchmark consisting of 5 speech-based Natural Language Processing tasks like slot filling and translation to untrained languages relying on high-quality multilingual alignments between text and speech tokens. The final level benchmark tasks attempt to understand speech and non-speech auditory information for speech-audio co-reasoning and audio-based storytelling.
To sum it up, the SALMONN framework is
- The first multimodal large language model capable of understanding and perceiving general audio inputs including audio events, speech, and music to the maximum of its ability.
- An attempt to analyze cross-modal emergent abilities offered by implementing the LoRA scaling factor, and using an extra budget-friendly activation stage during training to activate cross-modal emergent abilities of the framework.
SALMONN : Architecture and Methodology
In this section, we will be having a look at the architecture, training method, and experimental setup for the SALMONN framework.
At the core of its architecture, the SALMONN framework synchronizes and combines the outputs from two auditory encoders following which the framework implements a Q-Former at the frame level as a connection module. The output sequence generated by the Q-Former is merged with text instruction prompts and it is then provided as an input to the LoRA adaptation approach to generate the required response.
The SALMONN framework makes use of two auditory encoders: a non-speech BEATs audio encoder, and a speech encoder sourced from OpenAI’s Whisper framework. The BEATs audio encoder is trained to use the self-supervised iterative learning approach in an attempt extract non-speech high-level audio semantics whereas the speech encoder is trained on a high amount of weakly supervised data for speech recognition and speech translation tasks with the output features of the encoder suitable to include background noise and speech information. The model first tokenizes the input audio, and follows it up by masking and predicting it in training. The resulting auditory features of these two encoders complement each other, and are suitable for both speech, and non-speech information.
Window Level Q-Former
Implementing the Q-Former structure is a common approach used in the LLM frameworks to convert the output of an image encoder into textual input tokens, and some modification is needed when dealing with audio tokens of varying lengths. To be more specific, the framework regards the encoder output of the input image as a concatenated encoder output sequence, and the Q-Former deploys a fixed number of trainable queries to transform the encoder output sequence into textual tokens using stacked blocks of Q-Former. A stacked Q-Former block resembles a Transformer decoder block with the exceptions being removing casual masks in the self-attention layers, and the use of a fixed number of trainable static queries in the initial blocks.
LoRA and LLM
The SALMONN framework also deploys a Vicuna LLM which is a LLaMA large language model framework fine-tuned to follow instructions more accurately, and effectively. The LoRA framework is a common method used for parameter-efficient fine-tuning, and its inclusion in the SALMONN framework to value weight matrices and adapt the query in the self-attention layers.
The SALMONN framework makes use of a three-stage cross-modal training approach. The training stage comprises a pre-training stage, and an instruction tuning stage that are included in most visual LLM frameworks, and an additional activation tuning stage is implemented to resolve over-fitting issues encountered during audio captioning and speech recognition tasks.
To limit the gap observed between pre-trained parameters including encoders & LLM, and randomly initialized parameters including adaptor & connection modules, the SALMONN framework uses a large amount of audio captioning and speech recognition data to pre-train the LoRA and Q-Former components. These tasks contain vital auditory information about the key contents of audio events both speech and non-speech, and neither of them require complex understanding or reasoning to learn alignment between textual and auditory information.
Instruction Fine-Tuning Stage
The instruction fine-tuning stage implemented in the SALMONN framework resembles the one implemented in NLP and visual LLM frameworks by using a list of audio events, music tasks and speech events to fine-tune audi-text instructions. The tasks are prioritized on the basis of their importance across different tests including phone recognition, overlapping speech recognition, and music captions. Furthermore, textual information paired with audio data forms the basis for generating instruction prompts.
Even when implementing only the first two training stages, the SALMONN framework delivers competitive results on instruction tuning tasks, although the performance is not up to the mark when performing cross-modal tasks, especially on tasks that require cross-modal co-reasoning abilities. Specifically, the model occasionally violates instruction prompts that result in the generation of irrelevant or incorrect responses, and this phenomenon is referred to as task overfitting in the SALMONN framework, and the Activation Tuning stage is implemented to resolve these overfitting issues.
Activation Tuning Stage
An effective approach to resolve overfitting issues is to regularize intrinsic conditional language models using longer and more diverse responses like storytelling or auditory-information based question answering. The framework then generates the pair training data for such tasks using text paired with audio or speech or music captions.
To evaluate SALMONN’s zero-shot cross-modal emergent abilities, developers have included 15 speech, audio and music tasks divided across three levels.
In the first level, tasks are used for instruction tuning, and therefore, they are the easiest set of tasks that the SALMONN framework has to perform.
The second level consists of untrained tasks, and the complexity level is higher when compared to level 1 tasks. In level 2, tasks are Natural Language Processing based tasks including speech keyword extraction that is used to evaluate the framework's accuracy when extracting certain keywords using speech. Other tasks include SQQA or Spoken Query-based Question Answering that evaluates the common sense knowledge the framework extracts using speech questions, a SF or Speech-based Slot Filling task to evaluate the accuracy of slot values, and finally, there are two AST tasks for English to German, and English to Japanese conversions.
The complexity of tasks in Level 3 is the maximum when compared to other two levels, and it includes SAC or Speech Audio Co-Reasoning, and Audio-based Storytelling tasks. The SAC task requires the SALMONN framework to understand a question included in the audio clip fed to the model, find supportive evidence using audio events or music in the background, and finally generate an appropriate reason to answer the question. The Audio-based storytelling tasks require the model to generate a meaningful story based on the auditory information sourced from general audio inputs.
Level 1 Tasks
The following table demonstrates the results on Level 1 tasks, and as it can be observed, the SALMONN framework returns competitive results on Level 1 tasks with or without activation-tuning.
Level 2 and 3 Tasks
Although the SALMONN framework returns competitive results on Level 1 tasks even without fine-tuning, the same cannot be said for Level 2 and Level 3 tasks as without activation, the SALMONN framework suffers heavily from over-fitting on tasks. The performance dips even further on SQQA, SAC, and Storytelling tasks with emphasis on multimodal interactions, and the SALMONN framework struggles to follow instructions without activation tuning. However, with activation tuning, the results improve considerably, and the results are included in the following image.
Discounting LoRA Scaling Factor
Discounting LoRA Scaling Factor evaluates the influence of using time-test discounting of the LoRA scaling factor to minimize overfitting issues on tasks. As it can be observed in the following figure, a decrease in the LoRA scaling factor to 2.0 elevates the cross-modal reasoning ability of the SALMONN framework on ASR & PR tasks, SQQA tasks, Storytelling tasks, and SAC tasks respectively.
To emphasize on activation tuning, the SALMONN framework analyzes the changes in perplexity during the three training stages, and as it can be seen in the following image, perplexity changes for AAC and ASR tasks have small final values post the first training stage, indicating the model’s learning of cross-modal alignments.
Furthermore, the perplexity of the PR task also drops post instruction tuning owing to its reliance on the LoRA component to learn the output tokens. It is also observed that although instruction tuning helps in reducing the perplexity on Storytelling and SAC tasks, the gap is still large enough to perform the tasks successfully unless an additional activation stage is added or the LoRA component is removed.
The SALMONN framework dives into different activation methods including training the model on text-based QA task pairs with long answers, or using audio-based long written stories, whereas using long speech transcriptions for ASR tasks. Both the Q-Former and LoRA components are fine-tuned using these three methods. Furthermore, the framework ignores the audio and Q-Former inputs in an attempt to fine-tune the LoRA and Vicuna components as an adaptive text-based large language model, and the results are demonstrated in the following image, and as it can be seen, the model cannot be activated by ASR ( training ASR with long labels), nor Story or Text-based by training LoRA component using text prompt inputs.
In this article, we have talked about SALMONN or Speech Audio Language Music Open Neural Network, a single audio-text multimodal large language model framework capable of perceiving and understanding three basic audio or sound types including speech, audio events, and music. The SALMONN model enables Large Language Models to understand and process generic audio inputs directly, and deliver competitive performance on a wide array of audio & speech tasks.
The SALMONN framework delivers competitive performance on a wide array of trained tasks including audio captioning, speech translation & recognition, and more while generalizing to a host of untrained understanding tasks including speech translation for keyword extracting and untrained languages. Owing to its abilities, the SALMONN framework can be regarded as the next step towards enhancing the generic hearing abilities of large language models.