MosaicML’s LLM journey started with the release of MPT-7B (Mosaic Pretrained Transformer) in May 2023 which came with three variants:
- MPT-7B-StoryWriter-65k+ (for long-form story generation)
- MPT-7B-Instruct (for short-form instruction following)
- MPT-7B-Chat (for dialogue generation)
The models witnessed massive success in the ML community because of their open-source nature, commercial usability, and exceptional capability to handle extended context windows.
Most importantly, the model was at par and, in some cases, outperformed the other comparable models (LLaMA-7B, StableLM 7B, etc). By June, the MPT-7B series had been downloaded over 3 million times. On 22nd June, MosaicML released MPT-30B which raised the bar even further for open-source foundation models.
The MPT-30B: A Powerful LLM That Exceeds GPT-3
MPT-30B is an open-source and commercially licensed decoder-based LLM that is more powerful than GPT-3-175B with only 17% of GPT-3 parameters, i.e., 30B. It outperforms GPT-3 on several tasks. Here’s a comparison between MPT-30B and GPT-3.
MPT-30B builds upon the previous MPT-7B model. It is computationally efficient to train compared to models with similar sizes. For instance, LLaMA-30B used approximately 1.44 times more FLOPs budget than MPT-30B, while Falcon-40B had a 1.27 times higher FLOPs budget than MPT-30B. Here’s an illustration of MPT-30B’s improvement on various tasks over its predecessor.
Some special features of MPT-30B are as follows:
8k Token Context Window
Context window in LLMs refers to the range of tokens the model can consider before generating the output. MPT-30B had a context window of 8000 tokens at training time. It was first trained on 1T token using 2k token sequences and then an additional 50B tokens of 8k token sequences (roughly 6000 words).
To explain this feature, let’s consider a question:
How can MPT-30B understand and make predictions for longer sequences than what it was trained on?
MPT-30B uses an Attention with Linear Biases (ALiBi) technique to understand longer sequences and extend the context window beyond 8k tokens during finetuning or inference.
Instead of calculating positional embeddings in which we assign a vector to each word in the sequence, ALiBi calculates attention scores between key and query tokens. When the key and query tokens are close together, the penalty is low but higher otherwise. As a result, the underlying transformer architecture can extrapolate to long-form inputs.
Efficient Inference & Training Performance via FlashAttention
Attention i.e., focusing on relevant parts of the input sequence, is a critical component of transformers, but it can be slow and memory-intensive, especially when processing long text sequences.
FlashAttention is an approach proposed by researchers at Cornell University that addresses this problem for MPT-30B. Using a technique called tiling, FlashAttention reduces the number of times the model needs to read from or write to memory, speeding up the processing. Hence, the model employs the state-of-the-art FlashAttention technique and NVIDIA’s FasterTransformer optimization library for efficient training and inference.
Ease of Training & Deployment
Developers can train MPT-30B from scratch or use MosaicML’s checkpoints for quicker deployments. Also, it can be finetuned for domain-specific use cases on a particular dataset.
The model's size was chosen to enable effortless deployment on a single GPU, specifically 1xA100-80GB in 16-bit precision or 1xA100-40GB in 8-bit precision. This means that the model was designed to fit within the memory limitations of these GPUs.
MPT-30B provides exceptional coding capabilities as well. HumanEval is a dataset released by OpenAI that contains 164 handcrafted programming problems. On the HumanEval dataset, the model surpasses purpose-built LLM models, such as the StarCoder series.
Fine-Tuned Variants: MPT-30B-Instruct & MPT-30B-Chat
LLMs are primarily used for instructions such as question answering, text summarization, language translation, etc. MPT-30B-Instruct is a commercially usable (maintains commercial CC-By-SA-3.0 license) variant of MPT-30B fine-tuned specifically for instruction following tasks. For fine-tuning, the following datasets were used:
The Dolly dataset was further augmented with Anthropic’s Helpful and Harmless dataset for instruction finetuning. Additionally, a diverse range of datasets were used for data augmentation, which are as follows:
MPT-30B-Chat is a fine-tuned version of MPT-30B for dialogue generation. It is a research artifact released under the CC-By-NC-SA-4.0 license, allowing only non-commercial use. The model was fine-tuned using various language datasets, including:
LLMs share a big chunk of the multi-billion dollar generative AI market, which has experienced tremendous growth in no time after ChatGPT revolutionized the landscape last year. The MPT family is a foundational part of this revolution. In the near future, we can expect to see commercially available open-source models that are far more powerful and efficient than the MPT family.
For the latest AI news, visit unite.ai.
- The Black Box Problem in LLMs: Challenges and Emerging Solutions
- Alex Ratner, CEO & Co-Founder of Snorkel AI – Interview Series
- Circleboom Review: The Best AI-Powered Social Media Tool?
- Stable Video Diffusion: Latent Video Diffusion Models to Large Datasets
- Donny White, CEO & Co-Founder of Satisfi Labs – Interview Series