stub data2vec: A Milestone in Self-Supervised Learning - Unite.AI
Connect with us

Artificial Intelligence

data2vec: A Milestone in Self-Supervised Learning

Updated on

Machine learning models have heavily relied on labeled data for training, and traditionally speaking, training models on labeled data yields accurate results. However, the main downside of using labeled data is the high annotation costs that rise with an increase in the size of the training data. High annotation costs are a big hurdle for developers, especially when working on a large project with substantial amounts of training data.

To tackle the annotation issue, developers came up with the concept of SSL or Self Supervised Learning. Self Supervised Learning is a machine learning process in which the model trains itself to learn a portion of the input from another part of the input. A Self Supervised Learning model aims to exploit the relationship between the data instead of using labeled data’s supervised signals. 

In addition to Self Supervised Learning, there are several other methods & models to train machine learning models without the use of labeled data. However, most of these methods have two major issues

  1. They are often specialized for a single modality like an image or a text. 
  2. They require a high amount of computational power. 

These limitations are a major issue why an average human mind is able to learn from a single type of data much more effectively when compared to an AI model that relies on separate models & training data to distinguish between an image, text, and speech. 

To tackle the issue of single modality, Meta AI released the data2vec, the first of a kind, self supervised high-performance algorithm to learn patterns information from three different modalities: image, text, and speech. With the implementation of the data2vec algorithm, text understandings could be applied to an image segmentation problem, or it can also be deployed in a speech recognition task. 

In this article, we will be talking about the data2vec model in-depth. We will discuss the method overview, related work, architecture, and results of the model in greater depth so that you have a clear understanding of the data2vec algorithm. 

Data2vec Introduction: The Core Idea

Although the fundamental concept of Self Supervised Learning is applied across modalities, actual objectives & algorithms differ from each other because they were designed in respect to a single modality. Designing a model for a single modality is the reason why the same self supervised learning algorithm cannot work effectively across different kinds of training data. 

To overcome the challenge presented by single modality models & algorithms, Meta AI released the data2vec, an algorithm that uses the same learning methodology for either computer vision, NLP or speech.  

The core idea behind the data2vec algorithm is to use the masked view of the input to predict latent representations of the full input data in a self-distillation setup with the help of standard Transformer architecture. So, instead of modality-specific objects like images, text, or voice that are local in nature, the data2vec algorithm predicts latent representations with information from the complete training or input data. 

Why Does the AI Industry Need the Data2Vec Algorithm?

Self Supervised Learning models build representations of the training data using human annotated labels, and it’s one of the major reasons behind the advancement of the NLP or Natural Language Processing, and the Computer Vision technology. These self supervised learning representations are the reason why tasks like speech recognition & machine learning deploy unsupervised learning in their models. 

Until now, these self supervised learning algorithms focus on individual modalities that result in learning biases, and specific designs in the models. The individual modality of self supervised learning algorithms create challenges in different AI applications including computer vision & NLP. 

For example, there are vocabulary of speech units in speech processing that can define a self-supervised learning task in NLP. Similarly, in computer vision, developers can either regress the input, learn discrete visual tokens, or learn representations invariant to data augmentation. Although these learning biases are handy, it’s difficult to confirm whether these biases will generalize to other modalities. 

The data2vec algorithm is a major milestone in the self-supervised learning industry as it aims at improving multiple modalities rather than just one. Furthermore, the data2vec algorithm is not reliant on reconstructing the input or contrastive learning. 

So the reason why the world needs data2vec is because the data2vec algorithm has the potential of accelerating progress in AI, and contributes in developing AI models that can learn about different aspects of their surroundings seamlessly. Scientists hope that the data2vec algorithm will allow them to develop more adaptable AI and ML models that are capable of performing highly advanced tasks beyond what today’s AI models can do.

What is the Data2Vec Algorithm?

The data2vec is a unified framework that aims at implementing self-supervised machine learning across different data modalities including images, speech, and text. 

The data2vec algorithm aims at developing ML models that can learn the general patterns in the environment much better by keeping the learning objective uniform across different modalities. The data2vec model unifies the learning algorithm, but it still learns the representations for each modality individually. 

With the introduction of the data2vec algorithm, Meta AI hopes that it will make multimodal learning effective, and much more simpler. 

How Does the Data2Vec Algorithm Work?

The data2vec algorithm combines the learnings of latent target representations with masked prediction, although it uses multiple network layers as targets to generalize the latent representations. The model specifically trains an off-the-shelf Transformer network that is then used either in the teacher or student mode. 

In the teacher mode, the model first builds the representations of the input data that serves as targets in the learning task. In the student mode, the model encodes a masked version of the input data that is then used to make predictions on full data representations. 

The above picture represents how the data2vec model uses the same learning process for different modalities. In the first step, the model produces representations of the input data (teacher mode). The model then regresses these representations on the basis of a masked version of the input. 

Furthermore, as the data2vec algorithm uses latent representations of the input data, it can be viewed as a simplified version of the modality-specific designs like creating suitable targets by normalizing the input or learning a fixed set of visual tokens. But the crucial differentiating point between the data2vec & other algorithms is that the data2vec algorithm uses self-attention to make its target representation contextualized & continuous. On the other hand, other self-supervised learning models use a fixed set of targets that are based on a local context. 

Data2vec: Model Method

The data2vec model is trained by predicting the model representations of the input data given a partial view of the input. As you can see in the given figure, the dog’s face is masked, a particular section of the voice note is masked, and the word “with” is masked in the text. 

The model first encodes a masked version of the training sample(student mode), and then encodes the unmasked version of the input to construct training targets with the same model but only when it is parameterized as the exponential average of the model weights(teacher mode). Furthermore, the target representations encode the information present in the training sample, and in the student mode, the learning task is used to predict these representations when given a partial view of the input. 

Model Architecture

The data2vec model uses a standard Transformer architecture with modality-specific encoding of the input data. For tasks related to computer vision, the model uses the ViT strategy to encode an image as a sequence of patches where each image spans over 16×16 pixels, and fed as a linear transformation. 

Furthermore, the data for speech recognition, the model encodes the data using a multi-layer 1-D convolutional neural network that maps the 16 kHz waveforms into 50 Hz representations. To process the text data, the model preprocesses the data to extract sub-word units, and then embeds the data in distributional space via embedding vectors. 


Once the model embeds the input data as a sequence of tokens, the model masks parts of these units by replacing them with an embedding token, and then feeds the sequence to the Transformer network. For computer vision, the model practices block-wise marking strategy. Latent speech representations are used to mask spans of speech data, and for language related tasks, the tokens are masked. 

Training Targets

The data2vec model aims at predicting the model representations of the unmasked training sample based on an encoding of the masked sample that was originally feeded to the model. The model predicts the representations only for masked time-steps. 

The model predicts contextualized representations that not only encode the particular time-step, but it also encodes other information from the sample because it uses self-attention in the Transformer network. The contextualized representations & the use of Transformer network is what distinguishes the data2vec model from already existing BERT, wav2vec, BEiT, SimMIM, MAE, and MaskFeat models that predict targets without contextual information. 

Here is how the data2vec model parameterizes the teacher mode to predict the network representations that then serve as targets. 

Teacher Parameterization

The data2vec model parameterized the encoding of the unmasked training sample with the use of EMA or Exponential Moving Average of the model parameters(θ) where the weights of the model in the target mode(△) are as follows

                                           ∆ ← τ∆ + (1 − τ ) θ


Furthermore, the model schedules for τ that linearly increases the parameter from  τ0 to τe (target value) over the first τn updates. After these updates, the model keeps the value constant until the training gets over. The use of the EMA strategy updates the teacher much more frequently in the beginning when the training starts when the model is random. As the training proceeds & good parameters have been learned, the teacher gets updated less frequently. 

The results show that the model is more efficient & accurate when it shares the parameters of the feature encoder & positional encoder between the student & the teacher mode. 


The construction of the training targets are dependent on the output of the top K blocks of the teacher network for time-steps that are masked in the student mode. The output of the block l at any time-step t is denoted as alt. The model then applies normalization to each block to obtain âlt before it averages the top K blocks 



to obtain the training target yt for time-step t for a network with L blocks in total. 

It creates training targets that the model regresses when it's in student mode. In the initial experiments, the data2vec model performed well in predicting each block separately with a dedicated projection, and being much more efficient at the same time. 

Furthermore, normalizing the targets also allows the data2vec model from collapsing into constant representations for time-steps, and preventing layers with high normalization to dominate the features in the target dataset. For speech recognition, the model uses instance normalization over the current input sample without any learned parameters. It’s mainly because as the stride over the input data is small, the neighboring representations are highly correlated. 

Additionally, the researchers found that when working with computer vision and NLP, parameter-less normalization does the job sufficiently. The problem can also be solved with Variance-Invariance-Covariance regularization but the strategy mentioned above performs sufficiently well, and it does not require any additional parameters. 


For contextualized training targets yt, the model uses a Smooth L1 loss to regress the targets as mentioned below

Here, β is in control of transitioning from a squared loss to an L1 loss, and it depends heavily on the size of the gap between the model prediction ft(x) at time-step t. The advantage of this loss is that it’s comparatively less sensitive to the outliers, with the need to tune the setting of β

Experimental Setup

The data2vec model is experimented with two model sizes: data2vec Large and data2vec Base. For numerical stability, the EMA updates are done in fp32, and the models contain L= 12 or L= 24 Transformer blocks with hidden dimensions(H) = 768 or H= 1024.  Let’s have a detailed look at the experimental setup for different modalities, and purposes. 

Computer Vision

The data2vec model embeds images of 224×224 pixels as patches of 16×16 pixels. Each of these patches is transformed linearly, and a sequence with 196 representations is fed to the standard Transformer. 

The model follows BEiT to mask blocks with adjacent patches with each block having a minimum of 16 patches with a random aspect ratio. However, instead of masking 40% of the patch as originally in the BEiT model, the data2vec model masks 60% of the patch for better accuracy. 

Furthermore, the model randomly resizes the image crops, horizontal flips, and color jittering. Finally, the data2vec model uses the same modified image in both the teacher & the student mode. 

The ViT-B models are pre-trained for 800 epochs, and the data2vec model uses the batch size of 8,192 for the ViT-L model, and 2,048 for the ViT-B model. The data2vec model also uses a cosine, and a Adam schedule with a single cycle to warm up the learning rate for 80 epochs to 0.001 for ViT-L, and for 40 epochs to 0.001 for ViT-B. 

For both ViT-B, and ViT-L, the data2vec model uses β = 2, K = 6 and τ = 0.9998 as constant with no schedule. The model further uses the stochastic depth rate 0.2. 

Furthermore, for ViT-L, the model trains for 1,600 epochs where the first 800 epochs have a learning rate as 0.9998, and then the model resets the learning rate schedule, and continues for the final 800 epochs with learning rate as 0.9999. 

For image classification, the model uses the mean-pool of the output of the last Transformer block, and feeds it to the softmax-normalized classifier. The model then fine tunes the ViT-L for 50 epochs, and ViT-B for 100 epochs using the cosine, and Adam to warmup the learning rate. 

Speech Processing

For speech processing, the data2vec model uses the Fairseq, a sequence-modeling kit used to train customer models for summarization, translation, and text generation. The model takes 16 kHz waveform as input that is processed using a feature encoder, and contains temporal convolutions with 512 channels, kernel widths (10,3,3,3,3,2,2), and strides (5,2,2,2,2,2,2). 

The above results in the output frequency of the encoder being 50Hz, and it has a stride of 20ms between each sample. The receptive field comprises of 400 input samples or 25 ms of audio. The raw waveform fed to the encoder is normalized to unit variance, and zero mean

The masking strategy used by the data2vec for the Base model resembles the Baevski framework for self-supervised learning in speech recognition. The model samples p = 0.065 for all time-steps to be starting indices, and proceeds to mark the following ten time-steps. For a typical training sequence, the process allows almost 49% of the total time-steps to be masked. 

During training, the data2vec model linearly anneals τ using τo = 0.999, τe = 0.9999, and τn = 30,000. The data2vec model uses the Adam optimizer with the peak learning rate being 5×10-4 for the Base model. Furthermore, the base model uses a tri-stage scheduler that warms up the learning rate linearly for the first 3% of updates, maintains it for the next 90%, and then proceeds to decay it linearly for the remaining 7%. 

Natural Language Processing

The data2vec model uses the byte-pair encoding of 50K types to tokenize the input, and the model then learns an embedding for each type. After the data is encoded, the model applies the BERT masking strategy to 15% of uniformly selected tokens in which 80% are replaced by learned mask tokens, 10% are replaced by random vocabulary tokens, and the remaining 10% are unchanged. 

During pre-training the model uses τo = 0.999, τe = 0.9999, and τn = 100,000, K= 10, and β = 4. The model uses the Adam optimizer with a tri-stage learning rate schedule that warms up the learning rate linearly for the first 5% of updates, maintains it for the next 80%, and then proceeds to decay it linearly for the remaining 15%, with the peak learning rate being 2×10-4

Furthermore, the model trains on 16 GPUs with a batch size of 256 sequences, and each sequence containing about 512 tokens. For downstreaming, the model is pre-trained in four different learning rates: 1×10-4, 2×10-4, 3×10-4, 4×10-4, and the one that performs the best is selected for further NLP downstreaming tasks. 


Let’s have a look at how the data2vec model performs when it implements the strategies discussed above for different modalities. 

Computer Vision

To evaluate the results for computer vision, the data2vec model is pre-trained on the images obtained from the ImageNet-1K dataset. The resulting model is fine-tuned using the labeled data of the same benchmark. As per the standard practice, the model is then evaluated in terms of top-1 accuracy on validation data. 

The results are then distinguished on the basis of a single self-supervised model, and training a separate visual tokenizer on additional data, or other self-supervised learning models. 

The table below compares the performance of the data2vec model for computer vision, and other existing models: ViT-L, and ViT-B. 

The results from the above table can be summarized as follows. 

  • The data2vec model outperforms prior work with both the ViT-L, and ViT-B models in single model setting. 
  • The masked prediction setup used in the data2vec algorithm to predict contextualized latent representations performs better when compared to methods that predict local targets like engineering image features, input pixels, or visual tokens. 
  • The data2vec model also outperforms self-distillation methods that regress the final layer of the student network while taking two different augmented versions of an image as inputs. 

Audio & Speech Processing

For speech & audio processing, the data2vec model is trained on about 960 hours of audio data obtained from the Librispeech(LS-960) dataset. The dataset contains clean speech audio from audiobooks in English, and it is treated as a standard benchmark in the speech & audio processing industry. 

To analyze the model’s performance in different resource settings, researchers have fine tuned the data2vec model to use different amounts of labeled data(from a few minutes to several hours) for automatic speech recognition. To analyze the model’s performance, data2vec is compared against HuBERT & wav2vec 2.0, two of the most popular algorithms for speech & audio representation learnings that rely on discrete speech units. 

The above table compares the performance of data2vec in terms of word rate for speech recognition with other existing models. LM represents the language model used for decoding. The results can be summarized as follows. 

  • The data2vec model shows improvements for most labeled data setups with the largest gain of 10 minutes of labeled data for Base models. 
  • When it comes to large models, the model performs significantly better on small labeled datasets, and the performance is comparable on resource-rich datasets with over 100 & 960 hours of labeled data. It’s because the performance generally saturates on resource-rich labeled dataset for most models. 
  • After analyzing the performance, it can be deduced that when the model uses rich contextualized targets, it’s not essential to learn discrete units. 
  • Learning contextualized targets during training helps in improving the overall performance significantly. 

Furthermore, to validate data2vec’s approach for speech recognition, the model is also trained on the AudioSet benchmark. Although the pre-training setup for AudioSet is similar to Librispeech, the model is trained for K= 12, and for over 200K updates, where the size of each batch is 94.5 minutes. 

The model then applies the DeepNorm framework, and layer normalization to the targets to help in stabilizing the training. Additionally, the model is also fine tuned on balanced subsets with batch size of 21.3 minutes over 13k updates. The model also uses Linear Softmax Pooling and mixup with a probability score of 0.7. The model then adds a single linear projection into 527 unique classes of audio, and sets the projection learning rate to 2e-4. 

Furthermore, the pre-trained parameters have a learning rate of 3e-5, and the model uses masking techniques for fine tuning the dataset. The table below summarizes the results, and it can be seen that the data2vec model is capable of outperforming a comparable setup with the same fine-tuning, and pre-training data. 

Natural Language Processing

To analyze data2vec’s performance on text, the model follows the same training setup as BERT and pre-training the model on English Wikipedia dataset with over 1M updates, and batch size being 256 sequences. The model is evaluated on the GLUE or General Language Understanding Evaluation benchmark that includes natural language interference tasks(MNLI or Multi Genre Natural Language Inference), sentence similarity (QQP or Quora Question Pairs benchmark, MRPC or Microsoft Research Paragraph Corpus, and STS-B or Semantic Textual Similarity Benchmark), sentiment analysis(SST-2 or Stanford Sentiment Treebank), and grammatically(CoLA). 

Furthermore, to fine tune the data2vec model, the labeled data is provided by each task, and the average accuracy is reported on the development sets with 5 fine-tuning runs. The following table summarizes the performance of the data2vec model for Natural Language Processing tasks, and compares it with other models. 

  • The above data shows that the data2vec model outperforms the baseline RoBERTa model as the strategy in data2vec model does not use random targets. 
  • The data2vec model is the first successful pre-trained NLP model that does not use discrete units like characters, words or sub-words as training targets. Instead, the data2vec framework predicts contextualized latent representation over the complete unmasked text sequence. 
  • It helps in creating a learning task in which the model is required to predict targets with specific properties from the current sequence rather than predicting representations that are generic to every text unit with particular discretion. 
  • Furthermore, the training target set is not fixed, and the model is free to define new targets, and it is open to vocabulary settings. 

Data2Vec: Ablations Study

Ablation is a term used to define the removal of a component in the AI, and ML systems. An ablation study is used to investigate or analyze the performance of an AI or ML model by removing certain key components from the model that allows researchers to understand the contribution of that component in the overall system. 

Layer Averaged Targets

A major difference between data2vec and other self-supervised learning models is that the data2vec model uses targets that are based on averaging multiple layers from the teacher network. The idea comes from the fact that the top top layers of the wav2vec 2.0 model does not perform well for downstream tasks when compared to middle layers of the model. 

In the following experiment, the performance of all three modalities is measured by averaging K= 1, 2, …, 12 layers where K= 1 predicts only the top layer. However, to extract faster turnaround time, the data2vec trains the base model with 12 layers in total. For speech recognition, the model is pre-trained on over two hundred thousand updates on Librispeech, and then fine-tuned on a 10 hour labeled split of Libri-light. For Natural Language Processing, the model reports the average GLUE score for the validation set, and pre-trains the model for 300 epochs for computer vision & then reports the top-1 accuracy obtained on the ImageNet dataset. 

The above figure shows that targets based on multiple layers generally improve when only the top layer K=1 is used for all modalities. Using all the layers available is a good practice as the neural networks build features over different types of features, and numerous layers that are then extracted as feature layers. 

Using features from multiple layers helps in boosting accuracy, and enriches the self-supervised learning process. 

Target Feature Type

The transformer blocks in the data2vec model have several layers that can all serve as targets. To analyze how different layers affect performance, the model is pre-trained on Librispeech’s speech models that use different layers as target features. 

The figure below clearly indicates that the output of the feed forward network or the FFN works ideally whereas the output of the self-attention blocks do not result in a usable model. 

Target Contextualization

Teacher representations in the data2vec model use self-attention over the entire input to produce contextualized targets. It’s what separates data2vec from other self-supervised learning models that construct a learning task by reconstructing or predicting local parts of the input. It evidently poses the question: does the data2vec model require contextualized targets to work well? 

To answer the question, the researchers construct target representations that do not have access to the entire input dataset but only a fraction of it that’s predetermined. The model then restricts the self-attention mechanism of the teacher that allows it to access only a portion of surrounding environment input. After the model has been trained, it’s fine-tuned to access the full context size. 

The figure below indicates that larger context sizes often lead to a better performance, and when the entire input sample is visible, it yields the best accuracy. It further proves that richer target representations can yield better performance. 

Modality Specific Feature Extractors and Masking

The primary objective of data2vec is to design a simple learning mechanism that can work with different modalities. It’s because, although the current models and frameworks have a unified learning regime, they still use modality specific masking, and feature extractors. 

It makes sense that frameworks mostly work with a single modality given the nature of the input data varies vastly from one another. For example, speech recognition models use a high resolution input( like 10 kHz waveform) that usually have thousands of samples. The waveform is then processed by the framework using a multilayer convolutional neural network to obtain feature sequences of 50 Hz. 

Structured and Contextualized Targets

The main differentiating point between the data2vec and other masked prediction models is that in the data2vec model, the features of training targets are contextualized. These features are built using self-attention of the entire masked input in teacher mode. 

Some other frameworks like BYOL(Bootstrap Your Own Latent) or DINO also use latent representations like the data2vec, but their primary focus is to learn transformation invariant representations. 

Final Thoughts

Recent work in the AI and ML industry have indicated that uniform model architectures can be an effective approach to tackle multiple modalities. The data2vec model uses a self-supervised learning approach for working with three modalities: speech, images, and language. 

The key concept behind the data2vec model is to use partial input view to regress contextualized information or input data. The approach used by the data2vec frameworks is effective as the model performs better than prior self-supervised learning models on ImageNet-1K dataset for both ViT-B, and ViT-L single models. 

Data2vec is trully a milestone in the self-supervised learning industry as it demonstrates a single learning method for learning multiple modalities can indeed make it easier for models to learn across modalities. 

"An engineer by profession, a writer by heart". Kunal is a technical writer with a deep love & understanding of AI and ML, dedicated to simplifying complex concepts in these fields through his engaging and informative documentation.