Training an Adapter for RoBERTa Model for Sequence Classification Task

Drishti Sharma 12 Apr, 2023 • 13 min read

Introduction

The current trend in NLP includes downloading and fine-tuning pre-trained models with millions or even billions of parameters. However, storing and sharing such large trained models is time-consuming, slow, and expensive. These constraints hinder the development of more multi-purpose and adaptable NLP techniques with the RoBERTa model that can learn from and for multiple tasks; in this article, we will be focusing on the sequence classification tasks. Considering this, adapters were proposed, which are small, lightweight, and parameter-efficient alternatives to full fine-tuning. They are basically small bottleneck layers that can be dynamically added with a pre-trained model based on different tasks and languages.

RoBERTa Model training

In this article, we will train an adapter for ROBERTa model on the Amazon polarity dataset for sequence classification tasks with the help of adapter-transformers, the AdapterHub adaptation of Hugging Face’s transformers library. Additionally, we will compare the performance of the adapter module to a fully fine-tuned RoBERTa model trained on the same dataset.

By the end of this article, you will have learned the following:

How to train an adapter for the RoBERTa model on the Amazon Polarity dataset for the Sequence Classification task?
How can a trained adapter with the Hugging Face pipeline be used to help make quick predictions?
How to extract the adapter from the trained model and save it for later use?
How can the base model’s weights be restored to their original form by deactivating and deleting the adapter?
Push the trained model to the Hugging Face hub for later use. Additionally, we will see the comparison between the adapters and full fine-tuning.

This article was published as a part of the Data Science Blogathon.

Project Description

This project includes training a task adapter for the RoBERTa model on the Amazon polarity dataset for sequence classification tasks, specifically sentiment analysis. To train, we will use the RoBERTa base model from the Hugging Face hub and the AdapterHub adaptation of Hugging Face’s transformers library. Additionally, we will compare the performance of the adapter module to a fully fine-tuned RoBERTa model trained on the same dataset.

What are Adapters?

Adapters are lightweight alternatives to fully fine-tuned pre-trained models. Currently, adapters are implemented as small feedforward neural networks that are inserted between layers of a pre-trained model. They provide a parameter-efficient, computationally efficient, and modular approach to transfer learning. The following image shows added adapter.

Source: Adapterhub

During training, all the weights of the pre-trained model are frozen such that only the adapter weights are updated, resulting in modular knowledge representations. They can be easily extracted, interchanged, independently distributed, and dynamically plugged into a language model. These properties highlight the potential of adapters in advancing the NLP field astronomically.

Significance of Adapters in NLP Transfer Learning

The following are some important points regarding the significance of adapters in NLP transfer learning:

Efficient Use of Pretrained Models: Pretrained language models such as BERT, GPT-2, and RoBERTa have been proven effective in various NLP tasks. However, fine-tuning the entire model can be computationally expensive and time-consuming. Adapters allow for more efficient use of these pretrained models by enabling the insertion of task-specific functionality without modifying the original architecture.
Improved Adaptability: Adapters allow for greater flexibility in adapting pretrained models to new tasks. Rather than fine-tuning the entire model, adapters enable selective modification of specific layers, improving model adaptation to new tasks and leading to better performance.
Cost-Effective: Adapters can be trained with fewer data than required for training a full model, reducing the cost of training and improving the model’s scalability.
Reduced Memory Requirements: Since adapters require fewer parameters than a full model, they can be easily added to a pre-existing model without requiring significant additional memory.
Transfer Learning Across Languages: Adapters can also enable knowledge transfer across languages, allowing models to be trained on a source language and then adapted to a target language with minimal additional training. And hence they can also prove to be very effective in low-resource settings.

Overview of the RoBERTa Model

Roberta is a large pre-trained language model developed by Facebook AI and released in 2019. It shares the same architecture as the BERT model. It is a revised version of BERT with minor adjustments to the key hyperparameters and embeddings.

Except for the output layers, BERT’s pre-training and fine-tuning procedures use the same architecture. The pre-trained model parameters are utilized to initialize models for various downstream tasks, and during fine-tuning, all parameters are adjusted. The following diagram illustrates BERT’s pre-training and fine-tuning procedures. The following figure shows the BERT Architecture.

Source: Arxiv

In contrast, RoBERTa does not employ the next-sentence pretraining objective but utilizes much larger mini-batches and learning rates during training. RoBERTa adopts a different pretraining method and replaces the byte-level BPE tokenizer (similar to GPT-2) with a character-level BPE vocabulary. Moreover, RoBERTa uses “dynamic masking,” which helps the model learn more robust representations of the input text by forcing it to predict a diverse set of tokens rather than just predicting a fixed subset of tokens.

In this article, we will train an adapter for RoBERTa base model for the sequence classification task (more precisely, sentiment analysis). Simply put, a sequence classification task is a task that involves assigning a label or category to a sequence of words or tokens, such as a sentence or document.

Overview of the Dataset

We will use the Amazon Reviews Polarity dataset constructed by Xiang Zhang. This dataset was created by classifying reviews with scores of 1 and 2 as negative and reviews with scores of 4 and 5 as positive. Moreover, the samples with a score of 3 were ignored. Each class has 1,800,000 training samples and 200,000 testing samples.

Training the Adapter for RoBERTa Model on Amazon Polarity Dataset

To start we will begin with installing the libraries:

!pip install -U adapter-transformers datasets

And now, we will load the Amazon Reviews Polarity dataset using the HuggingFace dataset:

from datasets import load_dataset

#Loading the dataset
dataset = load_dataset("amazon_polarity")

Now let’s see what our dataset consists of:

dataset

Output: DatasetDict({
train: Dataset({
features: [‘label’, ‘title’, ‘content’],
num_rows: 3600000
})
test: Dataset({
features: [‘label’, ‘title’, ‘content’],
num_rows: 400000
})
})

So from the above output, we can see that the Amazon Reviews Polarity dataset consists of 3,600,000 training samples and 400,000 testing samples. Now let’s take a look at what a sample from the train set and test set looks like.

dataset["train"][0]

Output: {‘label’: 1, ‘title’: ‘Stunning even for the ‘non-gamer’, ‘content’: ‘This soundtrack was beautiful! It paints the scenery in your mind so good I would recommend it even to people who hate video game music! I have played the game Chrono Cross, but out of all of the games I have ever played, it has the best music! It backs away and takes a fresher step with great guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^’}

dataset["test"][0]

Output: {‘label’: 1, ‘title’: ‘Great CD’, ‘title’: ‘Great CD’, ‘content’: ‘My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and still LOVE IT. When I\’m in a good mood, it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. The vocals are just STUNNING, and the lyrics just kill. One of life\’s hidden gems. This is a desert island CD in my book. Why she never made it big is just beyond me. Every time I play this, no matter male or female, EVERYBODY says one thing “Who was that singing ?”‘}

From the output of print(dataset), dataset[“train”][0], and dataset[“test”][0], we can see that the dataset consists of three columns, i.e., “label”, “title”, and “content”. Considering this, we need to drop the column named title since we won’t require this to train the adapter.

#Removing the column "title" from the dataset
dataset = dataset.remove_columns("title")

Let’s check whether the column “title” has been dropped!

dataset

Below is a Screenshot showing the composition of the dataset after dropping the column “title”.

Output:

Fig. 3 Screenshot showing the composition of dataset after dropping the column

So clearly, the column “title” has been successfully dropped and no longer exists.

Now we will encode all the dataset samples. For this, we will use RobertaTokenizer and dataset.map() function for encoding the input data. Moreover, we will rename the target column class as “labels” since that is what a transformer model takes. Furthermore, we will use set_format() function to set the dataset format to be compatible with PyTorch.

from transformers import AutoTokenizer, RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

#Encoding a batch of input data with the help of tokenizer
def encode_batch(batch):
  return tokenizer(batch["content"], max_length=100, truncation = True, padding="max_length")  
  
dataset = dataset.map(encode_batch, batched=True)

#Renaming the column "label" to "labels"
dataset = dataset.rename_column("label", "labels")

#Setting the dataset format to torch and mentioning the columns we want to format
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Now, we will use RobertaModelWithHeads class, which is unique to adapter-transformers and allows us to easily add and configure prediction heads.

from transformers import RobertaConfig, RobertaModelWithHeads

#Defining the configuration for the model
config = RobertaConfig.from_pretrained("roberta-base", num_labels=2)

#Setting up the model
model = RobertaModelWithHeads.from_pretrained("roberta-base", config=config)

We will now add an adapter with the help of the add_adapter() method. For this, we will pass an adapter name; we passed “amazon_polarity”. Following this, we will also add a matching classification head. Lastly, we will activate the adapter and prediction head using train_adapter().

Basically, train_adapter() method performs two functions majorly:

It freezes all the weights of the pre-trained model such that only the adapter weights are updated during the training.
It also activates the adapter and prediction head to use both in every forward pass.

#Adding adapter to the RoBERTa model
model.add_adapter("amazon_polarity")

# Adding a matching classification head
model.add_classification_head(
    "amazon_polarity",
    num_labels=2,
    id2label={ 0: "negative", 1: "positive"}
  )
  
# Activating the adapter
model.train_adapter("amazon_polarity")

We will configure the training process with the help of TraniningArguments class. Following this, we will also write a function to calculate evaluation accuracy. Lastly, we will pass the arguments to the AdapterTrainer, a class optimized for only training adapters.

import numpy as np
from transformers import TrainingArguments, AdapterTrainer, EvalPrediction

training_args = TrainingArguments(
    learning_rate=3e-4,
    max_steps=80000,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=1000,
    output_dir="adapter-roberta-base-amazon-polarity",
    overwrite_output_dir=True,
    remove_unused_columns=False,
)

def compute_accuracy(eval_pred):
  preds = np.argmax(eval_pred.predictions, axis=1)
  return {"acc": (preds == eval_pred.label_ids).mean()}

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_accuracy,
)

Let’s start training now!

trainer.train()

Fig. 4 Image depicting the training run (Source: Author)

TrainOutput(global_step=80000, training_loss=0.13133217878341674, metrics={‘train_runtime’: 7884.1676, ‘train_samples_per_second’: 324.701, ‘train_steps_per_second’: 10.147, ‘total_flos’: 1.33836672e+17, ‘train_loss’: 0.13133217878341674, ‘epoch’: 0.71})

Evaluating the Trained Model

Now let’s evaluate the adapter’s performance on the dataset’s test split.

trainer.evaluate()

ROBERTa Model Evaluation | classification task

We can use the trained model with the help of the Hugging Face pipeline to make quick predictions.

from transformers import TextClassificationPipeline
classifier = TextClassificationPipeline(model=model,
                                        tokenizer=tokenizer,
                                        device=training_args.device.index)
                                        
classifier("I came across a lot of reviews stating that it is the best book out there.")#import csv

Output: [{‘label’: ‘positive’, ‘score’: 0.5589291453361511}]

Extracting and Saving the Adapter

Ultimately, we can also extract the adapter from the trained model and save it for later use. save_adapter() creates a file for saving adapter weights and adapter configuration.

model.save_adapter("./final_adapter", "amazon_polarity")

!ls -lh final_adapter

Fig. 7 The files present in final_adapter folder — Fig. 7 The files present in the final_adapter folder

Deactivating and Deleting the Adapter

Once we are done working with the adapters, and they are no longer needed, we can restore the weights of the base model in its original form by deactivating and deleting the adapter.

#Deactivating the adapter
model.set_active_adapters(None)

#Deleting the added adapter
model.delete_adapter("amazon_polarity")

Pushing the Trained Model to the Hub

We can also push the trained model to the Hugging Face hub for later use. For this, we will import the libraries and install git, and then we will push the model to the hub.

from huggingface_hub import notebook_login
notebook_login()

!apt install git-lfs 
!git config --global credential.helper store

trainer.push_to_hub()

Link to the Model Card: https://huggingface.co/DrishtiSharma/adapter-roberta-base-amazon-polarity

Comparison of Adapter with Full Fine-tuning

Since the finetuning of adapters involves only the updation of adapter parameters while the parameters of the pre-trained models are frozen, this greatly reduces the training time, computational cost of fine-tuning, and memory footprint of the adapter module when compared to full fine-tuning.
The adapter module can be easily integrated with the pre-trained models to adapt them to new tasks without the need to retrain the whole model. Notably, the size of the file, which contains adapter weights, is just 3.5 MB. Both of these aspects highlight its potential for ease of reusability for multiple tasks.
While trying to fine-tune the RoBERTa model on Amazon Review Polarity dataset, I ran into memory-related issues, which caused the training session to end abruptly at around 40k steps. This highlights the advantage of adapters, i.e., in scenarios where computational resources are limited; adapters are a lot more promising approach than full-fine-tuning.
To draw further conclusions, I trained the adapter and RoBERTa model on a smaller dataset, i.e., “Rotten Tomatoes”. I was pleasantly surprised that adapters scored better than the full fine-tuned model. Notably, after training the adapter for around 113 epochs, the eval_acc was 88.93%, and the model had started to overfit. On the other hand, when the RoBERTa model was trained for the same number of epochs, the eval_acc was 50%, and the train_loss and eval_loss were around 0.693, and these were still going down. Regardless, to draw a more fair and concrete conclusion, a lot more experiments need to be conducted.

Applications of the Trained Adapter

Following are some of the potential applications of an Adapter trained on the Amazon Polarity dataset for sequence classification tasks:

Social Media Analysis: The trained adapter can analyze the underlying sentiment in social media posts or comments. Businesses can use this to gauge customer sentiment and effectively respond to negative/constrictive feedback in time.
Customer Service: The trained adapter can be used to automatically classify the raised customer support tickets into positive or negative, allowing the support team to address and prioritize customer complaints more effectively and timely.
Product/Service Reviews: The trained adapter can automatically classify product/service reviews as positive or negative, helping businesses quickly gauge customer satisfaction with their offerings.
Market Research: The trained adapter can also be used for analyzing sentiment in customer feedback surveys, market research forms, etc., which can be further utilized to draw insights about customer sentiment toward their product/service/brand.
Brand Monitoring: The trained model can be used to monitor online mentions of a brand or product and classify them by sentiment, allowing businesses to track their online reputation and respond to negative feedback or complaints.

Pros of the Adapters

Adapters have several advantages over traditional methods. Here are some of the advantages of adapters in NLP:

Efficient Fine-tuning: Adapters can be fine-tuned on new tasks with fewer parameters than training an entire model from scratch.
Modular: Adapters are modular/interchangeable; they can be easily swapped or added to a pre-trained model.
Domain-specific Adaptations: Adapters can be fine-tuned on domain-specific tasks, resulting in better performance at those tasks.
Incremental Learning: Adapters can be used for incremental learning, allowing for efficient continuous learning and adapting the pre-trained model to new data.
Faster Training: Adapters can be trained faster than training the entire model from scratch, which helps in faster experimentation and prototyping.
Smaller Size: Adapters are significantly smaller than a fine-tuned model, allowing for faster inference and less memory consumption.

Cons of the Adapters

While adapters have several advantages, they have some disadvantages too. Here are some of the disadvantages of adapters:

Reduced Performance: Since an additional adapter layer is added on top of a pre-trained model, this can add computational overhead to the model and affect the model’s performance regarding inference speed and accuracy.
Increased Complexity: Again, as the adapters are added to a pre-trained model, the model must be modified to accept inputs and outputs from the adapter layer. This can, in turn, make the overall architecture of the model more complex.
Limited Expressiveness: Adapters are task-specific and may not be as expressive as a fully-trained model fine-tuned for certain tasks, especially for complex tasks or those requiring domain-specific knowledge.
Limited Transferability: Adapters are trained on limited task-specific data, which may not enable them to generalize well to new tasks or domains, reducing their usefulness when the task or domain differs from the one the adapter was trained on.
Potential for Overfitting: The experiments we performed in this article itself showed that the adapter started to overfit after certain steps, which can lead to poor performance on a downstream task.

Future Research Directions

Following are some of the potential research directions which can help in furthering the advanced development and usage of Adapters:

Exploring Different Adapter Architectures: Adapters are currently implemented as small feedforward neural networks inserted between layers of a pre-trained model. There is huge potential for exploring different architectures for adapters that may offer better performance for specific tasks. This could include investigating new methods for parameter sharing, designing adapters with multiple layers, exploring different activation functions, incorporating attention, etc.
Studying the Impact of Adapter Size: Larger adapters have been shown to work better than smaller ones. But there’s a caveat here the “largeness” of the model affects the inference speed and the computational cost/requirement. Hence further research could be done to explore the optimal size of the adapters based on specific tasks.
Investigating Multi-Layer Adapters: Currently, adapters are added to a single layer of a pre-trained model. There is a scope for exploring multi-layer adapters that can adapt multiple layers of a model for a given task.
Adapting to Other Modalities: Although adapters have been developed, studied, and tested primarily in the context of NLP, there is a scope for studying their use for other modalities like image, audio processing, etc.
Improving Efficiency and Scalability: The efficiency and scalability of adapter training could be improved much more than it currently is.
Multi-domain Adaptation and Multi-task Learning: Adapters have been shown to adapt to new domains and tasks quickly. Future research can help develop adapters that can simultaneously adapt to multiple domains.
Compression and Pruning with Adapters: The efficiency of the adapters can be further increased by developing methods for compressing or pruning adapters while maintaining their effectiveness.
Adapters for Reinforcement Learning: Investigating the use of adapters for reinforcement learning can enable agents to learn more quickly and effectively in complex environments.

Conclusion

This article presents how we can train an adapter model to alter the weights of a given pre-trained model based on the task at hand. And we also saw that once the task is complete, we can easily restore the weights of the base model in its original form by deactivating and deleting the adapter.

To summarize, the key takeaways from this article are:

Adapters are small bottleneck layers that can be dynamically added to a pre-trained model based on different tasks and languages.
We trained an adapter for the RoBERTa model on the Amazon polarity dataset for the sentiment classification task with the help of adapter-transformers, the AdapterHub adaptation of HuggingFace’s transformers library.
train_adapter() method freezes all the weights of the pre-trained model such that only the adapter weights are updated during the training. It also activates the adapter and prediction head to use both in every forward pass.
The adapter from the trained model can be extracted and saved for later use. save_adapter() creates a file for saving adapter weights and adapter configuration.
When the adapter is not needed, we can restore the weights of the base model in its original form by deactivating and deleting the adapter.
Adapters seemed to perform better than the fully fine-tuned RoBERTa model, but, to have a concrete conclusion, more experiments must be conducted.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Drishti Sharma 12 Apr 2023

Advanced Classification Deep Learning Guide NLP