Introduction

This is the first post of a series, which will go through the "NLP with Transformers" book. I am following the reading group organized by Hugging Face, with the authors themselves! So I thought the best way to properly understand each chapter would be to try to apply the concepts to a different example (then the problems happen😂). My idea is to post one of these every two weeks, before the reading group session. However, life gets in the way, so I cannot promise anything. Do not expect fancy or innovative things, the code will be similar to the one in the book. Of course I would like to go deeper, but time is limited⏳ and this is just to enhance my learning (and the learning of whoever wants to follow the book). Enough introduction, now let's go with my first dive into NLP!

Warning: These posts are not attempting to instruct the reader and they probably will contain errors. Reading the corresponding chapter of the book is needed for a correct understanding. They will also contain questions to be asked in the reading group session.

The dataset

Instead of the emotion dataset we will use the paws. This contains multiple languages, so we specify the English version by using "en".

from datasets import load_dataset

paws = load_dataset("paws-x", "en")

paws

DatasetDict({
    train: Dataset({
        features: ['id', 'sentence1', 'sentence2', 'label'],
        num_rows: 49401
    })
    test: Dataset({
        features: ['id', 'sentence1', 'sentence2', 'label'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['id', 'sentence1', 'sentence2', 'label'],
        num_rows: 2000
    })
})

So we have a dataset where each example is a pair of sentences, with a label. The label indicates whether or not one of the sentences is paraphrasing the other. 0 indicates the pair has a different meaning, while 1 indicates the pair is a paraphrase. Our goal will be to use a pre-trained DistillBert model to distinguish them.

Let's get a Pandas Dataframe to easily explore our data.

paws.set_format(type="pandas")

df=paws["train"][:]

df.head()

Here we can see an example of a paraphrase:

print(df.iloc[1]["sentence1"] + "\n")
print(df.iloc[1]["sentence2"] + "\n")
print("Label: " + str(df.iloc[1]["label"]))

The NBA season of 1975 -- 76 was the 30th season of the National Basketball
Association .

The 1975 -- 76 season of the National Basketball Association was the 30th season
of the NBA .

Label: 1

And here a "non-paraphrase" pair:

print(df.iloc[0]["sentence1"] + "\n")
print(df.iloc[0]["sentence2"] + "\n")
print("Label: " + str(df.iloc[0]["label"]))

In Paris , in October 1560 , he secretly met the English ambassador , Nicolas
Throckmorton , asking him for a passport to return to England through Scotland .

In October 1560 , he secretly met with the English ambassador , Nicolas
Throckmorton , in Paris , and asked him for a passport to return to Scotland
through England .

Label: 0

Class distribution

Now we check if we have an unbalanced dataset.

Note: Check your understanding: How many bars will the frequency chart show when checking the class distribution?

import matplotlib.pyplot as plt

df["label"].value_counts(ascending=True).plot.barh()
plt.title("Frequency of Classes")
plt.show()

We see that we have more examples from the negative class, but I would not consider it unbalanced. Nevertheless, it is good to keep it in mind.

How long are the sentences?

We will take the two sentences together as input to our model. Hence we will consider the length as the sum of the words in each sentence. This is just an approximation of the input size. Depending on the tokenization it may differ.

df["Words Per Pair"] = df["sentence1"].str.split().apply(len) +\
                        df["sentence2"].str.split().apply(len)
df.boxplot("Words Per Pair", by="label", grid=False, showfliers=False,
           color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()

/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))

In both cases we can see the number of words is between 40 and 50 on average, and it does not go beyond 70.

Note: For applications using DistilBERT, the maximum context size is 512 tokens.

paws.reset_format()

Tokenization

We will jump directly into sub-word tokenization.

from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

We can get iformation about the tokenizer.

print("Vocabulary size: " + str(tokenizer.vocab_size))
print("Model max length: " + str(tokenizer.model_max_length))
print("Model input names: " + str(tokenizer.model_input_names))

Vocabulary size: 30522
Model max length: 512
Model input names: ['input_ids', 'attention_mask']

The tokenizer function

As I introduced earlier, we will concatenate the two sentences and give them as input to our model. We can do it while we are tokenizing them, but because this is an example to gain understanding we will do it separately.

First we make a function to merge the two sentences. Pay attention that we insert [SEP] tokens to separate those. This way we let the model know that are two different sentences.

Question: Should we insert [SEP][CLS] instead? So we would have one classification token per each sentence?

Answer: Comming soon

import numpy as np

def fuse_sentences(batch):
    s1 = np.core.defchararray.add(np.array(batch["sentence1"]), np.array("[SEP]"))
    s2 = np.array(batch["sentence2"])
    return {"sentences":  np.core.defchararray.add( s1, s2 ) }

paws_fused = paws.map(fuse_sentences, batched=True, batch_size=None)

print(paws_fused["train"].column_names)

['id', 'label', 'sentence1', 'sentence2', 'sentences']

print(paws_fused["train"][0]["sentence1"] + "\n")
print(paws_fused["train"][0]["sentence2"] + "\n")
print(paws_fused["train"][0]["sentences"] + "\n")

In Paris , in October 1560 , he secretly met the English ambassador , Nicolas
Throckmorton , asking him for a passport to return to England through Scotland .

In October 1560 , he secretly met with the English ambassador , Nicolas
Throckmorton , in Paris , and asked him for a passport to return to Scotland
through England .

In Paris , in October 1560 , he secretly met the English ambassador , Nicolas
Throckmorton , asking him for a passport to return to England through Scotland
.[SEP]In October 1560 , he secretly met with the English ambassador , Nicolas
Throckmorton , in Paris , and asked him for a passport to return to Scotland
through England .

Now we define the function to tokenize the two sentences together.

def tokenize(batch):
    return tokenizer(batch["sentences"], padding=True, truncation=True)

paws_encoded = paws_fused.map(tokenize, batched=True, batch_size=None)

print(paws_encoded["train"].column_names)

['attention_mask', 'id', 'input_ids', 'label', 'sentence1', 'sentence2',
'sentences']

Let's check if the tokens have been inserted correctly.

tokenizer.convert_ids_to_tokens(paws_encoded["train"][0]["input_ids"])

['[CLS]',
 'in',
 'paris',
 ',',
 'in',
 'october',
 '1560',
 ',',
 'he',
 'secretly',
 'met',
 'the',
 'english',
 'ambassador',
 ',',
 'nicolas',
 'th',
 '##rock',
 '##mo',
 '##rton',
 ',',
 'asking',
 'him',
 'for',
 'a',
 'passport',
 'to',
 'return',
 'to',
 'england',
 'through',
 'scotland',
 '.',
 '[SEP]',
 'in',
 'october',
 '1560',
 ',',
 'he',
 'secretly',
 'met',
 'with',
 'the',
 'english',
 'ambassador',
 ',',
 'nicolas',
 'th',
 '##rock',
 '##mo',
 '##rton',
 ',',
 'in',
 'paris',
 ',',
 'and',
 'asked',
 'him',
 'for',
 'a',
 'passport',
 'to',
 'return',
 'to',
 'scotland',
 'through',
 'england',
 '.',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

The model

We import the pre-trained model from the Hub. It does not include the last autoregressive part, which uses the latent representations of the input to output a un-masked sentence.

from transformers import AutoModel

model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

Classifier

First, will use simple linear regression on the latent representations of the inputs.

Here is a function to extract the last hidden state. Specifically we are interested in the [CLS] token, which will serve us as a "summary" of the input relations.

def extract_hidden_states(batch):
    # Place model inputs on the GPU
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}
    # Extract last hidden states
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

Question: Would it have more sense if we had two [CLS] tokens (one per sentence) and concatenate them?

Answer: Coming soon.

The inputs to the DistillBert and the labels have to be torch tensors, so we have to change the type of those columns.

tokenizer.model_input_names

['input_ids', 'attention_mask']

paws_encoded.set_format("torch", 
                        columns=["input_ids", "attention_mask", "label"])

paws_hidden = paws_encoded.map(extract_hidden_states, batched=True)

paws_hidden["train"].column_names

['attention_mask',
 'hidden_state',
 'id',
 'input_ids',
 'label',
 'sentence1',
 'sentence2',
 'sentences']

Bridge to Scikit-Learn

Convert the output of the model to Numpy arrays. This is because the Scikit-Learn library works with those, not with PyTorch tensors.

Note: Check your understanding: What will be the dimensions of X_train as defined in the following code cell?

import numpy as np

X_train = np.array(paws_hidden["train"]["hidden_state"])
X_valid = np.array(paws_hidden["validation"]["hidden_state"])
y_train = np.array(paws_hidden["train"]["label"])
y_valid = np.array(paws_hidden["validation"]["label"])
X_train.shape, X_valid.shape

((49401, 768), (2000, 768))

Dimensionality Reduction

A good practice is to take a look to the vectors that we have representing each pair of sentences. But they are 768 dimensional! Dimensionality reduction will allow us to project them into a lower dimensional space.

from umap import UMAP
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Scale features to [0,1] range
X_scaled = MinMaxScaler().fit_transform(X_train)
# Initialize and fit UMAP
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
# Create a DataFrame of 2D embeddings
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = y_train
df_emb.head()

Now we can visualize how the data points are distibuted in a 2D space for each class.

fig, axes = plt.subplots(1, 2, figsize=(7,5))
axes = axes.flatten()
cmaps = ["Blues", "Reds"]
labels = paws["train"].features["label"].names

for i, (label, cmap) in enumerate(zip(labels, cmaps)):
    df_emb_sub = df_emb.query(f"label == {i}")
    axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap,
                   gridsize=20, linewidths=(0,))
    axes[i].set_title(label)
    axes[i].set_xticks([]), axes[i].set_yticks([])

plt.tight_layout()
plt.show()

Important: In our case, they look similar, but remember that it does not mean the two classes are not separable in higher dimensions.

Training the classifier

# We increase `max_iter` to guarantee convergence 
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)

lr_clf.score(X_valid, y_valid)

0.5785

We can compare it with a dummy classifier. In our case the one that performed best was choosing the most frequent class in the training set.

from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_valid, y_valid)

0.5685

So our classifier does not perform really well. It is almost like the dummy one! Lastly, we can plot the confusion matrix of the predictions to see what is going on.

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()
    
y_preds = lr_clf.predict(X_valid)
plot_confusion_matrix(y_preds, y_valid, labels)

We see that our classifier tends to give the 0 label no matter what is the input, similarly to the dummy one 🤔

Finetuning

After this hit, let's see if we can perform better by using a fully connected NN as a classifier and backpropagating the gradient through all the model.

from transformers import AutoModelForSequenceClassification

num_labels = 2
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, num_labels=num_labels)
         .to(device))

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token

from transformers import Trainer, TrainingArguments

batch_size = 64
logging_steps = len(paws_encoded["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-paws"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  push_to_hub=True, 
                                  log_level="error")

Question: Which loss are we using to train the model? How we could change it?

Answer: Comming soon

from transformers import Trainer

trainer = Trainer(model=model, args=training_args, 
                  compute_metrics=compute_metrics,
                  train_dataset=paws_encoded["train"],
                  eval_dataset=paws_encoded["validation"],
                  tokenizer=tokenizer)
trainer.train();

/content/notebooks/distilbert-base-uncased-finetuned-paws is already a clone of https://huggingface.co/XaviXva/distilbert-base-uncased-finetuned-paws. Make sure you pull the latest changes with `repo.git_pull()`.
WARNING:huggingface_hub.repository:/content/notebooks/distilbert-base-uncased-finetuned-paws is already a clone of https://huggingface.co/XaviXva/distilbert-base-uncased-finetuned-paws. Make sure you pull the latest changes with `repo.git_pull()`.

Several commits (4) will be pushed upstream.
WARNING:huggingface_hub.repository:Several commits (4) will be pushed upstream.

preds_output = trainer.predict(paws_encoded["validation"])

preds_output.metrics

{'test_loss': 0.38500452041625977,
 'test_accuracy': 0.8355,
 'test_f1': 0.8361579553422098,
 'test_runtime': 6.5148,
 'test_samples_per_second': 306.991,
 'test_steps_per_second': 4.912}

y_preds = np.argmax(preds_output.predictions, axis=1)
plot_confusion_matrix(y_preds, y_valid, labels)

model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (1): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (2): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (3): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (4): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (5): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

from torch import nn

head = nn.Sequential(model.pre_classifier, model.classifier, model.dropout)
head

Sequential(
  (0): Linear(in_features=768, out_features=768, bias=True)
  (1): Linear(in_features=768, out_features=2, bias=True)
  (2): Dropout(p=0.2, inplace=False)
)

This looks more promising than the simple classifier. We can take a deeper look by checking the predictions and the confusion matrix on the validation set.

preds_output = trainer.predict(paws_encoded["validation"])

preds_output = trainer.predict(paws_encoded["validation"])

preds_output.metrics

preds_output.metrics

{'test_loss': 0.38500452041625977,
 'test_accuracy': 0.8355,
 'test_f1': 0.8361579553422098,
 'test_runtime': 6.5148,
 'test_samples_per_second': 306.991,
 'test_steps_per_second': 4.912}

y_preds = np.argmax(preds_output.predictions, axis=1)
plot_confusion_matrix(y_preds, y_valid, labels)

y_preds = np.argmax(preds_output.predictions, axis=1)
plot_confusion_matrix(y_preds, y_valid, labels)

Note: It impresses me to achieve relatively good results with only two epochs of training.

So it definitely performs better. We can look inside the model to see what is exactly the classification head that has been added.

model

model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (1): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (2): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (3): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (4): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (5): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

So our head that we have added are:

pre_classifier
classifier
doropout

from torch import nn

head = nn.Sequential(model.pre_classifier, model.classifier, model.dropout)
head

from torch import nn

head = nn.Sequential(model.pre_classifier, model.classifier, model.dropout)
head

Sequential(
  (0): Linear(in_features=768, out_features=768, bias=True)
  (1): Linear(in_features=768, out_features=2, bias=True)
  (2): Dropout(p=0.2, inplace=False)
)

num_counted_elements = 0
for param in head.parameters():
  num_counted_elements += param.numel()

print("The number of parameters of the head is: " + str(num_counted_elements))

num_counted_elements = 0
for param in head.parameters():
  num_counted_elements += param.numel()

print("The number of parameters of the head is: " + str(num_counted_elements))

The number of parameters of the head is: 592130

So we have added a way more powerful classifier that the classical one. Moreover remember now we are backpropagating through all the model. So we are training even more parameters.

num_counted_elements = 0
for param in model.parameters():
  num_counted_elements += param.numel()

print("The number of parameters of the model is: " + str(num_counted_elements))

num_counted_elements = 0
for param in model.parameters():
  num_counted_elements += param.numel()

print("The number of parameters of the model is: " + str(num_counted_elements))

The number of parameters of the model is: 66955010

Error analysis

For our error analysis we will use Binary Cross Entropy loss because we only have two classes.

from torch.nn.functional import binary_cross_entropy_with_logits

def forward_pass_with_label(batch):
    # Place all input tensors on the same device as the model
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}

    with torch.no_grad():
        output = model(**inputs)
        pred_label = torch.argmax(output.logits, axis=-1)
        # note we use output.logits[:,:,1]
        loss = binary_cross_entropy_with_logits(output.logits[:,1],
                                                batch["label"].float().to(device), reduction="none")

    # Place outputs on CPU for compatibility with other dataset columns   
    return {"loss": loss.cpu().numpy(), 
            "predicted_label": pred_label.cpu().numpy()}

from torch.nn.functional import binary_cross_entropy_with_logits

def forward_pass_with_label(batch):
    # Place all input tensors on the same device as the model
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}

    with torch.no_grad():
        output = model(**inputs)
        pred_label = torch.argmax(output.logits, axis=-1)
        # note we use output.logits[:,:,1]
        loss = binary_cross_entropy_with_logits(output.logits[:,1],
                                                batch["label"].float().to(device), reduction="none")

    # Place outputs on CPU for compatibility with other dataset columns   
    return {"loss": loss.cpu().numpy(), 
            "predicted_label": pred_label.cpu().numpy()}

# Convert our dataset back to PyTorch tensors
paws_encoded.set_format("torch", 
                            columns=["input_ids", "attention_mask", "label"])
# Compute loss values
paws_encoded["validation"] = paws_encoded["validation"].map(
    forward_pass_with_label, batched=True, batch_size=16)

# Convert our dataset back to PyTorch tensors
paws_encoded.set_format("torch", 
                            columns=["input_ids", "attention_mask", "label"])
# Compute loss values
paws_encoded["validation"] = paws_encoded["validation"].map(
    forward_pass_with_label, batched=True, batch_size=16)

paws_encoded.set_format("pandas")
cols = ["sentence1","sentence2", "label", "predicted_label", "loss"]
df_test = paws_encoded["validation"][:][cols]

paws_encoded.set_format("pandas")
cols = ["sentence1","sentence2", "label", "predicted_label", "loss"]
df_test = paws_encoded["validation"][:][cols]

df_test.sort_values("loss", ascending=False).head(10)

df_test.sort_values("loss", ascending=False).head(10)

We can see that the model confuses both classes similarly, as is also shown in the confusion matrix.

df_test.sort_values("loss", ascending=True).head(10)

df_test.sort_values("loss", ascending=True).head(10)

Nevertheless, we also see that the predictions he is most confident about are when the sentences are not paraphrasing each other. This may be caused because of the small unbalance in our dataset. So we should account for that when using its predictions.

Saving and sharing

Time to share the model! How to share and get models and datasets from the hub is one of the parts I am most interested to learn about. These things are essential to work in today's ML industry and research!

trainer.push_to_hub(commit_message="Training completed!")

trainer.push_to_hub(commit_message="Training completed!")

	id	sentence1	sentence2	label
0	1	In Paris , in October 1560 , he secretly met t...	In October 1560 , he secretly met with the Eng...	0
1	2	The NBA season of 1975 -- 76 was the 30th seas...	The 1975 -- 76 season of the National Basketba...	1
2	3	There are also specific discussions , public p...	There are also public discussions , profile sp...	0
3	4	When comparable rates of flow can be maintaine...	The results are high when comparable flow rate...	1
4	5	It is the seat of Zerendi District in Akmola R...	It is the seat of the district of Zerendi in A...	1

	X	Y	label
0	-1.060788	4.195936	0
1	-5.594001	14.602888	1
2	2.748996	3.122407	0
3	6.510664	7.842440	1
4	5.438829	0.523094	1

Epoch	Training Loss	Validation Loss	Accuracy	F1
1	0.671500	0.598167	0.678500	0.679950
2	0.427800	0.385005	0.835500	0.836158

	sentence1	sentence2	label	predicted_label	loss
45	ACVM is headquartered in Edinburgh and has off...	ACVM is based in Glasgow and has subsidiaries ...	0	1	1.982430
939	Another way to control the population of deers...	Another way to regulate the population of deer...	1	0	1.923286
1257	1i Productions is an American board game publi...	1i Productions is an American board game , fou...	0	1	1.902109
357	The medals were presented by Barbara Kendall ,...	The medals were handed over by Carlo Croce , I...	1	0	1.883666
728	In other articles , it applauded the economic ...	In other articles it praised the economic idea...	1	0	1.853208
1317	The medals were presented by Barbara Kendall ,...	The medals have been presented by Carlo Croce ...	1	0	1.831070
1468	As a small composer in the French school , he ...	A minor composer in the French school , as a c...	1	0	1.725615
867	Following her success , Jane Campion hired Jon...	Following her success , Jane Campion Jones was...	1	0	1.725041
360	In November 1989 , Delaney became a member of ...	In November 1989 , Delaney became a member of ...	0	1	1.698835
1559	Brewarrina Shire comprises Brewarrina and the ...	Brewarrina Shire and the villages of Gongolgon...	0	1	1.681595

	sentence1	sentence2	loss
784	The racial Rubber Bowl was used by the Nationa...	The historic Rubber Bowl was used by the Natio...	0.113665
16	The racial Rubber Bowl was used by the Nationa...	The historic Rubber Bowl was used by the Natio...	0.115467
20	Earl St Vincent was a British ship that was ca...	Earl St Vincent was a French ship that was cap...	0.117033
1440	The historic Rubber Bowl was used by the Natio...	The racial Rubber Bowl was used by the Nationa...	0.117698
1854	Mark Knowles and Daniel Nestor won the title ,...	Jonathan Erlich and Andy Ram won the title and...	0.120314
377	Ruby died in Woodland Hills , California and w...	Ruby died in Los Angeles and was buried in the...	0.120729
111	In 284 BC , King Qi met King Zhao of Qin in we...	In 284 BC , King Xi met with King Zhao of Qin ...	0.121336
786	Between 1940 and 1945 he served as Canada 's f...	Between 1940 and 1945 , he served as Australia...	0.122137
1281	In 284 BC , King Qi met King Zhao of Qin in we...	In 284 BCE , King Xi met with King Zhao of Qin...	0.122312
1226	In the summer of 1924 he went to Smackover in ...	In the summer of 1924 , he went to Union Count...	0.122760