NLP with Transformers: Text Classification
First post on the NLP with Transformers series. We will apply the concepts of Chapter 2 to a different problem. Will DistillBert be capable of distinguishing between a paraphrase and a sentence with a different meaning?
- Introduction
- The dataset
- Tokenization
- The model
- Classifier
- Finetuning
- Error analysis
- Saving and sharing
This is the first post of a series, which will go through the "NLP with Transformers" book. I am following the reading group organized by Hugging Face, with the authors themselves! So I thought the best way to properly understand each chapter would be to try to apply the concepts to a different example (then the problems happenš). My idea is to post one of these every two weeks, before the reading group session. However, life gets in the way, so I cannot promise anything. Do not expect fancy or innovative things, the code will be similar to the one in the book. Of course I would like to go deeper, but time is limitedā³ and this is just to enhance my learning (and the learning of whoever wants to follow the book). Enough introduction, now let's go with my first dive into NLP!
from datasets import load_dataset
paws = load_dataset("paws-x", "en")
paws
So we have a dataset where each example is a pair of sentences, with a label. The label indicates whether or not one of the sentences is paraphrasing the other. 0 indicates the pair has a different meaning, while 1 indicates the pair is a paraphrase. Our goal will be to use a pre-trained DistillBert model to distinguish them.
Let's get a Pandas Dataframe to easily explore our data.
paws.set_format(type="pandas")
df=paws["train"][:]
df.head()
Here we can see an example of a paraphrase:
print(df.iloc[1]["sentence1"] + "\n")
print(df.iloc[1]["sentence2"] + "\n")
print("Label: " + str(df.iloc[1]["label"]))
And here a "non-paraphrase" pair:
print(df.iloc[0]["sentence1"] + "\n")
print(df.iloc[0]["sentence2"] + "\n")
print("Label: " + str(df.iloc[0]["label"]))
Now we check if we have an unbalanced dataset.
import matplotlib.pyplot as plt
df["label"].value_counts(ascending=True).plot.barh()
plt.title("Frequency of Classes")
plt.show()
We see that we have more examples from the negative class, but I would not consider it unbalanced. Nevertheless, it is good to keep it in mind.
We will take the two sentences together as input to our model. Hence we will consider the length as the sum of the words in each sentence. This is just an approximation of the input size. Depending on the tokenization it may differ.
df["Words Per Pair"] = df["sentence1"].str.split().apply(len) +\
df["sentence2"].str.split().apply(len)
df.boxplot("Words Per Pair", by="label", grid=False, showfliers=False,
color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()
In both cases we can see the number of words is between 40 and 50 on average, and it does not go beyond 70.
paws.reset_format()
We will jump directly into sub-word tokenization.
from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
We can get iformation about the tokenizer.
print("Vocabulary size: " + str(tokenizer.vocab_size))
print("Model max length: " + str(tokenizer.model_max_length))
print("Model input names: " + str(tokenizer.model_input_names))
As I introduced earlier, we will concatenate the two sentences and give them as input to our model. We can do it while we are tokenizing them, but because this is an example to gain understanding we will do it separately.
First we make a function to merge the two sentences. Pay attention that we insert [SEP] tokens to separate those. This way we let the model know that are two different sentences.
Question: Should we insert [SEP][CLS] instead? So we would have one classification token per each sentence?
Answer: Comming soon
import numpy as np
def fuse_sentences(batch):
s1 = np.core.defchararray.add(np.array(batch["sentence1"]), np.array("[SEP]"))
s2 = np.array(batch["sentence2"])
return {"sentences": np.core.defchararray.add( s1, s2 ) }
paws_fused = paws.map(fuse_sentences, batched=True, batch_size=None)
print(paws_fused["train"].column_names)
print(paws_fused["train"][0]["sentence1"] + "\n")
print(paws_fused["train"][0]["sentence2"] + "\n")
print(paws_fused["train"][0]["sentences"] + "\n")
Now we define the function to tokenize the two sentences together.
def tokenize(batch):
return tokenizer(batch["sentences"], padding=True, truncation=True)
paws_encoded = paws_fused.map(tokenize, batched=True, batch_size=None)
print(paws_encoded["train"].column_names)
Let's check if the tokens have been inserted correctly.
tokenizer.convert_ids_to_tokens(paws_encoded["train"][0]["input_ids"])
We import the pre-trained model from the Hub. It does not include the last autoregressive part, which uses the latent representations of the input to output a un-masked sentence.
from transformers import AutoModel
model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)
First, will use simple linear regression on the latent representations of the inputs.
Here is a function to extract the last hidden state. Specifically we are interested in the [CLS] token, which will serve us as a "summary" of the input relations.
def extract_hidden_states(batch):
# Place model inputs on the GPU
inputs = {k:v.to(device) for k,v in batch.items()
if k in tokenizer.model_input_names}
# Extract last hidden states
with torch.no_grad():
last_hidden_state = model(**inputs).last_hidden_state
# Return vector for [CLS] token
return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}
Question: Would it have more sense if we had two [CLS] tokens (one per sentence) and concatenate them?
Answer: Coming soon.
The inputs to the DistillBert and the labels have to be torch tensors, so we have to change the type of those columns.
tokenizer.model_input_names
paws_encoded.set_format("torch",
columns=["input_ids", "attention_mask", "label"])
paws_hidden = paws_encoded.map(extract_hidden_states, batched=True)
paws_hidden["train"].column_names
Convert the output of the model to Numpy arrays. This is because the Scikit-Learn library works with those, not with PyTorch tensors.
import numpy as np
X_train = np.array(paws_hidden["train"]["hidden_state"])
X_valid = np.array(paws_hidden["validation"]["hidden_state"])
y_train = np.array(paws_hidden["train"]["label"])
y_valid = np.array(paws_hidden["validation"]["label"])
X_train.shape, X_valid.shape
A good practice is to take a look to the vectors that we have representing each pair of sentences. But they are 768 dimensional! Dimensionality reduction will allow us to project them into a lower dimensional space.
from umap import UMAP
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Scale features to [0,1] range
X_scaled = MinMaxScaler().fit_transform(X_train)
# Initialize and fit UMAP
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
# Create a DataFrame of 2D embeddings
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = y_train
df_emb.head()
Now we can visualize how the data points are distibuted in a 2D space for each class.
fig, axes = plt.subplots(1, 2, figsize=(7,5))
axes = axes.flatten()
cmaps = ["Blues", "Reds"]
labels = paws["train"].features["label"].names
for i, (label, cmap) in enumerate(zip(labels, cmaps)):
df_emb_sub = df_emb.query(f"label == {i}")
axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap,
gridsize=20, linewidths=(0,))
axes[i].set_title(label)
axes[i].set_xticks([]), axes[i].set_yticks([])
plt.tight_layout()
plt.show()
# We increase `max_iter` to guarantee convergence
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)
lr_clf.score(X_valid, y_valid)
We can compare it with a dummy classifier. In our case the one that performed best was choosing the most frequent class in the training set.
from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_valid, y_valid)
So our classifier does not perform really well. It is almost like the dummy one! Lastly, we can plot the confusion matrix of the predictions to see what is going on.
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
def plot_confusion_matrix(y_preds, y_true, labels):
cm = confusion_matrix(y_true, y_preds, normalize="true")
fig, ax = plt.subplots(figsize=(6, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
plt.title("Normalized confusion matrix")
plt.show()
y_preds = lr_clf.predict(X_valid)
plot_confusion_matrix(y_preds, y_valid, labels)
We see that our classifier tends to give the 0 label no matter what is the input, similarly to the dummy one š¤
After this hit, let's see if we can perform better by using a fully connected NN as a classifier and backpropagating the gradient through all the model.
from transformers import AutoModelForSequenceClassification
num_labels = 2
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = (AutoModelForSequenceClassification
.from_pretrained(model_ckpt, num_labels=num_labels)
.to(device))
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
from huggingface_hub import notebook_login
notebook_login()
from transformers import Trainer, TrainingArguments
batch_size = 64
logging_steps = len(paws_encoded["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-paws"
training_args = TrainingArguments(output_dir=model_name,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps,
push_to_hub=True,
log_level="error")
Question: Which loss are we using to train the model? How we could change it?
Answer: Comming soon
from transformers import Trainer
trainer = Trainer(model=model, args=training_args,
compute_metrics=compute_metrics,
train_dataset=paws_encoded["train"],
eval_dataset=paws_encoded["validation"],
tokenizer=tokenizer)
trainer.train();