Fine Tuning Embedding Models using Sentence Transformers: Code Included

Read Time:
minutes

Introduction

Have you ever wondered how google text/image search works or how the recommendation system works under the hood? It uses embedding models. Embedding models are used to convert text into vector embeddings and these vector embeddings are used to represent the semantic similarity between 2 things, whether it can be an image or text. If you want to learn more about vector embeddings in more detail then I have already written a blog about it here.

There are many embedding models available in the market but what if you have an ecommerce business and you want to implement your own recommendation system for users based on the similarity of your products then you have to fine-tune or train your own embedding model, sounds hard right 👀? actually it’s not. The sentence transformer library has made it easy to train or fine-tune your own embedding model.

In this blog, we will first learn about the sentence transformer library and then we will train our model from scratch.

💡 You can get the full source code discussed in this blog from our github repo

What are embedding models?

Embedding models are types of LLMs that can convert a given text or media into vector embeddings. These embeddings are then stored in a vector space from which you can perform different operations on these embeddings to get the desired results. For example, you can perform a semantic search to get similar results for a given sentence or you can get the similarity score of a sentence which shows how similar 2 sentences are.

This is how the vector embeddings look like:

As you can see in the diagram, the similar sentences will be closer to each other in vector space which makes it easy to get the similar sentences for a given sentence. Also feel free to check out my other blog in which I explained the vector embeddings in more detail.

What are sentence transformers?

Sentence transformers is a library specifically created to create and fine-tune embedding models for sentences. You can use sentence transformers to generate embeddings for your sentences, get the similarity score between 2 or more sentences or do a semantic search for your sentence. You can also easily fine-tune an existing embedding model for specific tasks or train your own model using sentence transformers.

Let’s see some of the features of sentence transformers in action!

Converting text to embeddings

Let’s first try converting a given text into embeddings. You might also have used openai embeddings to get the embeddings of a given text but it charges money to use their embedding model so as an alternative you can use models from huggingface or any other opensource embedding model with sentence transformers to generate vector embeddings for your sentence.

Let’s see how we can generate embeddings using all-MiniLM-L6-v2 model:

First, install the sentence transformer library using pip


!pip install sentence-transformers

Now let’s import our model, you can also use another model of your choice.


# model
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

Now we will define the sentences for which we want to generate the embeddings in an array and then we can use “encode” method from our model to generate embeddings.


# sentences we like to encode
sentences = [
    "This framework generates embeddings for each input sentence",
    "Sentences are passed as a list of string.",
    "The quick brown fox jumps over the lazy dog.",
]

# Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)

# Print the embeddings
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

After running the above code we can see the 384 dimensional embeddings generated for each sentence.

Cosine Similarity between Sentences

You can use cosine similarity to find out the similarity between 2 sentences. Sentence transformers allow us to find the cosine similarity score between 2 sentences so let’s see it in action!

First, we will import the required modules and convert our sentences into embeddings using the same model we used before


 # Finding cosine similarity
from sentence_transformers import SentenceTransformer, util

# Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")

Now we can find the cosine similarity between these 2 embeddings using “util.cos_sim” method


cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

After running the above code, you will see a similarity score which shows how much similar these 2 sentences are ( If it is closer to 1 then these sentences are similar and if it is closer to 0 then the sentences are not similar).

Semantic Search

We have discussed about google search or google image search before and it all works based on semantic search. In semantic search, you have a query (it can be a sentence or an image) and you convert that query into embeddings and then you find the similar sentence embeddings for the given query embedding using semantic search by performing cosine similarity.

Once we get all the similarity scores for different sentences, we then sort the sentences based on the scores in descending order meaning that the most similar sentence or a sentence with highest similarity score will be at the top and we can specify the number of similar sentences we want as “k”.

Let’s see it in action!

First we will define the existing sentences which works as a database meaning that we want to find the top k similar sentences from this list. We will have to convert these sentences into encodings so that we can perform cosine similarity on them.


from sentence_transformers import SentenceTransformer, util
import torch

# Corpus with example sentences
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

Now we will define our queries and for each query we will find top 3 similar sentences from corpus


# Query sentences:
queries = [
    "A man is eating pasta.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah chases prey on across a field.",
]
top_k = 3
# Traverse queries
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 3 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:")
    # Printing the top 3 similar sentences with their similarity scores
    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

instead of using “util.cos_sim” and then getting the top k results, you can use “util.semantic_search” method to do the same thing easily.


top_k = 3
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)
    similar_results = util.semantic_search(query_embeddings=query_embedding,corpus_embeddings=corpus_embeddings,top_k=top_k)
    print("===============\n")
    print(f"Similar Sentences for '{query}'")
    for result in similar_results[0]:
      print(f"{corpus[result['corpus_id']]} (score: {result['score']})")

Now we know how to use sentence transformers, let’s take a look at how we can train or fine-tune our own model.

💡 You can get the full source code discussed above from our github repo

Before we train: Preparing the Model

Before we dive into training, let’s first prepare our model to fine-tune it:

Selecting a Model

You can use any kind of “sentence similarity” model from hugging face or any different open source model with sentence transformers. Once you get the model, you can then load it just like we loaded it above.

We will use the “bert-base-uncased” model as our base model and we limit that layer to a maximal sequence length of 256, texts longer than that will be truncated.


# Training a bert model using sentence transformer
from sentence_transformers import SentenceTransformer, models
import torch

word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)

Checkout the different model list from the sentence transformer website to select the model of your choice

Pooling

BERT produces contextualized word embeddings for all input tokens in our text. Different models might give us embeddings in an array of different size. If you want to get a fixed-sized output representation then you need to add a pooling layer.

Here we want to get the 768 dimensional vector embeddings so we will provide that size in pooling layer and you can easily get the embedding dimension for any model using “get_word_embedding_dimension()" method


pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

Additionally, if you want to reduce the dimensions of your output array then you can add dense layer after pooling


dense_model = models.Dense(
    in_features=pooling_model.get_sentence_embedding_dimension(),
    out_features=256,
    activation_function=nn.Tanh(),
)

Preparing a Dataset and Loss Function

To train a SentenceTransformer model, you need to inform it somehow that two sentences have a certain degree of similarity. Therefore, each example in the data requires a label or structure that allows the model to understand whether two sentences are similar or different.

The training data totally depends on your goal and the structure of your data. There are different type of datasets that you can prepare for your model but the main goal of every dataset is to define the similarity between 2 or more sentences.

Here are some of the popular dataset types:

  • Pair of sentences with label: Every example in this dataset will have a pair of sentences with a label that shows whether they are similar or not. This case applies to datasets originally prepared for Natural Language Inference (NLI), since they contain pairs of sentences with a label indicating whether they infer each other or not.
  • Pair of sentences without label: Every example in this dataset will have a pair of sentences indicating that those sentences are similar. For example, pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language)
  • Sentence with an integer label: Every example in this dataset will have an integer label to indicate the class of the sentence. This data can be easily converted by loss functions into the triplets which contains the main sentence (anchor), the positive sentences of the same class as the anchor and the negative sentences of the same class as anchor.
  • Triplets without class: Every example in this dataset will have an anchor. the positive sentences which are similar to anchor and the negative sentences which are not similar to the anchor. These triplets don’t have any class or label.

For this blog, we are going to explore the Triplets and the pair of sentences with label type of datasets to train our model.

Loss Functions

The loss function plays a critical role in model training because it determines how well our embedding model will work for the specific task.

There is no single loss function that you can use for every model so you have to decide the loss function suitable for you based on the training data and the target task.  You can take a look at the below table to determine the loss function for your model:

Training an embedding model

Now we know everything we need to know before training an embedding model so it’s time to get our hands dirty 🚀 !

Sentence Transformers was designed in such a way that fine-tuning your own sentence / text embeddings models is easy. It provides most of the building blocks that you can stick together to tune embeddings for your specific task.

Here we will fine-tune a “bert-base-uncased” model using 2 different types of datasets and then we will evaluate the performance and results with both models.

Training a model using a triplets dataset

Let’s first pull our base model and apply pooling on it so that we can get fixed 768 sized embedding array in output


from sentence_transformers import SentenceTransformer, models
import torch

word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Now let’s pull our dataset, we are going to use “embedding-data/QQP_triplets” but you can use any other triplet dataset too if you want


# Using QQP Triples dataset for training
from datasets import load_dataset

dataset_id = "embedding-data/QQP_triplets"
dataset = load_dataset(dataset_id)

Let’s take a look at how each data looks like in dataset

As we can see, each example have a query, a positive sentence which is similar to that query and a list of negative sentences which are not similar to query.

We can’t directly pass this dataset examples into our model because first we have to convert them to a specific format that sentence transformers and model can understand. Every training example must be in “InputExample” format in sentence transformers so we will convert our dataset data into this format.

We will also take only first sentence from both “pos” and “neg” arrays to make it easy but in production scenario, you might need to pass the full array for better performance and accuracy


from sentence_transformers import InputExample

train_examples = []
train_data = dataset['train']['set']
# We will only train 1/2 of our available data
n_examples = dataset['train'].num_rows // 2
for i in range(n_examples):
  example = train_data[i]
  # Add your converted data into train_examples array
  train_examples.append(InputExample(texts=[example['query'],example['pos'][0],example['neg'][0]]))

Now let’s create our dataloader


from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

Now let’s define our loss function. We can use “losses” class from sentence transformers which allows us to get different loss functions that we discussed above.

We just have to attach the model to triplet loss function


from sentence_transformers import losses
train_loss = losses.TripletLoss(model)

And now we are ready, let’s combine everything we prepared and fine-tune the model using “model.fit” method which takes dataloader and loss function as a train objectives.


# we will put epoch value as 4 which defines the total number of iterations during training
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=4)

Once the training is completed, you will see output like this:

Now let’s push this fine-tuned model on huggingface so that we can share it with other people and they can also see what we cooked!

First login with huggingface using your access token


from huggingface_hub import notebook_login

notebook_login()

After that. call “save_to_hub” method to push your model on huggingface


model.save_to_hub(
    "distilroberta-base-sentence-transformer-triplets", # Give a name to your model
    organization="0xSH1V4M" # Your Huggingface Username
    train_datasets=["embedding-data/QQP_triplets"],
    )

And we have successfully fine-tuned and pushed the embedding model!

Training a model using labeled sentences dataset

Now let’s try to fine-tune a model using a different dataset. This time we will use a dataset in which each example contains a pair of sentences with a label score that defines the relationship between 2 sentences.

Let’s first load our model and add pooling to it


from sentence_transformers import SentenceTransformer, models
import torch

word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

We will use “snli” dataset to train this model which have the data in the format we discussed above


from datasets import load_dataset

# Using snli as a dataset
snli = load_dataset('snli', split='train')
# Removing invalid rows
snli = snli.filter(
    lambda x: False if x['label'] == -1 else True
)

Let’s take a look at how each example looks in dataset

Now let’s convert each data example in InputExample format


from sentence_transformers import InputExample
from tqdm.auto import tqdm  # so we see progress bar

train_samples = []
for row in tqdm(snli):
    train_samples.append(InputExample(
        texts=[row['premise'], row['hypothesis']],
        label=row['label']
    ))

Now let’s define our dataloader and loss function. For this type of dataset, we will use sotfmax loss function


from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=16)
# Define the model, dimension and total labels in dataset
train_loss = losses.SoftmaxLoss(model,sentence_embedding_dimension=model.get_sentence_embedding_dimension(),num_labels=3)

Now let’s train our model!


epochs = 1
# Warmup for 10% of training as before (you can increase this count according to needs)
warmup_steps = int(len(train_dataloader) * epochs * 0.1) 
# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

Once the training is completed, you will see output like this:

Now let’s push this model on huggingface


model.save_to_hub(
    "distilroberta-base-sentence-transformer-snli", # Give a name to your model
    organization="0xSH1V4M" # Your Huggingface Username
    train_datasets=["snli"],
    )

And we have successfully fine-tuned a model using both datasets!

💡 You can get the full source code discussed above from our github repo

Evaluation

Now it’s time to test our fine-tuned models with the base model and analyze the accuracy and performance of these models.

We will first get the vector embeddings of some sentences using both models and then reduce the dimensions of these embeddings to 2 using “TSNE” technique then using metaploitlib, we will plot the embeddings on a 2D graph

We will use these sentences for testing


sentences = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

Let’s first get the embeddings  of these sentences using “bert-base-uncased” model which is our base model


model = SentenceTransformer("bert-base-uncased")
# Get the embeddings
sentence_embeddings = model.encode(sentences)

Now let’s reduce the embedding dimensions using TSNE


import numpy as np
from sklearn.manifold import TSNE
embeddings = np.array(sentence_embeddings)
# Perplexity must be less than total sentences
tsne = TSNE(n_components=2, random_state=42,perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)

Now we have 2D embeddings, we will do clustering to classify all these embeddings into different classes so that it will be easy for us to visualize how these models are classifying different embeddings and the positions of embeddings in vector space


from sklearn.cluster import KMeans
# Perform kmean clustering
num_clusters = 3 # We will classify all sentences in total 3 classes
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(sentence_embeddings)
cluster_assignment = clustering_model.labels_

If you print the “cluster_assignment” array then you will see the class labels for every sentence which shows the class of each sentence

Now let’s plot these embeddings in 2D vector space using metaplotlib


import matplotlib.pyplot as plt

# Create a scatter plot
plt.figure(figsize=(6, 4))  # Adjust figure size as needed
colors = ["red","green","blue"]

for index,embedding in enumerate(embeddings_2d):
  plt.scatter(embedding[0],embedding[1],color=colors[cluster_assignment[index]])

# Add labels and title
plt.xlabel("X")
plt.ylabel("Y")
plt.title("BERT Base Model")

# Add sentence labels (consider using for small datasets)
for i, sentence in enumerate(sentences):
  plt.annotate(sentence, (embeddings_2d[i, 0], embeddings_2d[i, 1]))

plt.grid(False)
plt.show()

And here are the results!

Now let’s get the embedding plot for our fine-tuned models in the same way and compare them side by side

Here is the comparison of BERT base model with a fine-tuned model with snli dataset

As we can see from above image, the BERT base model is not properly able to classify the sentence embeddings and those are not placed at the right place but if you compare it with fine-tuned SNLI dataset model then the classification is better than base model and 2 similar sentences are also close to each other.

But still it’s not that much good 🤔 because take a look at “The girl is carrying a baby” and “Women is playing violin” sentences should be closer to each other but still they are far away and why “monkey is playing drums” is closer to “The girl is carrying a baby” 💀?  It also can happen because we have trained the model with limited examples from the dataset.

Do we get better results with triplets? Let’s check it out 👀

Here is the comparison of BERT base model with a fine-tuned model with triplet dataset

Now you can see the classification is done properly and now the closer sentences are making more sense. So we can say that the model trained with triplet dataset gave more better results than the model trained with SNLI dataset. But all this depends on the input examples you added for training and the examples in your dataset so it’s all about your use case.

Conclusion

The embedding models are very useful in searching, recommendation systems or getting similar results for a query and is widely used in different domains. We also saw that how easy it is to fine-tune or train your own model from scratch using sentence transformers in few lines of code and how better the models perform after fine-tuning them so it is always advisable to fine-tune a model before using it for your own use case for better results.

Want to train your own LLM?

As we all know that public large language models (LLM) like gpt or llama are trained on public data and might not perform well for your specific use case so to make them efficient and accurate for your specific task, you have to fine-tune a base model with your own dataset.If you have any business idea for which you might need to train or fine-tune a LLM or embedding model then kindly book a call with us and we will be happy to convert your ideas into reality.

Thanks for reading 😀

Book an AI consultation

Looking to build AI solutions? Let's chat.

Schedule your consultation today - this not a sales call, feel free to come prepared with your technical queries.

You'll be meeting Rohan Sawant, the Founder.
 Company
Book a Call

Let us help you.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Behind the Blog 👀
Shivam Danawale
Writer

Shivam is an AI Researcher & Full Stack Engineer at Ionio.

Pranav Patel
Editor

Good boi. He is a good boi & does ML/AI. AI Lead.