Suppose you were given a challenge to pick out a horror novel from a small collection of books, without any prior information regarding these books. How would you go about this task in the most efficient way possible? The most appropriate strategy would be flicking through the books to figure out the theme based on its words/sentences/paragraphs. Another approach could be reading the reviews on the cover page for context.
This article discusses how we can use a pre-trained BERT model to accomplish a similar task. Kaggle recently released an Open Research Dataset Challenge called CORD-19 , with over 52,000 research articles on COVID-19, SARS-CoV-2, and other related topics. The problem statement was to provide insights on the “tasks” using this vast research corpus. The required solution must fetch all the relevant research articles based on questions user submits.
Our approach to finding the most relevant research articles (related to the task), was to compare the question with abstracts of all the research articles present. By finding the similarity score, we could easily rank the top-N articles. Comparing the question with the entire corpus is as easy as flicking through the small collection of novels to pick out the horror genre. This automation takes place by using Sentence-Transformers library, pre-trained BERT model and Spacy library
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019 ACL paper by Nils Reimers and Iryna Gurevych mention the architecture, and advantages of Siamese and Triplet Network Structures, some of which are:
- Finding similar pairs of sentences from a corpus of 10,000 sentences requires around 50 million inference computations and ~65 hours with BERT/RoBERTa. Sentence BERT can quite significantly reduce the embeddings construction time for the same 10,000 sentences to ~5 seconds!
- Fine-tuning a pre-trained BERT network and using siamese/triplet network structures to derive semantically meaningful sentence embeddings, which can be compared using cosine similarity.
1. Install sentence-transformers and load a pre-trained BERT model. We have used ‘bert-base-nli-mean-tokens’ due to its high performance on STS benchmark dataset.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
2. Vectorize/encode the abstracts of each article using pre-trained Bert embedding. Suppose the abstracts are contained in a list named ‘abstract’.
abstract_embeddings = model.encode(abstract)
The abstracts are encoded using BERT pre-trained embeddings and the shape would be (size of abstract list) rows × 768 columns.
3. Dump the abstract_embeddings as pickle.
with open("my.pkl", "wb") as f:
4. Inference from the trained embeddings in real time.
i) Load the embedding file “my.pkl”.
with open("my.pkl", "rb") as f:
df = pickle.load(f)
ii) Encode the question into a 768-D embedding using step 1-2.
For e.g., the question “What do we know about COVID-19 risk factors?” would be represented by a 768 dimensional embedding.
iii) Compare the question with all the abstracts’ embeddings and find the cosine similarity to rank and give top-N results. As the question and the abstracts are in numerical form, this can be done easily using scipy function.
distances = scipy.spatial.distance.cdist([query_embedding], df, "cosine")
This article discusses how to use pre-trained Sentence-BERT library for a downstream NLP task, which is to find relevant research articles based on user’s questions.
We can compare the question with the title or research body. Comparing the question with only the title could be less informative as the text-length of the title will be too small. Contrary to this, comparing the question with the research body requires heavy computations. Additionally, shrinking the huge research body to 768 dimensions will lead to a huge loss of information.
However, feel free to experiment with embeddings of the title and research body.
Your findings may be fascinating.