Daily Dose of Data Science – Day 6 – An easy approach to extract Keywords from paragraphs using KeyBERT

Daily Dose of Data Science – Day 6 – An easy approach to extract Keywords from paragraphs using KeyBERT

Welcome everyone for yet another Daily Dose of Data Science! For day 6, we will explore the world of natural language for the use case of keyword extraction from text using the BERT based architecture called KeyBERT. Let’s get started!

KeyBERT is an open source Python framework with minimal and easy-to-use keyword extraction technique that leverages BERT (Bi-Directional Encoder Representations from Transformers) embeddings to create keywords and keyphrases using simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself. It considers document level embeddings and word level embeddings using N-gram based phrases and then using cosine similarity to find word or phrases similar to the document.

Installing KeyBERT is very simple like other python packages and can be done using the simple pip installer.

pip install keybert

or for installing all backend dependencies, use

pip install keybert[all]

Using KeyBERT is also very simple. Please take a look at the official GitHub Repository of KeyBERT to know more about the package in details. The example usage is given as follows:

from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)

Output:

[('learning algorithm', 0.6978),
 ('machine learning', 0.6305),
 ('supervised learning', 0.5985),
 ('algorithm analyzes', 0.5860),
 ('learning function', 0.5850)]

KeyBERT also supports many embedding models like Transformers, HuggingFace Transformers, Spacy, other word embedding models that can be used to embed the documents and words:

  • Sentence-Transformers
  • Flair
  • Spacy
  • Gensim
  • USE

Visit the detailed document for complete information on all supported embedding models. Also visit https://maartengr.github.io/KeyBERT/ to find more documentation about the usage of this package. That’s all folks for today’s dose! Thank you for staying tuned till day 6! Keep following https://aditya-bhattacharya.net/ for another daily dose of data science and please feel free to like, share, comment and subscribe to my posts if you find it helpful!

Tags: , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *