Thursday, September 29, 2022

What is BERT? and How BERT Work?

Must Read

Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. In 2019, Google announced that it had begun leveraging BERT in its search engine, and by late 2020 it was using BERT in almost every English-language query. A 2020 literature survey concluded that “in a little over a year, BERT has become a ubiquitous baseline in NLP experiments”, counting over 150 research publications analyzing and improving the model. Here we Briefly Explain that how BERT Work?

The original English-language BERT has two models: (1) the BERTBASE: 12 encoders with 12 bidirectional self-attention heads, and (2) the BERTLARGE: 24 encoders with 16 bidirectional self-attention heads. Both models are pre-trained from unlabeled data extracted from the Books Corpus with 800M words and English Wikipedia with 2,500M words.


BERT is at its core a transformer language model with a variable number of encoder layers and self-attention heads. The architecture is “almost identical” to the original transformer implementation in Vaswani et al. (2017).

BERT was pretrained on two tasks: language modelling (15% of tokens were masked and BERT was trained to predict them from context) and next sentence prediction (BERT was trained to predict if a chosen next sentence was probable or not given the first sentence). As a result of the training process, BERT learns contextual embeddings for words. After pretraining, which is computationally expensive, BERT can be finetuned with less resources on smaller datasets to optimize its performance on specific tasks.


When BERT was published, it achieved state-of-the-art performance on a number of natural language understanding tasks:

  • GLUE (General Language Understanding Evaluation) task set (consisting of 9 tasks)
  • SQuAD (Stanford Question Answering Dataset) v1.1 and v2.0
  • SWAG (Situations With Adversarial Generations)
  • Sentiment Analysis: sentiment classifiers based on BERT achieved remarkable perfomance in several languages


The reasons for BERT’s state-of-the-art performance on these natural language understanding tasks are not yet well understood. Current research has focused on investigating the relationship behind BERT’s output as a result of carefully chosen input sequences, analysis of internal vector representations through probing classifiers, and the relationships represented by attention weights.


BERT has its origins from pre-training contextual representations including semi-supervised sequence learning, generative pre-training, ELMo, and ULMFit. Unlike previous models, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. We Explained that How BERT Work

On October 25, 2019, Google Search announced that they had started applying BERT models for English language search queries within the US. On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages. In October 2020, almost every single English-based query was processed by BERT.


The research paper describing BERT won the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

How BERT work

The goal of any given NLP technique is to understand human language as it is spoken naturally. In BERT’s case, this typically means predicting a word in a blank. To do this, models typically need to train using a large repository of specialized, labeled training data. This necessitates laborious manual data labeling by teams of linguists.

This is contrasted against the traditional method of language processing, known as word embedding, in which previous models like GloVe and word2vec would map every single word to a vector, which represents only one dimension, a sliver, of that word’s meaning.

These word embedding models require large datasets of labeled data. In BERT words are defined by their surroundings, not by a pre-fixed identity. In the words of English linguist John Rupert Firth, “You shall know a word by the company it keeps.”

This is significant because often, a word may change meaning as a sentence develops. Each word added augments the overall meaning of the word being focused on by the NLP algorithm.  We Explained that How BERT Work.

Latest News

CareCloud EHR- The Path To Endless Success

EHR solutions have always played a vital role in running smooth medical practices. These are imperative for achieving success...

More Articles Like This