BERT vs. XLNet


The transformer-based model has been key to recent advancement in the field of Natural Language Processing. The reason behind this success is a technique called Transfer Learning. Although computer vision practitioners are well-versed with this technique, it is relatively new to the field of NLP. In Transfer Learning, a model (in our case, a Transformer model) is pre-trained on a vast dataset using an unsupervised pre-training objective. This same model is then fine-tuned on the actual task. This approach works exceptionally well, even if you have as small as 500–1000 training samples.

Further, two pre-training objectives that have been successful for pre-training neural networks used in transfer learning NLP are autoregressive (AR) language modeling and autoencoding (AE). The AR model can only use forward context or backward context at a time, whereas the AE language model can do both simultaneously. BERT is AE, whereas GPT is an AR language model.


BERT(Bidirectional Encoder Representations from Transformers ), as its name suggests, is a bidirectional autoencoder(AE) language, model. It obtained state-of-the-art results on 11 Natural Language Processing tasks when it was published.

How does BERT work?


BERT currently has two variants.

  • BERT Base: 12 layers, 12 attention heads, and 110 million parameters

  • BERT Large: 24 layers, 16 attention heads, and 340 million parameters

Processing and Pre-training

BERT undergoes three layers of abstraction to preserve the true meaning of input text.

BERT is pre-trained on two NLP tasks:

  • Masked Language Modelling: In a broad sense, it replaces words with [MASK] tokens and trains so that the model will be able to predict missing words.

  • Next Sentence Prediction: Here, given two sentences – A and B, the model is asked to indicate, is B the actual following sentence after A in the corpus, or just a random sentence?

Lastly, we fine-tune this pre-trained model to perform specific NLP tasks.


XLNet is a "generalized" autoregressive (AR) language model that enables learning bidirectional contexts using Permutation Language Modeling. XLNet borrows the ideas from both AE and AR language models while avoiding their limitations. As per the paper, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

How does XLNet work?


Same as BERT, XLNet currently has two variants.

  • XLNet-Base cased: 12 layers, 12 attention heads, and 110 million parameters

  • XLNet-Large cased: 24 layers, 16 attention heads, and 340 million parameters

Processing and Pre-training

Permutation Language Modeling (PLM) is the concept of training bi-directional AR models on all permutations of words in a sentence. XLNet makes use of PLM to achieve state-of-the-art (SOTA) results. Besides, XLNet is based on the Transformer-XL, which it uses as the main pre-training framework. It adopts the method like recurrent segment mechanism and relative encoding from Transformer-XL model.

Which one should you choose?

Both BERT and XLNet are impressive language models. I recommend you start with BERT and Transformer-XL then get into XLNet.

  •  February, 18, 2021
  • Rupesh Gelal
We'll never share your email with anyone else.
Save my name, email, and website in this browser for the next time I comment.
Latest Blogs