There are a ton of language models out there today! Many of which have their unique way of learning "self-supervised" language representations that can be used by other downstream tasks.
In this article, I decided to summarize the current trends and share some key insights to glue all these novel approaches together. 😃 (Slide credits: Delvin et. al. Stanford CS224n)
Problem: Context-free/Atomic Word Representations
We started with context-free approaches like word2vec, GloVE embeddings in my previous post. The drawback of these approaches is that they do not account for syntactic context. e.g. "open a bank account" v/s "on the river bank". The word bank has different meanings depending on the context the word is used in.
Solution #1: Contextual Word Representations
With ELMo the community started building forward (left to right) and backward (right to left) sequence language models, and used embeddings extracted from both (concatenated) these models as pre-trained embeddings for downstream modeling tasks like classification (Sentiment etc.)
Potential drawback:ELMo can be considered a "weakly bi-directional model" as they trained 2 separate models here.
Solution #2: Truly bi-directional Contextual Representations
- Simply tells the model what sentence does this token belongs to e.g. "Sentence A: The man went to buy milk. Sentence B: The store was closed".
- Can be thought as a token number e.g. The - 0, man - 1 and so on.
BERT is a huge model (110M parameters ~1 GB filesize). Alright, How do we do better?
"Bigger the LM, the better it is"
XLNet introduced this idea of relative position embeddings instead of static position embeddings that we saw earlier. These start out as linear relationships and are combined together in deeper layers to learn a non-linear attention function.
Additionally, instead of going just Left-to-Right, XLNet introduced this idea of Permutation Language Modelling (PLM) which allows us to randomly permute the order for every training sentence as shown in the figure. You are still predicting one "MASKED" word at a time given some permutation of the input. This gives us a much better sample efficiency.
Slide credits - Jacob Delvin, Google Language AI