author: @himanshustwts
I’m currently working on this article based on N-gram Language Models from Stanford’s Speech and Language Processing Course by Daniel Jurafsky & James H. Martin.
I’ve tried explaining this in brief but from scratch while building up an intuition.
PS: This will be a series where I’ll be updating the site daily with new topics. Keep reading :)
The first question should be, what is an n-gram?
An n-gram is a sequence of n words:
2-gram (bigram) is a sequence of two words. example: hello world
N-gram language models estimate the probability of a word given its preceding words.
They are based on the Markov assumption, which posits that the probability of a word depends only on a limited history of preceding words.
Let’s we have a history(h) as “The water of Walden Pond is so beautifully” and we want to know the probability that next word is blue.
P(blue | The water of Walden Pond is so beautifully)
We can intuitively estimate the probability of "blue" appearing after "The water of Walden Pond is so beautifully" by calculating its relative frequency in a corpus, this approach is flawed. Language is “creative” and constantly evolving with new sentences being created, making it impossible to obtain accurate counts for every possible sequence.