author: @himanshustwts

I’m currently working on this article based on N-gram Language Models from Stanford’s Speech and Language Processing Course by Daniel Jurafsky & James H. Martin.

I’ve tried explaining this in brief but from scratch while building up an intuition.

PS: This will be a series where I’ll be updating the site daily with new topics. Keep reading :)

What are N-gram Language Models?

The first question should be, what is an n-gram?

An n-gram is a sequence of n words:

2-gram (bigram) is a sequence of two words. example: hello world

N-gram language models estimate the probability of a word given its preceding words.

They are based on the Markov assumption, which posits that the probability of a word depends only on a limited history of preceding words.

Let’s we have a history(h) as “The water of Walden Pond is so beautifully” and we want to know the probability that next word is blue.

P(blue | The water of Walden Pond is so beautifully)


Understanding N-Grams and their Calculation

We can intuitively estimate the probability of "blue" appearing after "The water of Walden Pond is so beautifully" by calculating its relative frequency in a corpus, this approach is flawed. Language is “creative” and constantly evolving with new sentences being created, making it impossible to obtain accurate counts for every possible sequence.

image.png