SFU NLP class: n-gram language models

n-gram Language models

Question 1

Consider the following sentence $s$ where <bs> and <es> are padding tokens:

<bs> <bs> I would like to go West . <es>

Write down $P(s)$ using a bigram language model.
Write down $P(s)$ using a trigram language model.

Question 2

Using a vocabulary ${\cal V} = \{$ crazy, killer, clown $\}$ define a language model which has a particular constraint: $p(x_1, \ldots, x_n) = \gamma_n \times 0.5^n$ for any $x_1, \ldots, x_n$ such that $x_i \in {\cal V}$ for $i = 1, \ldots, n-1$ and $x_n =$ STOP.

$\gamma_n$ is some expression that can be a function of $n$.

Choose one of the following definitions of $\gamma_n$ so that $p(x_1, \ldots, x_n)$ is a valid language model.

1.	$3^{n-1}$
2.	$3^n$
3.	$1$
4.	$\frac{1}{3^n}$
5.	$\frac{1}{3^{n-1}}$

Hint: $\sum_{n=1}^\infty 0.5^n = 1$.

Question 3

The perplexity of a language model on a test corpus is defined as

\[2^{-\ell}\]

where

\[\ell = \frac{1}{M} \sum_{i=1}^m \log_2 p(x^{(i)})\]

where $m$ is the number of sentences in the corpus, $M$ is the total number of words in the corpus, $\log_2$ is log base 2, $x^{(i)}$ is the $i$‘th sentence in the corpus.

What is the maximum value for perplexity $2^{-\ell}$.
What is the minimum value for perplexity $2^{-\ell}$.
Assume we have a bigram language model where
\[p(w_1, \ldots, w_n) = \prod_{i=1}^n q(w_i \mid w_{i-1})\]
and $w_0 = \ast$ and $w_n =$ STOP. Each parameter $q(w \mid v)$ is calculated as follows:
\[q(w \mid v) = \frac{C(v,w)}{C(v)}\]
C($\cdot$) is the count of that item in training data. Write down a training corpus and a test corpus such that the perplexity of the model trained on the training corpus takes the maximum possible value on the test corpus.
Write down a training corpus and a test corpus such that the perplexity of the model trained on the training corpus takes the minimum possible value on the test corpus. (Assume that we use a bigram language model as in the previous question).

Question 4

We define a trigram language model as follows. Take $C(w), C(v, w)$ and $C(u, v, w)$ to be unigram, bigram and trigram counts taken from a training corpus (here $w$ is a single word, $v, w$ is a bigram, and $u, v, w$ is a trigram). Take $N$ to be the total number of words seen in the corpus. Then the unigram, bigram and trigram maximum-likelihood estimates are:

\[q(w) = \frac{C(w)}{N}\] \[q(w \mid v) = \frac{C(v,w)}{C(v)}\] \[q(w \mid u, v) = \frac{C(u,v,w)}{C(u,v)}\]

The final estimate of the trigram probability is:

\[p(w \mid u, v) = \alpha \times q(w \mid u, v) + (1 - \alpha) \times \left( \beta \times q(w \mid v) + (1 - \beta) \times q(w) \right)\]

Assume that $\alpha = \beta = 0.5$. Show that the above model is equivalent to the following model by providing values for $\lambda_1, \lambda_2, \lambda_3$.

\[p(w \mid u, v) = \lambda_1 \times q(w \mid u, v) + \lambda_2 \times q(w \mid v) + \lambda_3 \times q(w)\]

Question 5

Consider the following corpus of sentences where <bs> and <es> are padding tokens:

<bs> I am Sam <es>
<bs> Sam I am <es>
<bs> I do not like green eggs and ham <es>

Fill in the following table:

P(`I` \| `<bs>`) =		P(`Sam` \| `<bs>`) =		P(`am` \| `I`) =
P(`<es>` \| `Sam`) =		P(`Sam` \| `am`) =		P(`do` \| `I`) =

Acknowledgements

Some of these questions are modified versions of questions from the Columbia University course COMS 4705 by Michael Collins.

P(`I` \| `<bs>`) =		P(`Sam` \| `<bs>`) =		P(`am` \| `I`) =
P(`<es>` \| `Sam`) =		P(`Sam` \| `am`) =		P(`do` \| `I`) =

1.	\(3^{n-1}\)
2.	\(3^n\)
3.	\(1\)
4.	\(\frac{1}{3^n}\)
5.	\(\frac{1}{3^{n-1}}\)