The EM Algorithm Homework 1

The EM algorithm is an approach to unsupervised learning where the missing labels are considered to be data hidden from the learner and the EM algorithm assumes a posterior distribution over the missing labels and learns the parameters for the complete data using the posterior.

## Inputs ##

output labels ${\cal Y}$: the output label set ${\cal Y} = \{ 1, \ldots, k \}$. Each label $y \in {\cal Y}$.
training examples: $x^{(i)}$ for $i = 1, \ldots, n$ where each $x^{(i)} \in {\cal X}$
model $p(x, y; \theta)$: assigns a probability to each $(x,y)$ such that $x \in {\cal X}$ and $y \in {\cal Y}$ using parameters $\theta$
possible parameter values $\Omega$: the set $\Omega$ is the set of possible parameters: $\theta \in \Omega$.

## Initialization ##

set $\theta^0$ to an initial value from $\Omega$

## Iterative Re-estimation ##

for $t = 1, \ldots, T$
$$ \theta^t = \arg\max_{\theta \in \Omega} Q(\theta, \theta^{t-1})$$
where
$$ Q(\theta, \theta^{t-1}) = \sum_{i=1}^n \sum_{y \in {\cal Y}} \delta(y \mid i) \log p(x^{(i)}, y; \theta) $$
and
$$ \delta(y \mid i) = p(y \mid x^{(i)}; \theta^{t-1}) = \frac{ p(x^{(i)}, y; \theta^{t-1}) }{ \sum_{y \in {\cal Y}} p(x^{(i)}, y; \theta^{t-1}) } $$

## Output ##

return $\theta^T$

Taking the $\arg\max$ of $Q(\theta, \theta^{t-1})$ involves taking the derivative of $Q$ wrt the parameters and setting it to zero to find the parameters values that would take the gradient to zero (which gives the local maxima of the $Q$ function). The difficulty of writing out the derivative depends on the model $p(x, y; \theta)$.

Getting Started

You must have git installed on your system. Once you’ve confirmed this, run this command:

git clone https://github.com/anoopsarkar/decipherment-class-hw.git

In the repository directory you will see an em-algo directory. This directory contains some data files you will use for this homework.

The Challenge

Your task is to use the EM algorithm to solve the following hidden data problems.

Three Coins

Implement EM for the three coins problem which is discussed in:

The EM algorithm. Michael Collins. 1997.

Write down a Python program (or equivalent) that can be run using the following commands:

python three_coins.py number-of-EM-iterations init-for-coin0 init-for-coin1 init-for-coin2 observation-list ...

Run your program on the following inputs:

python three_coins.py 5 0.3 0.3 0.6 HHH TTT HHH TTT HHH
python three_coins.py 5 0.3 0.3 0.6 HHH TTT HHH TTT
python three_coins.py 5 0.3 0.7 0.7 HHH TTT HHH TTT 
python three_coins.py 11 0.3 0.7001 0.7 HHH TTT HHH TTT 
python three_coins.py 11 0.3 0.6999 0.7 HHH TTT HHH TTT 

Explore other settings of $\theta^0$ and observation sequences $x^{(i)}$.

Naive Bayes

Implement EM for Naive Bayes which is discussed in:

The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm. Michael Collins. 2014.

Warning do not re-estimate the class prior $q(y)$. Keep it fixed to be uniform over the labels $y$.

You should be careful about underflow when you multiply many probabilities. Use the following code fragment to use logarithms to avoid underflow due to the multiplication of the model parameters for the Naive Bayes model.

def logadd(items):
    return functools.reduce(numpy.logaddexp, items[:-1], items[-1])

You should check in your code that you do not take $\log(0)$ because in some cases the EM algorithm might assign a zero probability posterior.

Run your implementation on example-data.txt and voynich.txt (this is the so-called Currier transcription). Use only two class labels and name them A and B. After EM training of the parameters, compute the $\arg\max$ for each example in training.

The output of your program on the example-data.txt should look like this:

FINAL 01P Obama/McCain B -0.693147528283
FINAL 02P Obama/McCain B -0.693147528283
FINAL 03P Obama/McCain B -0.693147528283
FINAL 04S Giants/Patriots A -0.778657762454
FINAL 05S Giants/Patriots A -0.778657762454
FINAL 06S Giants/Patriots A -0.778657762454
corpus likelihood -249.945083149

The Voynich data looks like this:

15201B 4CPC89/ZC89/4OPOE/O8AE/SAE/8AD/4OBZC89/4OPC89/4OPC89/4OFC89/8AE-
15202B CEZC89/4OEZCC89/49RS2/98CC89/OFC89/EZC89/SR/Z8AE/SC89/4OEF9-
15203B 4/FCCC9/ZC89/4OFCC89/4O89FC9/4OFC89/2ZCC89/PA3/ZCOE/PCC89/9R9-
15204B 8SCOE/FZC9/4OESZ9/SCOE/OESC89/OES2/OEZCC89/4OFC89/SC89-
15205B PSC8AR/AEAN/SCX9/2AE/ZC89/4OFAE/SC89/4OFC89/4OE/ZC89/EOE-

Ignore the first column, the rest of the line has “words” separated by /. Your output on the Voynich data should look the same as the output on example-data.txt.

Hidden Markov Models optional

Use a two-state Hidden Markov model to model words in the Voynich data: voynich.txt. After EM training compute the $\arg\max$ over the data, essentially tagging the data using the two-state HMM. See if you can replicate the famous Currier result about evidence of “two hands” in Voynich. You can use a suitable HMM package. You do not have to implement it yourself.

Ground Rules

Each person should work on their own. You can share ideas and help each other understand the material but all coding should be your own work.
You must turn in the following:
1. The python program three_coins.py which implements EM for Three Coins problem and your python program em-nb.py which implements EM for Naive Bayes.
2. The trace of running your implementation of the EM algorithm for Naive Bayes on the following two data sets:
  - example-data.txt
  - voynich.txt
3. Optionally your analysis of using Hidden Markov Models on the Voynich manuscript.
4. Upload a zip file or tar file with the above files to Coursys as the submission for Homework 1. The code should be self-contained, self-documenting, and easy to use.
You can optionally submit a written description to help me understand your homework solution. You can use plain ASCII but for math equations it is better to use either latex or kramdown. Do not use any proprietary or binary file formats such as Microsoft Word.

If you have any questions or you’re confused about anything, just ask.