The EM algorithm is an approach to unsupervised learning where the missing labels are considered to be data hidden from the learner and the EM algorithm assumes a posterior distribution over the missing labels and learns the parameters for the complete data using the posterior.
## Inputs ##
output labels
\({\cal Y}\)training examples
model
\(p(x, y; \theta)\)possible parameter values
\(\Omega\)## Initialization ##
## Iterative Re-estimation ##
$$ \theta^t = \arg\max_{\theta \in \Omega} Q(\theta, \theta^{t-1})$$
$$ Q(\theta, \theta^{t-1}) = \sum_{i=1}^n \sum_{y \in {\cal Y}} \delta(y \mid i) \log p(x^{(i)}, y; \theta) $$
$$ \delta(y \mid i) = p(y \mid x^{(i)}; \theta^{t-1}) = \frac{ p(x^{(i)}, y; \theta^{t-1}) }{ \sum_{y \in {\cal Y}} p(x^{(i)}, y; \theta^{t-1}) } $$
## Output ##
Taking the \(\arg\max\) of \(Q(\theta, \theta^{t-1})\) involves taking the derivative of \(Q\) wrt the parameters and setting it to zero to find the parameters values that would take the gradient to zero (which gives the local maxima of the \(Q\) function). The difficulty of writing out the derivative depends on the model \(p(x, y; \theta)\).
You must have git installed on your system. Once you’ve confirmed this, run this command:
git clone https://github.com/anoopsarkar/decipherment-class-hw.git
In the repository directory you will see an em-algo
directory.
This directory contains some data files you will use for this
homework.
Your task is to use the EM algorithm to solve the following hidden data problems.
Implement EM for the three coins problem which is discussed in:
The EM algorithm. Michael Collins. 1997.
Write down a Python program (or equivalent) that can be run using the following commands:
python three_coins.py number-of-EM-iterations init-for-coin0 init-for-coin1 init-for-coin2 observation-list ...
Run your program on the following inputs:
python three_coins.py 5 0.3 0.3 0.6 HHH TTT HHH TTT HHH
python three_coins.py 5 0.3 0.3 0.6 HHH TTT HHH TTT
python three_coins.py 5 0.3 0.7 0.7 HHH TTT HHH TTT
python three_coins.py 11 0.3 0.7001 0.7 HHH TTT HHH TTT
python three_coins.py 11 0.3 0.6999 0.7 HHH TTT HHH TTT
Explore other settings of \(\theta^0\) and observation sequences \(x^{(i)}\).
Implement EM for Naive Bayes which is discussed in:
The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm. Michael Collins. 2014.
Warning do not re-estimate the class prior \(q(y)\). Keep it fixed to be uniform over the labels \(y\).
You should be careful about underflow when you multiply many probabilities. Use the following code fragment to use logarithms to avoid underflow due to the multiplication of the model parameters for the Naive Bayes model.
def logadd(items):
return functools.reduce(numpy.logaddexp, items[:-1], items[-1])
You should check in your code that you do not take \(\log(0)\) because in some cases the EM algorithm might assign a zero probability posterior.
Run your implementation on example-data.txt
and voynich.txt
(this is the so-called Currier transcription). Use only two class
labels and name them A
and B
. After EM training of the parameters,
compute the \(\arg\max\) for each example in training.
The output of your program on the example-data.txt
should look
like this:
0 FINAL 01P Obama/McCain B -0.693147528283
1 FINAL 02P Obama/McCain B -0.693147528283
2 FINAL 03P Obama/McCain B -0.693147528283
3 FINAL 04S Giants/Patriots A -0.778657762454
4 FINAL 05S Giants/Patriots A -0.778657762454
5 FINAL 06S Giants/Patriots A -0.778657762454
corpus likelihood -249.945083149
The Voynich data looks like this:
15201B 4CPC89/ZC89/4OPOE/O8AE/SAE/8AD/4OBZC89/4OPC89/4OPC89/4OFC89/8AE-
15202B CEZC89/4OEZCC89/49RS2/98CC89/OFC89/EZC89/SR/Z8AE/SC89/4OEF9-
15203B 4/FCCC9/ZC89/4OFCC89/4O89FC9/4OFC89/2ZCC89/PA3/ZCOE/PCC89/9R9-
15204B 8SCOE/FZC9/4OESZ9/SCOE/OESC89/OES2/OEZCC89/4OFC89/SC89-
15205B PSC8AR/AEAN/SCX9/2AE/ZC89/4OFAE/SC89/4OFC89/4OE/ZC89/EOE-
Ignore the first column, the rest of the line has “words” separated
by /
. Your output on the Voynich data should look the same as the
output on example-data.txt
.
Use a two-state Hidden Markov model to model words in the Voynich
data: voynich.txt
. After EM training compute the \(\arg\max\)
over the data, essentially tagging the data using the two-state
HMM. See if you can replicate the famous Currier result about
evidence of “two hands” in Voynich. You can use a suitable HMM
package. You do not have to implement it yourself.
three_coins.py
which implements EM for Three Coins problem and your python program em-nb.py
which implements EM for Naive Bayes.example-data.txt
voynich.txt
If you have any questions or you’re confused about anything, just ask.