Skip to main content

Word Alignment Homework 3

Due on Friday, November 10, 2017

Word alignment is a key task in building a machine translation system. We start with a large corpus of aligned sentences called a parallel corpus. For example, we might have the following sentence pair from the Canadian Hansards (the published proceedings of the Canadian parliament):

monsieur le Orateur , ma question se adresse à le ministre chargé de les transports .
Mr. Speaker , my question is directed to the Minister of Transport .

Your task is to find the alignments between the words between the two languages. For example, given the sentence pair above, your program should output a word alignment in the following format that uses the word indices from the two languages:

0-0 2-1 3-2 4-3 5-4 7-6 8-7 9-8 10-9 12-10 14-11 15-12

This corresponds to an alignment of the words shown in the following table:

0 monsieur 0 Mr.
1 le  
2 Orateur 1 Speaker
3 , 2 ,
4 ma 3 my
5 question 4 question
6 se  
  5 is
7 adresse 6 directed
8 à 7 to
9 le 8 the
10 ministre 9 Minister
11 chargé  
12 de 10 of
14 transports 11 Transport
15 . 12 .

Getting Started

You must have git and python (2.7) on your system to run the assignments.

If you have already cloned the nlp-class-hw repository, then do the following to get the files for Homework 3:

# go to the directory where you did a git clone before
cd nlp-class-hw
git pull origin master

Or you can create a new directory that does a fresh clone of the repository:

git clone

In the aligner directory you will find several python programs and data sets that you will use for this assignment. Warning: the size of the aligner directory is 51MB. contains a default training algorithm for the alignment task.

python -m default.model implements a complete but very simple alignment algorithm. For every word, it computes the set of sentences that the word appears in. Intuititvely, word pairs that appear in similar sets of sentences are likely to be translations. Our aligner first computes the similarity of these sets with Dice’s coefficient. Given the co-occurence count of words and in the parallel corpus, Dice’s coefficient for each pair of words is:

$$ \delta(i,j) = \frac{2 \times C(e_i, f_j)}{C(e_i) +C(f_j)} $$

The default aligner will align any word pair with a coefficient over 0.5. You can experiment with different thresholds using -t.

The Challenge

The goal of this homework is to train a word alignment model using a given set of sentence pairs in two languages and then produce a word alignment for the data sets given to you.

You are only provided sentence pairs without any alignments, hence this is an example of unsupervised learning. The plan is to learn a conditional probability model of a French sentence given an English sentence . You are given a dataset of sentence pairs that are known to be translation pairs:

$${\cal D} = \{ (\textbf{f}^{(1)}, \textbf{e}^{(1)}), \ldots, (\textbf{f}^{(N)}, \textbf{e}^{(N)}) \}$$

There are two data sets provided to you. You will develop your aligner using French-English sentence pairs:


Use the default Dice alignment to align the first 10,000 lines of the training data.

python -n 100000 > dice.a

You can check the validity of your alignment file:

python -i dice.a

This will print out all the valid alignments in your input file. Ignore the following warning:

WARNING:root:WARNING ( bitext is longer than alignment

Evaluate the output alignments against the reference French-English alignments:

python -i dice.a

You will see an ASCII-based graphical view of each alignment compared to the true alignment (guessed alignment versus sure and possible alignments from truth). At the end you will see the precision, recall and the alignment error rate (AER) scores of your alignments. For precision and recall, the higher the better. For AER the lower the better.

 Alignment 8  KEY: ( ) = guessed, * = sure, ? = possible
 |(*)( )                                  | monsieur
 |    ? ( )            ( )( )   ( )   ( ) | le
 |    *                                   | Orateur
 |      (*)            ( )( )   ( )   ( ) | ,
 |          *                             | ma
 |            (*)                         | question
 |                ?  ?                    | se
 |                ?  ?                    | adresse
 |                     (*)( )   ( )   ( ) | à
 |      ( )            ( )(*)   ( )   ( ) | le
 |                           (*)          | ministre
 |                                        | chargé
 |      ( )            ( )( )   (*)   ( ) | de
 |      ( )            ( )( )   (?) ? ( ) | les
 |                                  *     | transports
 |      ( )            ( )( )   ( )   (*) | .
   M  S  ,  m  q  i  d  t  t  M  o  T  . 
   r  p     y  u  s  i  o  h  i  f  r    
   .  e        e     r     e  n     a    
      a        s     e        i     n    
      k        t     c        s     s    
      e        i     t        t     p    
      r        o     e        e     o    
               n     d        r     r    

You can also do it all at once:

python -n 100000 | python | python

Alignment Error Rate

AER is used to evaluate the output. The alignments created by humans for the evaluation are divided into two types:

  • Sure alignments which use - as the separator, e.g. 0-0
  • Unsure alignments use ? as the separator, e.g. 0?0
  • Possible alignments is the union of Sure and Unsure alignments:

The quality of an alignment is computed using precision and recall:

$$ \textrm{precision} = \frac{ | A \cap P | }{ |A| } $$

$$ \textrm{recall} = \frac{ | A \cap S | }{ |S| } $$

Alignment error rate (AER) combines precision and recall:

$$ \textrm{AER} = 1 - \left( \frac{ |A \cap S| + |A \cap P| }{ |A| + |S| } \right) $$

The Leaderboard

Important: You need upload the alignments on German-English data to the leaderboard

In this homework, you will be developing your aligner on French-English data, but you will be uploading your alignment file for the provided German-English data. The German-English sentence pairs are in the files:


To upload the alignment using

python -p europarl -f de -n 100000 > output.a
head -1000 output.a > upload.a

There is a size limit to your uploads to the leaderboard. Make sure you upload only the first 1000 lines of the alignment file to the leaderboard.

When you develop your own aligner called answer/ you have to make sure you use the same command line arguments as

python answer/ -p europarl -f de -n 100000 > output.a
head -1000 output.a > upload.a

Then upload the file upload.a to the leaderboard for Homework 3 on

The Baseline

The word alignment model

The baseline model is a simple model that uses a word-to-word or lexical translation model. It has only one set of parameters: a conditional probability where is a French word and is an English word. We will build the model using .

Pick one translation pair from the data . Let the French sentence be and the English sentence . Now we make a big assumption: that each French word can only map to exactly one English word. This means that we can represent an alignment of the French words by: . There is one for each French word which corresponds to an English word . If is allowed to be then it means that a French word is not mapped to any English word. Setting to is called aligning to null.

$$ \Pr(\mathbf{f}, \textbf{a} \mid \mathbf{e}) = \prod_{i=1}^I t(f_i \mid e_{a_i})$$

Assume we have a three word French sentence () sentence-aligned to a three word English sentence (). If the alignment , that is, , , and we can derive the probability of this sentence pair alignment to be:

$$\Pr(\mathbf{f}, \textbf{a} \mid \mathbf{e}) = t(f_1 \mid e_1) \cdot t(f_2 \mid e_3) \cdot t(f_3 \mid e_2)$$

In this simple model, we allow any alignment function that maps any word in the source sentence to any word in the target sentence (no matter how far apart they are). The alignments are not provided to us, so we remove the alignments by summing over them1:

$$ \begin{eqnarray*} \Pr(\mathbf{f} \mid \mathbf{e}, t) & = & \sum_{\textbf{a}} \Pr(\mathbf{f}, \textbf{a} \mid \mathbf{e}, t) \\ & = & \sum_{a_1=1}^J \cdots \sum_{a_I=1}^J \prod_{i=1}^I t(f_i \mid e_{a_i}) \\ && \textrm{(this computes all possible alignments)} \\ & = & \prod_{i=1}^I \sum_{j=1}^J t(f_i \mid e_j) \\ && \textrm{(after conversion of $J^I$ terms into $I \cdot J$ terms)} \end{eqnarray*} $$

We wish to learn the parameters that maximize the log-likelihood of the training data:

$$ \arg\max_{t} L(t) = \arg\max_{t} \sum_s \log \Pr(\mathbf{f}^{(s)} \mid \mathbf{e}^{(s)}, t) $$

Training the model

In order to estimate the parameters we start with an initial estimate and modify it iteratively to get . The parameter updates are derived for each French word and English word as follows:

$$t_k(f_i \mid e_j) = \sum_{s=1}^N \sum_{(f_i, e_j) \in (\textbf{f}^{(s)}, \textbf{e}^{(s)})} \frac{ \textrm{count}(f_i, e_j, \textbf{f}^{(s)}, \textbf{e}^{(s)}) }{ \textrm{count}(e_j, \textbf{f}^{(s)}, \textbf{e}^{(s)}) }$$

These counts are expected counts over all possible alignments, and each alignment has a probability computed using . Using maximum likelihood, each alignment between and is the number of times we observe an alignment between and times the probability of that alignment divided by the total of all other alignments to other French words observed for times the probability of each of those alignments.

$$ \begin{eqnarray*} \textrm{count}(f_i, e_j, \textbf{f}, \textbf{e}) & = & \frac{ t_{k-1}(f_i \mid e_j) }{ \Pr(\textbf{f} \mid \textbf{e}, t_{k-1}) } \\ & = & \frac{ t_{k-1}(f_i \mid e_j) }{ \sum_{a_i=1}^J t_{k-1}(f_i \mid e_{a_i}) } \\ \textrm{count}(e_j, \textbf{f}, \textbf{e}) & = & \sum_{i=1}^I \textrm{count}(f_i, e_j, \textbf{f}, \textbf{e}) \end{eqnarray*} $$

The description of the training algorithm is very compressed here. You will have to work through the background reading below in order to fully understand the steps. Pseudo-code for the training algorithm is given below.

Algorithm: Training a lexical word alignment model

  • = 0
  • Initialize ## Easy choice: initialize uniformly ##
  • repeat
    • += 1
    • Initialize all counts to zero
    • for each in
      • for each in
        • = 0 ## Z commonly denotes a normalization term ##
        • for each in
          • +=
        • for each in
          • c =
          • count(, ) += c
          • count() += c
    • for each (, ) in count
      • Set new parameters: = count() / count()
  • until convergence ## See below for convergence tests ##


Initializing uniformly means that every French word is equally likely for every English word: for all we initialize where is the French vocabulary size. This ensures that .


The theory behind this algorithm states that the iterative updates have the following property:

$$L(t_k) \geq L(t_{k-1})$$

We can check for convergence by checking if the value of does not change much from the previous iteration (difference from previous iteration is less than , for example).

The objective for the baseline method, , can be shown to be an example of convex optimization, which means we are guaranteed to find the value of that maximizes in the limit. However, this could mean hundreds or thousands of iterations for any given data set.

Most practitioners simply iterate over the training data for 3 to 5 iterations.

Decoding: compute the word alignment

We have trained an alignment model so far, but what we really need is the alignment for a given translation pair.

$$ \hat{\textbf{a}} = \arg\max_{\textbf{a}} \Pr(\textbf{a} \mid \textbf{e}, \textbf{f}) $$

The best alignment to a target sentence in our simple baseline model is obtained by simply finding the best alignment for each word in the source sentence independently of the other words. For each French word in the source sentence the best alignment is given by:

$$ \hat{a_i} = \arg\max_{a_i} t(f_i \mid e_{a_i}) $$

Pseudo-code for this search is given below.

Algorithm: Decoding the best alignment

  • for each in
    • for each in
      • bestp = 0
      • bestj = 0
      • for each in
        • if > bestp
          • bestp =
          • bestj =
      • align to

The following output shows you how much time it takes to run the baseline algorithm for training over 5 iterations and then decoding (aligning) the data:

$ time python answer/ -n 100000 | python 
Training IBM Model 1 (no nulls) with Expectation Maximization...
Iteration 0........................................................................................................
Iteration 1........................................................................................................
Iteration 2........................................................................................................
Iteration 3........................................................................................................
Iteration 4........................................................................................................

Precision = 0.597732
Recall = 0.774889
AER = 0.341639

real    10m54.083s
user    10m21.104s
sys     0m10.252s

Background reading

The model and training and decoding algorithms and the theory behind the algorithms is covered in the following basic tutorial (which is easier to understand than the original research papers):

Adam Lopez. Word Alignment and the Expectation-Maximization Algorithm.

Easier to understand, but considerably longer is the following workbook:

Kevin Knight. A Statistical MT Tutorial Workbook. 1999.

Your Task

Developing an aligner using the simple alignment algorithms (described in the above pseudo-code) is good enough to get an alignment error rate (AER) that is close to the performance of the baseline system on the leaderboard. But getting closer to the best known accuracy on this task2 is a more interesting challenge. To get full credit you must experiment with at least one extension of the baseline and document your work. Here are some ideas:

  • There are better ways to find the best alignment:
    • Align using and also align using , then decode the best alignment using each model independently and then report the alignments that are the intersection of these two alignment sets.
    • Use the posterior probability to decode:
  • Add null words to the source sentence.
  • There are better ways to initialize the parameters that lead to better alignments especially if you run only for 5 iterations.
  • Implement a HMM-based alignment model.
  • Add part of speech tags to one language or both and use them, for example, to separate and from the alignment .
  • Add phrasal chunks to one language or both and reward alignments within phrasal chunks and/or penalize alignments across phrasal chunks.
  • Implement the more sophisticated alignment models from the Statistical MT Tutorial Workbook.

But the sky’s the limit! You are welcome to design your own model, as long as you follow the ground rules:

Ground Rules

  • Each group should submit using one person as the designated uploader. Ideally use the same person across all homeworks.
  • Follow these step-by-step instructions to submit your homework solution:
    1. Your solution to this homework should be in the answer directory in a file called The code should be self-contained, self-documenting, and easy to use. It should read the data exactly like does. Your program should run like this:

         python answer/ -p europarl -f de -n 100000 > output.a
         head -1000 output.a > upload.a
    2. Upload this file upload.a to the leaderboard submission site according to the Homework 0 instructions. Your score on the leaderboard is the score on the development data set and the test data set which shown to you immediately after you upload your output file. The program will give you a good idea of how well you’re doing without uploading to the leaderboard.
    3. Run the program: python This will create a a zip file called Each group should assign one member to upload to Coursys as the submission for this homework. It should use the same input and output assumptions of Only use to prepare your zip file.
    4. A clear, mathematical description of your algorithm and its motivation written in scientific style. This needn’t be long, but it should be clear enough that one of your fellow students could re-implement it exactly. You are given a dummy file in the answer directory. Update this file with your description.
    5. Also in the answer directory include for each group member with a user name username a file in your submission called README.username which contains a description of your contribution to the homework solution along with the commit identifiers from either svn or git. If you have only one member in your group then create an empty file.
  • You cannot use data or code resources outside of what is provided to you. You can use NLTK but not the NLTK tokenizer class.
  • For the written description of your algorithm, you can use plain ASCII but for math equations it is better to use either latex or kramdown. Do not use any proprietary or binary file formats such as Microsoft Word.

If you have any questions or you’re confused about anything, just ask.


Your F1 score should be equal to or greater than the score listed for the corresponding marks.

AER (en-de) Marks Grade
.80 0 F
.75 55 D
.70 60 C-
.65 65 C
.60 70 C+
.55 75 B-
.50 80 B
.44 85 B+
.34 90 A-
.24 95 A
.14 100 A+


This assignment is adapted from the word alignment homework developed by Matt Post and Adam Lopez based on an original homework developed by Philipp Koehn and later modified by John DeNero. It incorporates some ideas from Chris Dyer.

  1. For each assignment of which is a sum over terms we have to do multiplications, so the total number of terms is . However, if you allow assignment of to (alignment to null) then the number of terms is

  2. The best known alignment error rate on this task using the data provided to you for French-English is around 19 and for German-English the best error rate is approximately 12.5 according to this comparison of different alignment models