Skip to main content

Translation Reranking Homework 5

Due on Tuesday, November 29, 2016

In Homework 2 you learned to train a discriminative learner, In Homework 4 you built a decoder that can produce n-best outputs in the target language given a source language sentence.

In this assignment we will give you some sentences from French news articles, and you will use the metric to improve a model that chooses from a set of possible English translations generated by a state-of-the-art machine translation decoder. Your challenge is to use discriminative learning to choose the best translations.

Getting Started

You must have git and python (2.7) on your system to run the assignments.

If you have already cloned the nlp-class-hw repository, then do the following to get the files for Homework 3:

# go to the directory where you did a git clone before
cd nlp-class-hw
git pull origin master

Or you can create a new directory that does a fresh clone of the repository:

git clone https://github.com/anoopsarkar/nlp-class-hw.git

In the reranker directory you will find several python programs and data sets that you will use for this assignment.

The reranker reads candidate translations from the file data/test.nbest. Every candidate translation of an input sentence has an associated feature vector . is the n-best list of candidate translations for . The reranker takes a parameter vector whose length is equal to that of . By default, assigns uniform weight to each features, so each weight is equal to . For each , the reranker returns according to the following decision function.

$$\hat{e} = \arg\max_{e\in E(f)} \theta \cdot h(e,f)$$

Reranking with default weights

Use the default reranker simply produces uniform weights for all features in the file data/train.nbest:

python default.py > default.weights

The default simply ignores the training data. See below on how to use the training data to learn weights.

You can use the default weights with the reranker program provided to you:

python rerank.py -w default.weights > default.output

The reranker code tries to predict the best translation for each sentence in the file data/test.nbest. The output file contains the best translations according to the weights provided to the reranker.

Score the output

Score the output using score-reranker.py:

python score-reranker.py < default.output

This prints out the BLEU score for the output file by comparing it with data/test.en. The score is for the first 250 sentences of the 500 sentences in the test set.

Do not use test.nbest or test.en in your implementation of reranking or you will get a zero mark on this assignment. Using the reference data in test.en we can obtain an oracle score (the best possible BLEU score we can get by reranking). See below for how to compute the oracle score.

You can do it all at once:

python rerank.py -w default.weights | python score-reranker.py

The features

The files train.nbest and test.nbest have six feature functions:

LMScore ReorderingScore p(f|e) lex(f|e) p(e|f) lex(e|f) 

The feature functions are explained below:

  1. LMScore: the language model score for this candidate
  2. ReorderingScore: the sum of the distortion penalties for this translation
  3. p(f|e): the inverse phrase translation probability
  4. lex(f|e): the inverse lexical weighting
  5. p(e|f): the direct phrase translation probability
  6. lex(e|f): the direct lexical weighting

For example, here are the 5-best outputs for the first sentence in the training data. Each has a feature vector with values for the above feature functions:

0 ||| barack obama will become the fourth american president to receive the nobel peace prize ||| -25.014 -1.000 -5.000 -2.000 -14.000 -6.080
0 ||| barack obama will be the fourth american president to receive the nobel peace prize ||| -26.457 -1.000 -4.000 -2.000 -14.000 -6.080
0 ||| barack obama will be the fourth us president to receive the nobel peace prize ||| -26.781 -1.000 -5.000 -2.000 -14.000 -6.080
0 ||| barack obama would be the fourth american president to receive the nobel peace prize ||| -26.311 -1.000 -5.000 -2.000 -14.000 -6.080
0 ||| barack obama will become the fourth american president to receive the nobel peace prize winner ||| -25.899 -1.000 -5.000 -2.000 -15.000 -6.514

Your task is to find weights for these features to pick a better translation out of the n-best list for each input sentence. This task is called reranking.

The oracle score

How much better can you get by reranking. The program oracle.py uses the reference sentences to show you the upper bound in the improvement.

python oracle.py | python score-reranker.py

The oracle uses the references in test.en but your reranker cannot use that information to learn the weights. Learn your weights only on the training data.

Your reranker

The program default.py simply ignores the training data:

  • train.nbest: n-best lists for training
  • train.en: target language references for learning how to rank the n-best list
  • train.fr: source language sentences used as input to a decoder to create the n-best list

Your job is to use the training data to learn weights for all the features in train.nbest. Then you can use your weights to produce a better output using rerank.py:

python rerank.py -w weights > output

You can then score your output in the same way as shown above:

python score-reranker.py < output

Leaderboard

To get on the leaderboard, produce your output weights:

python answer/learn.py > weights

You should upload the output of rerank.py using weights that you learn by using the following data files:

  • train.nbest (n-best list for training),
  • train.en (the reference target language output for training) and
  • train.fr (the source language input for training)

Run rerank.py:

python rerank.py -w weights > output

Upload the output of rerank.py to the leaderboard on sfu-nlp-class.appspot.com.

The score will be almost identical to the local score reported by score-reranker.py so please do not upload excessively to the leaderboard. The leaderboard scores on all 500 sentences in data/test.nbest while the local score is on the first 250 sentences.

Options

python default.py -h

This shows the different options you can use in your training algorithm implementation.

Data

The data directory contains several files derived from a French-English newswire translation task.

  • train.fr: Training data consisting of French news text
  • train.en: Human translations of the French training data
  • train.nbest: N-best machine translations of the French training data
  • test.fr: 500 sentences of French test data
  • test.en: Human translations of the first 250 sentences of the French test
  • test.nbest: N-best machine translations of the French test data

The Challenge

The oracle should convince you that it is possible to do much better than the default reranker. Maybe you can improve it by changing the parameter vector . Do this using command-line arguments to rerank. Try a few different settings. How close can you get to the oracle BLEU score?

You can improve the parameter vector by trial and error, but that will not be very efficient. To really improve the system you need automation. There are two components you can add: informative features that correlate with BLEU, and effective learning algorithms that optimize for BLEU. Your task is to improve translation quality on the blind test set as much as possible by improving these components.

The Baseline

For the baseline learning algorithm we will use an algorithm called PRO (pairwise ranking optimization) which is described in the following paper:

Tuning as Ranking. Mark Hopkins and Jonathan May. EMNLP 2011.

One caveat is that we will not be generating new candidates using a decoder in each iteration. This means we will iterate over the n-best lists provided to you rather than re-decoding to produce n-best lists for each iteration. This approach is called batch tuning and explained further in:

Batch Tuning Strategies for Statistical Machine Translation. Colin Cherry and George Foster. In NAACL 2012.

Here is a pseudo code version of the PRO algorithm to learn weights for reranking:

Parameters:
    tau: samples generated from n-best list per input sentence (set to 5000)
    alpha: sampler acceptance cutoff (set to 0.1)
    xi: training data generated from the samples tau (set to 100)
    eta: perceptron learning rate (set to 0.1)
    epochs: number of epochs for perceptron training (set to 5)

for each sentence i:
    collect all the n-best outputs for i
    for each candidate c in the n-best list:
        compute the bleu score b (using bleu.py) for c
        append (c,b) to nbests[i]

for i = 1 to epochs:
    for nbest in nbests:
        get_sample():
            initialize sample to empty list 
            loop tau times:
                randomly choose two items from nbest list, s1 and s2:
                if fabs(s1.smoothed_bleu - s2.smoothed_bleu) > alpha:
                    if s1.smoothed_bleu > s2.smoothed_bleu:
                        sample += (s1, s2)
                    else:
                        sample += (s2, s1)
                else:
                    continue
            return sample
        sort the tau samples from get_sample() using s1.smoothed_bleu - s2.smoothed_bleu
        keep the top xi (s1, s2) values from the sorted list of samples
        do a perceptron update of the parameters theta:
            if theta * s1.features <= theta * s2.features:
                mistakes += 1
                theta += eta * (s1.features - s2.features) # this is vector addition!
return theta

Implementing a batch tuning version of PRO will be enough to match the baseline score. However, there will still be substantial room for improvement. Here are some ideas:

  • Improve the learning algorithm in PRO (from the perceptron to averaged perceptron, for instance).
  • You can add features to train.nbest and test.nbest
    • Add a feature that counts the number of words.
    • Add a feature to count words that appear to be untranslated.
    • Add an IBM Model 1 score (sum over all alignments) as a feature.
  • Use ordinal regression or uneven margins.
  • Find a consensus translation using minimum Bayes risk.
  • Many, many ideas to improve reranking: using hope and fear.

But the sky’s the limit! You can try anything you want, as long as you follow the ground rules.

Ground Rules

  • Each group should submit using one person as the designated uploader. Ideally use the same person across all homeworks.
  • Follow these step-by-step instructions to submit your homework solution:
    1. Your solution to this homework should be in the answer directory in a file called learn.py. The code should be self-contained, self-documenting, and easy to use. It should read the data exactly like default.py does. Your program should run like this:

         python answer/learn.py > weights
      
    2. Use the weights to call rerank.py to get the output file.

         python rerank.py -w weights > output
      
    3. Upload this file output to the leaderboard submission site according to the Homework 0 instructions.
    4. Run the program: python zipsrc.py. This will create a a zip file called source.zip. Each group should assign one member to upload source.zip to Coursys as the submission for this homework. It should use the same input and output assumptions of default.py. Only use zipsrc.py to prepare your zip file.
    5. A clear, mathematical description of your algorithm and its motivation written in scientific style. This needn’t be long, but it should be clear enough that one of your fellow students could re-implement it exactly. You are given a dummy README.md file in the answer directory. Update this file with your description.
    6. Also in the answer directory include for each group member with a user name username a file in your submission called README.username which contains a description of your contribution to the homework solution along with the commit identifiers from either svn or git. If you have only one member in your group then create an empty file.
  • You cannot use data or code resources outside of what is provided to you. You can use NLTK but not the NLTK tokenizer class.
  • For the written description of your algorithm, you can use plain ASCII but for math equations it is better to use either latex or kramdown. Do not use any proprietary or binary file formats such as Microsoft Word.

If you have any questions or you’re confused about anything, just ask.

Acknowledgements

This assignment is adapted from the decoding homework developed by Adam Lopez and subsequently improved by Chris Dyer.