Due on Friday, December 8, 2017 (no grace days)
Automatic evaluation is a key problem in machine translation. Suppose that we have two machine translation systems. On one sentence, system A outputs:
This type of zápisníku was very ceněn writers and cestovateli.
And system B outputs:
This type of notebook was very prized by writers and travellers.
We suspect that system B is better, though we don’t necessarily know that its translations of the words zápisníku, ceněn, and cestovateli are correct. But suppose that we also have access to the following reference translation.
This type of notebook is said to be highly prized by writers and travellers.
We can easily judge that system B is better. Your challenge is to write a program that makes this judgement automatically.
You must have git and python (2.7) on your system.
If you have already cloned the
then do the following to get the files for Homework 3:
# go to the directory where you did a git clone before cd nlp-class-hw git pull origin master
Or you can create a new directory that does a fresh clone of the repository:
git clone https://github.com/anoopsarkar/nlp-class-hw.git
evaluator directory you will find several python programs
and data sets that you will use for this assignment.
default.py contains the default method for evaluation:
python default.py > output
This program uses a very simple evaluation method. Given machine translations and and reference translation , it computes as follows, where is the count of words in that are also in .
It is a good idea to check the output of your evaluation program using
This will avoid your upload hanging on the leaderboard and other nasty consequences.
python check.py < output
It will report errors if your evaluation output is malformed.
Once you have checked the output of the evaluation program for errors, we can compare the results of this function with those of human annotator who rated the same translations.
python score-evaluation.py < output
Or you can do it all at once:
python default.py | python check.py | python score-evaluation.py Pred. y=-1 y=0 y=1 True y=-1 5245 2556 3161 True y= 0 1412 1108 1413 True y= 1 3090 2448 5135 Accuracy = 0.449312
Your task for this assignment is to improve the accuracy of automatic evaluation as much as possible.
A good way to start improving the metric to use the simple METEOR (Wikipedia also has a nice description) metric with the chunking penalty in place of . METEOR computes the harmonic mean of precision and recall, penalized by the number of chunks. That is:
where and are precision and recall, defined as:
are tunable parameters, is the number of chunks and is the number of matched unigrams.
Be sure to tune the parameter that balances precision and recall. This is a very simple baseline to implement. However, evaluation is not solved, and the goal of this assignment is for you to experiment with methods that yield improved predictions of relative translation accuracy.
Some things that you might try:
But the sky’s the limit! Automatic evaluation is far from solved, and there are many different solutions you might invent. You can try anything you want as long as you follow the ground rules (see below).
You should feel free to use additional data resources such as thesauruses, WordNet, or parallel data. You are also free to use additional codebases and libraries except for those expressly intended to evaluate machine translation systems. You must write your own evaluation metric. However, if you want your evaluation to depend on lemmatizers, stemmers, automatic parsers, or part-of-speech taggers, or you would like to learn a metric using a general machine learning toolkit, that is fine. But translation metrics including (but not limited too) available implementations of BLEU, METEOR, TER, NIST, and others are not permitted. You may of course inspect these systems if you want to understand how they work, although they tend to include other functionality that is not the focus of this assignment. It is possible to complete the assignment with a very modest amount of python code. If you aren’t sure whether something is permitted, ask us.
You do not need any other data than what we provide. You are free to use any code or software you like, except for those expressly intended to evaluate machine translation output. You must write your own evaluation function. If you want to use part-of-speech taggers, syntactic or semantic parsers, machine learning libraries, thesauri, or any other off-the-shelf resources, go nuts. But evaluation software like BLEU, TER, METEOR, or their many variants are off-limits. You may of course inspect these systems if it helps you understand how they work. If you aren’t sure whether something is permitted, ask us. If you want to do system combination, join forces with your classmates (but only use the output from other groups, not their source code!).
In this homework, the score produced by
score-evaluation.py will be
on the test set. Please do run your evaluation on the dev set many times
before uploading your results.
To get on the leaderboard, produce your output file:
python answer/evaluate.py > output
Then upload the file
output to the leaderboard for the Project on
Your solution to this homework should be in the
answer directory in a file called
evaluate.py. The code should be self-contained, self-documenting, and easy to use. It should read the data exactly like
default.py does. Your program should run like this:
python answer/evaluate.py > output
outputto the leaderboard submission site according to the Homework 0 instructions.
python zipsrc.py. This will create a a zip file called
source.zip. Each group should assign one member to upload
source.zipto Coursys as the submission for this homework. It should use the same input and output assumptions of
default.py. Only use
zipsrc.pyto prepare your zip file.
answerdirectory include for each group member with a user name
usernamea file in your submission called
README.usernamewhich contains a description of your contribution to the homework solution along with the commit identifiers from either
git. If you have only one member in your group then create an empty file.
source.zip. More details about the course project write-up below.
If you have any questions or you’re confused about anything, just ask.
In addition to performing well on the leaderboard, your write-up is also a very important parts of your project submission. It must have the following sections:
For your write-up, you can use plain ASCII but for math equations
and tables it is better to use either
kramdown. Do not use
any proprietary or binary file formats such as Microsoft Word unless you use
the template from the following page on Submission Formats. If you
do use Word or any other proprietary format then you must submit
the PDF file of your report as
report.pdf in your
If you use latex then use the style files from the Submission Format section of this page.
To get a PDF file you should run the following commands (which assume you have installed LaTeX on your system):
pdflatex project bibtex project pdflatex project pdflatex project
You can then open
project.pdf using any PDF viewer. Please submit
the LaTeX source and the PDF file along with your source code when
you submit your project to Coursys.
The final projects will be graded using the following criteria:
Here is one example of a good project submission. You can take inspiration for how to write a polished report from this examples but make sure you have the sections required by the Write-up section above.
This assignment was designed by Chris Dyer based on an original task by Adam Lopez which also inspired a whole series of papers. It is based on the shared task for evaluation metrics that is run at the annual Workshop on Statistical Machine Translation. The task was introduced by Chris Callison-Burch.