The script uses Python 3. You can simply run the following to clone this repository and install all of the above requirements:
git clone
cd terminology_evaluation
pip install -r requirements.txt
List of requirements:
- stanza
- argparse
- sacrebleu
- bs4
- lxml (need it for mac)
The main script is
that receives the following arguments:
- --language - The language code (eg. fr for French) of the target language.
- --hypothesis - This is the hypothesis file. Example file:
. - --source - This is a file with the source references. An example file is provided at
- --target_reference - This is a file with the target references. An example file is provided at
- --BLEU [True/False]. By default True. If True shows BLEU score.
- --EXACT_MATCH [True/False]. By default True. If True shows Exact Match score.
- --WINDOW_OVERLAP [True/False]. By default True. If True shows Window Overlap Score.
- --MOD_TER [True/False]. By default True. If True shows TERm score.
- --TER [True/False]. By default False. If True shows TER score.
You can test that your metrics work by running the following command on the sample data we provide.
python3 \
--language fr \
--hypothesis data/ \
--source data/dev.en-fr.en.sgm \
--target_reference data/
Running the above command will:
- Download the French Stanza models, if they are not available locally already
- Compute four metrics and print the following:
BLEU score: 45.33867641150976
Exact-Match Statistics
Total correct: 759
Total wrong: 127
Total correct (lemma): 15
Total wrong (lemma): 0
Exact-Match Accuracy: 0.8590455049944506
Window Overlap Accuracy :
Window 2:
Exact Window Overlap Accuracy: 0.29693757867032844
Window 3:
Exact Window Overlap Accuracy: 0.2907071747339513
1 - TERm Score: 0.5976316319523398
- The computation of TER or TERm can take quite some time if your data has very long sentences.