Monday, 01 April 2019 00:00

ParaCrawl works!

NMT experiments are performed for various language pairs, comparing models trained on WMT data with and without the addition of ParaCrawl released corpora. Shallow NMT models, trained with Marian, are used for these experiments. The following table shows that almost in all cases, except for en-cs, addition of ParaCrawl data significantly improves the BLEU scores. The ParaCrawl pipeline has significantly improved since the release 1 and that reflects in the following results as the v4 of the ParaCrawl data is much cleaner, the improvement in BLEU scores is much more evident. 

 

Pair Direction BLEU
(WMT)
BLEU
(ParaCrawl v1)
BLEU
(ParaCrawl v4)
Finnish-English en-fi 17.5 17.5 18.7
fi-en 21.7 24.2 26.3
Latvian-English en-lv 13.2 13.9 15.1
lv-en 15.6 16.5 18.1
Romanian-English en-ro 25.9 26.5 27.2
ro-en 31.1 33.5 35.1
Czech-English en-cs 20.5 19.1 20.4
cs-en 25.7 26.3 26.8
German-English en-de 24.0 20.8 25.2
de-en 29.8 28.8 32.9
Last modified on Tuesday, 26 November 2019 10:58
More in this category: « ParaCrawl corpus release v4.0