Monday, 03 May 2021 15:48

Evaluation of Synthesized Data from Paracrawl v8.0

The last release of the ParaCrawl corpus, version 8, comes along with synthesized data for four domains: Financial, IT, Law and Medical. It covers 3 language combinations: English-German, English-French and English-Italian.

These corpora are made by taking existing sentence pairs in ParaCrawl corpora and replacing words from a glossary and their translation to create a new sentence pair. The tool developed for this purpose, uses word embeddings to assess word similarity and word alignments computed by fast-align to identify the translations for words in the existing parallel corpus. More details about the tool and how it works can be found at https://github.com/paracrawl/synthesis

The corpora can be found on the Manufactured Data web page.

Evaluation

The synthesized data has been evaluated both in terms of translation quality and terminology translation accuracy. We created test sets that contain terminology terms and their specified translations by subsampling domain-relevant corpora, and compared performance of baseline Paracrawl models and models augmented with synthesized data.

The results are presented in the following table: 

 
Setup
Baseline 
BLEU
Baseline 
term accuracy
Synth. BLEU
Synthesized term accuracy
  French IT
31.2
80.5%
31.4 (+0.2)
81.6% (+1.1%)
  French Medical
35.8
80.4%
37.4 (+1.6)
83.3% (+2.9%)
  French Law
37.5
81.3%
38.1 (+0.6)
83.7% (+2.4%)
  French Financial
36.6
80.7%
38.4 (+1.8)
83.5% (+2.8%)
  German IT
17.7
69.5%
19.5 (+1.8)
79.6% (+10.1%)
  German Medical
31.3
47.7%
32.2 (+0.9)
52.0% (+4.3%)
  German Law
27.2
67.1%
29.1 (+1.9)
75.1% (+8.0%)
  German Financial
25.9
66.3%
27.1 (+1.2)
75.8% (+9.5%)
  Italian IT
29.6
75.9%
29.4 (-0.2)
77.8% (+1.9%)
  Italian Medical
37.7
78.9%
40.6 (+2.9)
85.1% (+6.2%)
  Italian Law
32.8
77.8%
33.5 (+0.7)
81.1% (+3.3%)
  Italian Financial
32.2
80.2%
33.6 (+1.4)
83.4% (+3.2%)
 
Last modified on Monday, 03 May 2021 16:20