ParaCrawl Synthesized Data

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.

The synthesized data corpus series are made using data from ParaCrawl releases. They are made by taking existing sentence pairs in ParaCrawl corpora and replacing words from a glossary and their translation to create a new sentence pair. The tool developed for this purpose, uses word embeddings to assess word similarity and word alignments computed by fast-align to identify the translations for words in the existing parallel corpus. More details about the tool and how it works can be found at https://github.com/paracrawl/synthesis. There have been several releases of synthetic data (see download section below):

  • Release 1 focuses on the COVID-19 domain: it uses parallel sentences from ParaCrawl release 7.0 in combination with a COVID-19 glossary sourced from Tico-19.
  • Release 2 focuses on the Financial, IT, Law and Medical domains: it uses parallel sentences from ParaCrawl release 8 in combination with automatically-computed terminology in the four covered domains.

Process Description

Assume that a specific glossary has the following translation:

  • Guangdong
  • Гуандун

The similarity model based on word embeddings may find the following similar translation pair in the corpus:

  • Hubei
  • Hubei

Note that both '' Guangdong'' and ''Гуандун'', as well as ''Hubei'' and ''Hubei'' must be similar.
The existing parallel corpus may contain this word translation pair in the following sentence pair:

  • Project monitoring and management are handled by our office in Hubei city .
  • Мониторинг и управление проектом осуществляется в нашем офисе в городе Hubei .

Note that the tool assumes that all data is tokenized - it does not perform any additional pre-processing.
Based on all this information, the following synthetic sentence pair is generated:

  • Project monitoring and management are handled by our office in Guangdong city .
  • Мониторинг и управление проектом осуществляется в нашем офисе в городе Гуандун.

The following sets of synthesized data have been made available:
Release 2 (April 2021): Financial, IT, Law and Medical corpora
Language
 
Sentences
German 
Financial
 
ParaCrawl V8
6,141,225
French 
Financial
 
ParaCrawl V8
11,589,149
Italian 
Financial
 
ParaCrawl V8
6,957,297
German 
Law
 
ParaCrawl V8
6,006,154
French 
Law
 
ParaCrawl V8
11,801,739
Italian 
Law
 
ParaCrawl V8
7,250,261
German 
Medical
 
ParaCrawl V8
2,220,814
French 
Medical
 
ParaCrawl V8
10,200,900
Italian 
Medical
 
ParaCrawl V8
7,574,006
Release 1 (September 2020): COVID-19 corpora
Language
 
Sentences
Bulgarian 
COVID-19
 
ParaCrawl V7
1,872,208
Czech 
COVID-19
 
ParaCrawl V7
28,872,079
Danish 
COVID-19
 
ParaCrawl V7
3,967,918
Estonian 
COVID-19
 
ParaCrawl V7
675,770
Finnish 
COVID-19
 
ParaCrawl V7
1,367,263
Greek 
COVID-19
 
ParaCrawl V7
541,089
Icelandic 
COVID-19
 
ParaCrawl V7
920,263
Latvian 
COVID-19
 
ParaCrawl V7
194,573
Lithuanian 
COVID-19
 
ParaCrawl V7
2,924,916
Portuguese 
COVID-19
 
ParaCrawl V7
43,204,196
Romanian 
COVID-19
 
ParaCrawl V7
1,282,807
Russian 
COVID-19
 
ParaCrawl V7
516,767
Slovak 
COVID-19
 
ParaCrawl V7
164,490
Slovenian 
COVID-19
 
ParaCrawl V7
74,015
Swedish 
COVID-19
 
ParaCrawl V7
16,526,006