ParaCrawl Synthesized Data
The synthesized data corpus series are made using data from ParaCrawl releases. They are made by taking existing sentence pairs in ParaCrawl corpora and replacing words from a glossary and their translation to create a new sentence pair. The tool developed for this purpose, uses word embeddings to assess word similarity and word alignments computed by fast-align to identify the translations for words in the existing parallel corpus. More details about the tool and how it works can be found at https://github.com/paracrawl/synthesis. There have been several releases of synthetic data (see download section below):
- Release 1 focuses on the COVID-19 domain: it uses parallel sentences from ParaCrawl release 7.0 in combination with a COVID-19 glossary sourced from Tico-19.
- Release 2 focuses on the Financial, IT, Law and Medical domains: it uses parallel sentences from ParaCrawl release 8 in combination with automatically-computed terminology in the four covered domains.
Assume that a specific glossary has the following translation:
The similarity model based on word embeddings may find the following similar translation pair in the corpus:
Note that both '' Guangdong'' and ''Гуандун'', as well as ''Hubei'' and ''Hubei'' must be similar.
The existing parallel corpus may contain this word translation pair in the following sentence pair:
- Project monitoring and management are handled by our office in Hubei city .
- Мониторинг и управление проектом осуществляется в нашем офисе в городе Hubei .
Note that the tool assumes that all data is tokenized - it does not perform any additional pre-processing.
Based on all this information, the following synthetic sentence pair is generated:
- Project monitoring and management are handled by our office in Guangdong city .
- Мониторинг и управление проектом осуществляется в нашем офисе в городе Гуандун.