Manufactured Data

ParaCrawl Synthesized Data

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.

The synthesized data corpus series are made using data from ParaCrawl releases. They are made by taking existing sentence pairs in ParaCrawl corpora and replacing words from a glossary and their translation to create a new sentence pair. The tool developed for this purpose, uses word embeddings to assess word similarity and word alignments computed by fast-align to identify the translations for words in the existing parallel corpus. More details about the tool and how it works can be found at https://github.com/paracrawl/synthesis. There have been several releases of synthetic data (see download section below):

Release 1 focuses on the COVID-19 domain: it uses parallel sentences from ParaCrawl release 7.0 in combination with a COVID-19 glossary sourced from Tico-19.
Release 2 focuses on the Financial, IT, Law and Medical domains: it uses parallel sentences from ParaCrawl release 8 in combination with automatically-computed terminology in the four covered domains.
Release 3 focuses on improving translation of rare words: it also uses parallel sentences from ParaCrawl release 8 in combination and adds additional data for words that are rare in this corpus. Parallel data for the 8 lowest-resources official European languages is available for download.

Process Description

Assume that a specific glossary has the following translation:

Guangdong
Гуандун

The similarity model based on word embeddings may find the following similar translation pair in the corpus:

Hubei
Hubei

Note that both '' Guangdong'' and ''Гуандун'', as well as ''Hubei'' and ''Hubei'' must be similar.
The existing parallel corpus may contain this word translation pair in the following sentence pair:

Project monitoring and management are handled by our office in Hubei city .
Мониторинг и управление проектом осуществляется в нашем офисе в городе Hubei .

Note that the tool assumes that all data is tokenized - it does not perform any additional pre-processing.
Based on all this information, the following synthetic sentence pair is generated:

Project monitoring and management are handled by our office in Guangdong city .
Мониторинг и управление проектом осуществляется в нашем офисе в городе Гуандун.

The following sets of synthesized data have been made available:

Release 3 (Sep 2021): Translation of rare words

Language

Sentences

Estonian

Rare words

ParaCrawl V8

2,425,32

Irish

Rare words

ParaCrawl V8

568,214

Croatian

Rare words

ParaCrawl V8

3,974,757

Icelandic

Rare words

ParaCrawl V8

790,846

Lithuanian

Rare words

ParaCrawl V8

2,719,698

Latvian

Rare words

ParaCrawl V8

1,752,229

Maltese

Rare words

ParaCrawl V8

438,183

Slovenian

Rare words

ParaCrawl V8

3,077,244

Release 2 (April 2021): Financial, IT, Law and Medical corpora

Language

Sentences

German

Financial

ParaCrawl V8

6,141,225

French

Financial

ParaCrawl V8

11,589,149

Italian

Financial

ParaCrawl V8

6,957,297

German

Law

ParaCrawl V8

6,006,154

French

Law

ParaCrawl V8

11,801,739

Italian

Law

ParaCrawl V8

7,250,261

German

Medical

ParaCrawl V8

2,220,814

French

Medical

ParaCrawl V8

10,200,900

Italian

Medical

ParaCrawl V8

7,574,006

Release 1 (September 2020): COVID-19 corpora

Language

Sentences

Bulgarian

COVID-19

ParaCrawl V7

1,872,208

Czech

COVID-19

ParaCrawl V7

28,872,079

Danish

COVID-19

ParaCrawl V7

3,967,918

Estonian

COVID-19

ParaCrawl V7

675,770

Finnish

COVID-19

ParaCrawl V7

1,367,263

Greek

COVID-19

ParaCrawl V7

541,089

Icelandic

COVID-19

ParaCrawl V7

920,263

Latvian

COVID-19

ParaCrawl V7

194,573

Lithuanian

COVID-19

ParaCrawl V7

2,924,916

Portuguese

COVID-19

ParaCrawl V7

43,204,196

Romanian

COVID-19

ParaCrawl V7

1,282,807

Russian

COVID-19

ParaCrawl V7

516,767

Slovak

COVID-19

ParaCrawl V7

164,490

Slovenian

COVID-19

ParaCrawl V7

74,015

Swedish

COVID-19

ParaCrawl V7

16,526,006

Bitextor

ParaCrawl Open Source pipeline

Bitextor is a tool for automatically harvesting bitexts from multilingual websites.

View release on Github

Quickstart

Docker

If you want to easily install Bitextor, just use the Docker version:

docker pull bitextor/bitextor docker run --name bitextor bitextor/bitextor

Docker Image

Corset

Filtering

Searching

Perform searches on ParaCrawl corpora or get filtered subsets from it using Corset.

View release on Github

KEOPS

Quality Evaluation

KEOPS provides a complete tool for manual evaluation of parallel sentences and other linguistic tasks.

View release on Github

Bicleaner

Classifier

Bicleaner (bicleaner-classify) is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

View release on Github

Third Party Releases

JParaCrawl

JParaCrawl is the largest publicly available English-Japanese parallel corpus created by NTT Communication Science Laboratories. It was created by largely crawling the web and automatically aligning parallel sentences.

Read More

Citing ParaCrawl

Research

If you want to cite ParaCrawl, please refer to: ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages

ParaCrawl Synthesized Data