ParaCrawl Corpus Bonus Release

Two language pairs Dutch-French and Polish-German are part of this bonus release. These language pairs are crawled in collaboration with an industry partner. The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility.

Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Language
Crawled Websites
 
Sentences
Source Words
Dutch-French
7,700
2,687,331
60,504,313
38,164,560
770,141,393
2,687,331
60,504,313
Polish-German
5,549
916,522
18,883,576
11,060,105
202,765,359
916,522
18,883,576
In the proceedings of WMT 2019 Release 3 of the corpus was used. For WMT 2018, the FILTERED v1.0 of the Release 1 was used.
Extra Languages in release v1 - Last Updated on Jan 2018
Russian
14,035
12,061,155
157,061,045
1,078,819,759