ParaCrawl Corpus release v9
Release 9 is the final release for ParaCrawl Action 3: "Continued Web-Scale Provision of Parallel Corpora for European Languages".
ParaCrawl 9 brings new content and higher quality as the result of an improved pipeline with:
- better PDF processing
- language identification based on CLD2 full instead of lite
- improved machine translation models (almost all neural) used to parallelize sentences
- neural cleaning applied for the first time
As a bonus, we release an English-Chinese corpus and monolingual data (coming soon!).