ParaCrawl Corpus release v7.1
The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 7 is the final release of ParaCrawl Action 2: "Broader Web-Scale Provision of Parallel Corpora for European Languages".
ParaCrawl 7 uses a brand new version of Bicleaner (v0.14, see changes). A previous restorative cleaning was also applied using Bifixer. ParaCrawl 7 also comes with a new version of the corpus in which personal data has been filtered with Biroamer, a tool that performs anonymisation, or, better said, a full ROAM (Random, Omit, Anonymize and Mix) process to a parallel corpus. For the purposes of the ParaCrawl project, sentences that are identified as containing personal data are removed.
Three new language pairs are also part of ParaCrawl 7: Spanish-Basque, Spanish-Catalan, Spanish-Galician.