Tuesday, 15 September 2020 12:48

ParaCrawl Corpus Release 7

ParaCrawl 7 is the final release of ParaCrawl Action 2: "Broader Web-Scale Provision of Parallel Corpora for European Languages" and it uses a brand new version of Bicleaner, namely version 0.14 (see full log of changes). Some highlights are as follows:

  • new rules have been implemented to filter out noise for, e.g. sentences containing a lot of glued words or inappropriate language
  • the classifier uses now a different technology: extremely randomised trees instead of random forest is the default classifier
  • classifier features have been improved to better cope with OOVs and make the most of the probabilistic dictionaries
  • training procedure has been simplified and logging info messages are now more informative
  • access to pre-trained language packs has also been eased
  • the 29 available language packs have been updated
The previous restorative cleaning was also applied to ParaCrawl 7 using Bifixer. ParaCrawl 7 also comes with a new version of the corpus in which personal data has been filtered with Biroamer, a tool that performs anonymisation, or, better said, a full ROAM (Random, Omit, Anonymise and Mix) process to a parallel corpus. For the purposes of the ParaCrawl project, sentences that are identified as containing personal data are removed during the full ROAM process.
Three new language pairs are also part of ParaCrawl 7: Spanish-Basque, Spanish-Catalan, Spanish-Galician.

Corpora sizes and download links are available from ParaCrawl's website (https://paracrawl.eu/v7).

The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.
The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages and more.
The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)
