Tuesday, 07 April 2020 08:38

ParaCrawl Corpus Release 6

Release 6 includes a new language pair English-Icelandic with a lot more data for many other languages. Restorative cleaning with Bifixer gets more data by improving sentence splitting, better data by applying fixes to wrong encoding, html issues, alphabet issues and typos and unique data not only identifying duplicates but also near duplicates. Improved Bicleaner models have also been applied to filter out noisy parallel sentences for this release.

Corpora sizes and download links are available from ParaCrawl's website (https://paracrawl.eu/v5-1).

This is the official release to be used in WMT20. Stay tuned for more news and follow us on twitter @ParaCrawl.

The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English).

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

Last modified on Tuesday, 07 April 2020 08:41
More in this category: « ParaCrawl Corpus Release 5.1