September 2020
Augmented Data Release including Manufactured Data and Implementation of Deferred Crawling
Continued Web-Scale Provision of Parallel Corpora for European Languages

This Action will offer improved extraction software capable of efficiently processing an even larger portion of the Web (more than 1 compressed petabyte). At the same time, it will apply state-of-the-art neural methods to the detection of parallel sentences, and the processing of the extracted corpora. Special emphasis will be placed on collecting larger corpora for language pairs that are currently under-resourced.

Implementation schedule: October 2019 to September 2021

Broader Web-Scale Provision of Parallel Corpora for European Languages

This Action aims to collect translated sentences from the web for all 24 official EU languages plus Icelandic, Norwegian, Basque, Catalan/Valencian, and Galician. These translations will be mined from a large collection of web pages, approximately 1 petabyte in size. The system will extract web pages in hypertext markup language (HTML) as well as files in portable document format (PDF) format, using text where available and optical character recognition otherwise.

Implementation schedule: September 2018 to September 2020

Provision of Web-Scale Parallel Corpora for Official European Languages

The Action aims for the development of parallel corpora (collections of translated text) for all official EU languages. The Action will provide to the CEF building block Automated Translation (CEF-AT), through web crawling, the same type of large scale data that is available to large commercial Machine Translation (MT) engines. By the end of the action, parallel corpora for all 24 official languages will be made available to CEF-AT. For 8 languages (i.e. English, German, Spanish, French, Polish, Italian, Portuguese, and Czech) the parallel corpora will have more than 1 billion tokens. For the 16 other languages, the action aims to collect more than 100 million tokens.

Implementation schedule: September 2017 to March 2019

