Last week took place the kick off meeting of the third CEF funded Action aiming at improving and expanding the parallel corpora developed in two previous actions (ParaCrawl-1-Action no 2016-EU-IA-0114 and ParaCrawl-2-Action no 2017-EU-IA-0178). These previous Actions have already resulted in the release of the largest ever publicly available parallel corpora, for all EU/EEA official languages paired with English, as well as a complete end-to-end crawling and extraction open-source software toolkit.
ParaCrawl 3 will offer improved extraction software capable of efficiently processing an even larger portion of the Web (more than 1 compressed petabyte). At the same time, it will apply state-of-the-art neural methods to the detection of parallel sentences, and the processing of the extracted corpora. Special emphasis will be placed on collecting larger corpora for language pairs that are currently under-resourced. The corpora will be made more useful for training machine translation (MT) systems by post-processing the data to split long sentences, repair broken sentences and synthesise new sentences.
The new corpus releases will be made available via a data portal which will allow the users building the machine translation systems to select the types of text which best fit their purpose.