Broader Web-Scale Provision of Parallel Corpora for European Languages
Scope & Objectives of ParaCrawl
Action: Broader Web-Scale Provision of Parallel Corpora for European Languages
This action builts upon the ongoing efforts to collect 24 offical EU languages by adding Icelandic, Norwegian (Bokmål and Nynorsk), Basque, Catalan/Valencian, and Galician. Going beyond HTML by ingesting PDFs and word processing formats. Expanding the current crawling efforts to 1 petabyte of compressed web pages from the latest Internet Archive crawl. Domain filtering and weighting with a freely provided open-source tool. Better document and segment alignment, better cleaning, and corpus postprocessing. We expect to create the largest parallel corpus for many of the languages, focusing on the needs of CEF-AT.
Action: Provision of Web-Scale Parallel Corpora for Official European Languages
ParaCrawl will create and release large parallel corpora to/from English for all official EU languages by a broad web crawling effort. State-of-the-art methods will be applied for the entire processing chain from identifying web sites with translated text all the way to collecting, cleaning and delivering parallel corpora that are ready as training data for CEF.AT and translation memories for DG Translation. It will also make available consortium partners’ open-source tools to CEF Automated Translation and all other interested parties. Throughout the project there will be four large parallel corpora releases and two software releases.
The target is to collect parallel corpora for the official 24 languages of the European Union
Multilingual Web Crawling
ParaCrawl will discover multilingual content from candidate websites and crawl it.
Improved document and sentence-level alignment. Cleaned, anonymised and annotated translation units.
Open Source Data Collection Pipeline
The ParaCrawl project will make use of state-of-the-art open source tools to crawl, align and clean data, and bundled them together in a open source pipeline.
September 2017, Kickoff meeting in Alicante, Spain
November 2017, Website up and running
January 2018, Release of Corpus v1
April 2018, Release of Corpus v1 in ELRC-SHARE
June 2018, Release of Corpus v2
June 2018, Release of Software v1
October 2018, Release of Corpus v3
March 2019, Release of Corpus v4, also in ELRC-SHARE
March 2019, Release of Software v2
Follow ParaCrawl on Github
Provision of Web-Scale Parallel Corpora for Official European Languages
© Copyright 2018. All rights reserved.