Based on the feedback received from users and evaluators, we have been recently working on improving our capabilities for the detection and removal of Machine Translated (MT) content from ParaCrawl corpora. Although most of the feedback received was related to the existence of large chunks of MT content in low resource languages, the problem is also relevant to high resource languages (though at a lower percentage of the total content). 

The approach taken is “bottom-up”, focusing on the structure of the raw data that we process (i.e. the crawled web pages) rather than on the textual content itself. Limiting our attention to the most common MT translation services and plugins, we analysed the structure of many websites in order to identify traces of such tools allowing us to reliably detect pages that include solely MT content. To avoid excessive filtering, we have followed a conservative approach, concentrating on patterns that pointed to explicit and extensive use of translation tools rather than filtering out the whole site as many web pages that contain human translated content fall back to MT for some content.  Based on these findings, a new filtering tool has been developed and integrated in the processing pipeline as an additional pre-processing step.

For low resource languages, this resulted in a reduction of the size of the raw corpora of up to 25% (i.e. for the same quantity of raw data). For higher resource languages, size reduction was smaller, between 10-15%, depending on the language.  We believe that the new tool is a valuable addition to our processing pipeline and we hope that users of our corpora will be very happy with this improvement.

The newly released filtered corpora, “Paracrawl Corpus - release 8.0”, have been fully reprocessed using this tool. As always, the new release includes all the material collected and processed for the last data release, and all new data collected afterwards.

The last release of the ParaCrawl corpus, version 8, comes along with synthesized data for four domains: Financial, IT, Law and Medical. It covers 3 language combinations: English-German, English-French and English-Italian.

These corpora are made by taking existing sentence pairs in ParaCrawl corpora and replacing words from a glossary and their translation to create a new sentence pair. The tool developed for this purpose, uses word embeddings to assess word similarity and word alignments computed by fast-align to identify the translations for words in the existing parallel corpus. More details about the tool and how it works can be found at https://github.com/paracrawl/synthesis

The corpora can be found on the Manufactured Data web page.

Evaluation

The synthesized data has been evaluated both in terms of translation quality and terminology translation accuracy. We created test sets that contain terminology terms and their specified translations by subsampling domain-relevant corpora, and compared performance of baseline Paracrawl models and models augmented with synthesized data.

The results are presented in the following table: 

 
Setup
Baseline 
BLEU
Baseline 
term accuracy
Synth. BLEU
Synthesized term accuracy
  French IT
31.2
80.5%
31.4 (+0.2)
81.6% (+1.1%)
  French Medical
35.8
80.4%
37.4 (+1.6)
83.3% (+2.9%)
  French Law
37.5
81.3%
38.1 (+0.6)
83.7% (+2.4%)
  French Financial
36.6
80.7%
38.4 (+1.8)
83.5% (+2.8%)
  German IT
17.7
69.5%
19.5 (+1.8)
79.6% (+10.1%)
  German Medical
31.3
47.7%
32.2 (+0.9)
52.0% (+4.3%)
  German Law
27.2
67.1%
29.1 (+1.9)
75.1% (+8.0%)
  German Financial
25.9
66.3%
27.1 (+1.2)
75.8% (+9.5%)
  Italian IT
29.6
75.9%
29.4 (-0.2)
77.8% (+1.9%)
  Italian Medical
37.7
78.9%
40.6 (+2.9)
85.1% (+6.2%)
  Italian Law
32.8
77.8%
33.5 (+0.7)
81.1% (+3.3%)
  Italian Financial
32.2
80.2%
33.6 (+1.4)
83.4% (+3.2%)
 
Tuesday, 15 September 2020 12:48

ParaCrawl Corpus Release 7

ParaCrawl 7 is the final release of ParaCrawl Action 2: "Broader Web-Scale Provision of Parallel Corpora for European Languages" and it uses a brand new version of Bicleaner, namely version 0.14 (see full log of changes). Some highlights are as follows:

  • new rules have been implemented to filter out noise for, e.g. sentences containing a lot of glued words or inappropriate language
  • the classifier uses now a different technology: extremely randomised trees instead of random forest is the default classifier
  • classifier features have been improved to better cope with OOVs and make the most of the probabilistic dictionaries
  • training procedure has been simplified and logging info messages are now more informative
  • access to pre-trained language packs has also been eased
  • the 29 available language packs have been updated
The previous restorative cleaning was also applied to ParaCrawl 7 using Bifixer. ParaCrawl 7 also comes with a new version of the corpus in which personal data has been filtered with Biroamer, a tool that performs anonymisation, or, better said, a full ROAM (Random, Omit, Anonymise and Mix) process to a parallel corpus. For the purposes of the ParaCrawl project, sentences that are identified as containing personal data are removed during the full ROAM process.
 
Three new language pairs are also part of ParaCrawl 7: Spanish-Basque, Spanish-Catalan, Spanish-Galician.

Corpora sizes and download links are available from ParaCrawl's website (https://paracrawl.eu/v7).

The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.
 
The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages and more.
The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

The main goal of the ParaCrawl project is to create the largest publicly available corpora by crawling hundreds of thousands of websites, using open source tools. As part of this effort, several open source components have been developed and integrated into the open-source tool Bitextor, a highly modular pipeline that allows harvesting parallel corpora from multilingual websites or from preexisting or historical web crawls such as Common Crawl or the one available as part of the Internet Archive. The processing pipeline consists of the steps: crawling, text extraction, document alignment, sentence alignment, and sentence pair filtering. The ACL paper describes these steps in detail and evaluates alternative methods empirically in terms of their impact on machine translation quality. Hunalign, Bleualign and Vecalign tools are evaluated for the sentence alignment step. Similarly, Zipporah, Bicleaner and LASER are evaluated for the sentence pair filtering step. Benchmarking data sets for these evaluations are also published. The released parallel corpora is also described in the paper and useful statistics are tabulated about the size of the corpora before and after cleaning for different languages. The quality and usefulness of the data is measured by training Transformer-Based machine translation models with Marian for five different languages. Improvements in BLEU scores are reported against models trained on WMT data sets. Furthermore, the energy cost consumption of running and maintaining such a computationally expensive pipeline is discussed and positive environmental impacts are highlighted. The paper aims to contribute to the further development of novel methods of better processing of raw parallel data and to neural machine translation training with noisy data especially for low resource languages.

 Read the paper

Watch our pre-recorded talk on ACL2020 Virtual Conference website
and join the live Q&A sessions on Tuesday, July 7, 2020:
Session 8A: Resources and Evaluation-7 14:00–15:00 CEST
Session 9A: Resources and Evaluation-9 19:00–20:00 CEST

Tuesday, 07 April 2020 08:38

ParaCrawl Corpus Release 6

Release 6 includes a new language pair English-Icelandic with a lot more data for many other languages. Restorative cleaning with Bifixer gets more data by improving sentence splitting, better data by applying fixes to wrong encoding, html issues, alphabet issues and typos and unique data not only identifying duplicates but also near duplicates. Improved Bicleaner models have also been applied to filter out noisy parallel sentences for this release.

Corpora sizes and download links are available from ParaCrawl's website (https://paracrawl.eu/v6).

The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English).

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

Tuesday, 25 February 2020 10:30

ParaCrawl Corpus Release 5.1

Version 5.1 builds upon the same raw corpus as version 5. Thanks to improvements in filtering procedure, the official subset extracted as version 5.1 is now higher in quantity for almost all language pairs (but ga, de, sl and et). Quality measured extrinsically through MT for several language pairs shows also improvement in quality.

Corpora sizes and download links are available from ParaCrawl's website (https://paracrawl.eu/v5-1).

This is the official release to be used in WMT20. Stay tuned for more news and follow us on twitter @ParaCrawl.

The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English).

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

Wednesday, 01 January 2020 09:05

ParaCrawl - A CEF Digital Success Story

EU funding supports ParaCrawl, the largest collection of language resources for many European languages – significantly improving machine translation quality. Read the Success Story published by CEF Digital, titled "ParaCrawl taps the World Wide Web for language resources".

Continue Reading the Article

Last week took place the kick off meeting of the third CEF funded Action aiming at improving and expanding the parallel corpora developed in two previous actions (ParaCrawl-1-Action no 2016-EU-IA-0114 and ParaCrawl-2-Action no 2017-EU-IA-0178). These previous Actions have already resulted in the release of the largest ever publicly available parallel corpora, for all EU/EEA official languages paired with English, as well as a complete end-to-end crawling and extraction open-source software toolkit.

ParaCrawl 3 will offer improved extraction software capable of efficiently processing an even larger portion of the Web (more than 1 compressed petabyte). At the same time, it will apply state-of-the-art neural methods to the detection of parallel sentences, and the processing of the extracted corpora. Special emphasis will be placed on collecting larger corpora for language pairs that are currently under-resourced. The corpora will be made more useful for training machine translation (MT) systems by post-processing the data to split long sentences, repair broken sentences and synthesise new sentences.

The new corpus releases will be made available via a data portal which will allow the users building the machine translation systems to select the types of text which best fit their purpose.

Keep posted!

Friday, 13 September 2019 09:58

ParaCrawl corpus release 5

The fifth version of the ParaCrawl corpus has been released. It is the first release under the ParaCrawl action: "Broader Web-Scale Provision of Parallel Corpora for European Languages". The latest release of the corpora contains newly crawled data, including data from Internet Archive. Enhancements in document and sentence aligners with updated BiCleaner strategy resulted in corpora twice the size compare to release v4 for all the official EU languages (23 languages paired with English).

Corpora sizes and download links are available from ParaCrawl's website (https://paracrawl.eu/releases.html).

Following chart shows an overview of the corpora sizes in terms of English word counts:

The latest release of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.

The ParaCrawl efforts will continue with the Broader Web-Scale Provision of Parallel Corpora for European Languages; focusing on more language pairs, ingesting more file formats beyond HTML, expanding the crawl coverage and applying domain filtering. Stay tuned for more news and follow us on twitter @ParaCrawl.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English).

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

 

Monday, 01 April 2019 00:00

ParaCrawl works!

NMT experiments are performed for various language pairs, comparing models trained on WMT data with and without the addition of ParaCrawl released corpora. Shallow NMT models, trained with Marian, are used for these experiments. The following table shows that almost in all cases, except for en-cs, addition of ParaCrawl data significantly improves the BLEU scores. The ParaCrawl pipeline has significantly improved since the release 1 and that reflects in the following results as the v4 of the ParaCrawl data is much cleaner, the improvement in BLEU scores is much more evident. 

 

Pair Direction BLEU
(WMT)
BLEU
(ParaCrawl v1)
BLEU
(ParaCrawl v4)
Finnish-English en-fi 17.5 17.5 18.7
fi-en 21.7 24.2 26.3
Latvian-English en-lv 13.2 13.9 15.1
lv-en 15.6 16.5 18.1
Romanian-English en-ro 25.9 26.5 27.2
ro-en 31.1 33.5 35.1
Czech-English en-cs 20.5 19.1 20.4
cs-en 25.7 26.3 26.8
German-English en-de 24.0 20.8 25.2
de-en 29.8 28.8 32.9
Thursday, 14 March 2019 00:00

ParaCrawl corpus release v4.0

The fourth version of the ParaCrawl corpus has been released. It is the final release for the first ParaCrawl project, Provision of Web-Scale Parallel Corpora for Official European Languages, contains parallel corpora for 23 European languages paired with English. The latest release of the corpora brings cutting-edge improvements to the processing pipeline, mainly focusing on getting high-quality bilingual sentences. To that end, extensive cleaning techniques have been applied such as character-based language model filtering or safe restorative cleaning.

Corpora sizes and download links are available from ParaCrawl's website (http://paracrawl.eu/releases.html).
ParaCrawl corpus is hosted by Registry of Open Data on AWS.

The source code of the ParaCrawl OpenSource Pipeline (Bitextor) is available on Github.

The ParaCrawl efforts will continue with the second iteration, Broader Web-Scale Provision of Parallel Corpora for European Languages; focusing on more language pairs, ingesting more file formats beyond HTML, expanding the crawl coverage and applying domain filtering. Stay tuned for more news and follow us on twitter @ParaCrawl.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English).

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

Monday, 19 November 2018 00:00

ParaCrawl corpus release v3.0

The thrid version of the ParaCrawl corpus has been released. It contains parallel corpora for 23 languages paired with English. 6 new languages are added to the v3 release namely Bulgarian, Danish, Greek, Slovak, Slovenian and Swedish. For the previously released languages more data is added to the corpus. For each language two different versions of corpus are released based on two cleaning tools, i.e. BiCleaner and Zipporah. ParaCrawl corpus is crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl.

Corpus size and download links are available from ParaCrawl's website (http://paracrawl.eu/releases.html). The corpus will soon be uploaded to other public data repositories as well.

The source code of the ParaCrawl OpenSource Pipeline (Bitextor) is also available on Github.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Next release is scheduled for March 2019.

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

Thursday, 27 September 2018 00:00

ParaCrawl corpus release v2.0

The second version of the ParaCrawl corpus has been released. It contains parallel corpora for 17 languages paired with English. 6 new languages are added to the v2 release namely Irish, Croatian, Maltese, Lithuanian, Hungarian and Estonian. For the previously released languages (German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian and Finnish) more data is added to the corpus. For each language two different versions of corpus are released based on two cleaning tools, i.e. BiCleaner and Zipporah. ParaCrawl corpus is crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl.

Corpus size and download links are available from ParaCrawl's website (http://paracrawl.eu/releases.html). The corpus will soon be uploaded to other public data repositories as well.

The source code of the ParaCrawl OpenSource Pipeline (Bitextor) is also available on Github.

The corpus and software are released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Updated releases are scheduled for June 2018, October 2018, and March 2019.

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)

Wednesday, 07 March 2018 00:00

Meet ParaCrawl at AMTA Technology Forum!

Prompsit, one of our partners, will attend and exhibit on behalf of ParaCrawl at the next AMTA Conference in Boston (17-21 March 2018). The exhibition is part of the Technology Forum organised inside the AMTA Conference which will take place on 18th March 2018 from 12:30 to 17:30.

By visiting us at AMTA’s Technology Forum:

  • you will learn more about the 11 parallel corpora that we already released
  • you will see a live demo of some the tools that we will soon release: Bicleaner, a web-based TMX cleaner and KEOPS, an evaluation toolkit for parallel sentences.

Come a visit us at AMTA’s Technology Forum for free!
If you are coming only for the Technology Forum, you just need to select Complimentary Registration on the AMTA registration site.

Sunday, 14 January 2018 00:00

1st corpus release for ParaCrawl

The first version of the ParaCrawl corpus has been released. It contains parallel corpora for 11 languages paired with English, namely German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Finnish and Latvian, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl.

Corpus size, BLEU score evaluations and download links are available from ParaCrawl's website (http://paracrawl.eu/releases.html). The corpus will soon be uploaded to other public data repositories as well.

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility (CEF). This release used an existing toolchain that will be refined throughout the project and expanded to cover all official EU languages (23 languages parallel with English). Updated releases are scheduled for June 2018, October 2018, and March 2019.

The corpora are released under the Creative Commons CC0 license ("no rights reserved"). (https://creativecommons.org/share-your-work/public-domain/cc0/)