More monolingual and parallel data
This section is about additional sources of parallel or monolingual data done using parts or the whole ParaCrawl pipeline or using the data to create derived corpora.

Monolingual data from ParaCrawl V8: English
The corpus has 96,470,655,818 lines, 1,337,127,886,176 tokens, and 9,153,226,323,307 characters of English. Text was extracted from HTML, classified, split, and deduplicated.
The corpus is available as 128 files, split by the hash of the line. The first and last URLs are:
https://neural.mt/data/paracrawl8-mono/en-000.gz
https://neural.mt/data/paracrawl8-mono/en-127.gz
#!/bin/bash for i in {0..127}; do wget https://neural.mt/data/paracrawl8-mono/en-$(printf "%03i" $i).gz done
Files are hosted on the Internet Archive. Due to their 1 TB limit per directory, there are redirects to the appropriate directory.
Source data
This is all the English data used for ParaCrawl release 8, which is based on the following crawls.- Internet Archive: wide00006, wide00015, and pages with en, is, hr, no, and ga in their URL.
- CommonCrawl: 2016-30, 2017-30, 2018-30, 2019-18, and 2019-35.
- Targeted: Philipp Koehn crawled domains that have a mix of multilingual content based on language classification in CommonCrawl. Marta Bañón aimed for sites in Basque, Catalan, Galician, and Spanish but picked up some English on the way. Hieu Hoang crawled sites that produced parallel sentences in earlier generations of ParaCrawl.
More languages
Coming, though ParaCrawl release 9 processing takes priority. That will have even more data!

EuroPat: Unleashing European Patent Translations
Patents provide a rich source of technical vocabulary, product names, and person names that complement other data sources used for machine translation.
This Action will will mine parallel corpora from patents by aggregating, aligning, and converting patent data. Alignment and cleaning modules in the ParaCrawl pipeline will be enhanced and used to carry out this action.
The first release included English-German (12.6M parallel sentences) and English-French corpora (9.2M parallel sentences) made up by using information from the European Patent Organisation database to identify patents.
The second release includes 6 language combinations: English-German (15.5M parallel sentences), English-Spanish (44.4M parallel sentences), English-French corpora (12M parallel sentences), English-Croatian (75k parallel sentences), English-Norwegian (4M parallel sentences) and English-Polish (89k parallel sentences) from various sources.
Implementation schedule: September 2010 to September 2021
More info Project website Download the data
MultiParaCrawl v 7.1
Parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English. They only provide the additional language pairs that came out of pivoting. The bitexts for English are available from the ParaCrawl release. Stats about the data in MultiParaCrawl v7.1:
- 40 languages, 669 bitexts
- total number of files: 40
- total number of tokens: 10.14G
- total number of sentence fragments: 505.48M