ParaCrawl Corpus release v5.0

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v5 is the first release for the Action: "Broader Web-Scale Provision of Parallel Corpora for European Languages". New crawled data is added, including data from Internet Archive. Enhancements in document and sentence aligners with updated BiCleaner strategy resulted in corpora twice the size compare to release v4 for all official EU languages (23 languages paired with English).

Language Crawled Websites Download Details
In the proceedings of WMT 2019 release 3 of the corpus is used. For WMT 2018, the FILTERED v1.0 of the released corpus was used.
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Bulgarian 4,762
  File Size Sentence Pairs English Words
RAW v5.0 7.4GB 248,555,951 1,564,051,100
BiCleaner v5.0 692MB 2,586,277 55,725,444
Croatian 8,889
  File Size Sentence Pairs English Words
RAW v5.0 8.33GB 273,330,006 1,738,164,401
BiCleaner v5.0 477MB 1,861,590 43,464,197
Czech 14,335
  File Size Sentence Pairs English Words
RAW v5.0 20GB 665,535,115 4,025,512,842
BiCleaner v5.0 1.2GB 5,280,149 117,385,158
Danish 19,776
  File Size Sentence Pairs English Words
RAW v5.0 17.8GB 447,743,455 3,347,135,236
BiCleaner v5.0 1066MB 4,606,183 106,565,546
Dutch 17,887
  File Size Sentence Pairs English Words
RAW v5.0 32GB 1,101,087,006 6,792,400,704
BiCleaner v5.0 2.8GB 10,596,717 233,087,345
Estonian 9,522
  File Size Sentence Pairs English Words
RAW v5.0 4.5GB 168,091,382 915,074,587
BiCleaner v5.0 338MB 1,387,869 30,858,140
Finnish 11,028
  File Size Sentence Pairs English Words
RAW v5.0 13.5GB 460,181,215 2,731,068,033
BiCleaner v5.0 704MB 3,097,223 66,385,933
French 48,498
  File Size Sentence Pairs English Words
RAW v5.0 128GB 4,273,819,421 24,983,683,983
BiCleaner v5.0 13.9GB 51,316,168 1,178,317,233
German 67,977
  File Size Sentence Pairs English Words
RAW v5.0 142.8GB 5,038,103,659 27,994,213,177
BiCleaner v5.0 11.07GB 36,936,714 929,818,868
Greek 11,343
  File Size Sentence Pairs English Words
RAW v5.0 16.2GB 640,502,801 3,768,712,672
BiCleaner v5.0 1122MB 3,830,643 88669279
Hungarian 9,522
  File Size Sentence Pairs English Words
RAW v5.0 14.47GB 461,181,772 3,208,285,083
BiCleaner v5.0 1069MB 4,187,051 104,292,635
Irish 1,283
  File Size Sentence Pairs English Words
RAW v5.0 4.66GB 64,628,733 667,211,260
BiCleaner v5.0 190MB 782,769 21,909,039
Italian 31,518
  File Size Sentence Pairs English Words
RAW v5.0 66GB 2,251,771,798 13,150,606,108
BiCleaner v5.0 6.02GB 22,100,078 533,512,632
Latvian 3,557
  File Size Sentence Pairs English Words
RAW v5.0 5GB 176,113,669 1,069,218,155
BiCleaner v5.0 286MB 1,019,003 23,656,140
Lithuanian 4,678
  File Size Sentence Pairs English Words
RAW v5.0 4.92GB 198,101,611 963,384,230
BiCleaner v5.0 375MB 1,270,933 27,214,054
Maltese 672
  File Size Sentence Pairs English Words
RAW v5.0 173MB 3,693,930 38,492,028
BiCleaner v5.0 38MB 177,244 4,252,814
Polish 13,357
  File Size Sentence Pairs English Words
RAW v5.0 22.6GB 723,052,912 4,123,972,411
BiCleaner v5.0 1.6GB 6,382,371 145,802,939
Portuguese 18,887
  File Size Sentence Pairs English Words
RAW v5.0 34.8GB 1,068,161,866 6,537,298,891
BiCleaner v5.0 3.3GB 13,860,663 299,634,135
Romanian 9,335
  File Size Sentence Pairs English Words
RAW v5.0 15.2GB 510,209,923 3,034,045,929
BiCleaner v5.0 728MB 2,870,687 62,189,306
Slovak 7,980
  File Size Sentence Pairs English Words
RAW v5.0 6.05GB 269,067,288 1,416,750,646
BiCleaner v5.0 568MB 2,365,339 45,636,383
Slovenian 5,016
  File Size Sentence Pairs English Words
RAW v5.0 4.07GB 175,682,959 1,003,867,134
BiCleaner v5.0 406MB 1,406,645 31,855,427
Spanish 36,211
  File Size Sentence Pairs English Words
RAW v5.0 80.4GB 2,674,900,280 16,598,620,402
BiCleaner v5.0 9.6GB 38,971,348 897,891,704
Swedish 13,616
  File Size Sentence Pairs English Words
RAW v5.0 16.54GB 620,338,561 3,496,650,816
BiCleaner v5.0 1542MB 6,079,175 138,264,978
Bonus Release
Dutch-French 7,700
  File Size Sentence Pairs Dutch Words French Words
RAW 1.8GB 38,164,560 770,141,393 817,973,481
BiCleaner 752MB 2,687,331 60,504,313 64,650,034
Polish-German 5,549
  File Size Sentence Pairs Polish Words German Words
RAW 479MB 11,060,105 202,765,359 198,442,547
BiCleaner 216MB 916,522 18,883,576 20,271,637
Extra Languages in release v1.0
Russian 14,035 RAW v1.0 FILTERED v1.0
  File Size Sentence Pairs English Words
RAW v1.0 38GB 1,078,819,759 -
Filtered v1.0 637MB 12,061,155 157,061,045
  • Releases 4 and earlier included unaligned sentences in the raw file with one side empty. Release 5 removes these sentences from the raw file, explaining why the raw sizes dropped.