ParaCrawl Corpus release v5.1

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Version 5.1 builds upon the same raw corpus as V5. Thanks to improvements in filtering procedure, the official subset extracted as version 5.1 is now higher in quantity for almost all language pairs (but ga, de, sl and et). Quality measured extrinsically through MT for several language pairs shows also improvement in quality.

Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Language
Crawled Websites
 
Sentences
Source Words
Bulgarian
4,762
2,775,256
60,246,308
248,555,951
1,564,051,100
2,775,256
60,246,308
Croatian
8,889
1,993,180
44,945,371
273,330,006
1,738,164,401
1,993,180
44,945,371
Czech
14,335
5,345,693
105,973,351
665,535,115
4,025,512,842
5,345,693
105,973,351
Danish
19,776
4,851,772
111,476,139
447,743,455
3,347,135,236
4,851,772
111,476,139
Dutch
17,887
11,272,396
247,536,605
1,101,087,006
6,792,400,704
11,272,396
247,536,605
Estonian
9,522
1,452,963
31,597,344
168,091,382
915,074,587
1,452,963
31,597,344
Finnish
11,028
3,421,382
66,385,933
460,181,215
2,731,068,033
3,421,382
66,385,933
French
48,498
63,634,915
1,518,457,124
4,273,819,421
24,983,683,983
63,634,915
1,518,457,124
German
67,977
34,371,306
708,068,143
5,038,103,659
27,994,213,177
34,371,306
708,068,143
Greek
11,343
4,038,777
93,473,163
640,502,801
3,768,712,672
4,038,777
93,473,163
Hungarian
9,522
4,782,328
115,330,046
461,181,772
3,208,285,083
4,782,328
115,330,046
Irish
1,283
521,768
12,089,677
64,628,733
667,211,260
521,768
12,089,677
Italian
31,518
24,089,063
587,087,473
2,251,771,798
13,150,606,108
24,089,063
587,087,473
Latvian
3,557
1,056,252
22,810,714
176,113,669
1,069,218,155
1,056,252
22,810,714
Lithuanian
4,678
1,368,514
27,894,906
198,101,611
963,384,230
1,368,514
27,894,906
Maltese
672
186,630
4,280,211
3,693,930
38,492,028
186,630
4,280,211
Polish
13,357
6,577,804
143,702,545
723,052,912
4,123,972,411
6,577,804
143,702,545
Portuguese
18,887
15,259,967
337,394,318
1,068,161,866
6,537,298,891
15,259,967
337,394,318
Romanian
9,335
3,176,488
69,998,913
510,209,923
3,034,045,929
3,176,488
69,998,913
Slovak
7,980
2,496,533
48,160,348
269,067,288
1,416,750,646
2,496,533
48,160,348
Slovenian
5,016
1,220,652
29,042,458
175,682,959
1,003,867,134
1,220,652
29,042,458
Spanish
36,211
44,587,162
1,072,236,916
2,674,900,280
16,598,620,402
44,587,162
1,072,236,916
Swedish
13,616
6,633,761
149,048,559
620,338,561
3,496,650,816
6,633,761
149,048,559
In the proceedings of WMT 2019 Release 3 of the corpus was used. For WMT 2018, the FILTERED v1.0 of the Release 1 was used.