ParaCrawl Corpus release v4.0

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v4 is the final release for the Action: "Provision of Web-Scale Parallel Corpora for Official European Languages" and it covers all official EU languages (23 languages paired with English)

A newer version is available
See Latest Releases
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Language
Crawled Websites
 
Sentences
Source Words
Bulgarian
4,762
1,039,885
21,109,546
288,395,110
1,552,588,179
1,039,885
21,109,546
Croatian
8,889
1,002,053
19,904,218
411,950,164
1,996,212,922
1,002,053
19,904,218
Czech
14,335
2,981,949
48,918,151
1,189,317,247
5,621,562,488
2,981,949
48,918,151
Danish
19,776
2,414,895
48,240,290
586,535,848
3,484,768,564
2,414,895
48,240,290
Dutch
17,887
5,659,268
108,197,376
1,760,140,259
8,239,317,278
5,659,268
108,197,376
Estonian
9,522
853,422
16,537,397
342,677,535
1,522,504,098
853,422
16,537,397
Finnish
11,028
2,156,069
41,564,859
736,050,617
3,494,554,815
2,156,069
41,564,859
French
48,498
31,374,161
664,924,148
6,429,921,903
28,529,875,306
31,374,161
664,924,148
German
67,977
16,264,450
307,786,150
7,387,809,953
32,358,035,774
16,264,450
307,786,150
Greek
11,343
1,985,233
38,322,532
740,094,469
3,340,324,438
1,985,233
38,322,532
Hungarian
9,522
1,901,342
30,835,267
622,224,794
2,590,060,050
1,901,342
30,835,267
Irish
1,283
357,399
8,241,515
156,189,807
1,194,451,883
357,399
8,241,515
Italian
31,518
12,162,239
260,361,435
3,333,886,336
14,519,224,940
12,162,239
260,361,435
Latvian
3,557
553,060
10,996,032
262,685,954
1,371,257,575
553,060
10,996,032
Lithuanian
4,678
844,643
15,087,805
294,568,032
1,198,118,449
844,643
15,087,805
Maltese
672
195,510
4,100,912
17,602,902
164,119,571
195,510
4,100,912
Polish
13,357
3,503,276
65,618,419
1,259,312,618
5,555,536,170
3,503,276
65,618,419
Portuguese
18,887
8,141,940
156,125,200
1,763,439,122
8,465,738,356
8,141,940
156,125,200
Romanian
9,335
1,952,043
39,882,223
793,759,210
4,059,255,214
1,952,043
39,882,223
Slovak
7,980
1,591,831
26,711,854
334,903,774
1,418,785,612
1,591,831
26,711,854
Slovenian
5,016
660,161
14,489,659
208,466,320
967,461,921
660,161
14,489,659
Spanish
36,211
21,987,267
476,409,854
3,959,845,706
18,128,847,778
21,987,267
476,409,854
Swedish
13,616
3,476,729
70,088,534
739,146,200
3,217,514,612
3,476,729
70,088,534
In the proceedings of WMT 2019 Release 3 of the corpus was used. For WMT 2018, the FILTERED v1.0 of the Release 1 was used.
Bonus Release - Last Updated on Jun 2019
Dutch-French
7,700
2,687,331
60,504,313
38,164,560
770,141,393
2,687,331
60,504,313
Polish-German
5,549
916,522
18,883,576
11,060,105
202,765,359
916,522
18,883,576
Extra Languages in release v1 - Last Updated on Jan 2018
Russian
14,035
12,061,155
157,061,045
1,078,819,759