ParaCrawl Corpus release v5.0

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. The release v5 is the first release for the Action: "Broader Web-Scale Provision of Parallel Corpora for European Languages". New crawled data is added, including data from Internet Archive. Enhancements in document and sentence aligners with updated BiCleaner strategy resulted in corpora twice the size compare to release v4 for all official EU languages (23 languages paired with English).

A newer version is available
See Latest Releases
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Language
Crawled Websites
 
Sentences
Source Words
Bulgarian
4,762
2,586,277
55,725,444
248,555,951
1,564,051,100
2,586,277
55,725,444
Croatian
8,889
1,861,590
43,464,197
273,330,006
1,738,164,401
1,861,590
43,464,197
Czech
14,335
5,280,149
117,385,158
665,535,115
4,025,512,842
5,280,149
117,385,158
Danish
19,776
4,606,183
106,565,546
447,743,455
3,347,135,236
4,606,183
106,565,546
Dutch
17,887
10,596,717
233,087,345
1,101,087,006
6,792,400,704
10,596,717
233,087,345
Estonian
9,522
1,387,869
30,858,140
168,091,382
915,074,587
1,387,869
30,858,140
Finnish
11,028
3,097,223
66,385,933
460,181,215
2,731,068,033
3,097,223
66,385,933
French
48,498
51,316,168
1,178,317,233
4,273,819,421
24,983,683,983
51,316,168
1,178,317,233
German
67,977
36,936,714
929,818,868
5,038,103,659
27,994,213,177
36,936,714
929,818,868
Greek
11,343
3,830,643
88,669,279
640,502,801
3,768,712,672
3,830,643
88,669,279
Hungarian
9,522
4,187,051
104,292,635
461,181,772
3,208,285,083
4,187,051
104,292,635
Irish
1,283
782,769
21,909,039
64,628,733
667,211,260
782,769
21,909,039
Italian
31,518
22,100,078
533,512,632
2,251,771,798
13,150,606,108
22,100,078
533,512,632
Latvian
3,557
1,019,003
23,656,140
176,113,669
1,069,218,155
1,019,003
23,656,140
Lithuanian
4,678
1,270,933
27,214,054
198,101,611
963,384,230
1,270,933
27,214,054
Maltese
672
177,244
4,252,814
3,693,930
38,492,028
177,244
4,252,814
Polish
13,357
6,382,371
145,802,939
723,052,912
4,123,972,411
6,382,371
145,802,939
Portuguese
18,887
13,860,663
299,634,135
1,068,161,866
6,537,298,891
13,860,663
299,634,135
Romanian
9,335
2,870,687
62,189,306
510,209,923
3,034,045,929
2,870,687
62,189,306
Slovak
7,980
2,365,339
45,636,383
269,067,288
1,416,750,646
2,365,339
45,636,383
Slovenian
5,016
1,406,645
31,855,427
175,682,959
1,003,867,134
1,406,645
31,855,427
Spanish
36,211
38,971,348
897,891,704
2,674,900,280
16,598,620,402
38,971,348
897,891,704
Swedish
13,616
6,079,175
138,264,978
620,338,561
3,496,650,816
6,079,175
138,264,978
In the proceedings of WMT 2019 Release 3 of the corpus was used. For WMT 2018, the FILTERED v1.0 of the Release 1 was used.
Bonus Release - Last Updated on Jun 2019
Dutch-French
7,700
2,687,331
60,504,313
38,164,560
770,141,393
2,687,331
60,504,313
Polish-German
5,549
916,522
18,883,576
11,060,105
202,765,359
916,522
18,883,576
Extra Languages in release v1 - Last Updated on Jan 2018
Russian
14,035
12,061,155
157,061,045
1,078,819,759