ParaCrawl Corpus release v6

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 6 includes a new language pair English-Icelandic with a lot more data for many other languages. Restorative cleaning with Bifixer gets more data by improving sentence splitting, better data by applying fixes to wrong encoding, html issues, alphabet issues and typos and unique data not only identifying duplicates but also near duplicates. Improved Bicleaner models have also been applied to filter out noisy parallel sentences for this release.

Update July 6, 2020
Norwegian (Bokmål and Nynorsk) added to the release.
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Language
Crawled Websites
 
Sentences
Source Words
Norwegian (Nynorsk)  
New
89,488
1,127,585
29,173,496
29,175,012
89,488
1,127,585
Norwegian (Bokmål)  
New
6,671,498
100,849,382
819,363,893
819,463,687
6,671,498
100,849,382
Icelandic  
New
739,634
12,311,516
58,636,060
296,521,104
739,634
12,311,516
Bulgarian
4,762
4,111,172
75,792,750
792,650,137
4,171,340,562
4,111,172
75,792,750
Croatian
8,889
3,626,644
68,316,261
787,104,627
4,122,379,934
3,626,644
68,316,261
Czech
14,335
17,982,022
240,177,725
2,637,345,977
12,270,206,847
17,982,022
240,177,725
Danish
19,776
6,370,432
118,125,307
1,540,349,007
8,493,249,524
6,370,432
118,125,307
Dutch
17,887
17,224,292
306,084,573
3,770,164,572
17,692,040,366
17,224,292
306,084,573
Estonian
9,522
1,785,161
34,224,039
502,447,307
2,320,269,980
1,785,161
34,224,039
Finnish
11,028
4,286,642
74,267,239
1,682,897,782
7,874,984,823
4,286,642
74,267,239
French
48,498
73,441,631
1,556,749,938
11,906,899,916
57,186,001,371
73,441,631
1,556,749,938
German
67,977
58,815,994
1,088,844,020
14,178,789,728
65,840,909,321
58,815,994
1,088,844,020
Greek
11,343
5,298,946
94,729,944
2,139,037,096
10,329,906,299
5,298,946
94,729,944
Hungarian
9,522
4,963,481
94,916,290
1,568,167,268
7,683,974,817
4,963,481
94,916,290
Irish
1,283
1,366,628
30,935,185
169,291,329
1,437,676,061
1,366,628
30,935,185
Italian
31,518
29,944,287
608,597,210
6,675,259,847
31,565,506,353
29,944,287
608,597,210
Latvian
4,678
2,195,650
41,705,066
502,217,938
2,606,048,793
2,195,650
41,705,066
Lithuanian
3,557
2,747,344
49,839,021
588,418,497
2,523,516,512
2,747,344
49,839,021
Maltese
672
510,083
8,514,287
35,103,487
247,691,151
510,083
8,514,287
Polish
13,357
8,540,028
151,928,476
2,588,804,772
11,610,625,139
8,540,028
151,928,476
Portuguese
18,887
20,677,300
389,850,230
3,403,007,862
16,650,042,721
20,677,300
389,850,230
Romanian
9,335
4,220,231
76,757,913
1,512,908,718
7,562,845,197
4,220,231
76,757,913
Slovak
7,980
3,303,841
55,865,617
950,013,068
4,133,766,029
3,303,841
55,865,617
Slovenian
5,016
1,923,589
34,867,797
533,967,231
2,554,017,243
1,923,589
34,867,797
Spanish
36,211
56,652,588
1,149,120,624
7,951,766,385
39,608,444,406
56,652,588
1,149,120,624
Swedish
13,616
7,525,057
138,124,887
1,710,627,707
8,031,360,940
7,525,057
138,124,887
In the proceedings of WMT 2019 Release 3 of the corpus was used. For WMT 2018, the FILTERED v1.0 of the Release 1 was used.
Bonus Release (Low resource languages) - Last Updated on Jul 2020
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Russian
5,377,911
101,312,142
491,941,804
492,260,972
5,377,911
101,312,142
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updated on Jun 2019
Dutch-French
7,700
2,687,331
60,504,313
38,164,560
770,141,393
2,687,331
60,504,313
Polish-German
5,549
916,522
18,883,576
11,060,105
202,765,359
916,522
18,883,576
Extra Languages in release v1 - Last Updated on Jan 2018
Russian
14,035
12,061,155
157,061,045
1,078,819,759