ParaCrawl Corpus release v6

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 6 includes a new language pair English-Icelandic with a lot more data for many other languages. Restorative cleaning with Bifixer gets more data by improving sentence splitting, better data by applying fixes to wrong encoding, html issues, alphabet issues and typos and unique data not only identifying duplicates but also near duplicates. Improved Bicleaner models have also been applied to filter out noisy parallel sentences for this release.

Update July 6, 2020
Norwegian (Bokmål and Nynorsk) added to the release.
A newer version is available
See Latest Releases
Language
 
Sentences
Source Words
Bulgarian
4,111,172
75,792,750
4,111,172
75,792,750
792,650,137
4,171,340,562
Czech
17,982,022
240,177,725
17,982,022
240,177,725
2,637,345,977
12,270,206,847
Danish
6,370,432
118,125,307
6,370,432
118,125,307
1,540,349,007
8,493,249,524
German
58,815,994
1,088,844,020
58,815,994
1,088,844,020
14,178,789,728
65,840,909,321
Greek
5,298,946
94,729,944
5,298,946
94,729,944
2,139,037,096
10,329,906,299
Spanish
56,652,588
1,149,120,624
56,652,588
1,149,120,624
7,951,766,385
39,608,444,406
Estonian
1,785,161
34,224,039
1,785,161
34,224,039
502,447,307
2,320,269,980
Finnish
4,286,642
74,267,239
4,286,642
74,267,239
1,682,897,782
7,874,984,823
French
73,441,631
1,556,749,938
73,441,631
1,556,749,938
11,906,899,916
57,186,001,371
Irish
1,366,628
30,935,185
1,366,628
30,935,185
169,291,329
1,437,676,061
Croatian
3,626,644
68,316,261
3,626,644
68,316,261
787,104,627
4,122,379,934
Hungarian
4,963,481
94,916,290
4,963,481
94,916,290
1,568,167,268
7,683,974,817
Icelandic  
New
739,634
12,311,516
739,634
12,311,516
58,636,060
296,521,104
Italian
29,944,287
608,597,210
29,944,287
608,597,210
6,675,259,847
31,565,506,353
Lithuanian
2,747,344
49,839,021
2,747,344
49,839,021
588,418,497
2,523,516,512
Latvian
2,195,650
41,705,066
2,195,650
41,705,066
502,217,938
2,606,048,793
Maltese
510,083
8,514,287
510,083
8,514,287
35,103,487
247,691,151
Norwegian (Bokmål)  
New
6,671,498
100,849,382
6,671,498
100,849,382
819,363,893
819,463,687
Dutch
17,224,292
306,084,573
17,224,292
306,084,573
3,770,164,572
17,692,040,366
Norwegian (Nynorsk)  
New
89,488
1,127,585
89,488
1,127,585
29,173,496
29,175,012
Polish
8,540,028
151,928,476
8,540,028
151,928,476
2,588,804,772
11,610,625,139
Portuguese
20,677,300
389,850,230
20,677,300
389,850,230
3,403,007,862
16,650,042,721
Romanian
4,220,231
76,757,913
4,220,231
76,757,913
1,512,908,718
7,562,845,197
Slovak
3,303,841
55,865,617
3,303,841
55,865,617
950,013,068
4,133,766,029
Slovenian
1,923,589
34,867,797
1,923,589
34,867,797
533,967,231
2,554,017,243
Swedish
7,525,057
138,124,887
7,525,057
138,124,887
1,710,627,707
8,031,360,940
Bonus Release (Low resource languages) - Last Updates on Sep 2021
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on Sep 2021
Polish-Czech  
New
24,001,403
288,826,678
24,001,403
288,826,678
6,055,618,075
28,559,061,699
Ukrainian  
New
13,354,365
505,831,880
13,354,365
505,831,880
235,700,383
5,832,658,894
Chinese  
New
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.