ParaCrawl Corpus release v8

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 8 is the first release for ParaCrawl Action 3: "Continued Web-Scale Provision of Parallel Corpora for European Languages".

ParaCrawl 8 adds a huge amount of data to previous releases and additional cleaning routines such as the removal of machine translated content detected through the use of MT plugins (more details) in websites. The corpus is the result of a full reprocessing of all the content from already crawled sources besides the addition of new sources from the Internet Archive or new crawlings.

This version relies on an updated and enhanced version of Bitextor (see changes) including minor fixes for Bifixer (fixes), Bicleaner (filters) and Biroamer (anonymizes). Bitextor provides for the first time deferred crawled corpora as part of this version.

As a bonus, a corpus made of all the monolingual English data in V8 (96 billion sentences!) has been produced along with a new version of the English-Russian corpus. Also, new synthesized data for 4 domains (Financial,Law, IT and Medical) is available as part of this version.

New version 8.1 for Spanish-Galician and Spanish-Catalan: due to a processing error, we discovered a lot of Spanish content in Catalan and Galician sentences. We've produced new filtered versions for these 2 pairs, in order to fix this issue.

Language
 
Sentences
Source Words
Bulgarian
11,927,063
203,375,210
11,927,063
203,375,210
854,733,138
18,604,567,822
3,009,916
50,070,366
11,927,063
203,375,210
Czech
50,152,749
686,422,416
50,152,749
686,422,416
3,305,572,201
68,597,385,464
41,100,059
541,110,549
50,152,749
686,422,416
Danish
41,939,140
615,918,838
41,939,140
615,918,838
3,347,601,977
70,654,954,456
34,772,889
485,480,287
41,939,140
615,918,838
German
261,109,308
3,814,804,964
261,109,308
3,814,804,964
26,655,605,254
573,283,847,994
213,698,600
3,016,392,423
261,109,308
3,814,804,964
Greek
34,586,041
499,466,027
34,586,041
499,466,027
3,441,314,226
73,822,945,203
28,665,368
397,368,723
34,586,041
499,466,027
Spanish
396,501,181
5,603,516,317
396,501,181
5,603,516,317
41,499,613,769
867,946,490,916
327,845,105
4,414,531,025
396,501,181
5,603,516,317
Estonian
8,585,187
144,690,364
8,585,187
144,690,364
642,176,845
12,878,933,565
7,375,531
121,286,878
8,585,187
144,690,364
Finnish
15,301,979
250,528,619
15,301,979
250,528,619
1,570,205,909
35,656,029,925
12,762,014
200,937,547
15,301,979
250,528,619
French
266,848,268
4,441,921,942
266,848,268
4,441,921,942
24,235,367,738
532,388,485,568
219,494,980
3,519,726,369
266,848,268
4,441,921,942
Irish
1,995,659
39,491,428
1,995,659
39,491,428
1,072,936,263
24,357,497,495
1,702,825
33,425,533
1,995,659
39,491,428
Croatian
11,063,212
164,345,030
11,063,212
164,345,030
1,715,050,486
36,278,818,443
9,062,845
130,636,396
11,063,212
164,345,030
Hungarian
12,681,746
196,278,321
12,681,746
196,278,321
1,307,661,091
27,915,080,653
10,529,565
160,217,586
12,681,746
196,278,321
Icelandic
5,724,258
79,645,858
5,724,258
79,645,858
209,761,975
4,400,395,928
4,557,278
60,363,762
5,724,258
79,645,858
Italian
120,119,878
1,970,999,568
120,119,878
1,970,999,568
11,726,003,767
260,835,888,372
97,967,919
1,529,943,957
120,119,878
1,970,999,568
Lithuanian
8,043,262
130,375,034
8,043,262
130,375,034
550,598,514
11,458,376,058
6,807,010
107,919,671
8,043,262
130,375,034
Latvian
8,177,660
138,752,970
8,177,660
138,752,970
490,189,945
10,062,934,467
6,955,523
115,765,677
8,177,660
138,752,970
Maltese
1,604,135
30,567,571
1,604,135
30,567,571
88,711,521
1,824,228,962
1,376,335
25,861,339
1,604,135
30,567,571
Dutch
98,474,880
1,384,748,238
98,474,880
1,384,748,238
6,703,800,208
140,665,734,012
80,099,994
1,083,914,273
98,474,880
1,384,748,238
Norwegian
59,090,389
785,275,124
59,090,389
785,275,124
4,990,065,958
101,779,188,277
47,373,466
607,754,127
59,090,389
785,275,124
Polish
45,359,213
666,844,113
45,359,213
666,844,113
5,767,226,735
121,958,563,962
37,802,842
533,691,061
45,359,213
666,844,113
Portuguese
102,631,451
1,562,012,122
102,631,451
1,562,012,122
10,122,731,111
225,822,517,114
85,102,715
1,237,102,517
102,631,451
1,562,012,122
Romanian
13,376,424
220,420,304
13,376,424
220,420,304
1,570,816,827
34,846,884,817
11,270,962
179,121,809
13,376,424
220,420,304
Slovak
13,010,434
202,080,757
13,010,434
202,080,757
1,002,007,836
20,601,640,521
11,208,541
170,049,628
13,010,434
202,080,757
Slovenian
7,536,844
136,309,008
7,536,844
136,309,008
615,555,357
13,079,867,865
6,442,830
113,165,698
7,536,844
136,309,008
Swedish
44,066,693
657,463,968
44,066,693
657,463,968
4,476,351,734
91,348,264,048
36,993,856
531,048,243
44,066,693
657,463,968
Spanish-Catalan  
New
39,688,735
732,461,360
39,688,735
732,461,360
2,953,756,667
90,271,047,717
39,312,412
725,174,709
39,688,735
732,461,360
Spanish-Basque
2,864,354
44,790,206
2,864,354
44,790,206
509,767,791
15,786,006,832
2,815,365
44,003,473
2,864,354
44,790,206
Spanish-Galician  
New
5,261,521
78,298,840
5,261,521
78,298,840
1,912,182,856
65,242,661,658
5,232,239
77,721,927
5,261,521
78,298,840
Bonus Release (Low resource languages) - Last Updates on Sep 2021
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on Sep 2021
Polish-Czech  
New
24,001,403
288,826,678
24,001,403
288,826,678
6,055,618,075
28,559,061,699
Ukrainian  
New
13,354,365
505,831,880
13,354,365
505,831,880
235,700,383
5,832,658,894
Chinese  
New
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359