ParaCrawl Corpus release v7

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 7 is the final release of ParaCrawl Action 2: "Broader Web-Scale Provision of Parallel Corpora for European Languages".

ParaCrawl 7 uses a brand new version of Bicleaner (v0.14, see changes). A previous restorative cleaning was also applied using Bifixer. ParaCrawl 7 also comes with a new version of the corpus in which personal data has been filtered with Biroamer, a tool that performs anonymisation, or, better said, a full ROAM (Random, Omit, Anonymize and Mix) process to a parallel corpus. For the purposes of the ParaCrawl project, sentences that are identified as containing personal data are removed.

Three new language pairs are also part of ParaCrawl 7: Spanish-Basque, Spanish-Catalan, Spanish-Galician.

A newer version is available
See Latest Releases
Language
 
Sentences
Source Words
Bulgarian
3,921,418
69,937,358
268,998,640
1,346,310,221
3,427,139
59,308,327
3,921,418
69,937,358
Czech
14,044,787
220,651,770
2,252,850,039
9,972,016,340
11,384,801
165,271,640
14,044,787
220,651,770
Danish
10,373,509
181,990,085
1,364,713,133
6,308,498,796
8,605,706
138,988,648
10,373,509
181,990,085
German
82,149,652
1,386,102,191
10,897,444,055
48,895,765,461
67,059,759
1,058,662,818
82,149,652
1,386,102,191
Greek
9,371,553
165,417,622
1,820,373,508
8,812,514,487
7,764,218
125,044,203
9,371,553
165,417,622
Spanish
78,089,925
1,399,352,074
6,666,898,872
32,736,785,453
64,244,795
1,081,927,625
78,089,925
1,399,352,074
Estonian
2,854,323
53,754,536
386,118,636
1,674,227,657
2,415,907
43,559,691
2,854,323
53,754,536
Finnish
7,240,755
125,249,277
1,373,321,548
6,238,234,293
6,089,997
98,213,431
7,240,755
125,249,277
French
103,755,224
1,939,999,072
8,968,041,958
41,819,656,526
86,323,105
1,535,367,949
103,755,224
1,939,999,072
Irish
2,671,526
56,226,938
577,193,486
4,236,208,024
2,255,879
46,280,346
2,671,526
56,226,938
Croatian
6,625,734
108,797,444
900,550,321
5,016,159,268
5,429,788
84,834,596
6,625,734
108,797,444
Hungarian
6,666,891
109,170,191
1,298,664,371
5,649,694,612
5,619,586
88,510,100
6,666,891
109,170,191
Icelandic
2,122,253
32,638,786
85,126,625
477,912,937
1,749,484
25,679,914
2,122,253
32,638,786
Italian
40,527,192
758,466,770
5,106,554,664
23,034,892,166
32,972,875
574,472,543
40,527,192
758,466,770
Lithuanian
4,098,149
68,615,582
429,624,163
1,800,505,463
3,496,927
55,912,963
4,098,149
68,615,582
Latvian
3,732,626
64,595,568
397,293,154
1,917,377,455
3,235,921
53,984,330
3,732,626
64,595,568
Maltese
768,684
13,828,864
39,430,173
274,428,587
650,920
11,086,006
768,684
13,828,864
Norwegian (Bokmål)
17,582,969
289,246,256
1,908,783,439
10,544,217,380
13,931,202
204,018,507
17,582,969
289,246,256
Dutch
31,011,106
448,733,583
3,109,527,599
13,621,731,300
17,308,204
234,362,416
31,011,106
448,733,583
Norwegian (Nynorsk)
323,519
3,587,725
38,272,065
160,524,882
255,753
2,662,188
323,519
3,587,725
Polish
13,680,703
229,122,591
2,216,428,318
9,637,861,918
11,345,670
176,361,222
13,680,703
229,122,591
Portuguese
31,306,077
531,543,618
2,853,539,902
13,341,568,456
26,111,348
417,025,428
31,306,077
531,543,618
Romanian
6,113,657
104,913,932
1,123,175,889
5,160,868,853
5,198,999
83,895,491
6,113,657
104,913,932
Slovak
4,866,054
79,227,777
791,959,183
3,387,908,021
4,179,081
65,109,372
4,866,054
79,227,777
Slovenian
3,093,989
55,262,160
441,779,459
2,160,088,991
2,629,084
44,545,677
3,093,989
55,262,160
Swedish
11,602,956
193,949,614
1,361,303,263
6,347,478,696
9,838,334
156,968,017
11,602,956
193,949,614
Spanish-Catalan  
New
6,870,183
121,809,348
122,416,150
920,203,247
6,807,883
120,454,209
6,870,183
121,809,348
Spanish-Basque  
New
514,610
10,417,432
14,668,769
110,842,012
508,693
10,283,207
514,610
10,417,432
Spanish-Galician  
New
1,222,837
18,062,499
31,904,287
197,118,066
1,209,971
17,803,958
1,222,837
18,062,499
Bonus Release (Low resource languages) - Last Updates on Sep 2021
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on Sep 2021
Polish-Czech  
New
24,001,403
288,826,678
24,001,403
288,826,678
6,055,618,075
28,559,061,699
Ukrainian  
New
13,354,365
505,831,880
13,354,365
505,831,880
235,700,383
5,832,658,894
Chinese  
New
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359