ParaCrawl Corpus release v7

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 7 is the final release of ParaCrawl Action 2: "Broader Web-Scale Provision of Parallel Corpora for European Languages".

ParaCrawl 7 uses a brand new version of Bicleaner (v0.14, see changes). A previous restorative cleaning was also applied using Bifixer. ParaCrawl 7 also comes with a new version of the corpus in which personal data has been filtered with Biroamer, a tool that performs anonymisation, or, better said, a full ROAM (Random, Omit, Anonymize and Mix) process to a parallel corpus. For the purposes of the ParaCrawl project, sentences that are identified as containing personal data are removed.

Three new language pairs are also part of ParaCrawl 7: Spanish-Basque, Spanish-Catalan, Spanish-Galician.

A newer version is available
See Latest Releases
Language
 
Sentences
Source Words
Spanish-Basque  
New
514,610
10,417,432
14,668,769
110,842,012
514,610
10,417,432
508,693
10,283,207
Spanish-Catalan  
New
6,870,183
121,809,348
122,416,150
920,203,247
6,870,183
121,809,348
6,807,883
120,454,209
Spanish-Galician  
New
1,222,837
18,062,499
31,904,287
197,118,066
1,222,837
18,062,499
1,209,971
17,803,958
Bulgarian
3,921,418
69,937,358
268,998,640
1,346,310,221
3,921,418
69,937,358
3,427,139
59,308,327
Croatian
6,625,734
108,797,444
900,550,321
5,016,159,268
6,625,734
108,797,444
5,429,788
84,834,596
Czech
14,044,787
220,651,770
2,252,850,039
9,972,016,340
14,044,787
220,651,770
11,384,801
165,271,640
Danish
10,373,509
181,990,085
1,364,713,133
6,308,498,796
10,373,509
181,990,085
8,605,706
138,988,648
Dutch
31,011,106
448,733,583
3,109,527,599
13,621,731,300
31,011,106
448,733,583
17,308,204
234,362,416
Estonian
2,854,323
53,754,536
386,118,636
1,674,227,657
2,854,323
53,754,536
2,415,907
43,559,691
Finnish
7,240,755
125,249,277
1,373,321,548
6,238,234,293
7,240,755
125,249,277
6,089,997
98,213,431
French
103,755,224
1,939,999,072
8,968,041,958
41,819,656,526
103,755,224
1,939,999,072
86,323,105
1,535,367,949
German
82,149,652
1,386,102,191
10,897,444,055
48,895,765,461
82,149,652
1,386,102,191
67,059,759
1,058,662,818
Greek
9,371,553
165,417,622
1,820,373,508
8,812,514,487
9,371,553
165,417,622
7,764,218
125,044,203
Hungarian
6,666,891
109,170,191
1,298,664,371
5,649,694,612
6,666,891
109,170,191
5,619,586
88,510,100
Icelandic
2,122,253
32,638,786
85,126,625
477,912,937
2,122,253
32,638,786
1,749,484
25,679,914
Irish
2,671,526
56,226,938
577,193,486
4,236,208,024
2,671,526
56,226,938
2,255,879
46,280,346
Italian
40,527,192
758,466,770
5,106,554,664
23,034,892,166
40,527,192
758,466,770
32,972,875
574,472,543
Latvian
3,732,626
64,595,568
397,293,154
1,917,377,455
3,732,626
64,595,568
3,235,921
53,984,330
Lithuanian
4,098,149
68,615,582
429,624,163
1,800,505,463
4,098,149
68,615,582
3,496,927
55,912,963
Maltese
768,684
13,828,864
39,430,173
274,428,587
768,684
13,828,864
650,920
11,086,006
Norwegian (Bokmål)
17,582,969
289,246,256
1,908,783,439
10,544,217,380
17,582,969
289,246,256
13,931,202
204,018,507
Norwegian (Nynorsk)
323,519
3,587,725
38,272,065
160,524,882
323,519
3,587,725
255,753
2,662,188
Polish
13,680,703
229,122,591
2,216,428,318
9,637,861,918
13,680,703
229,122,591
11,345,670
176,361,222
Portuguese
31,306,077
531,543,618
2,853,539,902
13,341,568,456
31,306,077
531,543,618
26,111,348
417,025,428
Romanian
6,113,657
104,913,932
1,123,175,889
5,160,868,853
6,113,657
104,913,932
5,198,999
83,895,491
Slovak
4,866,054
79,227,777
791,959,183
3,387,908,021
4,866,054
79,227,777
4,179,081
65,109,372
Slovenian
3,093,989
55,262,160
441,779,459
2,160,088,991
3,093,989
55,262,160
2,629,084
44,545,677
Spanish
78,089,925
1,399,352,074
6,666,898,872
32,736,785,453
78,089,925
1,399,352,074
64,244,795
1,081,927,625
Swedish
11,602,956
193,949,614
1,361,303,263
6,347,478,696
11,602,956
193,949,614
9,838,334
156,968,017
In the proceedings of WMT 2019 Release 3 of the corpus was used. For WMT 2018, the FILTERED v1.0 of the Release 1 was used.
Bonus Release (Low resource languages) - Last Updated on Jul 2020
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Russian
5,377,911
101,312,142
491,941,804
492,260,972
5,377,911
101,312,142
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updated on Jun 2019
Dutch-French
2,687,331
60,504,313
38,164,560
770,141,393
2,687,331
60,504,313
Polish-German
916,522
18,883,576
11,060,105
202,765,359
916,522
18,883,576