Data formats: 5 variations of each corpus are provided: 1. Bicleaner TXT format, 2. Bicleaner TMX format, 3. RAW corpus, 4. ROAM (anonymised) format and 5. Deferred crawling format containing pointers to URLs to recrawl the corpora on your end. Click on the icon for each language to show the download links. To effectively transform a TMX to a tab-separated text file Download TMXT tool. Also, if you use deferred crawled corpora, our reconstruction tool can prove useful.
WMT warning: ParaCrawl contribution to WMT has been permanent since 2018. Data sets from various ParaCrawl releases have been used in different shared tasks along the years. To make sure you get the right version of the exact data sets needed for these shared tasks, please download them directly from the links provided at WMT website .

ParaCrawl Corpus release v7.1

The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Release 7.1 is the final release of ParaCrawl Action 2: "Broader Web-Scale Provision of Parallel Corpora for European Languages".

ParaCrawl 7.1 uses a brand new version of Bicleaner (v0.14, see changes). A previous restorative cleaning was also applied using Bifixer. ParaCrawl 7 also comes with a new version of the corpus in which personal data has been filtered with Biroamer, a tool that performs anonymisation, or, better said, a full ROAM (Random, Omit, Anonymize and Mix) process to a parallel corpus. For the purposes of the ParaCrawl project, sentences that are identified as containing personal data are removed.

Three new language pairs are also part of ParaCrawl 7.1: Spanish-Basque, Spanish-Catalan, Spanish-Galician.

A newer version is available
See Latest Releases
Language
 
Sentences
Source Words
Bulgarian
6,470,710
114,394,546
1,045,324,425
5,099,682,610
6,470,710
114,394,546
6,470,710
114,394,546
Czech
14,083,311
221,140,072
2,879,555,459
12,833,538,437
14,083,311
221,140,072
14,083,311
221,140,072
Danish
10,407,658
182,522,421
1,786,249,273
8,306,677,482
10,407,658
182,522,421
10,407,658
182,522,421
German
82,638,202
1,393,267,371
15,756,191,802
70,616,139,476
82,638,202
1,393,267,371
82,638,202
1,393,267,371
Greek
9,402,646
165,889,295
2,424,995,524
11,601,759,140
9,402,646
165,889,295
9,402,646
165,889,295
Spanish
78,662,122
1,408,083,200
9,188,413,555
44,890,422,388
78,662,122
1,408,083,200
78,662,122
1,408,083,200
Estonian
3,180,464
59,363,107
876,552,412
3,696,036,512
3,180,464
59,363,107
3,180,464
59,363,107
Finnish
7,268,808
125,635,485
1,817,128,839
8,375,529,450
7,268,808
125,635,485
7,268,808
125,635,485
French
104,351,522
1,949,176,195
13,062,153,970
60,890,972,539
104,351,522
1,949,176,195
104,351,522
1,949,176,195
Irish
2,694,760
56,502,345
640,477,782
4,806,250,299
2,694,760
56,502,345
2,694,760
56,502,345
Croatian
6,959,317
115,545,209
1,669,917,061
8,562,686,886
6,959,317
115,545,209
6,959,317
115,545,209
Hungarian
6,693,055
109,534,986
1,725,663,556
7,526,477,040
6,693,055
109,534,986
6,693,055
109,534,986
Icelandic
2,392,422
37,138,945
142,810,077
770,770,693
2,392,422
37,138,945
2,392,422
37,138,945
Italian
40,798,278
763,383,205
7,264,202,675
33,037,833,584
40,798,278
763,383,205
40,798,278
763,383,205
Lithuanian
4,433,562
75,244,311
1,009,795,012
4,046,066,698
4,433,562
75,244,311
4,433,562
75,244,311
Latvian
3,950,244
68,920,531
891,308,794
4,198,459,624
3,950,244
68,920,531
3,950,244
68,920,531
Maltese
856,983
15,468,047
76,445,881
516,566,419
856,983
15,468,047
856,983
15,468,047
Norwegian (Bokmål)
17,582,969
289,246,256
1,908,783,439
10,544,217,380
17,582,969
289,246,256
17,582,969
289,246,256
Dutch
31,295,016
451,270,740
4,160,122,214
18,737,404,479
31,295,016
451,270,740
31,295,016
451,270,740
Norwegian (Nynorsk)
323,519
3,587,725
38,272,065
160,524,882
323,519
3,587,725
323,519
3,587,725
Polish
13,744,860
230,136,309
2,915,757,884
12,852,722,856
13,744,860
230,136,309
13,744,860
230,136,309
Portuguese
31,486,963
533,811,233
3,872,346,382
18,115,535,915
31,486,963
533,811,233
31,486,963
533,811,233
Romanian
6,160,525
105,627,421
1,618,609,306
7,606,750,533
6,160,525
105,627,421
6,160,525
105,627,421
Slovak
4,883,378
79,462,934
1,050,522,553
4,504,281,925
4,883,378
79,462,934
4,883,378
79,462,934
Slovenian
3,737,732
67,476,332
984,755,785
4,510,844,698
3,737,732
67,476,332
3,737,732
67,476,332
Swedish
11,645,807
194,586,663
1,950,250,201
9,013,344,825
11,645,807
194,586,663
11,645,807
194,586,663
Spanish-Catalan  
New
6,870,183
121,809,348
122,416,150
920,203,247
6,807,883
120,454,209
6,870,183
121,809,348
Spanish-Basque  
New
514,610
10,417,432
14,668,769
110,842,012
508,693
10,283,207
514,610
10,417,432
Spanish-Galician  
New
1,222,837
18,062,499
31,904,287
197,118,066
1,209,971
17,803,958
1,222,837
18,062,499
Bonus Release (Low resource languages) - Last Updates on Sep 2021
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
Bonus Release - Last Updates on Sep 2021
Polish-Czech  
New
24,001,403
288,826,678
24,001,403
288,826,678
6,055,618,075
28,559,061,699
Ukrainian  
New
13,354,365
505,831,880
13,354,365
505,831,880
235,700,383
5,832,658,894
Chinese  
New
14,170,585
217,604,664
14,170,585
217,604,664
1,207,487,761
8,953,713,029
Russian
5,377,911
101,312,142
5,377,911
101,312,142
491,941,804
492,260,972
English-Korean
4,002,441
61,963,744
4,002,441
61,963,744
Dutch-French
2,687,331
60,504,313
2,687,331
60,504,313
38,164,560
770,141,393
Polish-German
916,522
18,883,576
916,522
18,883,576
11,060,105
202,765,359