ParaCrawl Corpus release v5.1
The corpus is released as part of the ParaCrawl project co-financed by the European Union through the Connecting Europe Facility. Version 5.1 builds upon the same raw corpus as V5. Thanks to improvements in filtering procedure, the official subset extracted as version 5.1 is now higher in quantity for almost all language pairs (but ga, de, sl and et). Quality measured extrinsically through MT for several language pairs shows also improvement in quality.
A newer version is available
See Latest Releases
Language
			
			Sentences
            Source Words
		
			Bulgarian
					
		
		2,775,256
		60,246,308
	2,775,256
				60,246,308
			248,555,951
				1,564,051,100
			
			Czech
					
		
		5,345,693
		105,973,351
	5,345,693
				105,973,351
			665,535,115
				4,025,512,842
			
			Danish
					
		
		4,851,772
		111,476,139
	4,851,772
				111,476,139
			447,743,455
				3,347,135,236
			
			German
					
		
		34,371,306
		708,068,143
	34,371,306
				708,068,143
			5,038,103,659
				27,994,213,177
			
			Greek
					
		
		4,038,777
		93,473,163
	4,038,777
				93,473,163
			640,502,801
				3,768,712,672
			
			Spanish
					
		
		44,587,162
		1,072,236,916
	44,587,162
				1,072,236,916
			2,674,900,280
				16,598,620,402
			
			Estonian
					
		
		1,452,963
		31,597,344
	1,452,963
				31,597,344
			168,091,382
				915,074,587
			
			Finnish
					
		
		3,421,382
		66,385,933
	3,421,382
				66,385,933
			460,181,215
				2,731,068,033
			
			French
					
		
		63,634,915
		1,518,457,124
	63,634,915
				1,518,457,124
			4,273,819,421
				24,983,683,983
			
			Irish
					
		
		521,768
		12,089,677
	521,768
				12,089,677
			64,628,733
				667,211,260
			
			Croatian
					
		
		1,993,180
		44,945,371
	1,993,180
				44,945,371
			273,330,006
				1,738,164,401
			
			Hungarian
					
		
		4,782,328
		115,330,046
	4,782,328
				115,330,046
			461,181,772
				3,208,285,083
			
			Italian
					
		
		24,089,063
		587,087,473
	24,089,063
				587,087,473
			2,251,771,798
				13,150,606,108
			
			Lithuanian
					
		
		1,368,514
		27,894,906
	1,368,514
				27,894,906
			198,101,611
				963,384,230
			
			Latvian
					
		
		1,056,252
		22,810,714
	1,056,252
				22,810,714
			176,113,669
				1,069,218,155
			
			Maltese
					
		
		186,630
		4,280,211
	186,630
				4,280,211
			3,693,930
				38,492,028
			
			Dutch
					
		
		11,272,396
		247,536,605
	11,272,396
				247,536,605
			1,101,087,006
				6,792,400,704
			
			Polish
					
		
		6,577,804
		143,702,545
	6,577,804
				143,702,545
			723,052,912
				4,123,972,411
			
			Portuguese
					
		
		15,259,967
		337,394,318
	15,259,967
				337,394,318
			1,068,161,866
				6,537,298,891
			
			Romanian
					
		
		3,176,488
		69,998,913
	3,176,488
				69,998,913
			510,209,923
				3,034,045,929
			
			Slovak
					
		
		2,496,533
		48,160,348
	2,496,533
				48,160,348
			269,067,288
				1,416,750,646
			
			Slovenian
					
		
		1,220,652
		29,042,458
	1,220,652
				29,042,458
			175,682,959
				1,003,867,134
			
			Swedish
					
		
		6,633,761
		149,048,559
	6,633,761
				149,048,559
			620,338,561
				3,496,650,816
			Bonus Release (Low resource languages) - Last Updates on Oct. 2024
	    
			English-Azerbaijani
					
		
		3,158,025
		47,117,416
	3,158,025
				47,117,416
			3,158,025
				47,117,416
			336,067,622
				5,513,087,127
			
			English-Tajik
					
		
		343,401
		5,513,041
	343,401
				5,513,041
			343,401
				5,513,041
			11,394,854
				244,332,910
			
			English-Armenian
					
		
		1,988,287
		35,997,571
	1,988,287
				35,997,571
			1,988,287
				35,997,571
			31,671,924
				233,771,297
			
			English-Khmer v1
					
		
		65,113
		1,511,950
	21,560,446
				21,565,078
			65,113
				1,511,950
			
			English-Burmese v1
					
		
		31,374
		661,577
	40,590,354
				40,595,755
			31,374
				661,577
			
			English-Nepali v1
					
		
		92,084
		2,941,031
	36,454,553
				36,466,101
			92,084
				2,941,031
			
			English-Pashto
					
		
		26,321
		692,651
	2,587,950
				2,593,163
			26,321
				692,651
			
			English-Singhalese
					
		
		217,407
		5,791,982
	38,720,907
				38,724,422
			217,407
				5,791,982
			
			English-Somali
					
		
		14,879
		506,201
	28,387,922
				28,396,227
			14,879
				506,201
			
			English-Swahili
					
		
		132,517
		3,696,543
	84,605,506
				84,605,506
			132,517
				3,696,543
			
			English-Tagalog
					
		
		248,684
		6,327,801
	108,260,601
				108,260,601
			248,684
				6,327,801
			Bonus Release - Last Updates on October 2024
	    
			English-Hindi
							 
		
		New
					4,712,564
		74,000,000
	4,712,564
				74,000,000
			4,712,564
				74,000,000
			
			English-Indonesian
							 
		
		New
					7,133,323
		109,000,000
	7,133,323
				109,000,000
			7,133,323
				109,000,000
			
			English-Khmer v2
							 
		
		New
					1,501,304
		23,000,000
	1,501,304
				23,000,000
			1,501,304
				23,000,000
			
			English-Korean v2
							 
		
		New
					7,709,312
		114,000,000
	7,709,312
				114,000,000
			7,709,312
				114,000,000
			
			English-Lao
							 
		
		New
					1,994,053
		27,000,000
	1,994,053
				27,000,000
			1,994,053
				27,000,000
			
			English-Burmese v2
							 
		
		New
					1,666,530
		28,000,000
	1,666,530
				28,000,000
			1,666,530
				28,000,000
			
			English-Nepali v2
							 
		
		New
					2,243,954
		32,000,000
	2,243,954
				32,000,000
			2,243,954
				32,000,000
			
			English-Thai
							 
		
		New
					2,175,890
		22,000,000
	2,175,890
				22,000,000
			2,175,890
				22,000,000
			
			English-Vietnamese
							 
		
		New
					6,291,407
		93,000,000
	6,291,407
				93,000,000
			6,291,407
				93,000,000
			
			Polish-Czech
					
		
		24,001,403
		288,826,678
	24,001,403
				288,826,678
			6,055,618,075
				28,559,061,699
			
			English-Ukrainian
					
		
		13,354,365
		505,831,880
	13,354,365
				505,831,880
			235,700,383
				5,832,658,894
			
			English-Chinese
					
		
		14,170,585
		217,604,664
	14,170,585
				217,604,664
			1,207,487,761
				8,953,713,029
			
			English-Russian
					
		
		5,377,911
		101,312,142
	5,377,911
				101,312,142
			491,941,804
				492,260,972
			
			English-Korean
					
		
		4,002,441
		61,963,744
	4,002,441
				61,963,744
			0
				0
			
			Dutch-French
					
		
		2,687,331
		60,504,313
	2,687,331
				60,504,313
			38,164,560
				770,141,393
			
			Polish-German
					
		
		916,522
		18,883,576
	916,522
				18,883,576
			11,060,105
				202,765,359
			
Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.