Two formats of BiCleaner files are provided. You can download either the TMX format or plaint TXT format. To effectively transform a TMX to a tab-separated text file Download TMXT tool.
Language
Crawled Websites
 
Sentences
Source Words
Burmese
31,374
661,577
40,590,354
40,595,755
31,374
661,577
Khmer
65,113
1,511,950
21,560,446
21,565,078
65,113
1,511,950
Nepali
92,084
2,941,031
36,454,553
36,466,101
92,084
2,941,031
Pashto
26,321
692,651
2,587,950
2,593,163
26,321
692,651
Russian
5,377,911
101,312,142
491,941,804
492,260,972
5,377,911
101,312,142
Singhalese
217,407
5,791,982
38,720,907
38,724,422
217,407
5,791,982
Somali
14,879
506,201
28,387,922
28,396,227
14,879
506,201
Swahili
132,517
3,696,543
84,605,506
84,605,506
132,517
3,696,543
Tagalog
248,684
6,327,801
108,260,601
108,260,601
248,684
6,327,801
In the proceedings of WMT 2019 Release 3 of the corpus was used. For WMT 2018, the FILTERED v1.0 of the Release 1 was used.
Bonus Release - Last Updated on Jun 2019
Dutch-French
7,700
2,687,331
60,504,313
38,164,560
770,141,393
2,687,331
60,504,313
Polish-German
5,549
916,522
18,883,576
11,060,105
202,765,359
916,522
18,883,576
Extra Languages in release v1 - Last Updated on Jan 2018
Russian
14,035
12,061,155
157,061,045
1,078,819,759