Based on the feedback received from users and evaluators, we have been recently working on improving our capabilities for the detection and removal of Machine Translated (MT) content from ParaCrawl corpora. Although most of the feedback received was related to the existence of large chunks of MT content in low resource languages, the problem is also relevant to high resource languages (though at a lower percentage of the total content).
The approach taken is “bottom-up”, focusing on the structure of the raw data that we process (i.e. the crawled web pages) rather than on the textual content itself. Limiting our attention to the most common MT translation services and plugins, we analysed the structure of many websites in order to identify traces of such tools allowing us to reliably detect pages that include solely MT content. To avoid excessive filtering, we have followed a conservative approach, concentrating on patterns that pointed to explicit and extensive use of translation tools rather than filtering out the whole site as many web pages that contain human translated content fall back to MT for some content. Based on these findings, a new filtering tool has been developed and integrated in the processing pipeline as an additional pre-processing step.
For low resource languages, this resulted in a reduction of the size of the raw corpora of up to 25% (i.e. for the same quantity of raw data). For higher resource languages, size reduction was smaller, between 10-15%, depending on the language. We believe that the new tool is a valuable addition to our processing pipeline and we hope that users of our corpora will be very happy with this improvement.
The newly released filtered corpora, “Paracrawl Corpus - release 8.0”, have been fully reprocessed using this tool. As always, the new release includes all the material collected and processed for the last data release, and all new data collected afterwards.