Word-level machine translation for bag-of-words text analysis

Cheap, fast, and surprisingly good

Authors

  • A. Maurits van der Veen William & Mary

DOI:

https://doi.org/10.5117/CCR2023.2.8.VAND

Keywords:

machine translation, word embeddings, text-as-data, computational social science

Abstract

The quality of automated machine translation is rapidly approaching that of professional human translation. However, the best methods remain costly in terms of money, computational resources, and/or time, particularly when applied to large volumes of text. In contrast, word-level translation is both free and fast, simply mapping each word in a source language deterministically to a target language. This paper demonstrates that high-quality word-level translation dictionaries can be generated cheaply and easily, and that they produce translations that can serve reliably as inputs into some of the most common automated text analysis methods. It advances the field on two fronts: it assesses different techniques for creating word-level translation dictionaries, and it systematically compares the similarity of word-level translations against those produced by either state-of-the-art neural machine translation or professional human translation. Comparisons are performed for three common text analysis tasks — sentiment analysis, dictionary-based content analysis, and topic modeling — across a total of eleven different source languages and two target languages (English and French). Across all languages and tasks, word-level dictionaries perform sufficiently well to make them an attractive alternative when resource constraints make neural machine translation inaccessible. The translation dictionaries as well as the code used to generate and validate them are available on Github.

Downloads

Published

2023-09-28

How to Cite

van der Veen, A. M. (2023). Word-level machine translation for bag-of-words text analysis: Cheap, fast, and surprisingly good. Computational Communication Research (old Website), 5(2). https://doi.org/10.5117/CCR2023.2.8.VAND