Selecting Relevant Documents for Multilingual Content Analysis
An Evaluation of Keyword and Semantic Similarity Search Approaches
Keywords:information retrieval, sampling, machine translation, semantic search, multilingual text analysis, computational social science
Comparative research in communication often involves selecting and analyzing documents in multiple languages. Machine translation is an effective preprocessing step for automated content analysis, however its impact on data collection remains under-examined. Using a parallel language corpus of European Parliament debates, this paper evaluates machine translation as an approach for multilingual document retrieval, i.e., selecting documents for analysis. We compare several strategies for retrieving relevant multilingual documents, including 1) expert-validated search queries, 2) machine translated search queries, and 3) multilingual semantic similarity search, comparing them against monolingual searches, and describing how these strategies can impact results from topic modeling. Results show that expert-validated search queries achieve reliable results across languages, while the accuracy of machine translated search queries varies significantly between languages and impacts further analyses. Whereas semantic similarity search retrieved a similar subset of relevant documents across languages, results were less accurate than keyword approaches. In sum, validated translations of search queries can be effective for multilingual document retrieval, but errors can lead to systematic bias in further analysis results. These results are important for researchers seeking opportunities to introduce, validate and generalize findings and theories beyond English-speaking countries.
How to Cite
Copyright (c) 2023 Sean Palicki, Stefanie Walter, Wouter van Atteveldt, Alice Beazer, Isaac Bravo
This work is licensed under a Creative Commons Attribution 4.0 International License.