Word Embedding Enrichment for Dictionary Construction

An Example of Incivility in Cantonese

Authors

  • Hai Liang The Chinese University of Hong Kong
  • Yee Man Margaret Ng Department of Journalism, University of Illinois Urbana-Champaign
  • Nathan L.T. Tsang Department of Sociology, The University of Southern California

DOI:

https://doi.org/10.5117/CCR2023.1.10.LIAN

Keywords:

political incivility, machine learning, dictionary construction, Cantonese, swearing

Abstract

Dictionary-based methods remain valuable to measure concepts based on texts, though supervised machine learning has been widely used in much recent communication research. The present study proposes a semi-automatic and easily implemented method to build and enrich dictionaries based on word embeddings. As an example, we create a dictionary of political incivility that contains vulgarity and name-calling words in Cantonese. The study shows that dictionary-based classification outperforms supervised machine learning methods, including deep neural network models. Furthermore, a small number of random seed words can generate a highly accurate dictionary. However, the uncivil content detected is only weakly correlated with uncivil perceptions, as we demonstrate in a population-based survey experiment. The strengths and limitations of dictionary-based methods are discussed.

Downloads

Published

2023-09-26

How to Cite

Liang, H., Ng, Y. M. M., & Tsang, N. L. (2023). Word Embedding Enrichment for Dictionary Construction: An Example of Incivility in Cantonese. Computational Communication Research, 5(1). https://doi.org/10.5117/CCR2023.1.10.LIAN

Issue

Section

Articles