How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

  • Daniel Maier Freie Universität Berlin (FU)
  • Andreas Niekler University of Leipzig
  • Gregor Wiedemann University of Hamburg
  • Daniela Stoltenberg University of Münster
Keywords: Text analysis, topic model, latent Dirichlet allocation, preprocessing, model selection


Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually >10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.

How to Cite
Maier, D., Niekler, A., Wiedemann, G., & Stoltenberg, D. (2020). How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models. Computational Communication Research, 2(2), 139-152. Retrieved from