TY - JOUR AU - Maier, Daniel AU - Niekler, Andreas AU - Wiedemann, Gregor AU - Stoltenberg, Daniela PY - 2020/11/02 Y2 - 2024/03/29 TI - How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models JF - Computational Communication Research JA - CCR VL - 2 IS - 2 SE - Articles DO - UR - https://computationalcommunication.org/ccr/article/view/32 SP - 139-152 AB - <p>Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually &gt;10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.</p> ER -