How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models
Keywords:
Text analysis, topic model, latent Dirichlet allocation, preprocessing, model selectionAbstract
Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually >10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.