Document representation for text clustering

Thumbnail Image




Μπούργος, Νικόλαος

Journal Title

Journal ISSN

Volume Title



Clustering plays a crucial role in organizing and understanding large collections of documents. In this thesis, we conduct a comprehensive investigation of document clustering, examining clustering algorithms, text preprocessing techniques, similarity and distance measures, and evaluation metrics. We place significant emphasis on topic modeling and document representation methods, particularly those reliant on word embeddings, and conduct a detailed literature review to gain insight into the current state-of-the-art in document clustering. We conducted an experimental study on the 20newsgroups dataset, testing a range of document representation methods, including TF-IDF weighted bag-of-words, Doc2Vec, average and TF-IDF weighted average of Word2Vec, GloVe and FastText word embeddings with K-means clustering. We utilized both intrinsic and extrinsic evaluation metrics to assess the clustering performance of each of these representation methods. Moreover, Latent Dirichlet Allocation is also assessed in the context of document clustering. Our findings reveal the strengths and weaknesses of different document representation and topic modeling methods and offer insights into their effectiveness for document clustering. Despite some limitations, our study contributes to the understanding of document clustering, providing guidance on selecting and assessing document representation methods and implementing a complete clustering pipeline.



Document clustering, Bag-of-words, Word embeddings