Document representation for text clustering

Μπούργος, Νικόλαος

Document representation for text clustering

Files

Thesis_Nikos_Bourgos_HCI.pdf (6.01 MB)

Date

2023-06-23

Authors

Μπούργος, Νικόλαος

Abstract

Clustering plays a crucial role in organizing and understanding large collections of documents. In this thesis, we conduct a comprehensive investigation of document clustering, examining clustering algorithms, text preprocessing techniques, similarity and distance measures, and evaluation metrics. We place significant emphasis on topic modeling and document representation methods, particularly those reliant on word embeddings, and conduct a detailed literature review to gain insight into the current state-of-the-art in document clustering. We conducted an experimental study on the 20newsgroups dataset, testing a range of document representation methods, including TF-IDF weighted bag-of-words, Doc2Vec, average and TF-IDF weighted average of Word2Vec, GloVe and FastText word embeddings with K-means clustering. We utilized both intrinsic and extrinsic evaluation metrics to assess the clustering performance of each of these representation methods. Moreover, Latent Dirichlet Allocation is also assessed in the context of document clustering. Our findings reveal the strengths and weaknesses of different document representation and topic modeling methods and offer insights into their effectiveness for document clustering. Despite some limitations, our study contributes to the understanding of document clustering, providing guidance on selecting and assessing document representation methods and implementing a complete clustering pipeline.

Keywords

Document clustering, Bag-of-words, Word embeddings

URI

https://hdl.handle.net/10889/25187

Collections

Τμήμα Ηλεκτρολ. Μηχαν. και Τεχνολ. Υπολογιστών (ΜΔΕ)

Full item page

Document representation for text clustering

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections