Development of a Python tool for graph based representation of multiple texts

Loading...
Thumbnail Image

Date

2024-07-10

Authors

Ασημακόπουλος, Διονύσιος

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In this thesis, we present the construction of a Python code, a tool for representing an entire corpus of documents as a single graph, as proposed by Giarelis et al. (2020). This model enables the representation and operation of a wide variety of graph-based algorithms in fields such as natural language processing, along with a bibliographical study of the graph theory used and the evolution of the field from earlier techniques of information retrieval (IR) to the graph approaches prevalent today. Our tool introduces a novel approach, describing and creating a single graph that incorporates all the documents from a given set. This capability allows operations on the entire set with functionalities that were previously unattainable due to limitations in earlier approaches, such as the lack of metrics between different documents and the absence of a structural framework to analyze words, sentences, and documents as a cohesive unit. It is implemented in Python 3.9 and utilizes other libraries such as NetworkX (Hagberg, Swart & Chult, 2008), offering flexibility in creating the appropriate graph-of-docs. Chapter 2 presents the theoretical background of graph theory and state-of-the-art approaches, while Chapter 3 demonstrates use cases, providing an explanation of the code as an implementation of the graph-of-docs representation model. Finally, chapter 4 concludes our work and proposes future directions.

Description

Keywords

Graph-of-docs, Natural language processing, Graph-based algorithms, NetworkX

Citation