Development of language models for the Mycenaean Linear B script and their application in the restoration of tablets

Παπαβασιλείου, Κατερίνα

Mycenaean Linear B is an ancient script that was used for writing the earliest known form of the Greek language, referred to as Mycenaean Greek. It was primarily used during the Late Bronze Age, specifically from the 15th to the 13th century BCE. This thesis investigates the problem of restoring Mycenaean linear B tablets by using text infilling methods based on machine learning models. To capture the statistical structure of the Mycenaean documents we present a dataset of sequences focusing on the series D and series A\&B. We propose to enlarge the dataset by data augmentation methods that consider the structure and semantics of the domain described by the script. We investigate various Recurrent Neural Network architectures and compare their results on both synthetically generated and real gaps. To further tackle the problem of data scarcity we investigate the case of transferring knowledge between models trained on different series, by applying different transfer learning configurations. We provide quantitative results on both synthetic and real cases of damaged sequences and compare to the experts' opinions with promising results. The results can be extended to handle similar problems in Linear B or other ancient scripts such as decipherment, location identification or scribe identification. This is the first work of this kind on Mycenaean Linear B, which we hope to bring closer the communities of machine learning experts, archaeologists and linguists.



Mycenaean Linear B script, Natural language processing, Machine learning, Language model, Recurrent neural network