Development of semi-supervised machine learning algorithms and applications

Thumbnail Image
Φαζάκης, Νικόλαος
Journal Title
Journal ISSN
Volume Title
The well-established approach of Supervised learning is a branch of the broader science of artificial intelligence. The aim of this learning philosophy is the development of computer programs to automatically improve their experience through the extraction of useful information from annotated examples. The methodology of this learning approach is extremely useful in real world applications where large collections of data are available related to problems where absolute associations of the input data and the outcomes cannot be discovered or approximated by explicit mathematic formulations. Such scientific fields include observed data of text, audio or image formats. The classic methodology of supervised learning comes with the cost of annotating, usually referred as ‘labeling’ process, the available data instances of a dataset often by human experts in a field. Considering that modern big datasets can have terabytes of data; it is a very inefficient procedure for humans to tackle. This intrinsic bottleneck is addressed by Semi-supervised learning (SSL), which allows the model to incorporate part or all of the available unlabeled data into its supervised learning. The goal of SSL is to maximize a model's learning performance while reducing the amount of labor required by using such newly labeled instances. This thesis is oriented in the improvement of a sub-category of SSL algorithms referred as self-labeled techniques, and the application of them in real world problems. Numerous important questions are answered such as: Which learning algorithms can best utilize the self-labeling schemes? Can the introduction of ensemble learning along with semi-supervised learning provide classification improvements in real world problems such as speaker identification or educational grade prediction? Is it possible to define a new multi-regressor learning scheme based on self-labeling that can rival the existing semi-supervised regression algorithms? Can iterative data imputation be improved through the introduction of self-training? In health-related datasets is it possible to take advantage of unlabeled test sets to balance the shortage of examples through semi-supervised transductive learning?
Machine learning, Semi-supervised learning