Dialogue Systems in the Greek Language

Author nameIoannis Loumiotis
TitleDialogue Systems in the Greek Language
Year2018-2019
Supervisor

Georgios Petasis

GeorgiosPetasis

Summary

The recent advances in deep learning have increased the interest of the research community regarding dialogue systems and, thus, several architectures that can achieve state-of-the-art results have been proposed. These dialogue systems are usually trained on appropriate publicly available datasets proposed specifically for this purpose and used for benchmarking. However, as the majority of these datasets are available only for the English language, it is evident that there are obstacles in the research progress of dialogue system for other languages. In particular, the effort performed for dialogue systems in the Greek language is limited and mainly consists of proprietary datasets used for commercial purposes. The aim of this thesis is to study the problem of training dialogue systems in the Greek language with a small size dataset. Having as a starting point a small dataset with open-domain questions and answers that was created manually for a social robot, we apply a corpus augmentation technique in order to enlarge the dataset and increase the performance of the dialogue system. The augmentation technique is based on the synonyms of the words in the initial dataset. A morphological lexicon is used to obtain the synonyms, which are then manually processed in order to select the appropriate word sense and the specific synonyms that fit to the questions and answer pairs in the dataset. Then, these synonyms are used in order to augment the dataset by applying appropriate permutations. The synonym-based augmentation method studied in this thesis is evaluated both for the case of generative and utterance selection dialogue systems. Specifically, for the former, several architectures, such as Recurrent Neural Networks (RNNs) and Transformers are investigated, while for the latter, the pretrained Bidirectional Encoder Representations from Transformers (BERT) multilingual model is applied. The evaluation process is performed automatically by using word overlapping metrics. In particular, the Bilingual Evaluation Understudy (BLEU) score, the Metric for Evaluation of Translation with Explicit Ordering (METEOR) score and the Recall- Oriented Understudy for Gisting Evaluation (ROUGE) score are applied. The obtained results reveal that the augmentation of the small size dataset can increase the performance of Greek dialogue systems. Specifically, the BLEU-2 score, which is recommended in the literature for evaluating responses for non-technical dialogues, increased over 31.5% for the case of RNNs and about 43% for the case of the Transformer. Regarding the utterance selection approach, the BLEU-2 score increased by 49.7%. In order to validate the obtained results, hypothesis testing is applied for the case of the generative models. The outcome of the statistical tests reveal the validity of the results and confirm that an augmentation approach based on synonyms that are obtained from a morphological lexicon can significantly increase the performance of a dialogue system in the Greek language. Finally, the automated selection of the appropriate synonyms is studied, and the results of an automated procedure for selecting the synonyms using language models are presented. In particular, a custom pretrained XLNet model and the multilingual BERT model are employed in order to automate the selection process. The obtained results reveal that these models did not meet our expectations under the current scenario and further investigation towards this research direction is required.