Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya.

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Additional Information
    • Subject Terms:
    • Abstract:
      This article studies convolutional neural networks for Tigrinya (also referred to as Tigrigna), which is a family of Semitic languages spoken in Eritrea and northern Ethiopia. Tigrinya is a "low-resource" language and is notable in terms of the absence of comprehensive and free data. Furthermore, it is characterized as one of the most semantically and syntactically complex languages in the world, similar to other Semitic languages. To the best of our knowledge, no previous research has been conducted on the state-of-the-art embedding technique that is shown here. We investigate which word representation methods perform better in terms of learning for single-label text classification problems, which are common when dealing with morphologically rich and complex languages. Manually annotated datasets are used here, where one contains 30,000 Tigrinya news texts from various sources with six categories of "sport", "agriculture", "politics", "religion", "education", and "health" and one unannotated corpus that contains more than six million words. In this paper, we explore pretrained word embedding architectures using various convolutional neural networks (CNNs) to predict class labels. We construct a CNN with a continuous bag-of-words (CBOW) method, a CNN with a skip-gram method, and CNNs with and without word2vec and FastText to evaluate Tigrinya news articles. We also compare the CNN results with traditional machine learning models and evaluate the results in terms of the accuracy, precision, recall, and F1 scoring techniques. The CBOW CNN with word2vec achieves the best accuracy with 93.41%, significantly improving the accuracy for Tigrinya news classification. [ABSTRACT FROM AUTHOR]
    • Abstract:
      Copyright of Information (2078-2489) is the property of MDPI and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)