Enhancing Natural Language Processing in Somali Text Classification: A Comprehensive Framework for Stop Word Removal

Abdirahman, Abdullahi Ahmed and Hashi, Abdirahman Osman and Dahir, Ubaid Mohamed and Elmi, Mohamed Abdirahman and Rodriguez, Octavio Ernest Romo (2023) Enhancing Natural Language Processing in Somali Text Classification: A Comprehensive Framework for Stop Word Removal. International Journal of Engineering Trends and Technology, 71 (12). pp. 40-49. ISSN 22315381

Full text not available from this repository.

Abstract

Abstract - Text classification is a prominent field of study in information retrieval and natural language processing, where a crucial component is the utilization of a stop word list. This list helps identify frequently occurring words that have little
relevance in classification and are consequently removed during pre-processing. Although various stopword lists have been devised for the English language, a standardized stopword list specifically tailored for Somali text classification is yet to be established. This research presents a comprehensive framework for stop word removal in the context of the Somali language,
aiming to enhance the effectiveness of various Natural Language Processing (NLP) tasks. The proposed methodology
encompasses several essential steps, including noise identification, noise removal, character normalization, data masking,
tokenization, POS tagging, and lemmatization. By analysing a substantial dataset containing 79,741,231 tokens and
71,871,585 words, the framework demonstrates its capability to identify and eliminate stop words, thereby reducing vector
space and improving the performance of NLP algorithms. The research highlights the unique linguistic features of Somali,
such as contextual variations and morphological complexities. It discusses the potential applications of the developed stop
word list in sentiment analysis, information retrieval, and document classification. This work contributes valuable insights to
the field of language technology, particularly in underrepresented languages, and paves the way for further advancements in
NLP models tailored to diverse linguistic contexts.

Item Type: Article
Subjects:
Divisions: Faculty of Computing > Department of Information Technology
Depositing User: Center for Research and Development SIMAD University
Date Deposited: 26 May 2024 07:33
Last Modified: 26 May 2024 07:33
URI: https://repository.simad.edu.so/id/eprint/126

Actions (login required)

View Item
View Item