Abdirahman, Abdullahi Ahmed (2023) Enhancing Natural Language Processing in Somali Text Classification: A Comprehensive Framework for Stop Word Removal. International Journal of Engineering Trends and Technology.
IJETT-V71I12P205 (1).pdf - Published Version
Download (430kB)
Abstract
Abstract - Text classification is a prominent field of study in information retrieval and natural language processing, where a
crucial component is the utilization of a stop word list. This list helps identify frequently occurring words that have little
relevance in classification and are consequently removed during pre-processing. Although various stopword lists have been
devised for the English language, a standardized stopword list specifically tailored for Somali text classification is yet to be
established. This research presents a comprehensive framework for stop word removal in the context of the Somali language,
aiming to enhance the effectiveness of various Natural Language Processing (NLP) tasks. The proposed methodology
encompasses several essential steps, including noise identification, noise removal, character normalization, data masking,
tokenization, POS tagging, and lemmatization. By analysing a substantial dataset containing 79,741,231 tokens and
71,871,585 words, the framework demonstrates its capability to identify and eliminate stop words, thereby reducing vector
space and improving the performance of NLP algorithms. The research highlights the unique linguistic features of Somali,
such as contextual variations and morphological complexities. It discusses the potential applications of the developed stop
word list in sentiment analysis, information retrieval, and document classification. This work contributes valuable insights to
the field of language technology, particularly in underrepresented languages, and paves the way for further advancements in
NLP models tailored to diverse linguistic contexts.
Keywords - Somali language, Stopword removal, Natural Language Processing, Stopword list, Ontology.
| Item Type: | Article |
|---|---|
| Subjects: | A General Works > AC Collections. Series. Collected works |
| Divisions: | Faculty of Computing |
| Depositing User: | Unnamed user with email crd@smiad.edu.so |
| Date Deposited: | 20 Sep 2025 08:01 |
| Last Modified: | 20 Sep 2025 08:01 |
| URI: | https://repository.simad.edu.so/id/eprint/228 |
