Towards An Automatic Text Analysis and Summarization In Yoruba Language Using Transfer Learning Approach In Natural Language Processing
Keywords:
Extractive automatic text summarization, Transfer learning, Natural language processing, Machine learning, Yoruba languageAbstract
Text Summarization serves is a tool which helps the user to efficiently find useful information from immense
amount of information. This tool is increasingly used in both the public and private sector such as
telecommunication industry, research institutes and in web-based information retrievals. Yoruba Language,
being one of the three major languages spoken in the South-Western part of Nigeria and some communities in
other countries like Brazil, Cuba, Haiti, Togo, Benin Republic, Trinidad and Tobago, has been classified as a
language in serious danger of extinction by UNESCO Red Book on endangered languages. There are not many
researchworks in the field of natural language processing for Yoruba Language, not even any on text
summarization, as far as it is known to this study.Some other times, when Internet searches are made on Yoruba
subjects, the response obtained is loads of information, which is difficult for individuals to patiently read to
comprehension. Therefore, this study aimed at developing a system that automatically retrieves, categorize and
summarize Yoruba document as per users’ need. The design of the summarization system shall be divided into 3
stages vis-à-vis: pre-processing, feature extraction and summary generation stages. The developed system will
read a Yoruba document which will be broken into several paragraphs using a Paragraph Segmentation module.
In the Tokenization module, paragraphs will be broken into sentences in which punctuations, special characters,
and digits will be eliminated in the Normalization module and finally the sentences will be broken into words.
The Stop Word Filtering module will remove the stop words and reduce the text to more useful words. The
Yoruba Morphology Lexical Analyser module will process every sentence to a Subject-Verb-Object pattern.
Every word in the sentence will take a tag, representing its Part of Speech (PoS) position, which will be done by
the Part of Speech Tagging module. The feature extraction processes will commence by using the Keyword
Frequency which checks for the relevance of each words in the document by counting how many times it
occurred in the document. The keyword with the highest frequency is likely to be present in the generated
summary. This study proposed an extractive automatic text summarization tool for Yoruba language using
Transfer Learning in Natural Language Processing. At the end of the study, it is expected that a Summarisation
System would have been developed, which could be employed in generating a concise summary of any given
Yoruba text. This paper also presented views on recent techniques and approaches on automatic text
summarization with focus on English, Chinese, Persian, Arabic, Spanish and Hausa texts. Finally, it discussed
challenges and methodologies of Automatic Text Summarisation.