BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages

BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages

dc.contributor.advisor	Marivate, Vukosi
dc.contributor.coadvisor	Akinyi, Verrah
dc.contributor.postgraduate	Parvess, Jesse
dc.date.accessioned	2023-10-09T08:00:41Z
dc.date.available	2023-10-09T08:00:41Z
dc.date.created	2023-04
dc.date.issued	2023
dc.description	Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2023.	en_US
dc.description.abstract	It was researched whether a multilingual Bantu pretraining corpus could be created from freely available data. Here, to create the dataset, Bantu text extracted from datasets that are freely available online (mainly from Huggingface) were used. The resulting multilingual language model (BantuBERTa) from this pretraining data proved to be predictive across multiple Bantu languages on a higher-order NLP task (NER) and in a simpler NLP task (classification). This proves that this dataset can be used for Bantu multilingual pretraining and transfer to multiple Bantu languages. Additionally, it was researched whether using this Bantu dataset could benefit transfer learning in downstream NLP tasks. BantuBERTa under-performed with respect to other models (XlM-R, mBERT, and AfriBERTa) bench-marked on MasakhaNER’s Bantu language tests (Swahili, Luganda, and Kinyarwanda). Additionally, it produced state of the art results for the Bantu language benchmarks (Zulu, and Lingala) in the African News Topic Classification dataset. It was surmised that the pretraining dataset size (which was 30% smaller than AfriBERTa’s) and dataset quality were the main cause for the poor performance in the NER test. We believe this is a case-specific failure due to poor data quality resulting from a pretraining dataset consisting mainly of web-scraped pages. Here, the resulting dataset consisted mainly of MC4 and CC100 Bantu text. However, on lower-order NLP tasks, like classification, pretraining on languages solely within the language family seemed to benefit transfer to other similar languages within the family. This potentially opens a method for effectively including low-resourced languages in low-level NLP tasks.	en_US
dc.description.availability	Unrestricted	en_US
dc.description.degree	MIT (Big Data Science)	en_US
dc.description.department	Computer Science	en_US
dc.identifier.citation	*	en_US
dc.identifier.other	A2023	en_US
dc.identifier.uri	http://hdl.handle.net/2263/92766
dc.language.iso	en	en_US
dc.publisher	University of Pretoria
dc.rights	© 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subject	UCTD	en_US
dc.subject	Multilingual language modeling	en_US
dc.subject	BantuBERTa	en_US
dc.subject	Bantu Languages	en_US
dc.title	BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages	en_US
dc.type	Mini Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: BantuBERTa__Using_Language_Family_Grouping_in_Multilingual_Language_Modeling_for_Bantu_Languages.pdf
Size:: 1.56 MB
Format:: Adobe Portable Document Format
Description:: Mini Dissertation

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations (University of Pretoria)
Theses and Dissertations (Computer Science)

Simple item page