The Landscape of AI in African Languages and Linguistics

Some years back, I asked Google Assistant a simple question (can’t remember what it was) and it brought an unrelated answer. I asked again, and it brought a different answer. I had to type my question.

At that point, I concluded that voice recognition software were not made for Africans.

My spoken English and accentuation have improved due to personal development and exposure, but I can also tell that many voice recognition apps and software are becoming more adaptive to African speakers (accent-wise). However, the truth is that voice recognition for Africans has a much longer way to go.

I still wonder why we do not have many apps that can be prompted with major local African languages, of which there are over 10 million native speakers. So, I decided to speak with a field linguist and academic researcher on the current situation of artificial intelligence and natural language processing in African contexts and languages.

Q: Hello, can you tell me about yourself, and as a linguist and NLP AI enthusiast, what are your interests?

Hi, I am Olanrewaju Samuel.

I am interested in computational phonology, dataset building, annotation and curation, Natural Language Processing and field linguistics.

My primary mentor is Dr Akinbo Samuel. Lately, I found great pleasure in protein linguistics, protein folding, and mathematical linguistics. One scholar I respect alot is named Jeffrey Heinz. His works have influenced my research, and I have lent my expertise to deep learning NLP and robotics. My current study area is around protein folding and the interrelationship between quantum physics, quantum chemistry, and linguistics.

I am not yet strict with my research goals, but I am focused on developing my expertise and exploring my possibilities for now. Not for the certifications per se, but for self-development. So, I am seeking to develop myself while also attempting to complete my programme here and move on to some other things.

Q: What are some of your publications in this field?

I have collaborated with different great individuals to be part of different publications. One of my recent linguistics papers is “An acoustic study of vocal expression in two genres of Yoruba oral poetry.” Primarily, most of my featured publications in NLP are from the Masakhane NLP group.

This includes:

Q: What’s your current work in Kigali, Rwanda?

I am teaching a course entitled “Natural Language Processing for Linguists”. Basically, I am teaching linguistic natural language processes within the African contexts in Kigali, Rwanda.

I am tasked with providing and demonstrating the nuances of building, annotating, curating, analysing and publishing multilingual datasets for different NLP tasks, such as in building large language models (LLM). A large language model means to bring multiple language systems to function within a single stream. We try to achieve that by lateralization, which is sort of, training the AI system with a pattern or template. The pattern then becomes the basis for its other applications.

Beyond conversational AI, we are looking at doing something meaningful in the field of generative AI, which is still a part of lateralization for the model's ability to permutate data and generate results by mathematical computation such as probability.

Q: Can you talk about the current situation and applications of AI/NLP within the African context?

NLP has been used in many instances across Africa, some of which include robotics and conversational AI. A typical example of a conversational AI is Lagos’ Alaye, which is to help natural tourists (Nigerians from other states) to find their way around Lagos —a mega-city and state— and to identify locations such as restaurants, clubs, shops, and even traffic situations using the popular Nigerian pidgin (Naija pidgin).

We are developing AI models that can be trained to perform tasks –a complex system or process is narrowed down into simple command string (modelling). That’s the practical application of NLP in robotics as it stands in Africa, at the moment.

Currently, in linguistics, the application of AI is mostly in automation although there are linguistic models infused into different AI applications such as in robots and chatbots, among others.

We have some folks doing really great stuff, like Maskhane, Mbaza-NLP, NLPGhana (more), and Kenya NLP.

What are the challenges of AI applications in the context of African languages

A major challenge to Africa's landscape in finding global relevance in the AI industry is the limitation of language resources (data). Africa is multilingual, hence, there are limited datasets to supply the vast amount of data needed for various AI projects going on in the world. For instance, the largest language dataset we have in Africa is about 2000 hours, but the recognised datasets are even way smaller, which is very ridiculous in comparison with the English language, which has billions of hours of audio data.

If anything will happen to AI, it will happen to high-resource languages. Even if it were to happen to African languages, we don’t have the systems to power them. Hence, we are lagging behind because we do not have enough to work with, and the issue has been an almost-lifelong problem of our lack of documentation.

Take Nigeria, for example, over 200 tribes, yet only three languages are the most popular. Unlike Yoruba, Igbo, and Hausa, smaller tribes and languages have little data (low resource data). That’s what we are trying to do at Mbaza-NLP, collect data from low-resource languages and use them for programmable speech recognition, including speech-to-text (STT) and text-to-speech (TTS).

AI and NLP technicians are not investing because they don't believe in it, or they think there isn't enough data to explore for their ROI. So, we are hoping our current underground works will be the breakthrough.

Moreover, Africa is marginalised in the global market of linguistic AI and NLP because the most popular search engines are Asian and Western (American, especially). Also, for some of our works here, we cannot take credit for them as Africans because of the sponsorship.

Q: What are the African countries with the most strides in the applications of African NLP?

African countries that have made the most impact include South Africa, Kenya, and Rwanda –those guys are crazy! Nigeria is also trying, but most persons that ought to be exploring the space are not seeking development but the gratifications of academic certifications. We value our language(s), but we are not building datasets with them. We would rather speak or privatise our language as a heritage when we ought to be investing in documentation to preserve and protect the language.

Q: So, business-wise, where does Africa stand in the commercialization of NLP for African languages?

Honestly, there isn’t much, other than the business of selling datasets . Even at that, those who pump money into the projects give much, but the amount that gets to the field agents is very little in comparison to the original amount put in.

Q: That brings me to the question of ethics. Is there any ethical value in collecting and selling people’s data? And is it fair to get a large amount of money for these projects and the primary sources of these languages get a very minimal amount (sometimes zero)? Are there protections for these data or sources?

There is no law against data collection. The most important thing is that the data is collected willingly from the native speakers, and they are rewarded for their time. However, all activities are to be in alignment with the African Union’s AU Data Privacy Acts. Also, Linguistics research that involves data collection usually needs consenting from native speakers or respondents.

And to your second question, there is nothing anyone can do about the amount of money that eventually reaches the people involved in these fields. The most important thing is that everyone commits to the project willingly. The people are told that they would be recorded and rewarded, and as long as they are okay with the price, there is no “unfairness.”

Q: If anyone wants to join NLP and language training as it stands, what do you recommend?

It is a wide field. Many have foundations already and are in the building stages, but we still have more aspects that are barely foundational. What I’ll recommend for anyone is to get involved with language data collection and analysis. We need data analytics for datasets as much as we need data.

Hence, I’ll recommend joining or volunteering to enthusiastic data-driven groups; volunteer for data collection and analysis, learning nomenclature and others.

Final Thoughts

Africa continues to be improperly represented in voice recognition software with commands or prompts for different AI and NLP. The narrative will become different when Africans set out to build datasets and put their language out and continue to invest in documentation. Yet, you will be impressed with some of the creations coming out from Africa concerning AI and NLP applications.

In my research and following leads, I have seen robots being prompted with local African languages, we are having more local chatbots fit for different African contexts (tourism, exploration), some languages are being used for IoT for home appliances. However, I believe we should be doing more, considering massive AI and NLP revolution going on in the world right now. For now, we have more datasets for text classifications than we have for audio data. Yet, we need more of both audio and textual datasets. Data is the new currency, I honestly hope Africans will do this right before foreigners do wrong jobs (Oh, yes, I have read false historical data being reported in a published book before; that’s what improper documentation does to us).