Natural Language Processing

Natural language processing (NLP) is the subfield of AI concerned with enabling computers to understand, interpret, manipulate, and generate human language in both text and speech form.

3 min readLast updated May 2026Foundations

Natural language processing (NLP) is the branch of artificial intelligence dedicated to enabling machines to understand and generate human language — in written, spoken, or structured form. NLP combines linguistics, statistics, and deep learning to bridge the gap between human communication and computational processing.

Core NLP Tasks

Tokenisation — splitting text into words or subwords
Part-of-speech tagging — labelling grammatical roles (noun, verb, adjective…)
Named entity recognition (NER) — identifying people, organisations, locations, dates
Sentiment analysis — classifying the emotional polarity of text
Text classification — assigning documents to predefined categories
Machine translation — converting text between languages
Summarisation — producing shorter versions of longer documents
Question answering (QA) — extracting or generating answers from context
Coreference resolution — determining when pronouns refer to the same entity

The Transformer Revolution

Before transformers, NLP relied on RNNs/LSTMs and hand-crafted features. The BERT model (Bidirectional Encoder Representations from Transformers, 2018) demonstrated that pre-training on large text corpora and fine-tuning on downstream tasks vastly outperformed prior approaches. This paradigm — pre-train then fine-tune — became standard.

Today, encoder-only models (BERT family) excel at classification and NER; decoder-only models (GPT family) excel at generation; encoder-decoder models (T5, mT5) excel at translation and summarisation.

Multilingual NLP

mBERT and XLM-RoBERTa are pre-trained on 100+ languages, enabling cross-lingual transfer: a model fine-tuned on English data can perform reasonably on a related task in another language. However, low-resource languages — those with limited training data — often remain underserved.

Malaysian Context — BM NLP

Bahasa Malaysia NLP challenges — BM is a morphologically agglutinative language with extensive affixation (ber-, me-, -kan, -an) that complicates tokenisation. Code-switching (campur bahasa — mixing BM, English, and Chinese) is ubiquitous in Malaysian online text, creating a distinct distribution gap from standard BM corpora.

Local datasets and models:

BERTi — a BM BERT model trained on Malay Wikipedia and crawled news corpora (MIMOS Berhad)
MalayNLP — community-curated BM NLP resources on Hugging Face
Malay NLP Library — open-source BM tokeniser and POS tagger
IIUM Confession Dataset — BM/English code-switching social media corpus

Active organisations:

MIMOS Berhad — government-linked technology body with an active NLP research division
Universiti Teknologi Malaysia (UTM) — NLP research in BM, Arabic NLP
Universiti Malaya — multilingual NLP, sentiment analysis in BM

Commercial applications:

Bahasa Malaysia sentiment analysis — social listening tools for Malaysian brands (Meltwater MY, BrandWatch with BM support)
BM ASR (Automatic Speech Recognition) — Telekom Malaysia's voice services, national contact centres
Court interpretation — JAWI (Jawi script) OCR and digitisation projects in the judiciary

References

Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019.
MIMOS Berhad (2022). BERTi: Bahasa Malaysia BERT Technical Report.

Tags:NLP text language BERT transformer sentiment analysis translation

Subfield of	Artificial Intelligence, Linguistics
Core tasks	Classification, NER, translation, QA
Architecture	Transformer (modern)
Key models	BERT, GPT, T5, mBERT
Languages	Python, with NLTK, spaCy, HuggingFace