AIWiki
Malaysia

Natural Language Processing

Natural language processing (NLP) is the subfield of AI concerned with enabling computers to understand, interpret, manipulate, and generate human language in both text and speech form.

3 min readLast updated May 2026Foundations

Natural language processing (NLP) is the branch of artificial intelligence dedicated to enabling machines to understand and generate human language — in written, spoken, or structured form. NLP combines linguistics, statistics, and deep learning to bridge the gap between human communication and computational processing.

Core NLP Tasks

  • Tokenisation — splitting text into words or subwords
  • Part-of-speech tagging — labelling grammatical roles (noun, verb, adjective…)
  • Named entity recognition (NER) — identifying people, organisations, locations, dates
  • Sentiment analysis — classifying the emotional polarity of text
  • Text classification — assigning documents to predefined categories
  • Machine translation — converting text between languages
  • Summarisation — producing shorter versions of longer documents
  • Question answering (QA) — extracting or generating answers from context
  • Coreference resolution — determining when pronouns refer to the same entity

The Transformer Revolution

Before transformers, NLP relied on RNNs/LSTMs and hand-crafted features. The BERT model (Bidirectional Encoder Representations from Transformers, 2018) demonstrated that pre-training on large text corpora and fine-tuning on downstream tasks vastly outperformed prior approaches. This paradigm — pre-train then fine-tune — became standard.

Today, encoder-only models (BERT family) excel at classification and NER; decoder-only models (GPT family) excel at generation; encoder-decoder models (T5, mT5) excel at translation and summarisation.

Multilingual NLP

mBERT and XLM-RoBERTa are pre-trained on 100+ languages, enabling cross-lingual transfer: a model fine-tuned on English data can perform reasonably on a related task in another language. However, low-resource languages — those with limited training data — often remain underserved.

References

  1. Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019.
  2. MIMOS Berhad (2022). BERTi: Bahasa Malaysia BERT Technical Report.