Natural Language Processing

NLP enables computers to understand, interpret, and generate human language. It combines linguistics, computer science, and machine learning to bridge human communication and computer understanding.

Text Preprocessing

Steps: tokenisation (splitting into words/subwords), lowercasing, stop word removal, stemming or lemmatisation, POS tagging, and named entity recognition (identifying names, places, organisations).

Text Representation

Bag of Words: word frequency vectors. TF-IDF: weighted term frequency. Word embeddings: dense semantic vectors (Word2Vec, GloVe). Contextual embeddings (BERT, GPT) vary by context.

Classification and Sentiment

Text classification assigns categories (spam, topics). Sentiment analysis determines polarity. Methods: lexicons, Naive Bayes, SVM, BERT fine-tuning.

Language Models

Predict word sequences. N-grams, RNN/LSTM, Transformers (GPT, BERT, T5) with self-attention achieve state-of-the-art through massive pre-training.

Machine Translation

Rule-based → statistical → neural MT. Sequence-to-sequence with attention. Google Translate uses transformer-based NMT.

Chatbots

Rule-based (ELIZA), retrieval-based, generative (language models). Modern assistants combine NLU, dialogue management, NLG.

Summary

NLP powers search engines, translation, chatbots, and content analysis through language understanding.

Natural Language Processing