Natural Language Processing
NLP enables computers to understand, interpret, and generate human language. It combines linguistics, computer science, and machine learning to bridge human communication and computer understanding.
Text Preprocessing
Steps: tokenisation (splitting into words/subwords), lowercasing, stop word removal, stemming or lemmatisation, POS tagging, and named entity recognition (identifying names, places, organisations).
Text Representation
Bag of Words: word frequency vectors. TF-IDF: weighted term frequency. Word embeddings: dense semantic vectors (Word2Vec, GloVe). Contextual embeddings (BERT, GPT) vary by context.
Classification and Sentiment
Text classification assigns categories (spam, topics). Sentiment analysis determines polarity. Methods: lexicons, Naive Bayes, SVM, BERT fine-tuning.
Language Models
Predict word sequences. N-grams, RNN/LSTM, Transformers (GPT, BERT, T5) with self-attention achieve state-of-the-art through massive pre-training.
Machine Translation
Rule-based → statistical → neural MT. Sequence-to-sequence with attention. Google Translate uses transformer-based NMT.
Chatbots
Rule-based (ELIZA), retrieval-based, generative (language models). Modern assistants combine NLU, dialogue management, NLG.
Summary
NLP powers search engines, translation, chatbots, and content analysis through language understanding.