NLP Basics
Natural Language Processing (NLP) bridges human linguistics and computer understanding.
Text Pre-processingβ
Computers can only understand numbers, not letters. Before any text can be fed into a Machine Learning model, it must be cleaned and mathematically encoded.
- Tokenization: Splitting a paragraph into sentences, and sentences into individual tokens (words or strict sub-words).
- Stop-word Removal: Dropping extremely common, meaningless words like 'the', 'is', 'at', and 'which'.
- Stemming / Lemmatization: Reducing words back to their root dictionary form (e.g., parsing "running" and "ran" both as "run").
- Vectorization / Word Embeddings: Converting strings into dense arrays of numbers where similar words (like "king" and "queen") are positioned mathematically close to each other in vector space.
Recurrent Neural Networks (RNNs)β
While CNNs revolutionized Vision, Recurrent Neural Networks traditionally governed NLP. Standard networks don't have "memory"βthey process each input completely independently. RNNs contain internal loops, allowing information from the previous word to persist and influence how it processes the current word. This makes them highly effective for sequential data like speech and text.
Note: Today, RNNs have largely been superseded by modern Transformer Architectures (like ChatGPT) which process entire sequences in parallel rather than word-by-word.
Common NLP Use Casesβ
- Sentiment Analysis: Reading thousands of customer reviews and labeling them as mathematically Positive or Negative.
- Named Entity Recognition (NER): Scanning corporate documents to automatically extract organizations, dates, and people's names.
- Machine Translation: Automatically converting English to Spanish.