Skip to main content

NLP Basics

Natural Language Processing (NLP) bridges human linguistics and computer understanding.

Text Pre-processing​

Computers can only understand numbers, not letters. Before any text can be fed into a Machine Learning model, it must be cleaned and mathematically encoded.

  1. Tokenization: Splitting a paragraph into sentences, and sentences into individual tokens (words or strict sub-words).
  2. Stop-word Removal: Dropping extremely common, meaningless words like 'the', 'is', 'at', and 'which'.
  3. Stemming / Lemmatization: Reducing words back to their root dictionary form (e.g., parsing "running" and "ran" both as "run").
  4. Vectorization / Word Embeddings: Converting strings into dense arrays of numbers where similar words (like "king" and "queen") are positioned mathematically close to each other in vector space.

Recurrent Neural Networks (RNNs)​

While CNNs revolutionized Vision, Recurrent Neural Networks traditionally governed NLP. Standard networks don't have "memory"β€”they process each input completely independently. RNNs contain internal loops, allowing information from the previous word to persist and influence how it processes the current word. This makes them highly effective for sequential data like speech and text.

Note: Today, RNNs have largely been superseded by modern Transformer Architectures (like ChatGPT) which process entire sequences in parallel rather than word-by-word.

Common NLP Use Cases​

  • Sentiment Analysis: Reading thousands of customer reviews and labeling them as mathematically Positive or Negative.
  • Named Entity Recognition (NER): Scanning corporate documents to automatically extract organizations, dates, and people's names.
  • Machine Translation: Automatically converting English to Spanish.