Computer Vision Basics
Computer Vision allows machines to understand and extract information from digital images and videos.
How Computers "See"
To a computer, a 28x28 pixel image is not a picture; it's a 2D array matrix containing 784 numbers. Each number ranges from 0 (black) to 255 (white) representing pixel intensity.
If it is a color image (RGB), it is a 3D matrix because it has three color channels overlapping each other.
Convolutional Neural Networks (CNNs)
Standard "Dense" Neural Networks are horrible at image processing because flattening a 2D image destroys spatial relationships (a pixel mapping an eye is related to the pixel directly below it, but flattening separates them).
CNNs solve this. Instead of looking at the whole image at once, CNNs use a mathematical operation called Convolution. They slide a small filter (or "Kernel") across the image grid by grid to detect specific geometric features like vertical edges, horizontal lines, and curves. Deep CNN layers combine these simple edges later on to form complex concepts like wheels, eyes, or faces.
Key Vision Tasks
- Image Classification: "What is the primary subject of this image?" (e.g., Cat vs Dog).
- Object Detection: "Where are the subjects localized in the image?" (Outputs bounding boxes around vehicles or pedestrians).
- Image Segmentation: Pixel-perfect isolation of a subject from the background (essential for autonomous self-driving).