Part 01
Vectors
Machine learning runs on one move repeated millions of times: measuring how much two arrows agree. A vector is the arrow; everything else is built on it.
01 · The problem
What is a vector even for?
A machine can only do arithmetic. So before it can learn anything about a word, an image, or a person, that thing must become numbers — and not just any numbers. They must be arranged so that relationships survive: similar things end up close together, related things point the same way.
A vector is the object that pulls this off. It turns "similar", "relevant", and "aligned" into quantities a machine can compute and gradually adjust — and gradual adjustment is exactly what learning is.
02 · The object
A direction and a length — not a list
A vector is an arrow. It has a direction (which way it points) and a magnitude (how long it is). The list of numbers you usually see, like (3, 1), is just the arrow's address on a chosen grid — three steps along one axis, one along another. The arrow is the real thing; the list is bookkeeping.
Why this matters: the same arrow is written as different lists if you change the grid, yet it is the same arrow. In machine learning, choosing a good grid — a good set of coordinates — is half the game.
+ the worked numbers
Two ways to combine arrows, both of which you will see constantly:
Adding = walk one then the other. (3,1) + (1,2) = (4,3): go 3 east and 1 north, then 1 east and 2 north — you land at 4 east, 3 north.
Scaling = stretch or shrink without turning. 2 × (3,1) = (6,2): same direction, twice as long. −1 × (3,1) = (−3,−1): same line, opposite way.
03 · The core operation
The dot product: how much do two arrows agree?
The takes two vectors and returns a single number measuring how much they point the same way. Large and positive means strongly aligned; zero means perpendicular — no agreement; negative means they point against each other.
The clean way to picture it is : lay one arrow's shadow onto the other's direction, and read off the shadow's length. That length is what the dot product captures.
dot product
- Meaning
- One number saying how much two vectors point the same way.
- Why it exists
- Every comparison in ML — similarity, relevance, a neuron firing — needs a single, cheap, adjustable agreement score.
- Example
- (1,2)·(3,1) = 1×3 + 2×1 = 5 — a positive score, so they partly agree.
+ the worked numbers
Take w = (1,2) and x = (3,1). Two routes, always the same answer.
|w||x|cosθ = √5 · √10 · (1/√2) = √25 = 5 ✓
The shadow of x on w's direction has length w·x / |w| = 5/√5 ≈ 2.24 — how far x reaches in w's direction.
+ go deeper: the formal version
The left form is what a computer runs; the right form is what it means. Both are forced — there is no choice, it follows from the geometry. Divide both lengths out and you keep only the angle: that is . Choosing cosine over the raw dot product is an empirical call — you do it when magnitude is noise (a long document versus a short one on the same topic).
04 · The payoff
A neuron is just a dot product
Here is where it becomes machine learning. An artificial carries its own arrow, called its weights. When an input arrives, the neuron asks exactly one question: how much does this input point in my direction? That question is a dot product.
So learning is the search for good directions to ask about. Stack many neurons and you ask many questions at once; stack layers and you ask questions about the answers. The whole tower is built from this one move.
+ the worked numbers
A neuron computes w·x + b, where b is a bias that shifts the threshold. With w=(1,2), x=(3,1), b=−4:
Change the weights and you change the question; that change is what training does.
+ go deeper: layers, attention, the gradient
A layer of neurons stacks one weight-arrow per neuron into the rows of a matrix, so the layer is a batch of dot products computed at once: Wx + b. Writing data as columns and weights as rows is convention.
Attention scores relevance as Q·K — a dot product asking how much what one token seeks aligns with what another offers. And the gradient that training follows is itself a vector — the steepest-uphill direction — with each step moving against it. math
05 · The leap
From space you can see to space you can't
On paper a vector lives in two or three dimensions. In machine learning it lives in hundreds. Each axis is no longer "north" or "east" — it is a or a learned coordinate of meaning.
You lose the picture, but you lose nothing else. The angle between two 300-dimensional arrows is just as real, and just as computable, as the angle between two arrows on a page. The dot product does not care how many dimensions there are.
+ go deeper: why high dimensions are strange
Something with no everyday parallel happens up there: pick two directions at random in a high-dimensional space and they are almost always nearly perpendicular. The typical cosine shrinks roughly like 1/√d.
So when an embedding reports a cosine of 0.6 between two words, that is not "moderately similar" — it is astronomically non-random. Alignment is a rare, loud signal, which is exactly why the dot product carries so much information in ML.
06 · Guardrails
What a vector is not
- Not just a list of numbers. The list is one address on one grid; change the grid and the numbers change while the arrow stays put.
- Aligned is not identical. The dot product rewards both pointing-the-same-way and being long; two arrows can agree in direction yet differ wildly in length.
- 3-D intuition does not transfer cleanly. In high dimensions almost everything is nearly perpendicular to everything — a fact that feels wrong but is the engine behind why similarity scores mean something.
07 · The fine print
The assumption hiding underneath
The dot product only measures meaning if the space was built so geometry lines up with meaning. Feed in raw pixels and two photos of the same cat, shifted by one pixel, sit far apart — the geometry barely tracks "same cat".
So the quiet assumption is that a good coordinate system already exists. The whole job of representation learning is to build one — to bend the space until plain dot products mean what we want. The dot product is exactly as smart as the space it runs in, and not one bit more. empirical
✓ · Rebuild it from a blank page
Test yourself
If you can answer these without looking, the concept is yours.
+ show the three questions
- Reconstruct. In plain words then in symbols: what does a single neuron compute, what geometric question is it asking, and what does it mean when its output is exactly zero?
- Transfer. "king" is a frequent word (large arrow), "monarchy" is rarer (small arrow), and they point almost the same way. Why would the raw dot product understate their similarity, what would you compute instead, and what does that alternative throw away?
- What breaks. An engineer feeds raw 784-pixel digit images straight into cosine similarity and gets poor matches. Using the hidden assumption above, why does the geometry fail — and what must happen to the space first?