Tensors
Intro
Tensors have been around for centuries and happen to be the core of the current machine learning revolution. They are in all state-of-the-art models currently. Yet despite being fundamental building blocks, they're often misunderstood because they're abstracted away by APIs and frameworks such as PyTorch, Candle, JAX, etc. In this post, I want to develop a deep understanding of tensors and give you a solid paradigm for understanding tensors mathematically and computationally. I expect most people in the ML space have experience with tensors and linear algebra to some extent, so we'll start with familiar concepts and build toward more powerful ideas that should give you new insights.
Tensors
The Foundation: From One Equation to Many
A linear algebra class usually begins by describing a linear equation that looks like:
What is the value of ax + by + cz
?
If we know a, b, c
and x, y, z
this is trivial to compute. Let's say a=1, b=2, c=3, x=4, y=5, z=6
then:
1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32
Now with just one equation this notation is perfectly fine, but as we move to many equations it becomes a bit cluttered:
a₁x + b₁y + c₁z
a₂x + b₂y + c₂z
a₃x + b₃y + c₃z
a₄x + b₄y + c₄z
In this case we need a total of 15 variables, so people came up with notation to make describing these systems more convenient.
Vectors: Our First Tensor (1D)
Moving back to our one equation case, we take a, b, c
and put them into what we'll call a 'list of numbers' structure: W = [a, b, c]
. And we take x, y, z
and put them in a similar structure X = [x, y, z]
. These structures are called vectors, but more generally they are called 1 dimensional tensors.
Now we can describe how to multiply them:
W*X = [a*x, b*y, c*z]
This takes us most of the way, but we expect a single number as output, not a list of numbers. If we could SUM
all items in this list we would have an answer. I'll use SUM
as shorthand for now and show you the proper tensor operation for this later in the post.
SUM(W*X) = SUM([a*x, b*y, c*z]) = a*x + b*y + c*z
All tensors have a shape. A shape is a list of numbers that describe the number of dimensions and how many elements are in each dimension. In the case of our 1D W
tensor, its shape is [3]
, because we have 1 dimension with 3 numbers.
Matrices: Scaling Up to 2D Tensors
Moving back to our multi-equation case we can make our W variable a 'list of lists of numbers' structure called a matrix, which is more generally a 2 dimensional tensor.
W = [
[a₁, b₁, c₁],
[a₂, b₂, c₂],
[a₃, b₃, c₃],
[a₄, b₄, c₄],
]
If you're wondering why I'm describing multiplication this way, I'll address that soon.
Since a vector is a 'list of numbers' structure, we can call a matrix a list of vectors. A matrix is a 2 dimensional object so it will have a shape with 2 numbers. In the case of our new W tensor it has a shape of [4, 3]
showing that the first dimension (rows) has 4 elements and the second dimension (columns) has 3 numbers.
Similar to last time we can do a multiplication.
W*X = [
[a₁*x, b₁*y, c₁*z],
[a₂*x, b₂*y, c₂*z],
[a₃*x, b₃*y, c₃*z],
[a₄*x, b₄*y, c₄*z],
]
And for now we can say SUM
only applies to the innermost dimension (the columns).
SUM(W*X) = [
a₁*x + b₁*y + c₁*z,
a₂*x + b₂*y + c₂*z
a₃*x + b₃*y + c₃*z
a₄*x + b₄*y + c₄*z
]
Indexing: Looking Inward
Quickly, one last thing before moving on, let's discard our variable names. Instead we are going to index directly into the tensor to describe these equations.
In our example above, since we know the shape of the matrix, we can come up with a way to look up any number without having to name them. We use coordinate notation, also more widely called subscript notation.
Note: I'll use 1-based indexing throughout this post (W[1,1]
for the first element) rather than the 0-based indexing common in programming. This creates nice symmetry when counting from the beginning (1, 2, 3) versus the end (-1, -2, -3).
So now the items inside of our W matrix above are:
W = [
[W[1,1], W[1,2], W[1,3]],
[W[2,1], W[2,2], W[2,3]],
[W[3,1], W[3,2], W[3,3]],
[W[4,1], W[4,2], W[4,3]],
]
You can see how each coordinate is unique, and each coordinate's leftmost index describes its position in the topmost list, and each subsequent index describes its position in the nested list at the next level.
Bringing in X
from above using the same notation we can write our base equations like so:
W[1,1]*X[1] + W[1,2]*X[2] + W[1,3]*X[3]
W[2,1]*X[1] + W[2,2]*X[2] + W[2,3]*X[3]
W[3,1]*X[1] + W[3,2]*X[2] + W[3,3]*X[3]
W[4,1]*X[1] + W[4,2]*X[2] + W[4,3]*X[3]
Aside: A Note on Traditional Linear Algebra
If you are even vaguely familiar with linear algebra you may be confused as to why I'm using a different notation to describe this process.
W*X^T = [W[1]*X[1] + W[2]*X[2] + W[3]*X[3]]
Traditional linear algebra would write this as W*X^T
, where the transpose (^T
) artificially converts our 1D vector into a 2D matrix just to make the multiplication work.
This is a very useful notation for matrices and basic linear algebra because this is by far the most common operation and has a nice geometric intuition. So why are we discarding the notation that has been sufficient the last 200, or so, years? The issue is, in this notation matrices are the first class citizens, everything is considered a 2 dimensional tensor.
For example, if we have a vector, we are allowed to "push" it into a higher dimension by taking the transpose. The transpose is well defined for matrices, we just "flip" it over its top left to bottom right diagonal. But that is an inherently 2 dimensional operation. Thinking about this geometrically makes more sense, you can imagine flipping a square over its diagonal, but a line doesn't have a diagonal to flip over. Linear algebra forces us to pretend it does, arbitrarily turning a horizontal arrangement into a vertical one. For a vector with n elements, it turns it into a matrix with a size of 1×n essentially creating a dimension out of nothing. We'll touch on this later, but if possible we'd like to ignore these extraneous operations.
As for the geometric intuition, the traditional explanation describes how a vector in a given basis would be represented in the canonical basis. It relies on the idea of projecting a vector onto a collection of other vectors that form a new basis. I think it's more intuitive to think about what's actually happening computationally: each basis vector is being scaled by its corresponding component of the input vector, and then these scaled vectors are added together tip-to-tail. This process essentially finds the vector pointing to a specific lattice point in the cannonical basis. You can think of it as finding the diagonal path to the first grid intersection.
Shapes: Dynamic and Fixed
With vectors, our W
vector has a shape of [3]
, but we could imagine a situation where a longer vector was needed.
For example:
W[1]*X[1] + W[2]*X[2] + W[3]*X[3] + W[4]*X[4] + W[5]*X[5] + W[6]*X[6]
In this case our W
vector would have a different shape, [6]
, but would still be a 1 dimensional tensor. Rather than writing out specific examples for every possible length, we can insert a variable into the shape to represent a size for a dimension that is yet to be decided. Let's call this variable n
. So the shape of W
is now [n]
, which should also be the same shape as our X
vector.
When n
equals 3 we have our original situation:
W[1]*X[1] + W[2]*X[2] + W[3]*X[3]
When n
equals 6 we have our new equation:
W[1]*X[1] + W[2]*X[2] + W[3]*X[3] + W[4]*X[4] + W[5]*X[5] + W[6]*X[6]
For arbitrary n
:
W[1]*X[1] + W[2]*X[2] + W[3]*X[3] + … + W[n]*X[n]
This works for our matrices too. We can describe W
to have shape [m,n]
and X
have shape [n]
:
For arbitrary m, n
:
W[1,1]*X[1] + W[1,2]*X[2] + … + W[1,n]*X[n]
W[2,1]*X[1] + W[2,2]*X[2] + … + W[2,n]*X[n]
…
W[m,1]*X[1] + W[m,2]*X[2] + … + W[m,n]*X[n]
The Pattern: Building Higher Dimensions
Before we extend this to 3 dimensions, let's move backwards. What is a zero dimensional tensor? A tensor without a dimension? So if a matrix is a 'list of lists of numbers', and a vector is a 'list of numbers', a 0 dimensional object must be a number also called a scalar.
Now let's move up to 3 dimensions. If we think of vectors as lines and matrices as squares, then a tensor of 3 dimensions could be thought of as a cube. Sometimes it's hard to wrap your head around what a 3 dimensional tensor would look like, but extending our pattern:
0D - scalar = number
1D - vector = 'list of numbers' = list of scalars
2D - matrix = 'list of lists of numbers' = list of vectors
3D - 'list of lists of lists of numbers' = list of matrices"
So thinking of a 3D tensor as a list of matrices, let me give you a concrete example: we can think of each matrix as a greyscale 2D image. We can stack 3 2D greyscale images into a 3 color image where each of the images are a different color. That's one way people took pictures before color film, they took 3 pictures with red, green, and blue filters then combined them.
I hope you can imagine how this would extend to higher dimensional tensors.
Singleton Dimensions: A Fascinating Side Effect
Singleton dimensions are dimensions that have size one. While simple, they're surprisingly powerful for tensor operations. You can conceptualize it like "wrapping" lower dimensional tensors in another tensor.
Let's start with a scalar:
5.0
has shape []
And then let's add a singleton dimension and make it a vector
[5.0]
has shape [1]
And let's add another singleton dimension and make it a matrix
[[5.0]]
has shape [1, 1]
And for higher dimensional tensors
[[[5.0]]]
has shape [1, 1, 1]
[[[[5.0]]]]
has shape [1, 1, 1, 1]
[[[[[5.0]]]]]
has shape [1, 1, 1, 1, 1]
Adding a singleton dimension anywhere in a tensor doesn't change the data it only changes how we index into the tensor.
For Example while a scalar 5.0 is just that, the vector [5.0] requires indexing into it's first entry to get the scalar back, for [[5.0]] we would need to index into it's first entry in both dimensions to get our scalar back. We'll get into more complex versions of these reindexing operations shortly.
Tensor Operations
Einop Notation: The language of shapes
Going forward we are going to be making extensive use of einop notation because it makes tensor operations much more readable than traditional approaches. I will introduce it using the identity operation, the operation that doesn't do anything.
If we have a vector with a shape of [n]
, we can write the identity operation in einop notation like so:
n -> n
The arrow shows the transformation from input shape to output shape. Here, 'n' represents any size, so this works for vectors of any length. This is a function that takes in one 1D tensor, and outputs one 1D tensor, in function notation it looks like this:
f(x) = x
In python it looks like this:
def f(x: Tensor1D) -> Tensor1D:
return x
It is the simplest function.
The zero dimensional identity has a certain elegance:
->
There's something satisfying about how the notation strips away everything except the fundamental concept of transformation.
Permutation: The Simplest Tensor Operation
The first operation we are going to talk about is changing the order of tensor dimensions. If you've worked in any machine learning framework you've likely used this op and have an intuition for how it works. But just so we're on the same page, let me discuss this op extensively as a jumping off point for later. it's the perfect way to get comfortable with how einop notation works.
For a 1 dimensional vector there's only one dimension to work with, so as you move through the elements, you're always just incrementing that single coordinate. The only permutation of a 0 or 1 dimensional vector is the identity.
e.g. as we discussed before, for a vector of size n:
0D: ->
1D: n -> n
Now once we start working with matrices and higher, permutation starts to make slightly more sense.
When we index into a tensor, let's say our Matrix W, there is an "order" to the coordinates. For example it is not necessarily true that W[i, j] = W[j, i]
. If we want the matrix to be in that order we can permute the dimensions. For matrices this is called the "transpose" operation often notated by a superscript T (^T
) after a matrix, something like f(W) = W^T
.
In einop notation we can write it like so:
m n -> n m
Once again it is important to stress that the einop notation is describing a function that takes in a 2D tensor and outputs a 2D tensor.
There are only two permutations of a matrix: the identity and transpose. But when we move up to a 3 dimensional tensor with shape [k, m, n]
there are 6 functions:
k m n -> k m n
k m n -> k n m
k m n -> m k n
k m n -> n k m
k m n -> m n k
k m n -> n m k
Side Note: For each dimension d there are actually d factorial permutation functions so:
0D -> 0! = 1
1D -> 1! = 1*0! = 1
2D -> 2! = 2*1! = 2
3D -> 3! = 3*2! = 6
4D -> 4! = 4*3! = 24
5D -> 5! = 5*4! = 120
But counting permutations is less important than understanding what they actually do to your data. Let's look at how permutation works under the hood.
When thinking about the list of lists analogy for matrices it can get confusing thinking about what is actually happening when you permute the dimensions, especially for 3D and higher. It's helpful to think about it like you aren't changing the underlying tensor, you are just changing how you view it, essentially transforming the coordinates, or reindexing.
For example, you have a 3D tensor Q
with shape [k m n]
you can index into it like so Q[1, 2, 3]
or whatever. Then let's say you permute the tensor using
f: k m n -> m k n
If you index into f(Q)[1, 2, 3]
what is this equal to? We just treat the shapes like variables and fill in the indexes going backwards, like so:
k m n -> m k n
m = 1
k = 2
n = 3
So it would output the value at:
T[2, 1, 3]
In fewer words
T[k, m, n] = f(T)[m, k, n] or T[i, j, k] = f(T)[j, i, k]
This way of thinking about einops can be quite helpful sometimes.
Reduce:
Towards the beginning of this article we defined the SUM function and gave it some arbitrary properties, for example, in the 2d case we said it only applies to the final dimension. Now we can create a more precise definition.
The einop notation for a reduction performs a function "over an axis" and removes the axis from the tensor.
It is written like so:
f: m n -sum> m
This represents a function that takes in a tensor of shape [m n]
and sums over the second dimension (n
). Dimensions that don't appear in the output are the ones we are reducing over.
For example this function applied to our W matrix would be:
f(W) = [
W[1,1] + W[1,2] + W[1,3],
W[2,1] + W[2,2] + W[2,3],
W[3,1] + W[3,2] + W[3,3],
W[4,1] + W[4,2] + W[4,3],
]
You can see f(W)
's shape is [m]
We could also change the function to multiplication
g: m n -product> m
Where:
g(W) = [
W[1,1] * W[1,2] * W[1,3],
W[2,1] * W[2,2] * W[2,3],
W[3,1] * W[3,2] * W[3,3],
W[4,1] * W[4,2] * W[4,3],
]
For computational consistency, the inner function just has to be associative, meaning it shouldn't matter whether you do op(op(a,b),c)
or op(a, op(b,c))
. This ensures the reduction gives the same result regardless of how the underlying computation groups the operations.
Amplect: The Core Multi-Tensor Operation
Amplect is an operation that doesn't show up often. It's a more general name for broadcast-op. Basically we take any number of tensors and perform an associative elementwise operation over corresponding dimensions. Let me give an example:
We have two tensors
A: [a b c]
B: [b]
Then we can write
a b c, b -add> a b c
This conceptually expands B
to match A
's shape by adding singleton dimensions: B
becomes [1 b 1]
. Then we perform elementwise addition between A
and the expanded B
.
This can be tricky to think about so I like to think about it in the "traditional" way.
We have an output with shape [a b c]
for each index of this output i
j
k
we have the equation:
Output[i, j, k] = A[i, j, k] + B[j]
Aside: Why Broadcasting is Un-intuitive
Broadcast ops have become commonplace in every machine learning framework, but they are unintuitive. You start with two tensors of differing shapes, and combine them together by "broadcasting" them to a common size then doing the op elementwise. The way you broadcast to a common size is:
- Starting from the last (right most) dimension of both tensors
- For each dimension starting at the last dimension:
- Check if the size of the current dimension of both tensors are equal
- If equal, continue
- Check if the size of tensor 1's current dimension is 1
- If yes, repeat tensor 1's current dimension to the size of tensor 2's current dimension and continue
- Check if the size of tensor 2's current dimension is 1
- If yes, repeat tensor 2's current dimension to the size of tensor 1's current dimension and continue
- Otherwise the two tensors are not broadcastable
- Check if the size of the current dimension of both tensors are equal
Now this creates problems if you want to broadcast add a tensor with shape [a b c]
to a tensor with shape [b]
because this algorithm fails, therefore you need to add an explicit extra singleton dimension to the end of the second tensor like so [b 1]
. Now the algorithm works.
You might have noticed that with singleton dimensions, technically any tensor is broadcastable to any other tensor.
For example, let's say you have a tensor A
with shape [a b c]
and a tensor B
with shape [d e f]
, these two tensors are considered "unbroadcastable" in the traditional sense, but if you add singleton dimensions to one of the tensors, let's say A
, so A
now has shape [a b c 1 1 1]
, these two tensors broadcast to a common shape [a b c d e f]
which is considered the outer product.
So technically any two tensors are always broadcastable to each other, we should be more specific about what we want when we are broadcasting which is where amplect shines.
Matrix Multiplication: Finally
Now that we are here, we can finally build matrix multiplication from the beginning using our new operations:
For a matrix A
with shape [a b]
and a matrix B
with shape [b c]
we have:
a b, b c -multiply> a b c -sum> a c
The first arrow represents an Amplect op which is performing a broadcast multiplication, the second arrow represents a reduction op which is performing a sum over the b dimension (the middle dimension of the intermediate [a b c]
result).
A more universally accepted way to write this particular expression would be using einsum notation, which is a standard mathematical notation for tensor operations:
a b, b c -> a c
This einsum notation is less verbose and has become the conventional way to express matrix multiplication in most frameworks.
Einop Notation: Pros and Cons
Where does einop notation shine:
Readability: Understanding what an operation is doing is much easier and standardized, you can quickly understand what is happening without having to look it up online or read a paper.
Simplicity: It's quite simple and therefore simple to teach
Expressiveness: You can describe complex operations that don't have standard names or that would require multiple steps in traditional linear algebra.
If you're like me you're probably wondering why this notation isn't more widespread considering its simplicity and expressiveness. It's mainly for 3 reasons:
Hardware: Specific operations work much faster on specific hardware and having a specific operation for that hardware can be a major performance improvement
Inertia: The current system just works well enough for it to not matter, neural network architectures don't change that much so a more expressive notation isn't needed.
Convolutions: The notation is incapable of elegantly expressing convolutions.
Final Word: Convolutions and Einops
I would argue that convolutions are one of the most important operations in machine learning, excellent at learning patterns in data that are adjacent and continuous, like images or audio. The issue is that there's no great way to express a convolution with the same options as pytorch in an elegant way in this notation.
I don't expect this to fully make sense now, but the reason why is that a convolution is (for A
with shape [colors, h, w]
and B
with shape [out_channels, colors, kh, kw]
) We first need to pad B
with zeros to the same shape as A
, then
A_tilde = fft(A, axes=(-2, -1))
B_tilde = fft(B, axes=(-2, -1))
f: colors h w, out_channels colors h w -mul> out_channels colors h w -sum> out_channels h w
A_pre = f(A_tilde, B_tilde)
A_conv = ifft(A_pre, axes=(-2, -1))
A_conv: [out_channels h w]
Which totally works, but most people don't conceptualize convolutions as a multiplication in frequency space. The padding is also unintuitive, and practical concerns like kernel dilation, stride, and edge padding don't translate well to this approach.
Outro
Hopefully, you now see tensors not as mysterious abstractions hidden behind PyTorch APIs, but as structured ways of organizing and transforming data. The fact that you really only need one binary tensor operation to build everything else reveals something profound about computation itself.
Tensors aren't just containers for numbers. They're the fundamental language of how information flows in our models. When you see a transformer's attention mechanism or CNN feature maps, you're no longer looking at black boxes, you're seeing coordinated tensor transformations you can reason about and improve.