Linear Regression Is Actually a Projection Problem (Part 2: From Projections to Predictions)

think that linear regression is about fitting a line to data.

But mathematically, that’s not what it’s doing.

It is finding the closest possible vector to your target within the
space spanned by features.

To understand this, we need to change how we look at our data.

In Part 1, we’ve got a basic idea of what a vector is and explored the concepts of dot products and projections.

Now, let’s apply these concepts to solve a linear regression problem.

We have this data.

The Usual Way: Feature Space

When we try to understand linear regression, we generally start with a scatter plot drawn between the independent and dependent variables.

Each point on this plot represents a single row of data. We then try to fit a line through these points, with the goal of minimizing the sum of squared residuals.

To solve this mathematically, we write down the cost function equation and apply differentiation to find the exact formulas for the slope and intercept.

As we already discussed in my earlier multiple linear regression (MLR) blog, this is the standard way to understand the problem.

This is what we call as a feature space.

After doing all that process, we get a value for the slope and intercept. Here we need to observe one thing.

Let us say ŷᵢ is the predicted value at a certain point. We have the slope and intercept value, and now according to our data, we need to predict the price.

If ŷᵢ is the predicted price for House 1, we calculate it by using

\[
\beta_0 + \beta_1 \cdot \text{size}
\]

What have we done here? We have a size value, and we are scaling it with a certain number, which we call the slope (β₁), to get the value as near to the original value as possible.

We also add an intercept (β₀) as a base value.

Now let’s remember this point, and we will move to the next perspective.

A Shift in Perspective

Let’s look at our data.

Now, instead of considering Price and Size as axes, let’s consider each house as an axis.

We have three houses, which means we can treat House A as the X-axis, House B as the Y-axis, and House C as the Z-axis.

Then, we simply plot our points.

When we consider the size and price columns as axes, we get three points, where each point represents the size and price of a single house.

However, when we consider each house as an axis, we get two points in a 3-dimensional space.

One point represents the sizes of all three houses, and the other point represents the prices of all three houses.

This is what we call the column space, and this is where the linear regression happens.

From Points to Directions

Now let’s connect our two points to the origin and now we call them as vectors.

Okay, let’s slow down and look at what we have done and why we did it.

Instead of a normal scatter plot where size and price are the axes (Feature Space), we considered each house as an axis and plotted the points (Column Space).

We are now saying that linear regression happens in this Column Space.

You might be thinking: Wait, we learn and understand linear regression using the traditional scatter plot, where we minimize the residuals to find a best-fit line.

Yes, that is correct! But in Feature Space, linear regression is solved using calculus. We get the formulas for the slope and intercept using partial differentiation.

If you remember my previous blog on MLR, we derived the formulas for the slopes and intercepts when we had two features and a target variable.

You can observe how messy it was to calculate those formulas using calculus. Now imagine if you have 50 or 100 features; it becomes complex.

By switching to Column Space, we change the lens through which we view regression.

We look at our data as vectors and use the concept of projections. The geometry remains exactly the same whether we have 2 features or 2,000 features.

So, if calculus gets that messy, what is the real benefit of this unchanging geometry? Let’s discuss exactly what happens in Column Space.”

Why This Perspective Matters

Now that we have an idea of what Feature Space and Column Space are, let’s focus on the plot.

We have two points, where one represents the sizes and the other represents the prices of the houses.

Why did we connect them to the origin and consider them vectors?

Because, as we already discussed, in linear regression we are finding a number (which we call the slope or weight) to scale our independent variable.

We want to scale the Size so it gets as close to the Price as possible, minimizing the residual.

You cannot visually scale a floating point; you can only scale something when it has a length and a direction.

By connecting the points to the origin, they become vectors. Now they have both magnitude and direction, and we already know that we can scale vectors.

Okay, we established that we treat these columns as vectors because we can scale them, but there is something even more important to learn here.

Let’s look at our two vectors: the Size vector and the Price vector.

First, if we look at the Size vector (1, 2, 3), it points in a very specific direction based on the pattern of its numbers.

From this vector, we can understand that House 2 is twice as large as House 1, and House 3 is three times as large.

There is a specific 1:2:3 ratio, which forces the Size vector to point in one exact direction.

Now, if we look at the Price vector, we can see that it points in a slightly different direction than the Size vector, based on its own numbers.

The direction of an arrow simply shows us the pure, underlying pattern of a feature across all our houses.

If our prices were exactly (2, 4, 6), then our Price vector would lie exactly in the same direction as our Size vector. That would mean size is a perfect, direct predictor of price.

But in real life, this is rarely possible. The price of a house is not just dependent on size; there are various other factors that affect it, which is why the Price vector points slightly away.

That angle between the two vectors (1,2,3) and (4,8,9) represents the real-world noise.

The Geometry Behind Regression

Now, we use the concept of projections that we learned in Part 1.

Let’s consider our Price vector (4, 8, 9) as a destination we want to reach. However, we only have one direction we can travel which is the path of our Size vector (1, 2, 3).

If we travel along the direction of the Size vector, we can’t perfectly reach our destination because it points in a different direction.

But we can travel to a specific point on our path that gets us as close to the destination as possible.

The shortest path from our destination dropping down to that exact point makes a perfect 90-degree angle.

In Part 1, we discussed this concept using the ‘highway and home’ analogy.

We are applying the exact same concept here. The only difference is that in Part 1, we were in a 2D space, and here we are in a 3D space.

I referred to the feature as a ‘way’ or a ‘highway’ because we only have one direction to travel.

This distinction between a ‘way’ and a ‘direction’ will become much clearer later when we add multiple directions!

A Simple Way to See This

We can already observe that this is the exact same concept as vector projections.

We derived a formula for this in Part 1. So, why wait?

Let’s just apply the formula, right?

No. Not yet.

There is something crucial we need to understand first.

In Part 1, we were dealing with a 2D space, so we used the highway and home analogy. But here, we are in a 3D space.

To understand it better, let’s use a new analogy.

Consider this 3D space as a physical room. There is a lightbulb hovering in the room at the coordinates (4, 8, 9).

The path from the origin to that bulb is our Price vector which we call as a target vector.

We want to reach that bulb, but our movements are restricted.

We can only walk along the direction of our Size vector (1, 2, 3), moving either forward or backward.

Based on what we learned in Part 1, you might say, ‘Let’s just apply the projection formula to find the nearest point on our path to the bulb.’

And you would be right. That is the absolute closest we can get to the bulb in that direction.

Why We Need a Base Value?

But before we move forward, we should observe one more thing here.

We already discussed that we are finding a single number (a slope) to scale our Size vector so we can get as close to the Price vector as possible. We can understand this with a simple equation:

Price = β₁ × Size

But what if the size is zero? Whatever the value of β₁ is, we get a predicted price of zero.

But is this right? We are saying that if the size of a house is 0 square feet, the price of the house is 0 dollars.

This is not correct because there has to be a base value for each house. Why?

Because even if there is no physical building, there is still a value for the empty plot of land it sits on. The price of the final house is heavily dependent on this base plot price.

We call this base value β₀. In traditional algebra, we already know this as the intercept, which is the term that shifts a line up and down.

So, how do we add a base value in our 3D room? We do it by adding a Base Vector.

Combining Directions

Now we have added a base vector (1, 1, 1), but what is actually done using this base vector?

From the above plot, we can observe that by adding a base vector, we have one more direction to move in that space.

We can move in both the directions of the Size vector and the Base vector.

Don’t get confused by looking at them as “ways”; they are directions, and it will be clear once we get to a point by moving in both of them.

Without the base vector, our base value was zero. We started with a base value of zero for every house. Now that we have a base vector, let’s first move along it.

For example, let’s move 3 steps in the direction of the Base vector. By doing so, we reach the point (3, 3, 3). We are currently at (3, 3, 3), and we want to reach as close as possible to our Price vector.

This means the base value of every house is 3 dollars, and our new starting point is (3, 3, 3).

Next, let’s move 2 steps in the direction of our Size vector (1, 2, 3). This means calculating 2 * (1, 2, 3) = (2, 4, 6).

Therefore, from (3, 3, 3), we move 2 steps along the House A axis, 4 units along the House B axis, and 6 steps along the House C axis.

Basically, we are adding the vectors here, and the order does not matter.

Whether we move first through the base vector or the size vector, it gets us to the exact same point. We just moved along the base vector first to understand the idea better!

The Space of All Possible Predictions

This way, we use both the directions to get as close to our Price vector. In the earlier example, we scaled the Base vector by 3, which means here β₀ = 3, and we scaled the Size vector by 2, which means β1 = 2.

From this, we can observe that we need the best combination of β₀ and β₁ so that we can know how many steps we travel along the base vector and how many steps we travel along the size vector to reach that point which is closest to our Price vector.

In this way, if we try all the different combinations of β₀ and β₁, then we get an infinite number of points, and let’s see what it looks like.

We can see that all the points formed by the different combinations of β0 and β1 along the directions of the Base vector and Size vector form a flat 2D plane in our 3D space.

Now, we have to find the point on that plane which is nearest to our Price vector.

We already know how to get to that point. As we discussed in Part 1, we find the shortest path by using the concept of geometric projections.

Now we need to find the exact point on the plane which is nearest to the Price vector.

We already discussed this in Part 1 using our ‘home and highway’ analogy, where the shortest path from the highway to the home formed a 90-degree angle with the highway.

There, we moved in one dimension, but here we are moving on a 2D plane. However, the rule remains the same.

The shortest distance between the tip of our price vector and a point on the plane is where the path between them forms a perfect 90-degree angle with the plane.

From a Point to a Vector

Before we dive into the math, let us clarify exactly what is happening so that it feels easy to follow.

Until now, we have been talking about finding the specific point on our plane that is closest to the tip of our target price vector. But what do we actually mean by this?

To reach that point, we have to travel across our plane.

We do this by moving along our two available directions, which are our Base and Size vectors, and scaling them.

When you scale and add two vectors together, the result is always a vector!

If we draw a straight line from the center at the origin directly to that exact point on the plane, we create what is called the Prediction Vector.

Moving along this single Prediction Vector gets us to the exact same destination as taking those scaled steps along the Base and Size directions.

The Vector Subtraction

Now we have two vectors.

We want to know the exact difference between them. In linear algebra, we find this difference using vector subtraction.

When we subtract our Prediction from our Target, the result is our Residual Vector, also known as the Error Vector.

This is why that dotted red line is not just a measurement of distance. It is a vector itself!

When we deal in feature space, we try to minimize the sum of squared residuals. Here, by finding the point on the plane closest to the price vector, we are indirectly looking for where the physical length of the residual path is the lowest!

Linear Regression Is a Projection

Now let’s start the math.

\[
\text{Let’s start by representing everything in matrix form.}
\]

\[
X =
\begin{bmatrix}
1 & 1 \\
1 & 2 \\
1 & 3
\end{bmatrix}
\quad
y =
\begin{bmatrix}
4 \\
8 \\
9
\end{bmatrix}
\quad
\beta =
\begin{bmatrix}
b_0 \\
b_1
\end{bmatrix}
\]
\[
\text{Here, the columns of } X \text{ represent the base and size directions.}
\]
\[
\text{And we are trying to combine them to reach } y.
\]
\[
\hat{y} = X\beta
\]
\[
= b_0
\begin{bmatrix}
1 \\
1 \\
1
\end{bmatrix}
+
b_1
\begin{bmatrix}
1 \\
2 \\
3
\end{bmatrix}
\]
\[
\text{Every prediction is just a combination of these two directions.}
\]
\[
e = y – X\beta
\]
\[
\text{This error vector is the gap between where we want to be.}
\]
\[
\text{And where we actually reach.}
\]
\[
\text{For this gap to be the shortest possible,}
\]
\[
\text{it must be perfectly perpendicular to the plane.}
\]
\[
\text{This plane is formed by the columns of } X.
\]
\[
X^T e = 0
\]
\[
\text{Now we substitute ‘e’ into this condition.}
\]
\[
X^T (y – X\beta) = 0
\]
\[
X^T y – X^T X \beta = 0
\]
\[
X^T X \beta = X^T y
\]
\[
\text{By simplifying we get the equation.}
\]
\[
\beta = (X^T X)^{-1} X^T y
\]
\[
\text{Now we compute each part step by step.}
\]
\[
X^T =
\begin{bmatrix}
1 & 1 & 1 \\
1 & 2 & 3
\end{bmatrix}
\]
\[
X^T X =
\begin{bmatrix}
3 & 6 \\
6 & 14
\end{bmatrix}
\]
\[
X^T y =
\begin{bmatrix}
21 \\
47
\end{bmatrix}
\]
\[
\text{computing the inverse of } X^T X.
\]
\[
(X^T X)^{-1}
=
\frac{1}{(3 \times 14 – 6 \times 6)}
\begin{bmatrix}
14 & -6 \\
-6 & 3
\end{bmatrix}
\]
\[
=
\frac{1}{42 – 36}
\begin{bmatrix}
14 & -6 \\
-6 & 3
\end{bmatrix}
\]
\[
=
\frac{1}{6}
\begin{bmatrix}
14 & -6 \\
-6 & 3
\end{bmatrix}
\]
\[
\text{Now multiply this with } X^T y.
\]
\[
\beta =
\frac{1}{6}
\begin{bmatrix}
14 & -6 \\
-6 & 3
\end{bmatrix}
\begin{bmatrix}
21 \\
47
\end{bmatrix}
\]
\[
=
\frac{1}{6}
\begin{bmatrix}
14 \cdot 21 – 6 \cdot 47 \\
-6 \cdot 21 + 3 \cdot 47
\end{bmatrix}
\]
\[
=
\frac{1}{6}
\begin{bmatrix}
294 – 282 \\
-126 + 141
\end{bmatrix}
=
\frac{1}{6}
\begin{bmatrix}
12 \\
15
\end{bmatrix}
\]
\[
=
\begin{bmatrix}
2 \\
2.5
\end{bmatrix}
\]
\[
\text{With these values, we can finally compute the exact point on the plane.}
\]
\[
\hat{y} =
2
\begin{bmatrix}
1 \\
1 \\
1
\end{bmatrix}
+
2.5
\begin{bmatrix}
1 \\
2 \\
3
\end{bmatrix}
=
\begin{bmatrix}
4.5 \\
7.0 \\
9.5
\end{bmatrix}
\]
\[
\text{And this point is the closest possible point on the plane to our target.}
\]

We got the point (4.5, 7.0, 9.5). This is our prediction.

This point is the closest to the tip of the price vector, and to reach that point, we need to move 2 steps along the base vector, which is our intercept, and 2.5 steps along the size vector, which is our slope.

What Changed Was the Perspective

Let’s recap what we have done in this blog. We haven’t followed the regular method to solve the linear regression problem, which is the calculus method where we try to differentiate the equation of the loss function to get the equations for the slope and intercept.

Instead, we chose another method to solve the linear regression problem which is the method of vectors and projections.

We started with a Price vector, and we needed to build a model that predicts the price of a house based on its size.

In terms of vectors, that meant we initially only had one direction to move in to predict the price of the house.

Then, we also added the Base vector by realizing there should be a baseline starting value.

Now we had two directions, and the question was how close can we get to the tip of the Price vector by moving in those two directions?

We are not just fitting a line; we are working inside a space.

In feature space: we minimize error

In column space: we drop perpendiculars

By using different combinations of the slope and intercept, we got an infinite number of points that created a plane.

The closest point, which we needed to find, lies somewhere on that plane, and we found it by using the concept of projections and the dot product.

Through that geometry, we found the perfect point and derived the Normal Equation!

You may ask, “Don’t we get this normal equation by using calculus as well?” You are exactly right! That is the calculus view, but here we are dealing with the geometric linear algebra view to truly understand the geometry behind the math.

Linear regression is not just optimization.

It is projection.

I hope you learned something from this blog!

If you think something is missing or could be improved, feel free to leave a comment.

If you haven’t read Part 1 yet, you can read it here. It covers the basic geometric intuition behind vectors and projections.

Thanks for reading!

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Linear Regression Is Actually a Projection Problem (Part 2: From Projections to Predictions)

[2505.12189] Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling

Quantum Simulations with Python | Towards Data Science

[2506.08915] Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

A Benchmark Dataset for Epitope-Specific Antibody Design

Fast Image and Video Editing with Diffusion Guidance

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

AI Leads All Reasons For U.S. Job Cuts In March, Report Says

[2505.12189] Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

OpenClaw. Codex. Cursor. What’s next for marketers?

Best times to post on social media in the UK [Updated March 2026]

Llms.txt Was Step One. Here’s The Architecture That Comes Next

Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Linear Regression Is Actually a Projection Problem (Part 2: From Projections to Predictions)

The Usual Way: Feature Space

A Shift in Perspective

From Points to Directions

Why This Perspective Matters

The Geometry Behind Regression

A Simple Way to See This

Why We Need a Base Value?

Combining Directions

The Space of All Possible Predictions

From a Point to a Vector

The Vector Subtraction

Linear Regression Is a Projection

What Changed Was the Perspective

Related Posts

Subscribe to Updates