Gradient Descent¶

What is Gradient Descent?¶

Gradient descent is an iterative optimization algorithm that lets a model learn its parameters by taking small steps in the direction that reduces error — repeating until the cost can no longer be meaningfully reduced.

The Cost Function as a Landscape¶

Imagine the cost function as a physical landscape:

Every possible combination of model parameters corresponds to a point on this landscape
The height of that point is the cost (error) at those parameter values
Training a model means starting wherever you are on this landscape and finding your way to the lowest point

Cost
  |
  |  \         /
  |   \       /
  |    \     /
  |     \   /
  |      \_/   ← global minimum
  |
  +-----------------> w (parameter)

The goal: reach the valley floor.

The Update Rule¶

The Gradient¶

The key ingredient is the gradient — a measure of the slope at your current position.

The gradient tells you which direction is uphill and how steep the slope is
To reduce cost, you step in the opposite direction of the gradient (downhill)

Formally, the partial derivative:

$$\frac{\partial \text{cost}}{\partial w}$$

measures how much the cost changes for a tiny nudge to parameter w — the slope in w's direction.

The Formula¶

Gradient descent updates each parameter at every step:

$$w \leftarrow w - \eta \cdot \frac{\partial \text{cost}}{\partial w}$$

$$b \leftarrow b - \eta \cdot \frac{\partial \text{cost}}{\partial b}$$

Where: - w — weight parameter - b — bias parameter - η (eta) — the learning rate, controls how large each step is - ∂cost/∂w — the gradient, tells us the slope direction

The minus sign is what makes it descent — we move opposite to the uphill direction.

The Learning Rate η¶

The learning rate controls step size and is critical to get right:

Learning Rate	Effect
Too large	Overshoots the minimum, may diverge
Too small	Converges very slowly, expensive
Just right	Steadily reaches the minimum

Too large:          Too small:          Just right:
   *                  *
     *                 *
        *               *
  (misses min)    (crawls)            * * * * *_/

The Algorithm Step by Step¶

Initialize parameters w and b (often randomly or as zeros)
Compute the cost at current parameters
Compute the gradient — how much each parameter affects the cost
Update each parameter using the update rule
Repeat steps 2–4 until cost stops decreasing significantly

Key Takeaways¶

Gradient descent doesn't find the minimum in one step — it iterates toward it
The gradient is a local measurement — it only sees the slope at the current point
It can get stuck in local minima on non-convex landscapes (a known limitation)
Most deep learning optimizers (Adam, RMSProp) are variations built on top of this core idea