Gradient Descent¶
What is Gradient Descent?¶
Gradient descent is an iterative optimization algorithm that lets a model learn its parameters by taking small steps in the direction that reduces error — repeating until the cost can no longer be meaningfully reduced.
The Cost Function as a Landscape¶
Imagine the cost function as a physical landscape:
- Every possible combination of model parameters corresponds to a point on this landscape
- The height of that point is the cost (error) at those parameter values
- Training a model means starting wherever you are on this landscape and finding your way to the lowest point
Cost
|
| \ /
| \ /
| \ /
| \ /
| \_/ ← global minimum
|
+-----------------> w (parameter)
The goal: reach the valley floor.
The Update Rule¶
The Gradient¶
The key ingredient is the gradient — a measure of the slope at your current position.
- The gradient tells you which direction is uphill and how steep the slope is
- To reduce cost, you step in the opposite direction of the gradient (downhill)
Formally, the partial derivative:
$$\frac{\partial \text{cost}}{\partial w}$$
measures how much the cost changes for a tiny nudge to parameter w — the slope in w's direction.
The Formula¶
Gradient descent updates each parameter at every step:
$$w \leftarrow w - \eta \cdot \frac{\partial \text{cost}}{\partial w}$$
$$b \leftarrow b - \eta \cdot \frac{\partial \text{cost}}{\partial b}$$
Where:
- w — weight parameter
- b — bias parameter
- η (eta) — the learning rate, controls how large each step is
- ∂cost/∂w — the gradient, tells us the slope direction
The minus sign is what makes it descent — we move opposite to the uphill direction.
The Learning Rate η¶
The learning rate controls step size and is critical to get right:
| Learning Rate | Effect |
|---|---|
| Too large | Overshoots the minimum, may diverge |
| Too small | Converges very slowly, expensive |
| Just right | Steadily reaches the minimum |
Too large: Too small: Just right:
* *
* *
* *
(misses min) (crawls) * * * * *_/
The Algorithm Step by Step¶
- Initialize parameters
wandb(often randomly or as zeros) - Compute the cost at current parameters
- Compute the gradient — how much each parameter affects the cost
- Update each parameter using the update rule
- Repeat steps 2–4 until cost stops decreasing significantly
Key Takeaways¶
- Gradient descent doesn't find the minimum in one step — it iterates toward it
- The gradient is a local measurement — it only sees the slope at the current point
- It can get stuck in local minima on non-convex landscapes (a known limitation)
- Most deep learning optimizers (Adam, RMSProp) are variations built on top of this core idea