ML by Learning Paradigm¶
A complementary lens to data domain: the same dataset can be approached with different learning paradigms depending on what labels you have, what you're optimizing for, and how feedback is structured.
Three fundamental paradigms — supervised, unsupervised, reinforcement — plus modern variants that blur the lines.
Supervised Learning¶
Learn from labeled examples.
You provide both the input data and the correct answers. The algorithm learns to map inputs to outputs by minimizing its error on the provided examples. At inference time, it generalizes this mapping to unseen inputs.
- Requires: labeled data (often expensive and time-consuming to collect)
- Feedback signal: the difference between predicted output and ground truth label (loss function)
Structure¶
Input (X) + Label (Y) → Train model → Predict Y for new X
Examples¶
| Task | Input | Label |
|---|---|---|
| Email spam detection | Email text | Spam / not spam |
| X-ray diagnosis | Chest X-ray image | Healthy / pneumonia |
| House price prediction | Square footage, location, etc. | Sale price |
| Speech recognition | Audio waveform | Transcript |
Sub-tasks¶
- Classification — output is a discrete class (binary or multiclass)
- Regression — output is a continuous value
- Structured prediction — output is a structured object (sequence, tree, bounding box)
Common Algorithms¶
- Logistic / Linear Regression
- Decision Trees, Random Forest, Gradient Boosting (XGBoost)
- Support Vector Machines
- Neural networks (when inputs are images, text, audio)
Unsupervised Learning¶
Find structure without a teacher.
No labels — only raw input data. There is no "right answer" to optimize against. The algorithm discovers patterns, groupings, or compressed representations inherent in the data.
- Requires: unlabeled data (cheap and abundant)
- Feedback signal: none external; optimization targets internal criteria (cluster cohesion, reconstruction error, etc.)
Structure¶
Input (X) only → Train model → Discover structure in X
Core Tasks¶
Clustering¶
Group similar data points together without predefined categories.
- Example: segment customers by purchasing behavior — model discovers natural groups without being told what the groups are
- Algorithms: k-Means, DBSCAN, Hierarchical clustering, Gaussian Mixture Models
Dimensionality Reduction¶
Compress high-dimensional data into fewer dimensions while preserving structure. Used for visualization, denoising, and as preprocessing.
- Example: compress 10,000-gene expression profiles to 2D for visualization
- Algorithms: PCA, t-SNE, UMAP, Autoencoders
Anomaly Detection¶
Identify data points that don't fit the learned distribution of "normal."
- Example: flag unusual network traffic without labeled attack examples
- Algorithms: Isolation Forest, One-Class SVM, Autoencoders
Generative Modeling¶
Learn the data distribution and sample new data points from it.
- Example: generate synthetic medical images for training data augmentation
- Algorithms: VAEs, GANs, Diffusion models
Reinforcement Learning¶
Learn through trial and error.
An agent takes actions in an environment, receives rewards for good actions and penalties for bad ones. Learning happens through repeated interaction — no labeled dataset, no static training set. The agent improves its policy (mapping from states to actions) to maximize cumulative reward over time.
- Requires: a defined environment with a reward signal
- Feedback signal: scalar reward (often delayed, sparse, or noisy)
Structure¶
Agent observes state → Takes action → Environment returns reward + new state → Agent updates policy → Repeat
Key Concepts¶
- Policy — the agent's strategy; maps states to actions
- Reward — scalar signal indicating how good an action was
- Value function — estimated cumulative future reward from a given state
- Exploration vs. exploitation — balance between trying new actions and using what already works
- Delayed reward — the consequences of an action may only become clear many steps later (e.g., a bad move in chess)
Examples¶
| Domain | Agent | Actions | Reward |
|---|---|---|---|
| Games | Game-playing AI (AlphaGo, OpenAI Five) | Board moves / controller inputs | Win/loss, score |
| Robotics | Robot arm | Joint torques | Task completion, efficiency |
| LLM fine-tuning (RLHF) | Language model | Token generation | Human preference score |
| Ad bidding | Bidding system | Bid amount | Revenue from click/conversion |
| Drug discovery | Molecule generator | Add/remove atoms | Predicted binding affinity |
Common Algorithms¶
| Algorithm | Notes |
|---|---|
| Q-Learning / DQN | Learn value of each action in each state; Deep Q-Network extends with neural nets |
| Policy Gradient (REINFORCE) | Directly optimize the policy; high variance |
| PPO (Proximal Policy Optimization) | Stable, widely used; powers most modern RL applications including RLHF |
| SAC (Soft Actor-Critic) | Off-policy; sample efficient; strong in continuous action spaces (robotics) |
| AlphaZero / MuZero | Combines RL with tree search; superhuman game-playing |
Modern Variants¶
The three paradigms above are foundational but increasingly blended in practice.
Semi-Supervised Learning¶
A small amount of labeled data + a large amount of unlabeled data.
Labeling is expensive; unlabeled data is cheap. The model uses the unlabeled data to learn better representations, then refines with the small labeled set.
- Example: you have 200 labeled X-rays and 50,000 unlabeled scans — semi-supervised learning uses all 50,200 images
- Techniques: pseudo-labeling, consistency regularization, label propagation
- Tools: FixMatch, MixMatch, graph-based methods
Self-Supervised Learning¶
The model creates its own labels from the structure of the data.
No human annotation required. The model is given a pretext task derived from the data itself — predicting a masked portion, predicting the next token, predicting if two image patches are from the same image. Solving this forces the model to learn rich, generalizable representations.
- This is how modern LLMs are trained: given a sequence of text, predict the next token. The "label" is just the next word in the corpus — no human needed.
- Also how vision foundation models train: mask patches of an image and predict the missing pixels (MAE), or learn that two augmented views of the same image should have similar embeddings (SimCLR, DINO)
| Model | Pretext task | Learned representation used for |
|---|---|---|
| GPT / LLaMA | Predict next token | Text generation, reasoning, classification |
| BERT | Predict masked tokens (MLM) | Text classification, NER, QA |
| MAE | Predict masked image patches | Image classification, detection, segmentation |
| SimCLR / DINO | Contrastive: augmented views → same embedding | Visual features without labels |
Transfer Learning¶
Not a paradigm on its own, but the dominant practical workflow: pretrain a model on a large task with abundant data (often self-supervised), then fine-tune on a small task-specific labeled dataset.
- Self-supervised pretraining → supervised fine-tuning
- Powers nearly all modern NLP and computer vision applications
Summary¶
| Paradigm | Labeled data | Feedback signal | Typical use |
|---|---|---|---|
| Supervised | Yes (required) | Ground truth label | Classification, regression, detection |
| Unsupervised | No | Internal criteria | Clustering, dimensionality reduction, generation |
| Reinforcement | No (uses rewards) | Environment reward | Games, robotics, LLM alignment (RLHF) |
| Semi-supervised | Small amount | Labels + unlabeled structure | Low-label medical, vision tasks |
| Self-supervised | No (self-generated) | Pretext task error | LLM pretraining, vision foundation models |