Skip to content

Supervised Learning

Overview

Supervised machine learning is based on learning a relationship between inputs and outputs using labeled data.

The core concepts of supervised learning are: - Data - Model - Training - Evaluating - Inference


Data

Data is the driving force of machine learning.

It can appear as: - Numbers and words stored in tables - Pixel values in images - Waveforms in audio files

Related data is stored in datasets.

Examples of datasets

  • Images of cats
  • Housing prices
  • Weather information

Dataset structure

A dataset is made up of individual examples.

An example is similar to: - A single row in a spreadsheet

Each example contains: - Features - Label

Features

  • Input values used by the model to make predictions

Label

  • The “answer”
  • The value the model is trying to predict

Example (weather prediction)

  • Features: latitude, longitude, temperature, humidity, cloud coverage, wind direction, atmospheric pressure
  • Label: rainfall amount

Examples that contain both features and a label are called labeled examples.


Dataset characteristics

Datasets are characterized by: - Size — number of examples - Diversity — range of conditions covered by the examples

Good datasets are: - Large and - Highly diverse

However: - A large dataset does not guarantee diversity - A diverse dataset does not guarantee enough examples

Examples

  • 100 years of data only for July → poor predictions for January
  • Few years of data covering all months → poor predictions due to limited historical variability

Both size and diversity are required for reliable learning.


Model

In supervised learning, a model is a complex collection of numbers that defines a mathematical relationship between: - Input features