Skip to content

Numerical Data and Feature Vectors

Numerical data

Numerical data represents measurable quantities expressed as numbers.

These values can be used directly in mathematical calculations and machine learning models.

Examples of numerical data: - Temperature - Weight - Height - Age - Price - Distance

Numerical data is commonly used as input features for machine learning models.


Feature vectors

A feature vector is a collection (array) of numerical values that represents an example in a machine learning model.

In practice: - Each example in a dataset is converted into a vector - Each element of the vector represents one feature - Values are typically stored as floating-point numbers

Example feature vector:

[temperature, humidity, wind_speed, pressure] [22.5, 0.65, 12.3, 1013.2]

This numerical representation allows machine learning models to process and learn patterns from the data.


Feature engineering

Feature engineering is the process of transforming raw data into a representation that a machine learning model can learn from effectively.

Goals of feature engineering: - Improve model performance - Convert raw data into numerical form - Highlight useful patterns for the model

Feature engineering often involves: - Transforming raw values - Scaling data - Creating new derived features


Common feature engineering techniques

Normalization

Normalization converts numerical values into a standard range, typically between 0 and 1.

Purpose: - Prevent features with large scales from dominating the model - Improve training stability

Example:

Original values: [0, 50, 100] Normalized values: [0.0, 0.5, 1.0]


Binning

Binning converts numerical values into ranges (buckets).

Purpose: - Simplify continuous data - Reduce noise - Capture broader patterns

Example: Age values → bins

0–18 → child 19–35 → young adult 36–60 → adult 60+ → senior

This transformation can make patterns easier for some models to learn.


Summary

  • Numerical data represents measurable quantities expressed as numbers
  • A feature vector is an array of numerical values describing one example
  • Feature engineering transforms raw data into model-friendly representations
  • Normalization scales values into a standard range
  • Binning groups numerical values into ranges

Data Preprocessing

What is data preprocessing?

Data preprocessing is the process of preparing raw data so that it can be effectively used by a machine learning model.

It typically involves: - Cleaning the data - Transforming feature values - Handling missing or incorrect values - Scaling or normalizing data - Creating additional features

The goal is to convert raw datasets into high-quality, model-ready data.


Outliers

What is an outlier?

An outlier is a value that is significantly distant from most other values in a feature or label.

Outliers can negatively affect: - Model training - Statistical estimates - Model stability

Types of outliers

Outliers can be categorized as:

1. Outliers caused by mistakes - Measurement errors - Incorrect data entry - Sensor malfunction

These should usually be removed or corrected.

2. Legitimate outliers - Rare but valid observations - Natural extreme values

These may contain important information and should be carefully evaluated before removal.


Normalization

Why normalization is needed

Normalization transforms feature values so they exist on a similar scale.

Benefits include:

  • Helps models converge faster during training
  • Prevents features with large values from dominating others
  • Improves prediction stability
  • Reduces numerical instability (such as NaN values)
  • Helps the model learn appropriate weights for each feature

Common normalization methods

Linear scaling

Linear scaling transforms values from their original range into a normalized range.

Two common approaches:

Min–Max scaling

Rescales values to a fixed interval (often 0–1): X_scaled = (X - min) / (max - min)

Mean normalization

Centers values around the mean: X_scaled = (X - mean) / (max - min)

When to use linear scaling

Best used when: - Feature bounds do not change significantly over time - The dataset contains few or no outliers - Values are roughly uniformly distributed


Z-score scaling (standardization)

Z-score scaling converts values into standard deviation units from the mean.

Formula: Z = (X - μ) / σ

Where: - μ = mean - σ = standard deviation

This transformation represents how many standard deviations a value is from the mean.

When to use Z-score scaling

Best when: - Data follows a normal or near-normal distribution - Features have different units or ranges


Log scaling

Log scaling applies the logarithm to the raw value. X_scaled = log(X)

When log scaling is useful

Log scaling is helpful when the data follows a power-law distribution, meaning:

  • Small values occur very frequently
  • Large values occur rarely
  • Values span several orders of magnitude

Example: - Movie ratings counts - Website traffic - Income distribution


Handling outliers

Clipping

Clipping limits extreme values to a specified maximum or minimum threshold.

Example: If value > threshold → set value = threshold

Purpose: - Reduce influence of extreme outliers - Improve model stability


Binning (Bucketing)

Binning converts numerical values into categorical ranges (bins).

Example:

Income 0–20k 20k–50k 50k–100k 100k+

When to use binning

  • When the relationship between feature and label is weak or nonlinear
  • When feature values cluster naturally
  • When interpretability is important

Quantile bucketing

Quantile bucketing creates bins such that each bin contains roughly the same number of examples.

Benefits: - Helps balance the dataset across bins - Reduces the influence of extreme outliers


Data scrubbing (data cleaning)

Data scrubbing removes unreliable examples from datasets.

Common problems include:

Problem Example
Omitted values Missing age in a census record
Duplicate examples Logs uploaded twice
Out-of-range values Typing an extra digit
Incorrect labels Mislabeling an image

Scripts or data pipelines can detect:

  • Missing values
  • Duplicate entries
  • Invalid feature ranges

Qualities of good numerical features

Good numerical features should be:

  • Clearly named
  • Validated before training
  • Sensible and meaningful
  • Consistent across the dataset

High-quality features improve model accuracy and reliability.


Polynomial feature transformations

Sometimes it is useful to create synthetic features from existing numerical features.

Example:

Original feature: x

Polynomial features: x² x³

Purpose: - Capture nonlinear relationships - Improve model expressiveness - Allow linear models to learn more complex patterns


Summary

Data preprocessing prepares raw data for machine learning models.

Key techniques include:

  • Detecting and handling outliers
  • Normalizing feature values
  • Applying transformations like log scaling
  • Using binning and quantile bucketing
  • Cleaning unreliable data
  • Creating synthetic features such as polynomial features

Effective preprocessing significantly improves model performance and stability.