Applying Standardization vs. Normalization: A Primer for UXers Interested in Machine Learning

4 minute read

Published:

UX researchers who deal with quantitative data are familiar with standardization and normalization. The reasons we would apply them in quantitative UX research are similar to why we would apply them in machine learning.

  • These methods help us control the influence of data points on the analysis such that one particular variable or a set of data points does not skew the results.
  • They also don’t alter the shape of the data very much, which matters according to the questions we are trying to answer.

So, when should either of these methods be applied? Standardization and normalization should be applied when using an analysis that relies on distance calculations. This is a key point, and since machine learning requires a deeper understanding of distance than the one typically required of UX researchers, I will provide a brief overview of it before diving into the details of Standardization and Normalization.

A Brief Introduction to Distance

In very simple terms, distance summarizes the difference between data points. The greater the distance, the more dissimilar are the data points. The smaller the distance, the more similar the data points. An easy example for UX researchers to understand is linear regression. When fitting a regression line to a data set, data points closer to a regression line will have a shorter distance than those data points further away. You could then argue that shorter distances indicate a better fit of the regression line to the data than larger distances.

Important Point: Distance-based analyses can be skewed (biased) by numerically larger values. These large values could be outliers depending on your data set.

Within my experience of UX research, analyses that use distances focus only on explaining observations, such as indicating how much variance in dependent variables is accounted for by one or more independent variables. In machine learning we go beyond this to prediction or classification, wherein distance matters for model performance of those activities.

A High Level Comparison of Standardization and Normalization

StandardizationNormalization
Creates a scale where the mean is 0 and the standard deviation is 1, thereby describing all values in the same units.Creates a scale where all values are within the range of [0, 1] or [-1, 1]. It essentially brings all values of numeric variables into a common scale.

Standardization

Based on the definition above, UX researchers will recognize this as Z-scoring. The formula for standardization is: $x_{sd} = \frac{x_{i} - \bar{x}}{\sigma}$

When to Use Standardization

  • Your data set is normally distributed.
  • Measures (independent variables) have different units.
    • Variables on different scales do not equally contribute to analyses, which is a way of introducing bias.
    • Standardization gives equal consideration to each variable in the analysis.
  • Extreme values/outliers exist in the data set.
    • Since there is no pre-defined range of transformed features, standardization isn’t overly-affected by outliers.
    • Quick Note: Sometimes you want to remove outliers, sometimes you do not. This all depends on your research question, from my perspective.

Some machine learning methods where applying standardization makes sense include:

  • Regression analysis, such as linear regression and logistic regression, used in classification. (Supervised learning)
  • Principal Component Analysis (PCA), used for dimensionality reduction. (Unsupervised learning)

Performing Standardization in R

R has a built in function, scale(), that can be applied to columns of continuous data. You can learn more here.

Performing Standardization in Python

scikit-learn has a preprocessing package with a constructor known as StandardScaler. You can read the documentation on StandardScaler at the scikit-learn website.

Normalization

Some UX researchers (and many machine learning experts), will also refer to normalization as min-max scaling. The formula for normalization is: $x_{norm} = \frac{x_{i} - x_{min}}{x_{max} - x_{min}}$

When to Use Normalization

  • No assumptions about your data set is being made, or you cannot assume your data set is normally distributed.
    • This extends to any algorithm you might use in machine learning, not just the assumptions you as the practitioner might make.
  • When variables have different ranges.
  • Standard Deviation is very small.

Some machine learning methods where applying normalization makes sense include:

  • K Nearest Neighbors (KNN), used in clustering. (Supervised learning)
  • Neural networks. (Supervised and unsupervised learning)

Performing Normalization in R

R offers a package called caret that contains a function preProcess(). You can read about caret on CRAN. For specific information on preProcess(), visit rdocumentation.

Performing Normalization in Python

scikit-learn’s preprocessing package also offers a constructor for normalization, MinMaxScaler. You can read about it on the scikit-learn website

In future blog posts I will dive into more details on distance, standardization, and normalization, including providing specific examples of using each.