<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.2">Jekyll</generator><link href="https://lesliemcfarlin.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://lesliemcfarlin.com/" rel="alternate" type="text/html" /><updated>2022-12-11T13:45:54-08:00</updated><id>https://lesliemcfarlin.com/feed.xml</id><title type="html">Liv McFarlin | UX + Data + AI/ML</title><subtitle>personal description</subtitle><author><name>Liv (Leslie) McFarlin</name></author><entry><title type="html">Creating Hashes in R with the Hash Package</title><link href="https://lesliemcfarlin.com/posts/2022/11/blog-post-4/" rel="alternate" type="text/html" title="Creating Hashes in R with the Hash Package" /><published>2022-12-11T00:00:00-08:00</published><updated>2022-12-11T00:00:00-08:00</updated><id>https://lesliemcfarlin.com/posts/2022/11/blog-post-4</id><content type="html" xml:base="https://lesliemcfarlin.com/posts/2022/11/blog-post-4/">&lt;p&gt;R does not provide a native hash table structure, which is unfortunate because if you need a fast and efficient way to retrieve information without worrying about element order, the hash table is a decent data structure choice. R users are not without options, though. The first option involves using an environment variable.
&lt;img src=&quot;/images/R_env_var_hash.png&quot; title=&quot;New environment variable in R with hash = True.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The downside to this is that one cannot easily use vectors as keys or values. A viable alternative, however, can be found in an R package named hash. Hash is an easy way of implementing hashes without relying on environment variables.&lt;/p&gt;

&lt;h2 id=&quot;using-the-hash-package&quot;&gt;Using the Hash Package&lt;/h2&gt;
&lt;p&gt;As always, before you can use hash it has to be installed. Once installed, include it in your R file with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;library(&quot;hash&quot;)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To give a concrete example of how to use the hash package, imagine a vector of 10 names:
&lt;img src=&quot;/images/hash-key-vector.png&quot; title=&quot;Vector with 10 full names in it.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Then, create a second vector containing ages. The example below shows how to randomly generate 10 numbers between 18 and 70.
&lt;img src=&quot;/images/hash-value-vector.png&quot; title=&quot;Code for a vector with 10 randomly generated numbers between 18 and 70.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The random numbers generated for our example are 44, 40, 67, 35, 41, 53, 55, 56, 52, and 58. To map the keys (names) to the age values, use the function hash().
&lt;img src=&quot;/images/name-age-hash.png&quot; title=&quot;Code to create a hash named name_age_hash.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This should produce ouput like below:
&lt;img src=&quot;/images/hash-output.png&quot; title=&quot;Output of the name_age_hash object.&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;useful-functions&quot;&gt;Useful Functions&lt;/h2&gt;
&lt;p&gt;Aside from the hash() function to create the actual hash, you can use the following functions below:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;keys(), to retrieve all key values within a hash.
&lt;img src=&quot;/images/hash-keys.png&quot; title=&quot;Output showing all keys in the new_age_hash object.&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;values(), to retrieve all values in a hash or a single value.
&lt;img src=&quot;/images/hash-values.png&quot; title=&quot;Output showing how to use the function values().&quot; /&gt;
    &lt;ul&gt;
      &lt;li&gt;Note that you can also use double square brackets or the dollar sign to access a single value by its key.
&lt;img src=&quot;/images/hash-value-dollarsign.png&quot; title=&quot;Code showing how to use the dollar sign to access a value by its key.&quot; /&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;.set(), to add a new key-value pair to the hash.
&lt;img src=&quot;/images/set-key-value.png&quot; title=&quot;Code demonstrating use of the .set() function.&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;has.key(), to verify the existence of a key in a hash.
&lt;img src=&quot;/images/has-key.png&quot; title=&quot;Code example of has.key()function.&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;invert(), to swap keys and values. Just a note of caution here: There may be repetition of values, so swapping them for keys could lead to problems in data retrieval.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To learn more about the hash package, view its documentation at the &lt;a href=&quot;https://cran.r-project.org/web/packages/hash/hash.pdf&quot;&gt;CRAN repository&lt;/a&gt; .&lt;/p&gt;</content><author><name>Liv (Leslie) McFarlin</name></author><category term="R" /><category term="Packages" /><summary type="html">R does not provide a native hash table structure, which is unfortunate because if you need a fast and efficient way to retrieve information without worrying about element order, the hash table is a decent data structure choice. R users are not without options, though. The first option involves using an environment variable.</summary></entry><entry><title type="html">Case Study: Creating Data Generator</title><link href="https://lesliemcfarlin.com/posts/2022/08/blog-post-3/" rel="alternate" type="text/html" title="Case Study: Creating Data Generator" /><published>2022-08-07T00:00:00-07:00</published><updated>2022-08-07T00:00:00-07:00</updated><id>https://lesliemcfarlin.com/posts/2022/08/blog-post-3</id><content type="html" xml:base="https://lesliemcfarlin.com/posts/2022/08/blog-post-3/">&lt;p&gt;A custom tool built off of multiple python scripts was created for an in-house UX team to make it easier to create data-heavy prototypes.&lt;/p&gt;

&lt;p&gt;Github location: &lt;a href=&quot;https://github.com/lammypi/data-generator&quot;&gt;Github link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; Updated 6 October 2022.&lt;/p&gt;

&lt;h2 id=&quot;scenario&quot;&gt;Scenario&lt;/h2&gt;

&lt;p&gt;When creating prototypes for data intensive products and services, the responsibility for the creation of dummy data falls on the UX architect. This makes sense because the UX architect knows the context for the prototype, such as:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Who are the users?&lt;/li&gt;
  &lt;li&gt;What are their needs?&lt;/li&gt;
  &lt;li&gt;What problem(s) are we solving for them?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;issue&quot;&gt;Issue&lt;/h2&gt;

&lt;p&gt;Creating dummy data can be very time-consuming depending on the types of data needed and the number of data points needed. Small amounts of data points are fairly quick and easy to create versus larger amounts of data points. For example, 10 names or six-character alphanumeric IDs are easier to create than 30 names or six-character alphanumeric IDs. More complex data points such as VINs and vehicle descriptions are also difficult to create even in smaller amounts. Creating these more challenging data points may require:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Using client data from the system, which can be a legal concern if that data falls under the category of personally identifiable information (PII).&lt;/li&gt;
  &lt;li&gt;Using multiple tools across the internet to create realistic-seeming data points.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;solution&quot;&gt;Solution&lt;/h2&gt;

&lt;p&gt;To solve the issues above, I created a tool for my team that generates sets of dummy data for their prototypes. The solution offers:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;One stop shop for generating a variety of data:
    &lt;ul&gt;
      &lt;li&gt;Names&lt;/li&gt;
      &lt;li&gt;Contact information&lt;/li&gt;
      &lt;li&gt;Dates&lt;/li&gt;
      &lt;li&gt;IDs, more often than not there are only 3 types of IDs that appear in our prototypes. However, some products will have 5.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Low chancs of exactly replicating PII sets that ppear in the company’s systems&lt;/li&gt;
  &lt;li&gt;Eliminates copying and pasting of blocks of data where repetition could be missed and break the immersion for those interacting with the prototype.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This single solution enables our UX architects to create prototypes faster and with fewer criticisms and cycles of correction than previously.&lt;/p&gt;

&lt;h2 id=&quot;tool-anatomy&quot;&gt;Tool Anatomy&lt;/h2&gt;

&lt;p&gt;Data Generator is a tool based on multiple python scripts that runs on the command line interface (CLI). It produces a CSV file containing the following data points:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Company Name&lt;/li&gt;
  &lt;li&gt;Vehicle ID&lt;/li&gt;
  &lt;li&gt;Make&lt;/li&gt;
  &lt;li&gt;Model&lt;/li&gt;
  &lt;li&gt;Year&lt;/li&gt;
  &lt;li&gt;VIN&lt;/li&gt;
  &lt;li&gt;License Plate&lt;/li&gt;
  &lt;li&gt;Fleet ID&lt;/li&gt;
  &lt;li&gt;Division ID&lt;/li&gt;
  &lt;li&gt;First Name&lt;/li&gt;
  &lt;li&gt;Last Name&lt;/li&gt;
  &lt;li&gt;Email Address&lt;/li&gt;
  &lt;li&gt;Phone Number&lt;/li&gt;
  &lt;li&gt;Driver ID&lt;/li&gt;
  &lt;li&gt;Street Address&lt;/li&gt;
  &lt;li&gt;City&lt;/li&gt;
  &lt;li&gt;State/Province&lt;/li&gt;
  &lt;li&gt;Zipcode/Postal Code&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;components&quot;&gt;Components&lt;/h3&gt;

&lt;p&gt;Data Generator relies on 2 python scripts and 6 text files.&lt;/p&gt;

&lt;h4 id=&quot;python-scripts&quot;&gt;Python Scripts&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Script&lt;/th&gt;
      &lt;th&gt;Description&lt;/th&gt;
      &lt;th&gt;Dependencies&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;data_generator.py&lt;/td&gt;
      &lt;td&gt;controls the creation of dummy data and its compilation into a CSV file.&lt;/td&gt;
      &lt;td&gt;pandas, random, string, random_names&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;random_names.py&lt;/td&gt;
      &lt;td&gt;generates names and locations.&lt;/td&gt;
      &lt;td&gt;pandas, random, string, datetime&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h4 id=&quot;text-files&quot;&gt;Text Files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;File Name&lt;/th&gt;
      &lt;th&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;can_provinces.txt&lt;/td&gt;
      &lt;td&gt;combinations of 30 major cities in their provinces and with postal codes.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;first_name.txt&lt;/td&gt;
      &lt;td&gt;2000 names, 1000 per gender.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;last_name.txt&lt;/td&gt;
      &lt;td&gt;1000 common last names in North America.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;street_names.txt&lt;/td&gt;
      &lt;td&gt;52 common North American street names.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;us_states.txt&lt;/td&gt;
      &lt;td&gt;combinations of 150 major cities in their states and with zipcodes.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;vehicles.txt&lt;/td&gt;
      &lt;td&gt;131 combinations of vehicle models with their respective manufacturers.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;remaining-issues&quot;&gt;Remaining Issues&lt;/h3&gt;

&lt;p&gt;Depending on the service, some of the individual data points may need to be concatenated. However, our team prototypes in Axure, so for some widgets (such as repeaters) it’s easy enough to do this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 1:&lt;/strong&gt; First and last names. In a repeater, ensure first and last name have their own columns in the underlying table. If they are named FirstName and LastName, respectively, you can concatenate these fields like [[Item.FirstName]] [[Item.LastName]]&lt;/p&gt;

&lt;p&gt;__ Example 2:__ Year, make, and model. These can be joined together in a similar manner as described for first and last names.&lt;/p&gt;

&lt;p&gt;Some products also use more detailed vehicle descriptions beyond the combination of year, make, and model. These typically include information like number of doors, trim level, or engine type. Data Generator does not currently support that.&lt;/p&gt;

&lt;p&gt;Some products may require additional ID numbers and descriptions not covered by Data Generator:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Request numbers&lt;/li&gt;
  &lt;li&gt;Request categories and subcategories&lt;/li&gt;
  &lt;li&gt;Order numbers&lt;/li&gt;
  &lt;li&gt;File attachment names and types (ex.: XLS, PDF, CSV, DOC)&lt;/li&gt;
&lt;/ul&gt;</content><author><name>Liv (Leslie) McFarlin</name></author><category term="Case Study" /><category term="Python" /><category term="UX" /><summary type="html">A custom tool built off of multiple python scripts was created for an in-house UX team to make it easier to create data-heavy prototypes.</summary></entry><entry><title type="html">Geometric Distances: A Crash Course</title><link href="https://lesliemcfarlin.com/posts/2022/08/blog-post-2/" rel="alternate" type="text/html" title="Geometric Distances: A Crash Course" /><published>2022-08-06T00:00:00-07:00</published><updated>2022-08-06T00:00:00-07:00</updated><id>https://lesliemcfarlin.com/posts/2022/08/blog-post-2</id><content type="html" xml:base="https://lesliemcfarlin.com/posts/2022/08/blog-post-2/">&lt;p&gt;In a &lt;a href=&quot;https://www.lesliemcfarlin.com/posts/2022/05/blog-post-1/&quot;&gt;previous blog post&lt;/a&gt; I introduced the concepts of standardization and normalization. A concept common to both of those techniques is distance. Simply put, distance is a mathematical summary of the differences between two objects. These objects can be data points or they can be full distributions. Distance measures fall into two categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Geometric Distances:&lt;/strong&gt; Measure of similiartity between vectors based on the distance between them in multidimensional space.     &lt;br /&gt;
&lt;strong&gt;Statistical Distances:&lt;/strong&gt; Distance between statistical objects denoting similarity between them.&lt;/p&gt;

&lt;p&gt;As the title of this post indicates, this post will focus on geometric distances. For most people, the concept of geometric distance is pretty easy to grasp as it is the one most similar to how we think about distance in the physical world. The common perception of distance is the length of the space between two points, such as one’s home and the grocery store. We either think of these distances as being direct, or as having to traverse streets or sidewalks that may require multiple turns. With these ideas in mind, let’s take a look at three common geometric distance measures: Euclidean Distance, Manhattan Distance, and Cosine Distance.&lt;/p&gt;

&lt;h2 id=&quot;euclidean-distance&quot;&gt;Euclidean Distance&lt;/h2&gt;
&lt;p&gt;Euclidean distance is the direct (shortest) distance between two points. Of all the geometric distance measures this is the easiest to understand. Also, as long as you passed high school geometry you already know how to calculate it as it is the square root of the Pythagorean Theorem. Imagine in this scenario that you have two points, A and B. A can be identifed at coordinates $(x_{A}, y_{A})$ and B can be identified at coordinates $(x_{B}, y_{B})$. The relevant equation for Euclidean distance then becomes as follows:&lt;/p&gt;

\[d(A,B) = \sqrt{A^{2} + B^{2}}\]

&lt;p&gt;Using the equation above with points A and B, imagine a right triangle with a 90-degree angle at point C.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/euclidean_distance2.png&quot; title=&quot;Euclidean Distance Visualization&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to Use It:&lt;/strong&gt; Clustering algorithms for normally distributed data, such as K-Means.&lt;/p&gt;

&lt;h2 id=&quot;manhattan-distance&quot;&gt;Manhattan Distance&lt;/h2&gt;
&lt;p&gt;The Manhattan Distance is also very easy to grasp as it is analogous to movement in the real world. Think of it like walking in a city. Instead of walking directly from your starting point to your destination as you would with Euclidean Distance, with Manhattan Distance you have to walk straight in one direction. Then, imagine you turn a corner to go in another direction. For each stretch you walk between two points, such as your starting point and the first turn you make, and then from the first turn to any following point, you calculate the distance of that stretch. At the end, you sum together each stretch of distance for the overall distance measure.&lt;/p&gt;

\[d(A,B) = |x_{A} - x_{B}| + |y_{A} - y_{B}|\]

&lt;p&gt;As before, imagine two points A and B.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/manhattan_distance.png&quot; title=&quot;Manhattan Distance Visualization&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to Use It:&lt;/strong&gt; Regression analyses.&lt;/p&gt;

&lt;h2 id=&quot;cosine-distance&quot;&gt;Cosine Distance&lt;/h2&gt;
&lt;p&gt;Of the three geometric distances presented in this post, cosine distance has the most complex calculation but is still easy to conceptualize. Think again about two points. The cosine distance is the cosine of the angle between them, where the angle is measured from some origin point. This makes for some fairly easy ways to judge distance. 
			- Cosine of 0 deg = 1, similar
			- Cosine of 90 deg = 0, dissimilar&lt;/p&gt;

&lt;p&gt;The formula can be fairly complicated, so let’s take a moment to break it down. Its initial form is below:&lt;/p&gt;

\[d(A,B) = 1 - \frac{A \cdot B}{||A|| * ||B||}\]

&lt;p&gt;The numerator refers to the dot product, so \(\sum_{i = 1}^{n} A_{i} * B_{i}\)&lt;/p&gt;

&lt;p&gt;The denominator refers to the cross product of A and B, so  \(\sqrt{\sum_{i = 1}^n A_{i}^2} * \sqrt{\sum_{i = 1}^n B_{i}^2}\)&lt;/p&gt;

&lt;p&gt;Each component may seem complicated, but if you know how to factor radicals it simplifies as you see below:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/cosine_distance.png&quot; title=&quot;Cosine Distance Visualization&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to Use It:&lt;/strong&gt; Judging similarity of documents for text analyses.&lt;/p&gt;

&lt;p&gt;While this is by no means an exhaustive list of geometric distances, these are the ones I most commonly use. Also, I think these are the ones that offer the gentlest introduction to the concept of geometric distances. In a future post I will deepen the discussion of distance by talking about statistical distance.&lt;/p&gt;</content><author><name>Liv (Leslie) McFarlin</name></author><category term="Statistics" /><category term="Machine Learning" /><summary type="html">In a previous blog post I introduced the concepts of standardization and normalization. A concept common to both of those techniques is distance. Simply put, distance is a mathematical summary of the differences between two objects. These objects can be data points or they can be full distributions. Distance measures fall into two categories:</summary></entry><entry><title type="html">Applying Standardization vs. Normalization: A Primer for UXers Interested in Machine Learning</title><link href="https://lesliemcfarlin.com/posts/2022/05/blog-post-1/" rel="alternate" type="text/html" title="Applying Standardization vs. Normalization: A Primer for UXers Interested in Machine Learning" /><published>2022-05-30T00:00:00-07:00</published><updated>2022-05-30T00:00:00-07:00</updated><id>https://lesliemcfarlin.com/posts/2022/05/blog-post-1</id><content type="html" xml:base="https://lesliemcfarlin.com/posts/2022/05/blog-post-1/">&lt;p&gt;UX researchers who deal with quantitative data are familiar with standardization and normalization. The reasons we would apply them in quantitative UX research are similar to why we would apply them in machine learning.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;These methods help us control the influence of data points on the analysis such that one particular variable or a set of data points does not skew the results.&lt;/li&gt;
  &lt;li&gt;They also don’t alter the shape of the data very much, which matters according to the questions we are trying to answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, when should either of these methods be applied? Standardization and normalization should be applied when using an analysis that relies on distance calculations. This is a key point, and since machine learning requires a deeper understanding of distance than the one typically required of UX researchers, I will provide a brief overview of it before diving into the details of Standardization and Normalization.&lt;/p&gt;

&lt;h2 id=&quot;a-brief-introduction-to-distance&quot;&gt;A Brief Introduction to Distance&lt;/h2&gt;
&lt;p&gt;In very simple terms, distance summarizes the difference between data points. The greater the distance, the more dissimilar are the data points. The smaller the distance, the more similar the data points. An easy example for UX researchers to understand is linear regression. When fitting a regression line to a data set, data points closer to a regression line will have a shorter distance than those data points further away. You could then argue that shorter distances indicate a better fit of the regression line to the data than larger distances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important Point:&lt;/strong&gt; Distance-based analyses can be skewed (biased) by numerically larger values. These large values could be outliers depending on your data set.&lt;/p&gt;

&lt;p&gt;Within my experience of UX research, analyses that use distances focus only on explaining observations, such as indicating how much variance in dependent variables is accounted for by one or more independent variables. In machine learning we go beyond this to prediction or classification, wherein distance matters for model performance of those activities.&lt;/p&gt;

&lt;h2 id=&quot;a-high-level-comparison-of-standardization-and-normalization&quot;&gt;A High Level Comparison of Standardization and Normalization&lt;/h2&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;strong&gt;Standardization&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;strong&gt;Normalization&lt;/strong&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Creates a scale where the mean is 0 and the standard deviation is 1, thereby describing all values in the same units.&lt;/td&gt;
      &lt;td&gt;Creates a scale where all values are within the range of [0, 1] or [-1, 1]. It essentially brings all values of numeric variables into a common scale.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;standardization&quot;&gt;Standardization&lt;/h3&gt;
&lt;p&gt;Based on the definition above, UX researchers will recognize this as Z-scoring. The formula for standardization is:
$x_{sd} = \frac{x_{i} - \bar{x}}{\sigma}$&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to Use Standardization&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Your data set is normally distributed.&lt;/li&gt;
  &lt;li&gt;Measures (independent variables) have different units.
    &lt;ul&gt;
      &lt;li&gt;Variables on different scales do not equally contribute to analyses, which is a way of introducing bias.&lt;/li&gt;
      &lt;li&gt;Standardization gives equal consideration to each variable in the analysis.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Extreme values/outliers exist in the data set.
    &lt;ul&gt;
      &lt;li&gt;Since there is no pre-defined range of transformed features, standardization isn’t overly-affected by outliers.&lt;/li&gt;
      &lt;li&gt;&lt;em&gt;Quick Note&lt;/em&gt;: Sometimes you want to remove outliers, sometimes you do not. This all depends on your research question, from my perspective.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some machine learning methods where applying standardization makes sense include:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Regression analysis, such as linear regression and logistic regression, used in classification. (Supervised learning)&lt;/li&gt;
  &lt;li&gt;Principal Component Analysis (PCA), used for dimensionality reduction. (Unsupervised learning)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;performing-standardization-in-r&quot;&gt;Performing Standardization in R&lt;/h4&gt;
&lt;p&gt;R has a built in function, scale(), that can be applied to columns of continuous data. You can learn more &lt;a href=&quot;https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/scale&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id=&quot;performing-standardization-in-python&quot;&gt;Performing Standardization in Python&lt;/h4&gt;
&lt;p&gt;scikit-learn has a preprocessing package with a constructor known as StandardScaler. You can read the documentation on StandardScaler at the &lt;a href=&quot;https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler&quot;&gt;scikit-learn website&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;normalization&quot;&gt;Normalization&lt;/h3&gt;
&lt;p&gt;Some UX researchers (and many machine learning experts), will also refer to normalization as min-max scaling. The formula for normalization is: 
$x_{norm} = \frac{x_{i} - x_{min}}{x_{max} - x_{min}}$&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to Use Normalization&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;No assumptions about your data set is being made, or you cannot assume your data set is normally distributed.
    &lt;ul&gt;
      &lt;li&gt;This extends to any algorithm you might use in machine learning, not just the assumptions you as the practitioner might make.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;When variables have different ranges.&lt;/li&gt;
  &lt;li&gt;Standard Deviation is very small.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some machine learning methods where applying normalization makes sense include:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;K Nearest Neighbors (KNN), used in clustering. (Supervised learning)&lt;/li&gt;
  &lt;li&gt;Neural networks. (Supervised and unsupervised learning)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;performing-normalization-in-r&quot;&gt;Performing Normalization in R&lt;/h4&gt;
&lt;p&gt;R offers a package called &lt;em&gt;caret&lt;/em&gt; that contains a function preProcess(). You can read about caret on &lt;a href=&quot;https://cran.r-project.org/web/packages/caret/vignettes/caret.html&quot;&gt;CRAN&lt;/a&gt;. For specific information on preProcess(), visit &lt;a href=&quot;https://www.rdocumentation.org/packages/caret/versions/6.0-92/topics/preProcess&quot;&gt;rdocumentation&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id=&quot;performing-normalization-in-python&quot;&gt;Performing Normalization in Python&lt;/h4&gt;
&lt;p&gt;scikit-learn’s preprocessing package also offers a constructor for normalization, MinMaxScaler. You can read about it on the &lt;a href=&quot;https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html&quot;&gt;scikit-learn website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In future blog posts I will dive into more details on distance, standardization, and normalization, including providing specific examples of using each.&lt;/p&gt;</content><author><name>Liv (Leslie) McFarlin</name></author><category term="Statistics" /><category term="UX Research" /><category term="Machine Learning" /><summary type="html">UX researchers who deal with quantitative data are familiar with standardization and normalization. The reasons we would apply them in quantitative UX research are similar to why we would apply them in machine learning. These methods help us control the influence of data points on the analysis such that one particular variable or a set of data points does not skew the results. They also don’t alter the shape of the data very much, which matters according to the questions we are trying to answer.</summary></entry></feed>