Liv McFarlin | UX + Data + AI/ML

Creating Hashes in R with the Hash Package

2022-12-11T00:00:00-08:00

R does not provide a native hash table structure, which is unfortunate because if you need a fast and efficient way to retrieve information without worrying about element order, the hash table is a decent data structure choice. R users are not without options, though. The first option involves using an environment variable.

The downside to this is that one cannot easily use vectors as keys or values. A viable alternative, however, can be found in an R package named hash. Hash is an easy way of implementing hashes without relying on environment variables.

Using the Hash Package

As always, before you can use hash it has to be installed. Once installed, include it in your R file with library("hash").

To give a concrete example of how to use the hash package, imagine a vector of 10 names:

Then, create a second vector containing ages. The example below shows how to randomly generate 10 numbers between 18 and 70.

The random numbers generated for our example are 44, 40, 67, 35, 41, 53, 55, 56, 52, and 58. To map the keys (names) to the age values, use the function hash().

This should produce ouput like below:

Useful Functions

Aside from the hash() function to create the actual hash, you can use the following functions below:

keys(), to retrieve all key values within a hash.
values(), to retrieve all values in a hash or a single value.
- Note that you can also use double square brackets or the dollar sign to access a single value by its key.
.set(), to add a new key-value pair to the hash.
has.key(), to verify the existence of a key in a hash.
invert(), to swap keys and values. Just a note of caution here: There may be repetition of values, so swapping them for keys could lead to problems in data retrieval.

To learn more about the hash package, view its documentation at the CRAN repository .

Case Study: Creating Data Generator

2022-08-07T00:00:00-07:00

A custom tool built off of multiple python scripts was created for an in-house UX team to make it easier to create data-heavy prototypes.

Github location: Github link

NOTE: Updated 6 October 2022.

Scenario

When creating prototypes for data intensive products and services, the responsibility for the creation of dummy data falls on the UX architect. This makes sense because the UX architect knows the context for the prototype, such as:

Who are the users?
What are their needs?
What problem(s) are we solving for them?

Issue

Creating dummy data can be very time-consuming depending on the types of data needed and the number of data points needed. Small amounts of data points are fairly quick and easy to create versus larger amounts of data points. For example, 10 names or six-character alphanumeric IDs are easier to create than 30 names or six-character alphanumeric IDs. More complex data points such as VINs and vehicle descriptions are also difficult to create even in smaller amounts. Creating these more challenging data points may require:

Using client data from the system, which can be a legal concern if that data falls under the category of personally identifiable information (PII).
Using multiple tools across the internet to create realistic-seeming data points.

Solution

To solve the issues above, I created a tool for my team that generates sets of dummy data for their prototypes. The solution offers:

One stop shop for generating a variety of data:
- Names
- Contact information
- Dates
- IDs, more often than not there are only 3 types of IDs that appear in our prototypes. However, some products will have 5.
Low chancs of exactly replicating PII sets that ppear in the company’s systems
Eliminates copying and pasting of blocks of data where repetition could be missed and break the immersion for those interacting with the prototype.

This single solution enables our UX architects to create prototypes faster and with fewer criticisms and cycles of correction than previously.

Tool Anatomy

Data Generator is a tool based on multiple python scripts that runs on the command line interface (CLI). It produces a CSV file containing the following data points:

Company Name
Vehicle ID
Make
Model
Year
VIN
License Plate
Fleet ID
Division ID
First Name
Last Name
Email Address
Phone Number
Driver ID
Street Address
City
State/Province
Zipcode/Postal Code

Components

Data Generator relies on 2 python scripts and 6 text files.

Python Scripts

Script	Description	Dependencies
data_generator.py	controls the creation of dummy data and its compilation into a CSV file.	pandas, random, string, random_names
random_names.py	generates names and locations.	pandas, random, string, datetime

Text Files

File Name	Description
can_provinces.txt	combinations of 30 major cities in their provinces and with postal codes.
first_name.txt	2000 names, 1000 per gender.
last_name.txt	1000 common last names in North America.
street_names.txt	52 common North American street names.
us_states.txt	combinations of 150 major cities in their states and with zipcodes.
vehicles.txt	131 combinations of vehicle models with their respective manufacturers.

Remaining Issues

Depending on the service, some of the individual data points may need to be concatenated. However, our team prototypes in Axure, so for some widgets (such as repeaters) it’s easy enough to do this.

Example 1: First and last names. In a repeater, ensure first and last name have their own columns in the underlying table. If they are named FirstName and LastName, respectively, you can concatenate these fields like [[Item.FirstName]] [[Item.LastName]]

__ Example 2:__ Year, make, and model. These can be joined together in a similar manner as described for first and last names.

Some products also use more detailed vehicle descriptions beyond the combination of year, make, and model. These typically include information like number of doors, trim level, or engine type. Data Generator does not currently support that.

Some products may require additional ID numbers and descriptions not covered by Data Generator:

Request numbers
Request categories and subcategories
Order numbers
File attachment names and types (ex.: XLS, PDF, CSV, DOC)

Geometric Distances: A Crash Course

2022-08-06T00:00:00-07:00

In a previous blog post I introduced the concepts of standardization and normalization. A concept common to both of those techniques is distance. Simply put, distance is a mathematical summary of the differences between two objects. These objects can be data points or they can be full distributions. Distance measures fall into two categories:

Geometric Distances: Measure of similiartity between vectors based on the distance between them in multidimensional space.
Statistical Distances: Distance between statistical objects denoting similarity between them.

As the title of this post indicates, this post will focus on geometric distances. For most people, the concept of geometric distance is pretty easy to grasp as it is the one most similar to how we think about distance in the physical world. The common perception of distance is the length of the space between two points, such as one’s home and the grocery store. We either think of these distances as being direct, or as having to traverse streets or sidewalks that may require multiple turns. With these ideas in mind, let’s take a look at three common geometric distance measures: Euclidean Distance, Manhattan Distance, and Cosine Distance.

Euclidean Distance

Euclidean distance is the direct (shortest) distance between two points. Of all the geometric distance measures this is the easiest to understand. Also, as long as you passed high school geometry you already know how to calculate it as it is the square root of the Pythagorean Theorem. Imagine in this scenario that you have two points, A and B. A can be identifed at coordinates $(x_{A}, y_{A})$ and B can be identified at coordinates $(x_{B}, y_{B})$. The relevant equation for Euclidean distance then becomes as follows:

\[d(A,B) = \sqrt{A^{2} + B^{2}}\]

Using the equation above with points A and B, imagine a right triangle with a 90-degree angle at point C.

When to Use It: Clustering algorithms for normally distributed data, such as K-Means.

Manhattan Distance

The Manhattan Distance is also very easy to grasp as it is analogous to movement in the real world. Think of it like walking in a city. Instead of walking directly from your starting point to your destination as you would with Euclidean Distance, with Manhattan Distance you have to walk straight in one direction. Then, imagine you turn a corner to go in another direction. For each stretch you walk between two points, such as your starting point and the first turn you make, and then from the first turn to any following point, you calculate the distance of that stretch. At the end, you sum together each stretch of distance for the overall distance measure.

\[d(A,B) = |x_{A} - x_{B}| + |y_{A} - y_{B}|\]

As before, imagine two points A and B.

When to Use It: Regression analyses.

Cosine Distance

Of the three geometric distances presented in this post, cosine distance has the most complex calculation but is still easy to conceptualize. Think again about two points. The cosine distance is the cosine of the angle between them, where the angle is measured from some origin point. This makes for some fairly easy ways to judge distance. - Cosine of 0 deg = 1, similar - Cosine of 90 deg = 0, dissimilar

The formula can be fairly complicated, so let’s take a moment to break it down. Its initial form is below:

\[d(A,B) = 1 - \frac{A \cdot B}{||A|| * ||B||}\]

The numerator refers to the dot product, so $\sum_{i = 1}^{n} A_{i} * B_{i}$

The denominator refers to the cross product of A and B, so $\sqrt{\sum_{i = 1}^n A_{i}^2} * \sqrt{\sum_{i = 1}^n B_{i}^2}$

Each component may seem complicated, but if you know how to factor radicals it simplifies as you see below:

When to Use It: Judging similarity of documents for text analyses.

While this is by no means an exhaustive list of geometric distances, these are the ones I most commonly use. Also, I think these are the ones that offer the gentlest introduction to the concept of geometric distances. In a future post I will deepen the discussion of distance by talking about statistical distance.

Applying Standardization vs. Normalization: A Primer for UXers Interested in Machine Learning

2022-05-30T00:00:00-07:00

UX researchers who deal with quantitative data are familiar with standardization and normalization. The reasons we would apply them in quantitative UX research are similar to why we would apply them in machine learning.

These methods help us control the influence of data points on the analysis such that one particular variable or a set of data points does not skew the results.
They also don’t alter the shape of the data very much, which matters according to the questions we are trying to answer.

So, when should either of these methods be applied? Standardization and normalization should be applied when using an analysis that relies on distance calculations. This is a key point, and since machine learning requires a deeper understanding of distance than the one typically required of UX researchers, I will provide a brief overview of it before diving into the details of Standardization and Normalization.

A Brief Introduction to Distance

In very simple terms, distance summarizes the difference between data points. The greater the distance, the more dissimilar are the data points. The smaller the distance, the more similar the data points. An easy example for UX researchers to understand is linear regression. When fitting a regression line to a data set, data points closer to a regression line will have a shorter distance than those data points further away. You could then argue that shorter distances indicate a better fit of the regression line to the data than larger distances.

Important Point: Distance-based analyses can be skewed (biased) by numerically larger values. These large values could be outliers depending on your data set.

Within my experience of UX research, analyses that use distances focus only on explaining observations, such as indicating how much variance in dependent variables is accounted for by one or more independent variables. In machine learning we go beyond this to prediction or classification, wherein distance matters for model performance of those activities.

A High Level Comparison of Standardization and Normalization

Standardization	Normalization
Creates a scale where the mean is 0 and the standard deviation is 1, thereby describing all values in the same units.	Creates a scale where all values are within the range of [0, 1] or [-1, 1]. It essentially brings all values of numeric variables into a common scale.

Standardization

Based on the definition above, UX researchers will recognize this as Z-scoring. The formula for standardization is: $x_{sd} = \frac{x_{i} - \bar{x}}{\sigma}$

When to Use Standardization

Your data set is normally distributed.
Measures (independent variables) have different units.
- Variables on different scales do not equally contribute to analyses, which is a way of introducing bias.
- Standardization gives equal consideration to each variable in the analysis.
Extreme values/outliers exist in the data set.
- Since there is no pre-defined range of transformed features, standardization isn’t overly-affected by outliers.
- Quick Note: Sometimes you want to remove outliers, sometimes you do not. This all depends on your research question, from my perspective.

Some machine learning methods where applying standardization makes sense include:

Regression analysis, such as linear regression and logistic regression, used in classification. (Supervised learning)
Principal Component Analysis (PCA), used for dimensionality reduction. (Unsupervised learning)

Performing Standardization in R

R has a built in function, scale(), that can be applied to columns of continuous data. You can learn more here.

Performing Standardization in Python

scikit-learn has a preprocessing package with a constructor known as StandardScaler. You can read the documentation on StandardScaler at the scikit-learn website.

Normalization

Some UX researchers (and many machine learning experts), will also refer to normalization as min-max scaling. The formula for normalization is: $x_{norm} = \frac{x_{i} - x_{min}}{x_{max} - x_{min}}$

When to Use Normalization

No assumptions about your data set is being made, or you cannot assume your data set is normally distributed.
- This extends to any algorithm you might use in machine learning, not just the assumptions you as the practitioner might make.
When variables have different ranges.
Standard Deviation is very small.

Some machine learning methods where applying normalization makes sense include:

K Nearest Neighbors (KNN), used in clustering. (Supervised learning)
Neural networks. (Supervised and unsupervised learning)

Performing Normalization in R

R offers a package called caret that contains a function preProcess(). You can read about caret on CRAN. For specific information on preProcess(), visit rdocumentation.

Performing Normalization in Python

scikit-learn’s preprocessing package also offers a constructor for normalization, MinMaxScaler. You can read about it on the scikit-learn website

In future blog posts I will dive into more details on distance, standardization, and normalization, including providing specific examples of using each.