Ok, we’re not dummies here, but dummy variables can be a tricky business for some. As a data scientist, you’re going to be working with dummy variables and when deploying models like k-nearest neighbors or ordinary least squares regression, your data set had better contain all continuous variables. But what if your data set has some categorical variables? Introducing dummy variables.

Dummy Example

Let’s say you have a variable called Eye Color. It has the categories: brown, blue and green. You might decide to create a numerical column with only a 0 or 1 for each of the eye colors, so you end up with something like this:

brown blue green
1 0 0
0 1 0
0 0 1

The problem with this is that some models expect an intercept term and insert a column of 1s into your data set which would change the above table to:

intercept brown blue green
1 1 0 0
1 0 1 0
1 0 0 1

See the issue? The intercept column is a linear combination of the brown, blue, and green columns. If you’re building a least squares regression model, you’ve just violated its assumption of full rank (meaning all columns in your data set are linearly independent from each other) which means you couldn’t rely on any prediction that model gave you.

The better way to construct dummy variables is to choose a common reference category (like brown if brown appears more often in the data set) and use that category as a common reference. Thus, one would only need 2 columns for 3 categories. Note that brown can be inferred by a 0 in the blue column and a 0 in the green column like the first row in this table:

intercept blue green
1 0 0
1 1 0
1 0 1

Also, notice that all columns are linearly independent from each other; there’s no way to take a linear combination of the columns to get the same values in another column.

Many modeling functions in Python and R will take your data set that might contain categorical variables and automatically dummify them for you, so best not to create your own function to do this. However, if you must, just remember to create your dummy variables in a way that won’t create linear dependence in your data set. A good rule of thumb for how many dummy columns you’ll end up with for any given categorical variable is # categories in that variable - 1.