How to Find and Handle Outliers
Outliers #
Outliers are a pain to deal with. They are observations in your data that stand out far from everything else. As a result they can skew how you interpret, model, and use your data. Let’s go over how to find outliers and how to mitigate the problems they carry.
Finding Outliers #
Smaller Datasets #
If you have a small dataset, it’s pretty easy to spot the outlier. You can chart your data and quickly identify them. On the bar chart below, Sam is a clear outlier. Note, for medium sized datasets, a scatter plot is useful for visualizing.
Bigger Datasets / Those With Unclear Outliers #
If you have a big dataset or you have a lot “gray” outliers, you can use stats to help you identify your outliers. Here are my favorite methods:
Use z-score: Z-Score is a measure of standard deviation units. So a score of 2.5 means your observed value is 2.5 standard deviations from the mean. Any data that lies outside of the 3rd standard deviation should be considered an outlier. If you need a refresher on standard deviation, see here.
Inter Quartile Range (IQR): The IQR is derived off your data’s third and first quartile where IQR = Q3 - Q1. Once you find your IQR value, any values that are 1.5x less than or greater than your calculated value, can be treated as outliers. Penn State has a great post about this.
Regardless of how you identify your outliers, you can mitigate them by either dropping them, marking them, or rescaling them..
1 Dropping records is essentially deleting them. Which is not a great option, you lose information without knowing for sure it is justly broken.
2 Marking is your safest option as you can then test with and without the outliers to see their net impact on your data analysis.
3 Rescaling with a log value is a good option if you want to keep the data. With rescaling, outliers don’t have as great an effect.
Python and Pandas #
Now let’s look at how we’d use Python alongside the pandas and numpy libraries to get this done.
Preliminary Code #
# Import Pandas and Numpy Libraries
import pandas as pd
import numpy as np
# Create Data
data = pd.DataFrame()
data['Name'] = ['Jim', 'Mary', 'Joe', 'Randy', 'Sam','Phil']
data['Chairs Produced'] = [12, 19, 3, 5, 186, 7]
# Show DataFrame
data
Name | Chairs Produced |
---|---|
Jim | 12 |
Mary | 19 |
Joe | 3 |
Randy | 5 |
Sam | 186 |
Phil | 7 |
1. Dropping #
For simplicity, I am only showing how to filter your data to remove outliers greater or equal to 100. If you used the z-score or IQR methodology from above, you’d filter accordingly.
data[data['Chairs Produced'] <100]
data
Name | Chairs Produced |
---|---|
Jim | 12 |
Mary | 19 |
Joe | 3 |
Randy | 5 |
Phil | 7 |
2. Marking #
data['Outlier'] = np.where(data['Chairs Produced'] <100,0,1)
data
Name | Chairs Produced | Outlier |
---|---|---|
Jim | 12 | 0 |
Mary | 19 | 0 |
Joe | 3 | 0 |
Randy | 5 | 0 |
Sam | 186 | 1 |
Phil | 7 | 0 |
3. Rescaling #
data['Log Value'] = [np.log(x) for x in data['Chairs Produced']]
data
Name | Chairs Produced | Outlier | Log Value |
---|---|---|---|
Jim | 12 | 0 | 2.48491 |
Mary | 19 | 0 | 2.94444 |
Joe | 3 | 0 | 1.09861 |
Randy | 5 | 0 | 1.60944 |
Sam | 186 | 1 | 5.22575 |
Phil | 7 | 0 | 1.94591 |