Outlier

Outliers are the extreme values in the data. If the value of a variable is too large or too small, i.e, if the value is beyond a certain acceptable range then we consider that value to be an outlier. A quick way to find outliers in the data is by using a Box Plot.

Outlier Treatment

The treatment of the outlier values/cases is called Outlier Treatment. Typically outlier treatment is done by capping/flooring.

  • Capping is replacing all higher side values exceeding a certain theoretical maximum or upper control limit (UCL) by the UCL value. Statistical formula for UCL is UCL = Q3 + 1.5 * IQR
  • Flooring is replacing all values falling below a certain theoretical minimum or lower control limit (UCL) by the LCL value. Statistical formula for LCL is LCL = Q1 – 1.5 * IQR

There may be some instances where you may want to delete the record having an outlier value. However, the deletion of a record should be considered as an option only when other outlier treatment options are not acceptable.

Note: This blog is a continuation of our Logistic Regression Blog Series

Python code | Finding Outlier using Box Plot

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.figure(figsize=(9,5))

boxplot = sns.boxplot(x="Balance",
                 data=dev, showmeans=True,
                 width=0.5, 
                 palette="colorblind")

plt.title("Box Plot of Balance", fontsize=20)
plt.xlabel("Balance", fontsize=15)
Outlier Treatment of Balance Variable

From the box plot, we observe that there are outlier values after 500000.

We compute the Upper Control Limit using the formula: UCL = Q3 + 1.5 * IQR

Python code | Compute UCL

#Getting Upper Control Limit value for Balance
Q1, Q3 = dev["Balance"].quantile([0.25,0.75])
UCL = Q3 + 1.5 * (Q3 - Q1)
print("UCL = ", round(UCL))
UCL = 506511

Python code | Capping of Outlier Values

# If value above 500000 then replace by 500000
####### Best Practice #######
# when you do outlier treatment, you should create a new variable


dev["Bal_cap"] = dev["Balance"].map(
    lambda x: 500000 if x > 500000 else x
)

R code for Outlier Treatment

The Python equivalent code in R is given below.

# Box Plot
boxplot(dev$Balance,
        main = "Box Plot of Balance",
        xlab = "Balance",
        col = "royalblue",
        border = "black",
        horizontal = TRUE)

# UCL - Upper Control Limit
Q = quantile(dev$Balance, c( 0.25, 0.75))
Q1 = Q[1]
Q3 = Q[2]
UCL = Q3 + 1.5 * (Q3 - Q1)
cat("UCL =" , round(UCL,0))

# Capping the Balance variable
# Creating new variable Bal_cap

dev$Bal_cap = ifelse(dev$Balance > 500000, 500000, dev$Balance)

<<< previous blog         |         next blog >>>
Logistic Regression blog series home

How can we help?

Share This

Share this post with your friends!