Outliers are the extreme values in the data. If the value of a variable is too large or too small, i.e, if the value is beyond a certain acceptable range then we consider that value to be an outlier. A quick way to find outliers in the data is by using a Box Plot.
The treatment of the outlier values/cases is called Outlier Treatment. Typically outlier treatment is done by capping/flooring.
- Capping is replacing all higher side values exceeding a certain theoretical maximum or upper control limit (UCL) by the UCL value. Statistical formula for UCL is UCL = Q3 + 1.5 * IQR
- Flooring is replacing all values falling below a certain theoretical minimum or lower control limit (UCL) by the LCL value. Statistical formula for LCL is LCL = Q1 – 1.5 * IQR
There may be some instances where you may want to delete the record having an outlier value. However, the deletion of a record should be considered as an option only when other outlier treatment options are not acceptable.
Note: This blog is a continuation of our Logistic Regression Blog Series
Python code | Finding Outlier using Box Plot
From the box plot, we observe that there are outlier values after 500000.
We compute the Upper Control Limit using the formula: UCL = Q3 + 1.5 * IQR
Python code | Compute UCL
Python code | Capping of Outlier Values
R code for Outlier Treatment
The Python equivalent code in R is given below.