We have discussed various missing value imputation techniques in previous blogs. In this blog, we will do the missing value imputation of a continuous variable using mean, mode, and business logic approach.
We will continue with the development sample as created in the training and testing step. The continuous variable, Holding Period, has missing values in it. Let us check the missing.
Imputation using Mean/Median Value
The simplest approach of imputing a continuous variable is to replace all missing values by Mean or Median.
Python code to replace the missing by Mean / Median
Missing Value Imputation using Business Logic
Reminder – The data being analyzed is Personal Loans Cross-Sell data. As a marketer, I am more interested in the segment-wise response rate. I would like to see the response rate in the missing value segment and compare it with other segments. Accordingly, then impute the value of missing segment with the segment value where the response rate is almost the same. Let us see this using Python code.
In the code, we are converting the continuous variable Holding Period into a categorical variable by coarse binning. In the coarse bins, we have replaced the missing values by -9999.
Key takeaways from the above table
- Response Rate of missing value segment is 3.68%
- The segment with an average Holding Period of 15.5 has a response rate of 4.43%
- Imputing the missing value by 15 would imply that we will merge the missings in a segment having a relatively higher response rate (4.43%) as against the missing value response rate of 3.68%. This can lead to model overfitting.
- The response rate of the segment having average HP as 15.5 is 4.43% and the response rate of the segment having average HP as 22 is 2.50%. The response rate of the missing segment is 3.68% and is between the segments having a 4.43% and 2.50% response rate.
The value to be used for imputing the missing can be calculated by interpolation.
(?-22) / (3.68 -2.5) = (22 -15.5) / (2.5 -4.43)
Based on interpolation we get the value 18 to be used for replacing the missing.
Great! You have learned various techniques of missing value imputation. we now move on to the next important topic – Visualization and Pattern Detection