Missing Value Imputation of Continuous Variable

We have discussed various missing value imputation techniques in previous blogs. In this blog, we will do the missing value imputation of a continuous variable using mean, mode, and business logic approach.

Dataset

We will continue with the development sample as created in the training and testing step. The continuous variable, Holding Period, has missing values in it. Let us check the missing.

print("No. of rows =",  dev.shape[0])
print("Occupation count =",  dev["Holding_Period"].count())
print("No. of rows with missing values =", (dev.shape[0] - dev["Holding_Period"].count()))

No. of rows = 10000
Occupation count = 9484
No. of rows with missing values = 516

Imputation using Mean/Median Value

The simplest approach of imputing a continuous variable is to replace all missing values by Mean or Median.

print("Mean Holding Period = ", 
    dev["Holding_Period"].mean().round(1))

print("Median Holding Period = ", 
    dev["Holding_Period"].median().round(1))

Mean Holding Period =  15.3
Median Holding Period =  15.0

Python code to replace the missing by Mean / Median

dev["HP_Mean_Imputation"] = dev["Holding_Period"].fillna(
    dev["Holding_Period"].mean().round(1))


dev["HP_Median_Imputation"] = dev["Holding_Period"].fillna(
    dev["Holding_Period"].median().round(1))

Missing Value Imputation using Business Logic

Reminder – The data being analyzed is Personal Loans Cross-Sell data. As a marketer, I am more interested in the segment-wise response rate. I would like to see the response rate in the missing value segment and compare it with other segments. Accordingly, then impute the value of missing segment with the segment value where the response rate is almost the same. Let us see this using Python code.

In the code, we are converting the continuous variable Holding Period into a categorical variable by coarse binning. In the coarse bins, we have replaced the missing values by -9999.

dev['coarse_bins'] = pd.qcut(dev['Holding_Period'], 5, labels=False)

Rrate =  dev.fillna(-9999).groupby('coarse_bins').apply(lambda x: pd.Series([
        np.min(x['Holding_Period']),
        np.max(x['Holding_Period']),
        np.mean(x['Holding_Period']).round(1),
        np.size(x['Holding_Period']),
        np.sum(x['Target']),
        np.size(x['Target'][x['Target']==0]),
        ],
        index=(["min_HP","max_HP","avg_HP",
        "cnt","cnt_resp","cnt_non_resp"])
        )).reset_index()

Rrate["rrate"]=Rrate["cnt_resp"]*100/Rrate["cnt"]
Rrate["rrate"] = Rrate["rrate"].round(2)
Rrate

	coarse_bins	min_HP	max_HP	avg_HP	cnt	cnt_resp	cnt_non_resp	rrate
0	-9999.0	-9999.0	-9999.0	-9999.0	516.0	19.0	497.0	3.68
1	0.0	0.0	6.0	3.0	2119.0	183.0	1936.0	8.64
2	1.0	7.0	12.0	9.5	1788.0	94.0	1694.0	5.26
3	2.0	13.0	18.0	15.5	1873.0	83.0	1790.0	4.43
4	3.0	19.0	25.0	22.0	1999.0	50.0	1949.0	2.50
5	4.0	26.0	31.0	28.5	1705.0	31.0	1674.0	1.82

Key takeaways from the above table

Response Rate of missing value segment is 3.68%
The segment with an average Holding Period of 15.5 has a response rate of 4.43%
Imputing the missing value by 15 would imply that we will merge the missings in a segment having a relatively higher response rate (4.43%) as against the missing value response rate of 3.68%. This can lead to model overfitting.
The response rate of the segment having average HP as 15.5 is 4.43% and the response rate of the segment having average HP as 22 is 2.50%. The response rate of the missing segment is 3.68% and is between the segments having a 4.43% and 2.50% response rate.

HP	RR
15.5	4.43
22	2.5
??	3.68

Imputation step

The value to be used for imputing the missing can be calculated by interpolation.

(?-22) / (3.68 -2.5) = (22 -15.5) / (2.5 -4.43)

Based on interpolation we get the value 18 to be used for replacing the missing.

dev["HP_Imputed"] = dev["Holding_Period"].fillna(18)

Great! You have learned various techniques of missing value imputation. we now move on to the next important topic – Visualization and Pattern Detection

<<< previous blog | next blog >>>
Logistic Regression blog series home

Missing Value Imputation of Continuous Variable

Dataset

Imputation using Mean/Median Value

Python code to replace the missing by Mean / Median

Missing Value Imputation using Business Logic

Key takeaways from the above table

Imputation step

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Share This