 +91 89396 94874 info@k2analytics.co.in
Select Page

We have discussed various missing value imputation techniques in previous blogs. In this blog, we will do the missing value imputation of a continuous variable using mean, mode, and business logic approach.

## Dataset

We will continue with the development sample as created in the training and testing step. The continuous variable, Holding Period, has missing values in it. Let us check the missing.

```print("No. of rows =",  dev.shape)
print("Occupation count =",  dev["Holding_Period"].count())
print("No. of rows with missing values =", (dev.shape - dev["Holding_Period"].count()))```
```No. of rows = 10000
Occupation count = 9484
No. of rows with missing values = 516```

## Imputation using Mean/Median Value

The simplest approach of imputing a continuous variable is to replace all missing values by Mean or Median.

```print("Mean Holding Period = ",
dev["Holding_Period"].mean().round(1))

print("Median Holding Period = ",
dev["Holding_Period"].median().round(1))
```
```Mean Holding Period =  15.3
Median Holding Period =  15.0```

### Python code to replace the missing by Mean / Median

```dev["HP_Mean_Imputation"] = dev["Holding_Period"].fillna(
dev["Holding_Period"].mean().round(1))

dev["HP_Median_Imputation"] = dev["Holding_Period"].fillna(
dev["Holding_Period"].median().round(1))
```

### Missing Value Imputation using Business Logic

Reminder – The data being analyzed is Personal Loans Cross-Sell data. As a marketer, I am more interested in the segment-wise response rate. I would like to see the response rate in the missing value segment and compare it with other segments. Accordingly, then impute the value of missing segment with the segment value where the response rate is almost the same. Let us see this using Python code.

In the code, we are converting the continuous variable Holding Period into a categorical variable by coarse binning. In the coarse bins, we have replaced the missing values by -9999.

```dev['coarse_bins'] = pd.qcut(dev['Holding_Period'], 5, labels=False)

Rrate =  dev.fillna(-9999).groupby('coarse_bins').apply(lambda x: pd.Series([
np.min(x['Holding_Period']),
np.max(x['Holding_Period']),
np.mean(x['Holding_Period']).round(1),
np.size(x['Holding_Period']),
np.sum(x['Target']),
np.size(x['Target'][x['Target']==0]),
],
index=(["min_HP","max_HP","avg_HP",
"cnt","cnt_resp","cnt_non_resp"])
)).reset_index()

Rrate["rrate"]=Rrate["cnt_resp"]*100/Rrate["cnt"]
Rrate["rrate"] = Rrate["rrate"].round(2)
Rrate
```
coarse_bins min_HP max_HP avg_HP cnt cnt_resp cnt_non_resp rrate
0 -9999.0 -9999.0 -9999.0 -9999.0 516.0 19.0 497.0 3.68
1 0.0 0.0 6.0 3.0 2119.0 183.0 1936.0 8.64
2 1.0 7.0 12.0 9.5 1788.0 94.0 1694.0 5.26
3 2.0 13.0 18.0 15.5 1873.0 83.0 1790.0 4.43
4 3.0 19.0 25.0 22.0 1999.0 50.0 1949.0 2.50
5 4.0 26.0 31.0 28.5 1705.0 31.0 1674.0 1.82

#### Key takeaways from the above table

• Response Rate of missing value segment is 3.68%
• The segment with an average Holding Period of 15.5 has a response rate of 4.43%
• Imputing the missing value by 15 would imply that we will merge the missings in a segment having a relatively higher response rate (4.43%) as against the missing value response rate of 3.68%. This can lead to model overfitting.
• The response rate of the segment having average HP as 15.5 is 4.43% and the response rate of the segment having average HP as 22 is 2.50%. The response rate of the missing segment is 3.68% and is between the segments having a 4.43% and 2.50% response rate.
 HP RR 15.5 4.43 22 2.5 ?? 3.68

#### Imputation step

The value to be used for imputing the missing can be calculated by interpolation.

(?-22) / (3.68 -2.5)   =   (22 -15.5) / (2.5 -4.43)

Based on interpolation we get the value 18 to be used for replacing the missing.

```dev["HP_Imputed"] = dev["Holding_Period"].fillna(18)
```

Great! You have learned various techniques of missing value imputation. we now move on to the next important topic – Visualization and Pattern Detection

How can we help?