+91 89396 94874 info@k2analytics.co.in
Select Page

In this blog, we will see how to impute a categorical variable using the KNN technique in Python.

## Missing Value Imputation of Categorical Variable (with Python code)

### Dataset

We will continue with the development sample as created in the training and testing step. The categorical variable, Occupation, has missing values in it. Let us check the missing.

### Check for missingness

```count_row = dev.shape[0]

```
```count_occupation = dev["Occupation"].count()

print("No. of rows =",  count_row)
print("Occupation count =",  count_occupation)
print("No. of rows with missing occupation =", (count_row - count_occupation))
```
No. of rows = 10000
Occupation count = 7706
No. of rows with missing occupation = 2294

### Frequency Distribution

`freq_table = dev["Occupation"].value_counts().to_frame()`
`freq_table.reset_index(inplace=True) # reset index`
`freq_table.columns = [ "Occupation" , "Count"] # rename columns`
```freq_table["Pct_Obs"] = round(freq_table['Count'] / sum(freq_table['Count']),2)
freq_table

```
Occupation Count Pct_Obs
0 SAL 2987 0.39
1 PROF 2734 0.35
2 SELF-EMP 1643 0.21
3 SENP 342 0.04

### Imputation using KNN with Python code

```# Select Age, Balance, No_OF_CR_TXNS and Occupation variables
# selecting only non missing records```
```
dev_knn = dev.iloc[:, [2, 4, 6, 5]].dropna()
X_train = dev_knn.iloc[:, 0:3]
y_train = dev_knn.iloc[:, 3]

```
```## Creating the K Nearest Neighbour Classifier Object
## I have not normalized the variables as such using KD Tree algorithm

from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors = 21, weights = 'uniform',
metric = 'euclidean', algorithm = 'kd_tree')
knn_model.fit(X_train, y_train)
```
`## Imputing the values for missing occupation`
```
def impute_missing_occ (row):
if pd.isnull(row['Occupation']) :
return knn_model.predict(
row[["Age","Balance","No_OF_CR_TXNS"]].values.reshape((-1, 3)))
else:
return row[['Occupation']]
```
```dev["Occ_KNN_Imputed"] = dev.apply(impute_missing_occ,axis=1)

```

### Compare the distribution

Compare the frequency distribution of imputed occupation with the original distribution. There mustn’t be much change in the distribution because of the imputation. If there is a significant change in then probably the imputation logic is not correct.

`freq_table = dev["Occ_KNN_Imputed"].value_counts().to_frame()`
`freq_table.reset_index(inplace=True) # reset index`
`freq_table.columns = [ "Occ_KNN_Imputed" , "Count"] # rename columns`
```freq_table["Pct_Obs"] = round(freq_table['Count'] / sum(freq_table['Count']),2)
freq_table

```
Occupation Count Pct_Obs
0 SAL 3979 0.40
1 PROF 3780 0.38
2 SELF-EMP 1875 0.19
3 SENP 366 0.04

### Conclusion

As there is not much difference in frequency distribution before and after imputation, we may assume the imputation has happened correctly.

How can we help?