In this blog, we will see how to impute a categorical variable using the KNN technique in Python.

Pre-read: K Nearest Neighbour Machine Learning Algorithm

Missing Value Imputation of Categorical Variable (with Python code)

 

Dataset

We will continue with the development sample as created in the training and testing step. The categorical variable, Occupation, has missing values in it. Let us check the missing.
 

Check for missingness

count_row = dev.shape[0]

count_occupation = dev["Occupation"].count()

print("No. of rows =",  count_row)
print("Occupation count =",  count_occupation)
print("No. of rows with missing occupation =", (count_row - count_occupation))
No. of rows = 10000
Occupation count = 7706
No. of rows with missing occupation = 2294

 

Frequency Distribution

freq_table = dev["Occupation"].value_counts().to_frame()
freq_table.reset_index(inplace=True) # reset index
freq_table.columns = [ "Occupation" , "Count"] # rename columns
freq_table["Pct_Obs"] = round(freq_table['Count'] / sum(freq_table['Count']),2)
freq_table

  Occupation Count Pct_Obs
0 SAL 2987 0.39
1 PROF 2734 0.35
2 SELF-EMP 1643 0.21
3 SENP 342 0.04

 

Imputation using KNN with Python code

# Select Age, Balance, No_OF_CR_TXNS and Occupation variables
# selecting only non missing records

dev_knn = dev.iloc[:, [2, 4, 6, 5]].dropna()
X_train = dev_knn.iloc[:, 0:3]
y_train = dev_knn.iloc[:, 3]

## Creating the K Nearest Neighbour Classifier Object
## I have not normalized the variables as such using KD Tree algorithm 

from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors = 21, weights = 'uniform', 
metric = 'euclidean', algorithm = 'kd_tree')
knn_model.fit(X_train, y_train)
## Imputing the values for missing occupation

def impute_missing_occ (row):
    if pd.isnull(row['Occupation']) :
        return knn_model.predict(
            row[["Age","Balance","No_OF_CR_TXNS"]].values.reshape((-1, 3)))
    else:
        return row[['Occupation']]
dev["Occ_KNN_Imputed"] = dev.apply(impute_missing_occ,axis=1)

 

Compare the distribution

Compare the frequency distribution of imputed occupation with the original distribution. There mustn’t be much change in the distribution because of the imputation. If there is a significant change in then probably the imputation logic is not correct.

freq_table = dev["Occ_KNN_Imputed"].value_counts().to_frame()
freq_table.reset_index(inplace=True) # reset index
freq_table.columns = [ "Occ_KNN_Imputed" , "Count"] # rename columns
freq_table["Pct_Obs"] = round(freq_table['Count'] / sum(freq_table['Count']),2)
freq_table

  Occupation Count Pct_Obs
0 SAL 3979 0.40
1 PROF 3780 0.38
2 SELF-EMP 1875 0.19
3 SENP 366 0.04

 

Conclusion

As there is not much difference in frequency distribution before and after imputation, we may assume the imputation has happened correctly.

<<< previous blog         |         next blog >>>
Logistic Regression blog series home

How can we help?

Share This

Share this post with your friends!