+91 89396 94874 info@k2analytics.co.in
Select Page

Transformation refers to the replacement of a variable by some function. In the logistic regression technique, variable transformation is done to improve the fit of the model on the data. Some of the common variable transformation functions are Natural Log, Square, Square-root, Exponential, Scaling (Standardization and Normalization), and Binning/ Bucketing. In this blog, we will take a practical example to understand the importance of variable transformation.

The featured image above shows the R-Squared value for the Age variable with and without transformation. The Age variable does not have a linear relationship with the log-odds and the R-Squared value is just 0.08. However, with transformation, we can fit the trend to be more linear and thereby increase the R-Squared to 0.66.

## Logistic Regression with Age variable

```import statsmodels.formula.api as sm
import statsmodels.api as sma

mylogit = sm.glm(
formula = "Target ~ Age", data = dev,
family = sma.families.Binomial()
).fit()

mylogit.summary()
```
Dep. Variable: No. Observations: Target 10000 GLM 9998 Binomial 1 logit 1.0000 IRLS -1863.2 Fri, 03 Jul 2020 3726.5 15:23:56 9.99e+03 6 nonrobust
coef std err z P>|z| [0.025 0.975] -3.4545 0.200 -17.261 0.000 -3.847 -3.062 0.0109 0.005 2.199 0.028 0.001 0.021

The p-value of the Age variable is 0.028. Assuming you have set the alpha value as 0.01 for variable selection in your model then, the Age variable may be considered as not significant. As such, we will have to drop the age variable from the model.

### Visualization Chart of Age Variable

The visualization chart of Age shows that the response rate decreases after 43. We can probably make the age variable significant by some transformation function. Possible transformation functions are:

• Binning – Converting the Age variable from continuous to categorical
• Mirroring – Assume the Red-colored line as shown in the image to be a mirror and transform the values above 43 to reflect on the lower side.

## Variable Transformation of Age in Python

The python code and the output below shows that the Age variable which was statistically insignificant can be made significant by the variable transformation.

```## Variable Transformation of Age Variable
## DV is used for Derived Variable

dev["DV_Age"]=dev["Age"].map(lambda
x: 43-(x-43) if x>43 else x)
```
```## Logistic Regression on Transformed Age Variable

mylogit = sm.glm(
formula = "Target ~ DV_Age", data = dev,
family = sma.families.Binomial()
).fit()

mylogit.summary()
```
coef std err z P>|z| [0.025 0.975] -5.0110 0.325 -15.411 0.000 -5.648 -4.374 0.0571 0.009 6.287 0.000 0.039 0.075

#### Practice Exercise

• Apply the binning approach of variable transformation on the Age variable, i.e convert Age variable from continuous to categorical

#### Final Note

Variable transformation is a very legal step and well-accepted industry practice. Some of the common reasons why we use transformations are:

• Scale the variable
• Convert non-linear relationship to linear relationship
• Reduce skewness
• To fit the best model on the given data

In the upcoming blog, we will learn about the Weight of Evidence concept to transform a categorical variable into a numerical variable.

How can we help?