Logistic Regression Technique explained with R & Python

What is Logistic Regression?

Logistic Regression is a machine learning technique that is used to model the probability of an event or class having a binary outcome. Logistic Regression is a technique mostly used in industry to model for binary classification problems. Binary outcome means the dependent variable can have only two possible values, viz, Yes / No (1 or 0)

Applications of Logistic Regression Model:
Marketing – Whether the customer will respond to the offer or not
Risk in Lending Business – Whether the customer being given loan will repay or not
HR – Whether an employee will attrite or not
Machine – When an appliance will breakdown or not

When the above business problems are converted to mathematical form, the occurrence of an event is typically labeled as 1, and non-occurrence is labeled as 0.

Logistic vs Linear Regression?

Logistic regression is used when the dependent variable is binary (1 / 0)
Linear regression is used when the dependent variable is continuous ( – inf. to + inf.)

In a binary classification problem, the value of the dependent variable is bounded between 0 & 1 as such Linear regression cannot be used. To restrict the predicted value of the regression model between 0 and 1, a generalized form of linear regression called logistic regression is used.

The logistic regression equation format is shown below:

Where:
p is the probability of event occurrence
1-p is the probability of event non-occurrence

Understanding logistic regression concept with data

We will consider a hypothetical data to understand the concept of logistic regression as shown in the table below.
Note: The entire data file named LR_DF.csv can be downloaded from our resources section.

Cust_ID	Target	Age
C1	0	30
C2	0	43
C3	0	53
C4	0	45
C5	0	37
C6	0	41
C7	1	46
C8	1	33
..	..	..
C20000	1	43

Independent Variable – Age is an independent variable in the above data.
Dependent Variable – Target is our binary clas, dependent variable where 1 is a responder to the marketing offer and 0 is non-responder class.

Where is the probability?

The value in the Target column for each row is 0 or 1.
Just imagine, you aggregate the data by Age and compute the percentage of customers responding in each age group, i.e. response probability. The sample table structure to explain the probability calculation is shown below.

Age	Target = 0	Target = 1	Total	Resp. Probability
21	207	5	212	0.024
22	241	7	248	0.028
23	375	9	384	0.023
24	375	21	396	0.053
25	531	13	544	0.024
26	591	21	612	0.034
27	600	12	612	0.020
28	718	30	748	0.040

The logistic regression is designed to model the relationship between the probability and the independent variable.

Logistic Function (Sigmoid Function)

Let us know see the mathematical steps to express the below equation in probability form.

The function p= 1/(1+ e^(-z) ) is called the Logistic Function.

S-Curve (Sigmoid Function)

If we make a plot of p vs z based on logistic function, p= 1/(1+ e^(-z) ), we will get an S-curve as shown in the plot. Because of the s-curve, the logistic function is also a sigmoid function.

z	p	z	p
0	0.500000	0	0.500000
-1	0.268941	1	0.731059
-2	0.119203	2	0.880797
-3	0.047426	3	0.952574
-4	0.017986	4	0.982014
-5	0.006693	5	0.993307
-6	0.002473	6	0.997527
-7	0.000911	7	0.999089
-8	0.000335	8	0.999665
-9	0.000123	9	0.999877
-10	0.000045	10	0.999955

The sigmoid function, s-curve has two horizontal asymptotes. Both ends of the s curve is an asymptote.

What is an asymptote?

a straight line that continually approaches a given curve but does not meet it at any finite distance.

As the value of z becomes more negative the value of probability tends towards 0 and vice-versa as z takes a higher positive value, the probability tends towards 1.

Logistic Regression Blog Series Links

Business Objective Statement: MyBank wishes to develop a Direct Marketing Channel to sell their loan products to existing deposit account customers. The bank executed a pilot campaign to cross-sell personal loans to its existing customers. A random base of 20000 customers was targeted with an attractive personal loan offer and processing fee waiver. The data of the customers who were targeted and their response to the marketing offer has been provided. The data is in the file (LR_DF.csv) and it can be downloaded from our resources section.

We will use the above business case to explain the concepts of Logistic Regression along with R and Python code in this blog series. The links to various modules of the blog series are given below:

Sr. No.	Logistic Regression blog-series	R	Python
1.	Introduction to Logistic Regression
2.	Hypothesis Development	Link
3.	Single Variable Logistic Regression Model Development & Model Summary Interpretation	Link
4.	Training and Testing	Link
5.	Splitting Data in Dev – Validation – Holdout Sample	Link
6A.	Information Value Concept	Link
7.	Outlier Treatment	Link
8.	Missing Value Imputation	Importance of Missing Value Imputation Imputation using KNN in Python
9.	Visualization and Pattern Detection	Visualization using Double Axis Charts and Log-Odds Plot Variable Transformation & Trend Fitting
10.	Weight of Evidence	WoE
11.	Model Development	Multiple Logistic Regression
12.	Model Performance Measurement	Rank Order, KS, Lift Chart Classification Accuracy, AUC-ROC Concordance, Gini, Goodness of Fit
13.	Model Validation	Link	Link
14.	Hold-out Testing	Link	Link
15.	Model Implementation & Deployment Strategy	Link	Link

Introduction to Logistic Regression

What is Logistic Regression?

Logistic vs Linear Regression?

Understanding logistic regression concept with data

Where is the probability?

Logistic Function (Sigmoid Function)

S-Curve (Sigmoid Function)

What is an asymptote?

Logistic Regression Blog Series Links

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Share This