Dear Blog Reader – “Welcome to our Linear Regression in Machine Learning blog series”.

In this blog series, we have provided a detailed step-by-step guide to building a Linear Regression Model with R and Python syntax. With examples and datasets, we have explained the linear regression concepts and also the R/Python code output.

Linear Regression Basics

Example 1: Fill the missing value in the below data table.

Input 10 20 30 40 50
Output 6 12 18 24 ?

Ans: The value in the missing cell should be 30. We can quickly see that there is a proportional relationship between Output and Input. Output = 0.6 * Input



Example 2: Fill the missing value for the Input-Output data given below.

Input 10 20 30 40 50
Output 8 14 20 26 ?

Ans: The value in the missing cell should be 32. There is a proportionality relationship between Output and Input plus a fixed constant. Output = 0.6 * Input + 2



Example 3: Fill the missing value for the Input-Output data given below.

Input 10 20 30 40 50
Output 8 15 18.5 26.5 ?

Ans: 31.75

To be able to predict the missing value we will have to plot the data and fit the Line of Best Fit. I hope you remember your school days working with Graph Paper and plotting the Line of Best Fit. With few data points and only two columns, we can use graph paper. However, if the number of rows and columns are many then we may have to use tools like Python and R.
The linear equation, in this case, is Output = 0.59 * Input + 2.25. Using the equation, we get 31.75 as the value for the missing cell.



Sample R code for Linear Regression

Input = c(10, 20, 30, 40)
Output = c(8, 15, 18.5, 26.5)
linear_model = lm(Output ~ Input)
linear_model$coefficients
(Intercept)       Input
2.25                0.59



Sample Python code for Linear Regression

import pandas as pd
import statsmodels.formula.api as sma

Input = [10, 20, 30, 40]
Output = [8, 15, 18.5, 26.5]
df = pd.DataFrame([Input, Output], index=["Input", "Output"]).T
linear_model = sma.ols(formula ="Output ~ Input" , data = df).fit()

linear_model.params
Intercept    2.25
Input        0.59



Linear Regression Table Content

The above example was a simple example to introduce linear regression. There are lots of assumptions and concepts to learn in linear regression. We will cover all of it with R and Python code in this blog series.



Sr. No Linear Regression Blog Series Python & R
1 Introduction to Linear Regression
Regression is a statistical process for estimating the relationship between a dependent variable (usually denoted by y) and one or more independent variables (usually denoted by x).
Link
2 Simple Linear Regression
Linear Regression with only one independent variable is Simple Linear Regression
Link
3 R-Squared Concept Explained
R Squared is a measure of how good the Linear Regression model is fitting the data
Link
4 Multiple Linear Regression
Linear Regression with more than one independent variable is Multiple Linear Regression
Link
5 Adjusted R-Squared Concept Explained
The Adjusted R Squared is a modified form of R Squared that has been adjusted for the number of predictor variables in the model.
Link
6 Multi-Collinearity and Variance Inflation Factor with Python and R code.
Multicollinearity is a phenomenon when two or more independent variables are highly intercorrelated, meaning that, an independent variable can be linearly predicted from one or more other independent variables.
Link
7 Importance of Variable Transformation in Model Development
Transformation refers to the replacement of a variable by some function. In machine learning, we apply Variable Transformation to improve the fit of the regression model on the data and improve the model performance.
Link
8 No Intercept Linear Regression Model and RMSE Measure
A regression model with intercept = 0, i.e., the regression equation passes through the origin is a NO Intercept Regression Model. RMSE stands for Root Mean Squared Error, it is one of the model performance measures.
Link
9 Assumptions of Linear Regression
Linearity, Homoscedasticity, Normal Error, No Autocorrelation of residual, No Perfect Multi-Collinearity, Exogeneity, and Sample Size.
Link


Data file

The data file used in the blog series is inc_exp_data.csv. You can download it from our Resources section.

Happy Learning. If you liked this blog series then, kindly drop in your comment, feedback, and remember to share it with your friends and colleague.

Thanking you.
Team K2 Analytics

Related Articles:

Statistics for Data Science
Logistic Regression Blog Series
Machine Learning Certification Program

How can we help?

Share This

Share this post with your friends!