Exploratory Data Analysis

Exploratory Data Analysis (EDA) as the name suggests is exploring the data and analyzing it to generate information, insights, and inferences. The dictionary meaning of “explore” is – inquire into or discuss a subject in detail. In data science, EDA is done to uncover information that is hidden in data in row and column format.

Benefits of performing EDA:

  • helps us understand the data
  • validates our assumptions and hypothesis
  • detects outliers and missing values in data
  • helps uncover hidden patterns and insights in data
  • insights thrown from EDA can lead to formulating new hypotheses and new data collection

 

Case-Study: Why Exploratory Data Analysis is Important?

Let me share a real-life case study. A friend of mine who works in an NBFC (Non-Banking Financial Company) executed a hackathon for their company. The hackathon was to build a risk model for their small-ticket unsecured loans portfolio. The training dataset shared with participants had anonymized column names as V1, V2, .. Vn. The dependent variable was labeled as “Target”. The variable V1 was actually “serial_number”, unique identifier for each row.

Many people participated in the hackathon from all over the world. The hackathon organizers found that the models of 70% of the participants were not worth evaluating because they used the variable Vn (i.e. Serial Number) as a predictor variable in their model. 

Note:

You do not use identifier columns like customer id, account no, card no, serial number, etc as a predictor variable in the model.

 

Could the participants have avoided this silly mistake? How?

Yes, by performing exploratory data analysis.

Remember

  • Always perform exploratory data analysis before running machine learning algorithms.
  • If you do not understand a variable, then simply do not consider it in your model.
  • Pay due attention to every variable.

 

How to perform Exploratory Data Analysis?

Performing exploratory data analysis is all about attitude. A data scientist needs to have the patience to see the data minutely by summarizing data in visual/tabular form, interpreting the summarized output, deriving insights from it, and comprehending it.

My approach of EDA is to start simple; analyze one variable at a time (Univariate Analysis). Then proceed to checking associations/relation between two or more variables (Bivariate and Multivariate Analysis).

There is a lot you can do in EDA. Descriptive Statistics using tabular and graphical methods is the most common approach of performing EDA. Typical graphical techniques used are like bar plot, histogram, box plot, pie chart, scatter plot, line plot, etc. We will learn the applications of these plots in our upcoming blogs.

 

Practicals

In the upcoming blog, we will take a real-life kind of scenario for EDA, because these concepts are best understood by practicals.

We at K2 Analytics believe in giving experiential learning to the students enrolling in our data science courses. We have seen many aspiring data scientists find statistics to be a difficult subject. To enable them to understand the concepts of statistics easily, we use exploratory data analysis as a way to introduce statistical thinking.

 

How can we help?

Share This

Share this post with your friends!