Interview Questions on Exploratory Data Analysis (EDA)

Radhika 24 Jun, 2022 • 10 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Are you aspiring to become a data analyst/scientist, but struggling to crack the interviews? Getting a break in the data science field can be tough. Doubly so, if you are a fresher in the field. So it’s better to be prepared before facing the interviews. And, there are a lot of rounds one has to undergo for landing up data science job and one of the most important rounds in the technical round. But, what kind of questions can be asked in the technical round? How can you prepare and what are the resources you should refer to?

This article includes a list of the top 10 plausible questions which are likely to come in a technical round for a data science field job.

I have seen candidates failing the interviews because they have good knowledge about models, but did not pay much importance in the Exploratory Data Analysis part. they failed to understand the importance of balance between EDA and modeling. So, If you can answer and understand these EDA Interview Questions, rest assured, you will give a tough fight in your job interview.

Happy learning and Good luck Guys !!

Questions:

1. What is the lifecycle of the data science project?

Data Collection
Exploratory Data Analysis
Model Training and Testing
Results Analysis from the models.

EDA Interview Questions - Data Science Lifecycle

2. What is the Difference between Univariate, Bivariate, and Multivariate analysis?

Univariate – When we analyze one variable at a time, it is called univariate data analysis. This analysis aims to describe the variable in question and find patterns that exist within it. Example: height of students
Bivariate – Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships. The investigation determines the relationship between the two variables, where one of the variables is the target variable. Example: temperature and ice cream sales in the summer season.
Multivariate – Analyzing three or more variables together is categorized under multivariate data analysis. It is similar to a bivariate but contains more than one dependent variable.
Example: data for house price prediction

3. Mention the two kinds of target variables for predictive modeling.

The two kinds of target variables are:

Numerical/Continuous variable – Variables whose values lie within a range, could be any value in that range and the time of prediction, values are not bound to be from the same range too.
For example: Height of students – 5; 5.1; 6; 6.7; 7; 4.5; 5.11
Here the range of the values is (4,7)
And, the height of some new students can/cannot be any value from this range.
Categorical variable – Variables that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group on the basis of some qualitative property.
A categorical variable that can take on exactly two values is termed a binary variable or a dichotomous variable. Categorical variables with more than two possible values are called polytomous variables
For example: Exam Result: Pass, Fail (Binary categorical variable)
The blood-type of a person: A, B, O, AB (polytomous categorical variable)

4. How to perform univariate analysis for numerical and categorical variables?

For the Numerical variables:
One can plot a Box and Whiskers plot and KDE plot to better understand the data; below is an example of the Age column plotted using both box and KDE plot.

Box plot and KDE both show that an average population age lies between 25yrs to 50 yrs roughly, and the mean of the population is 38yrs. The left skewness in the KDE plot shows that more population was between 20 and 30 years and very few aged people were in the sample, which could be verified from the box plot too, as the box is aligned more towards the Q1 and not evenly distributed.

For the Categorical variables:
Bar plots and Pie Charts are a great way to analyze categorical variables to understand the categorical data. The two plots represent the number (in a bar chart) and proportion (in a pie chart) of individuals opting for Course_types

Here, the Barplot and Pie chart shows that “Course” Course_Type was highest in number with 51.3 % people subscribing to such courses, followed by “Program” Course_Type, with the least number of “Degree” Course_Type with only 0.3% subscribing to such courses.

5. How to perform Bivariate analysis for Numerical-numerical, Categorical-Categorical, and Numerical-Categorical variables?

Univariate analysis is the analysis of one(“uni”) variable and Bivariate Analysis is the analysis of exactly two variables and is one of the simplest forms of statistical analysis, used to find out if there is a relationship between two sets of values.

Though Bivariate analysis can be performed for any two sets of variables, Bivariate analysis is performed using an independent variable and the dependent variable.

Numerical-Numerical – Here, one of the numerical variables is the target variable and the other one is any other independent numerical variable. A Scatter plot is a great way for understanding numerical-numerical variable data relationships. In the example shown, sales is the target numerical variable plotted on the y-axis against user-traffic numerical variables on the x-axis.

The scatter plot helps us in understanding that User_Traffic is increasing linearly as the Sales going up, or we can also say, as User_traffic increases the sales also increase linearly.

Categorical-Categorical – One of the Categorical variables is the target variable and another one can be an independent categorical variable. In the example below, the target variable is about default next month represented by either 0 or 1 against the education categorical independent variable.

The Bivariate for categorical and categorical variables can easily be done with the help of double bar or stacked bar charts. In the above example, we can see how defaulters( represented by 1: orange color) are highest in number for High School then University and then for Others category even when they are so less in number.

Numerical-Categorical – Here, the target variable is either categorical or numerical, and in such case, bar plots or strip plots are a great way of understanding the data. Below is an example for a bar and strip plot where sales which is the numerical variable(target) is on the y-axis and course_domain is the categorical variable represented on the x-axis.

Here, Bar Plot helps in understanding that sales for “Business” course_domain give the highest sales followed by Finance, the Development and the least sales from Software course_domain. The business gives the highest sales, and the strip plot corresponding to the same helps in understanding the minimum value of sale for this category is quite high if compared with others and maximum sales value is low than others, but at last, gives the most sales.

6. What are the different tests that are used for verifying analysis/hypothesis for numerical-numerical, categorical-categorical, and numerical-categorical variables?

For numerical-numerical data, the correlation matrix is used for understanding how much are independent variables correlated with the target variable.

In this example, sales are the target variable and correlation values that are most near to either +1 or -1 are most correlated with Sales. USer-Traffic in the orange color box has a correlation value of +0.8.

For Categorical-Numerical variable testing, the T-test or Z-test is mostly used depending upon whether the number of observations is below(Z-test) or above 30(T-test). And, if in the category column, the number of categories is more the Anova test is preferred over T/Z-test.
These tests are performed using p values and further helping in accepting or rejecting the null hypothesis made for the test columns.
For Categorical-Categorical variables, the chi sq test is used. There are two types of chi-square tests.
Chi-square goodness of fit test determines if sample data matches a population.
A chi-square test for independence tests to see whether distributions of categorical variables differ from each other.
A very small chi-square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship. A very large chi-square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

[stextbox id=’info’]

Note: If the p-value ≤ 0.05, that indicates strong evidence against the null hypothesis; so you reject the null hypothesis. And if the p-value > 0.05, indicates weak evidence against the null hypothesis, so you accept the null hypothesis.

[/stextbox]

For more understanding of tests, one should be thoroughly familiar with basic statistics concepts.

7. During the data preprocessing step, how should one treat missing/null values? How will you deal with them?

There are three types of missing data:

MCAR: Missing Completely At Random. It is the highest level of randomness. This means that the variable with the missing values is not dependent on any other variable/feature values. An example of MCAR is a weighing scale that ran out of batteries. Some of the data will be missing simply because of bad luck.
MAR: Missing At Random. This means that the missing values in any column/feature are dependent on other feature values. For example, when placed on a soft surface, a weighing scale may produce more missing values than when placed on a hard surface. Such data are thus not MCAR. If, however, we know the surface type and if we can assume MCAR within the type of surface, then the data are MAR.
MNAR: Missing Not At Random. Missing not at random data is a more serious issue and in this case, it is advisable to check the data gathering process further and understand the reason behind missing data. For example, the weighing scale mechanism may wear out over time, producing more missing data as time progresses, but we may fail to note this. If the heavier objects are measured later in time, then we obtain a distribution of the measurements that will be distorted. MNAR includes the possibility that the scale produces more missing values for the heavier objects (as above). Another example, if most people refuse to answer some particular questions, what was the reason? Was it an unclear question or some other issue? This helps in making better business decisions and saves time to do modeling as a basic issue might lie here.

Now that we have seen what type of missing data exists in our dataset, we should check what percentage of missing information exists for features.
If the missing data type is Missing completely at random, then the missing percentage of even 20 can be ignored, but if it is the other two types of missing data, missing values should not be ignored.
If missing data is MCAR with a high percentage value, they are advised to be dropped and not to be included in the modeling part. Also, if missing data is MAR with a high percentage they can be dropped, but if the percentage is low in MAR then they shouldn’t be dropped. Features with a missing percentage of more than 10% are mostly advisable not to be included in the modeling section.
Now that unnecessary features have been dropped or ignored, the features still having missing values should be treated using Imputation. Imputation is the process of filling the missing data by some statistical methods. Imputation is useful as it replaces the missing data with an estimated value based on other available information.
If the missing values in a column or feature are numerical, the values can be imputed by the mean of the complete cases of the variable. Mean can be replaced by median if the feature is suspected to have outliers. For a categorical feature, the missing values could be replaced by the mode of the column.

8. What is an outlier and how to identify them?

An outlier is an observation point that is distant from others. They sometimes represent errors in measurement, bad data collection, or simply show variables not considered when collecting the data or can be a part of the data distribution as well. And hence they might skew the results and give insights accordingly.
There is no one method to detect outliers as every dataset is different. One thing which should be practiced in detecting outliers is that you(data analyst) can inspect the unfiltered, basic observations and decide whether a value is an outlier or not based on the domain knowledge.

After performing the above step, one can understand the data and look at outliers using these two methods:

Box plot: In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers will appear separate from the plot. (Source: Wikipedia)

Scatter plot: Scatter plot graph points on two axes using Cartesian coordinates. By graphing the points this way, we can visually identify points that fall outside the expected grouping. These points are likely to be outliers. (Source: Wikipedia)

Outliers can be dropped only if it is a garbage value. Example: height of an adult = 0 ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed. If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point. If you cannot drop outliers, you can normalize the data. This way, the extreme data points are pulled to a similar range.

9. How can the data be normalized?

Data can be normalized by either transforming the data or by scaling the data down in a particular range.

Transformation – If the data is left-skewed, log transformation is the best way to make them behave in the normal distribution, and if the data is right-skewed, exponential transformation helps in transforming them into a normal distribution.
Scaling – There are two scalers used on a wide base
- Normalization (Min-Max Scaler): This scales down the data between 0 and 1 range where minimum value corresponds to 0 and maximum was 1.
  
  A value is normalized as follows:y = (x – min) / (max – min), where the minimum and maximum values pertain to the value x being normalized

- Standardization (Standard Scaler): This scaler helps in making a normal distribution in standard normal distribution where the mean is represented by 0 and the standard deviation is represented by 1.

A value is standardized as follows: y = (x – mean) / standard_deviation

Note: If the distribution of the quantity is normal, then it should be standardized, otherwise, the data should be normalized. Standardization can give values that are both positive and negative centered around zero. It may be desirable to normalize data after it has been standardized.

Voila !!

End Notes !!

EDA constitutes a major part of the interview questions. I hope this was helpful. Do let me know here if there are more important EDA interview questions that you think I forgot to add to this article.

Thank you 🙂

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

blogathon EDA Interview Questions

Radhika 24 Jun 2022

Beginner Data Exploration Interview Questions Interviews

Hemanth 11 Mar, 2021

Thank you so much for sharing this information.

Aabha 07 Jun, 2021

All the questions are explained so well . Really thanks for sharing

Nibedita 06 Aug, 2021

Thank you for the great article. Just one question and i might be wrong, shouldn't the t-test be done if number of observations are less than 30 and Z test if observation is greater than 30?

Sanjeev kumar 22 Dec, 2021

Photos are not loading properly. It will be easier to understand with photos that you have uploaded. Please look into it.

p koteswara rao 21 Sep, 2022

For Categorical-Numerical variable testing, the T-test or Z-test is mostly used depending upon whether the number of observations is below(Z-test) or above 30(T-test). This statement is wrong ,the number of observations is below 30 ( t-test) small sample test and above 30 (Z test)large sample test. please make it correct.