Why We Need Data Cleaning?

Data analysis is a time-consuming task, but are you prepared before the data analysis, and have you omitted the important step: data cleaning?

From Google

In the process of data analysis, data cleaning is such a preliminary preparation after data extraction. For data scientists, we will encounter all kinds of data. Before analyzing, we need to invest a lot of time and energy to “organize and trim” the data to the way we want or need.

Why? Because the data we collect often has many problems.

Let’s look at an example first. Suppose the boss gives you the following data to require you to do data analysis. How do you feel after seeing this data?

The data you just saw may be confusing because the data is missing annotations.

When we collect and organize data, we must mark the data. The data header is very important. For example, this data sheet lacks column names. In this way, we do not know the meaning of each column of data, we cannot understand the role of these values from the business meaning, and whether these values are correct.

Let me briefly explain what these numbers represent.

This is the membership data of a clothing store. The top row is column coordinates and the leftmost column is row coordinates.

In column coordinates, column 0 represents the serial number, column 1 represents the name of the member, column 2 represents the age, column 3 represents the weight, columns 4-6 represent the measurements of the male members, and columns 7-9 represent the measurements for female members.

After understanding the meaning, let’s look at the specific data in the middle part. There are many problems with these dirty data. Many of the values are NaN and there are blank lines.

This is only part of the membership data for a store. The data business in daily work will be much more complicated. Generally, we need to count more data dimensions, such as 100 indicators. The data volume usually exceeds the TB and EB levels, so the processing difficulty of the entire data analysis increases exponentially.  At this time, it is difficult to find the problem with the naked eye.

I’ve given such a simple example to show you why it is important to prepare for data cleaning before data analysis. Experienced data analysts know that a good data analyst must be a data cleaning expert. You must know that in the entire data analysis process, whether it is time or effort, data cleaning accounts for about 80%.

Data Quality Guidelines

In the above case of clothing store membership data, you can definitely find a few problems when you look at the data. Do you want to know if there are some guidelines to regulate the quality of these data?

Here, I summarize the data cleaning rules into the following 4 key points:

  1. Completeness: Whether there is a null value for a single piece of data and whether the statistical fields are complete.
  2. Comprehensiveness: Observe all the values ​​in a column. For example, in an Excel table, we can select a column, and you can see the average, maximum, and minimum values ​​of the column. We can use common sense to determine if there is a problem with the column, such as data definition, unit identification, value itself.
  3. Legality: The legality of the type, content and size of the data. For example, there are non-ASCII characters in the data, the gender is unknown, and the age is over 150.
  4. Uniqueness: Whether the data has duplicate records, because the data usually comes from the summary of different channels, and duplicates are common. Both row data and column data need to be unique. For example, a person cannot repeatedly record multiple times, and a person’s weight cannot be repeatedly recorded multiple times in the column index.

In accordance with the above principles, we can solve most of the problems encountered in data cleaning, making the data standard, clean, and continuous, preparing for subsequent data statistics and data mining. If you want to further optimize the data quality, you also need to use it flexibly in real cases.

Data Cleaning in Python

After understanding the data quality guidelines, let’s solve the problems in the above data case.

Here you need Python’s Pandas. We introduced this tool before. It is a NumPy-based tool created specifically to solve data analysis tasks. Pandas includes a large number of libraries, and we can use these libraries to efficiently clean up data.

Python has a wealth of libraries and can be called a good data mining helper. Of course, there are many other tools for data cleaning. Here we just take Pandas as an example to help you apply data cleaning guidelines and take you to a more intuitive understanding of how data cleaning is going on.

Below, we use Pandas for cleaning in accordance with the above four guidelines.

1. Completeness

Question 1: Missing values

Some age and weight values are missing in the table. This is often due to a large amount of data. In the process, some values were not collected. Generally, we can use the following three methods:

Delete: delete records with missing data;

Mean: use the mean of the current column;

High Frequency: Uses the most frequently occurring data in the current column.

For example, if we want to fill the missing values in df [‘Age’] with the average age, we can write:

df[‘Age’].fillna(df[‘Age’].mean(), inplace=True)

If we fill with the most frequently occurring data, we can first obtain the highest frequency age_maxf of the Age field through value_counts, and then fill the missing data in the Age field with age_maxf:

age_maxf = train_features[‘Age’].value_counts().index[0]

train_features[‘Age’].fillna(age_maxf, inplace=True)

Question 2: Blank lines

We found an empty row in the table, except for exception, all values are NaN. read_csv () in Pandas does not have optional parameters to ignore blank lines, so we need to use dropna () to process the data after it has been read in to delete blank lines. 

df.dropna(how=’all’,inplace=True)

2.Comprehensiveness

Problem: Units of column data are not uniform

Observing the values in the weight column, we can find that the units of the weight column are not uniform. Some units are kgs and some units are lbs.

Here I use kgs as a uniform unit of measurement to convert lbs to kgs:

# Get lbs data in weight column

rows_with_lbs = df [‘weight’]. str.contains (‘lbs’). fillna (False) print df [rows_with_lbs]

# Convert lbs to kgs, 2.2lbs = 1kgs

for i, lbs_row in df [rows_with_lbs].iterrows ():

# Truncate from the beginning to the third character before the last lbs.

weight = int (float (lbs_row [‘weight’] [:-3]) / 2.2) df.at [i, ‘weight’] = ‘() kgs’.format (weight)

3.Legality

Problem: Non-ASCII characters

We see that Firstname and Lastname have some non-ASCII characters in the dataset. We can use the delete or replace method to solve the non-ASCII problem. Here we use the delete method:

# Remove non-ASCII characters

df [‘first_name’]. replace ({r ‘[^ \ x00- \ x7F] +’: ”}, regex = True, inplace = True)

df [‘last_name’]. replace ({r ‘[^ \ x00- \ x7F] +’: ”}, regex = True, inplace = True)

4.Uniqueness

Question 1: Multiple parameters in a column

It is not difficult to find in the table that the Name column contains the two parameters Firstname and Lastname. For data cleanliness purposes, we split the Name column into Firstname and Lastname fields. We use Python’s split method, str.split (expand = True), to split the list into new columns, and then delete the original Name column.

Let’s check if there are duplicate records in the data. If duplicate records exist, use drop_duplicates () provided by Pandas to remove duplicate data.

df[[‘first_name’,’last_name’]] = df[‘name’].str.split(expand=True)

df.drop(‘name’, axis=1, inplace=True)

Question 2: Duplicate data

# Delete duplicate rows

df.drop_duplicates([‘first_name’,’last_name’],inplace=True)

In this way, we will clean up the membership data in the above case. Let’s see the results of the cleaned data. How is it? Is it clean and standard?

Get in the Habit of Data Cleaning

Now, can you feel the data problem is not trivial, there are 6 errors in the simple example above. That’s why we often say that real-world data is “dirty” and needs to be cleaned.

The data of third parties should be cleaned, and the data of own products also need data cleaning. It can be said that without high-quality data, there is no high-quality data mining, and data cleaning is a guarantee for high-quality data.

When you’re new to data science, you will definitely find data mining to be cool and valuable. This is true, but today I want to tell you that even cool things can’t be separated from basic work, like the data cleaning work we talked about today.

You might also be interested in…

Data Analysis Practice Guide——How to begin

What Data Analysis Tools Should I Learn to Start a Career as a Data Analyst?

6 Key Skills That Data Analysts Need to Master