Simply Explained

Data Cleaning – Part I

As you might know, data collected from various sources tend to be inconsistent, noisy and incomplete. It cannot be used for analysis as it would produce incorrect results.

Data cleaning is the first step in data preprocessing. It is essential that we clean the data before we do any kind of analysis. Data cleaning is the process of filling in missing values, reducing noise and correcting inconsistencies in the data.

Missing values is the most common issue when collecting data. There are three approaches that can be used to deal with missing data.

  1. Removing all records with missing values.
    This method is generally not preferred unless the data being thrown away is a small proportion of the dataset.
  2. Replace missing values with a default value.
    If the attribute values are numerical then the missing value can be replaced by the mean or median value.
    If the attribute values are nominal then the missing value can be replaced by the mode, that is, the frequently observed value (may introduce noise).
  3. Predicting missing values based on the values of other attributes.
    Algorithms like Naive Baye’s can be used to predict the missing attributes.

There are other methods of cleaning data which we will discuss in the next article.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.