As we know, the data we provide to an algorithm is very important for getting the right results. We will look at a few issues that we would face when working with large amounts of data.
Missing Relevant AttributesMissing relevant attributes can be a huge problem. It can occur for a number of reasons. It might be that it was not considered important when the data was collected or the instrument used for data collection may be faulty. Detecting the relevant missing attributes is difficult as we have hundreds of attributes in the dataset. It’s difficult to predict which variables are significant.
Inconsistent DataInconsistency in attribute value names or scales. For example, the values recorded in kilometres and miles in different datasets can be problematic when used without being checked. This problem is rare as companies have a data warehouse, integrating data from different departments into a consistent format.
Redundant DataRedundant data or duplicate data often occurs when integrating data from multiple databases. Detecting and removing redundant data is particularly important when the data mining algorithm is very sensitive to redundant data. For example, Naive Baye’s. Following are a few types of redundant data.
- Same attribute with different names. For example, DoB and Birthdate can be present as two different columns when they represent the same information.
- One attribute derived from another attribute. For example, Age can be derived from birthdate but they can be present as two separate columns.
- Two different attributes that are strongly related. For example, whether a person has retired may not be necessary information as we can use age to determine if they are retired.
We may not be able to eliminate all the noise and inconsistency in our data but we can reduce it. There are a few ways to do this, which we will discuss in the next article.