The quality of the data that we provide to the algorithm plays a crucial role in Data Mining. If we provide the algorithm with low-quality data we get low-quality results, that is an incorrect prediction. It is important that we spend a good amount of time preparing and collecting good quality data.
When data is collected for the first time from the user, there are several things that could go wrong. The data can be noisy, prone to error, incomplete, inconsistent and contain redundant data.
Dealing with noisy data can be very annoying and it can be a tedious task to pre-process data to reduce noise. There two types of errors that lead to the collection of noisy data.
1. Systematic Error
An error made by the wrong calibration of equipment which can be potentially detected and corrected; that is, values can be adjusted.
2. Non-systematic Error
An error that is very hard to detect, pinpoint the error and correct it.
Sources of noise in data collection
- Faulty data collection through equipment.
- Human error while entering data.
- Failure to register change in the data.
- Class attribute value can be time dependent. For example, A person whose financial condition is bad today may not be bad after 5 years.
- People lie about their personal data for security reasons during surveys.