Simply Explained

Introduction to Data Mining

  • We have data which transformed into useful information which provides knowledge to make the right decision.
  • Data is easily available to us in abundance and is not very useful.
  • DBMS processes and transforms the data into useful information.
  • Information is putting the data into a more restricted format so that can be more easily understood.
  • Knowledge goes beyond information. It’s understanding a higher level of detail.
  • The decision is made based on this knowledge.
  • We should be able to make clever decisions using this knowledge.
  • In data mining, we will learn how we can use knowledge to make intelligent decisions using data mining tools and techniques.
  • Knowledge is a very summarized data.
  • We need data mining expert to understand this data to make use of it to make intelligent decisions.
  • A prediction is based on probability.
  • Data mining has three areas statistics, ML, and databases.
  • We will cover data mining techniques from ML perspective.
  • In classical statistics, we first propose a hypothesis and then we use statistical tests to verify to what extent it is true.
  • The stats involved in Data mining is similar but the difference is that it considers many different hypotheses automatically and tests them and therefore provides the best hypothesis or model to the user.
  • But we still need the user to give the basics of what to do and what type of model should the system build.
  • The user needs to evaluate the quality of the knowledge that has been discovered.
Knowledge Discovery
  • This involves preparing data to be analysed.
  • Typically, the data that is stored in the company database is not immediately ready for data mining.
  • If we use such data then we will get incorrect results and we will fail.
  • Each data mining algorithm makes a different assumption about what kind of data should be analysed and what its structure would be.
  • We need to make sure that the data we have is suitable for the algorithm that we are using.
  • Then we apply one more algorithm.
  • Then we need to do the data validation so that we can make sure that the data we have extracted is somehow useful for us.
  •  It is usually not successful at the first go.
Data Mining
  • Data Mining is the extraction of interesting patterns from data.
  • Knowledge Discovery in Databases is identifying useful and understandable patterns in data.
Data mining Tasks or Problems
  • There are three types of tasks
    • Discovery of Association Rules
    • Classification
    • Clustering
  • Different data mining tasks require a different algorithm.
  • In Association Rule we create associations between data items.
  • eg. if a person buys a coffee then he will also buy bread.
  • In Classification Rule, we provide data and on the basis of the given data, the algorithm makes a prediction of where the new incoming data would belong to.

Classification figure

  • We provide the algorithm with the data of good and bad credit customers so in the future the algorithm can predict whether the new customer would be a good credit customer or bad.
  • In Clustering Rule, we cluster items or objects with similar characteristics into small groups or clusters.

Clustering figure

  • Here we have clusters of people at the top that have a high salary and at the bottom the people who have a low salary. This information can be useful for the company to decide what kind of products must be sold.
Desirable Properties of Knowledge Discovery
  • The knowledge discovered should be accurate, comprehensible or understandable by humans and should be useful or novel or surprising.
  • In data mining, you don’t have to specify all the rules to the system.
  • You give data, some broad instructions and the algorithm does all the calculations and provides the best possible set of rules.
  • Prediction is very very hard.
  • We need human judgment to interpret the results of the algorithm.
  • We always need to know and understand what the algorithm is doing and the way to do that is to have a classification model.
  • If we blindly trust the algorithm it might result in a disaster.
  • We need to make sure that our data is not biased.
Applications
  • The applications of data mining are unlimited but the most important thing is that we need to have the right data.
  • eg. predict weather, predict customer choice etc.
  • To identify fraud.
  • To identify similar customers.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.