When we think of data mining and its results, we need to understand how the process works. We need to make an important distinction pertaining to mining data, and that is the difference between:
- Mining the data to find patterns and build models
- Using the results of data mining
It is common for those new to data mining to oftentimes confuse these two (very different) processes when studying data science. It is also common for managers to sometimes confuse them when discussing business analytics. The use of data mining results should influence and inform the data mining process itself, but these two things should be kept distinct in order to prevent contamination of our results.
In our customer churn example that we have been using in our articles on data science, consider the deployment scenario in which the results will be used. We want to use the model to predict which of our customers will leave AT&T. Specifically, please assume that data mining has created a class probability estimation model M. Given each existing customer of AT&T’s, described using a set of characteristics, M takes these characteristics as inputs and produces a score or a probability estimate of attrition. This is the use of the results of data mining. The data mining produces the model M from some other, oftentimes historical, data.
The diagram above illustrates these two phases. Data mining produces the probability estimation model (as shown in the top half of the diagram). In the use phase (the bottom half of the diagram), the model is applied to a new, unseen case and it generates a probability estimate for it.