When we think of data mining and its results, we need to understand how the process works.  We need to make an important distinction pertaining to mining data, and that is the difference between:

  • Mining the data to find patterns and build models
  • Using the results of data mining

It is common for those new to data mining to oftentimes confuse these two (very different) processes when studying data science.  It is also common for managers to sometimes confuse them when discussing business analytics.  The use of data mining results should influence and inform the data mining process itself, but these two things should be kept distinct in order to prevent contamination of our results.

In our customer churn example that we have been using in our articles on data science, consider the deployment scenario in which the results will be used.  We want to use the model to predict which of our customers will leave AT&T.  Specifically, please assume that data mining has created a class probability estimation model M.  Given each existing customer of AT&T’s, described using a set of characteristics, M takes these characteristics as inputs and produces a score or a probability estimate of attrition.  This is the use of the results of data mining.  The data mining produces the model M from some other, oftentimes historical, data.

Diagram illustrating data mining versus the use of data mining results

Data mining vs the use of data mining results. The upper half of this diagram illustrates the mining of historical data to produce a model. Importantly, the historical data has the target (“class”) value specified for all data points. The bottom half shows the results of the data mining in use, where the model is applied to new data for which we do not know the class value. The model predicts both the class value and the probability that the class variable will take on that value.

The diagram above illustrates these two phases.  Data mining produces the probability estimation model (as shown in the top half of the diagram).  In the use phase (the bottom half of the diagram), the model is applied to a new, unseen case and it generates a probability estimate for it.

About jvaudio

I have masters degrees in information systems management, project management, and computer science. I have bachelors degrees in technical management and finance.

I love to learn. I love to write. I love technology. I love math.

Visit My Website
View All Posts
Recommended Posts