Introduction to Supervised vs. Unsupervised Methods
In our previous article, Transforming Business Problems Into Data Mining Tasks, we outlined the different methods that are used to solve data mining tasks. Our exploration of data science will continue as we will illustrate these principles with examples using clustering, regression, classification, similarity matching, and the rest of the methods used to solve data mining tasks. Before we can decide the best way to approach a problem however, we must first introduce some important distinctions. Let’s take a look at supervised vs. unsupervised methods. In order to do so, we need a sample problem:
We need to think carefully about the question that we are being asked to solve:
How AT&T should choose a set of customers to receive the retention offer in order to best reduce churn within a particular incentive budget?
This might sound deceptively simple, but it is not.
Consider which types of data mining tasks might fit our example churn problem at AT&T. Data scientists will often formulate churn prediction as a problem of finding segments of customers that are more likely or less likely to leave. Initially, you might think that this clearly sounds like a classification problem, or possibly a clustering problem (or, some might make a case for regression), but before we can determine the best approach to solving the problem. Let’s consider two similar questions that we might ask about our customer population:
Do our customers naturally fall into different groups?
In this question, there is no specific purpose or target that has been specified for the grouping. When there is no such target specified, the data mining task is considered unsupervised. What happens in unsupervised methods is that the data mining algorithm searches for patterns and structure among the different variables. The most common unsupervised data mining method is clustering. Let’s compare that to a slightly different question that we could formulate about our population:
Can we find groups of customers who have a high likelihoods of cancelling their service when their contracts expire?
In this question, there is a specific target defined for a specific reason (to take action based upon the likelihood of churn). This is considered a supervised data mining problem. Supervised data mining methods will usually apply the following methodology when building and evaluating a model:
Therefore, the algorithm needs to guard against “memorizing” the training data set and blindly applying all of the patterns that are found int he training data set to the future data.
The adjusted model is itself adjusted in order to minimize the error rate on the validation set. Estimates of model performance for future, unseen data can then be computed by observing various evaluative measures applied to the validation data set.
- First, training set records are stored
- Next, given a new, unclassified record, classification is performed by comparing it to records in the training set that it is most similar to
- k-Nearest Neighbor is typically used for classification, although it is also applicable to estimation and prediction tasks
In our example, we have a training set with 200 patients with Na/K ratio, age, and the drug that they were prescribed. Our task is to classify the type of drug for a new patient, a 35 year old with an Na/K ratio of 29. The scatter plot shows the records of three different patients that are similar to Patient 1.
Which drug should be prescribed to Patient 1? Drug B because all of the points near him are prescribed drug B.
There is an important (albeit subtle) difference between these two questions that you need to understand. If a specific target can be provided then the problem can be understood to be a supervised one. Supervised tasks require different techniques than do unsupervised tasks and the results are typically much more useful. A supervised technique is given a specific purpose for the grouping, and that is to predict the target. Clustering, an unsupervised task, produces groupings that are based on similarities, but there is no guarantee that these similarities are meaningful or that they will be useful for any particular purpose.
Strictly speaking, another condition must be met for supervised data mining – there must be data on the target. It isn’t enough that the target information exists in principle – it must also exist on the data.
Acquiring data on the target is a key data science investment. The value for the target variable for an individual is often called the individual’s label, emphasizing that often (but not always) one must incur expense to actively label the data.
Data Mining Tasks: Supervised vs Unsupervised Methods
Classification, regression, and causal modeling are typically solved with supervised methods. Similarity matching, link prediction, and data reduction could be done with either supervised or unsupervised methods. Clustering, co-occurrence grouping, and profiling are typically addressed using unsupervised methods. The fundamental principles of data mining that I present underlie all these types of techniques.
Two main subclasses of supervised data mining, classification, and regression, are distinguished by the type of target. Regression involves a numeric target whereas classification involves a categorical (and usually binary) target. Consider these similar questions that we might address with supervised data mining:
There are some subtleties amongst these questions that we need to pay attention to. For business applications, we often want a numerical prediction over a categorical target. In our churn example, a basic yes/no prediction of whether or not a customer is likely to continue to subscribe to the service may not be sufficient. It is almost a certainty that we want to model the probability that the customer will continue. This is considered classification modeling rather than regression because our underlying target is categorical (a yes/no rather than a numeric value).
A vital part during the early stages of the data mining process is to:
- Decide whether the line of attack will be supervised or unsupervised
- If supervised, to produce a precise definition of a target variable. This variable must be a specific quantity that will be the focus of the data mining.
- A target variable is not specified
- The algorithm will instead search for patterns and structure among the variables
- Clustering is the most common unsupervised method
- A target variable is specified
- The algorithm learns from the examples by determining which values of the predictor variables are associated with the different values of the target variable