Introduction to Supervised vs. Unsupervised Methods

In our previous article, Transforming Business Problems Into Data Mining Tasks, we outlined the different methods that are used to solve data mining tasks.  Our exploration of data science will continue as we will illustrate these principles with examples using clustering, regression, classification, similarity matching, and the rest of the methods used to solve data mining tasks.  Before we can decide the best way to approach a problem however, we must first introduce some important distinctions.  Let’s take a look at supervised vs. unsupervised methods.  In order to do so, we need a sample problem:

We need to think carefully about the question that we are being asked to solve:

How AT&T should choose a set of customers to receive the retention offer in order to best reduce churn within a particular incentive budget?

This might sound deceptively simple, but it is not.

Unsupervised Method

Consider which types of data mining tasks might fit our example churn problem at AT&T.  Data scientists will often formulate churn prediction as a problem of finding segments of customers that are more likely or less likely to leave.  Initially, you might think that this clearly sounds like a classification problem, or possibly a clustering problem (or, some might make a case for regression), but before we can determine the best approach to solving the problem.  Let’s consider two similar questions that we might ask about our customer population:

Do our customers naturally fall into different groups?

In this question, there is no specific purpose or target that has been specified for the grouping.  When there is no such target specified, the data mining task is considered unsupervised.  What happens in unsupervised methods is that the data mining algorithm searches for patterns and structure among the different variables.  The most common unsupervised data mining method is clustering.  Let’s compare that to a slightly different question that we could formulate about our population:

Supervised Method

Can we find groups of customers who have a high likelihoods of cancelling their service when their contracts expire?

In this question, there is a specific target defined for a specific reason (to take action based upon the likelihood of churn).  This is considered a supervised data mining problem. Supervised data mining methods will usually apply the following methodology when building and evaluating a model:

First
Magnifying glass analyzing data point on a line chart
The algorithm is provided with a training set of data which will include the pre-classified values of the target variable in addition to the predictor variables.
Second
Data coming into a computer for analysis
A provisional data mining model is then constructed using the training samples provided in the training data set.  Keep in mind however that the training data set is necessarily incomplete – it doesn’t include the “new” or future data that the data modelers are really interested in classifying.

Therefore, the algorithm needs to guard against “memorizing” the training data set and blindly applying all of the patterns that are found int he training data set to the future data.

Third
Computer monitor with donut chart and findings text
The next step in supervised data mining methodology is to examine how the provisional data mining model performs on a test set of data.  In the test set – a holdout data set – the values of the target variable are hidden temporarily from the provisional model.  The provisional model then performs classification according to the patterns and structure that it learned from the training set.  The efficacy of the classifications are then evaluated by comparing them against the true values of the target variable.
Fourth
Briefcase with wrench and screwdriver on top with ruler on the bottom of the image
The provisional data mining model is then adjusted to minimize the error rate on the test set of data.
Fifth
Data blocks being fed into a centralized data model
The adjusted data mining model is then applied to a validation data set (another holdout data set) where the values of the target variable are again temporarily hidden from the model.

The adjusted model is itself adjusted in order to minimize the error rate on the validation set.  Estimates of model performance for future, unseen data can then be computed by observing various evaluative measures applied to the validation data set.

 

The k-Nearest Neighbor Algorithm is an example of something called instance-based learning or memory-based learning which is part of a family of learning algorithms that compares new problem instances with instances that are seen in training, which have been stored in memory.  It constructs its hypothesis directly from the training instances.

  • First, training set records are stored
  • Next, given a new, unclassified record, classification is performed by comparing it to records in the training set that it is most similar to
  • k-Nearest Neighbor is typically used for classification, although it is also applicable to estimation and prediction tasks

In our example, we have a training set with 200 patients with Na/K ratio, age, and the drug that they were prescribed.  Our task is to classify the type of drug for a new patient, a 35 year old with an Na/K ratio of 29.  The scatter plot shows the records of three different patients that are similar to Patient 1.

Which drug should be prescribed to Patient 1?  Drug B because all of the points near him are prescribed drug B.

k-Nearest Neighbor Algorithm plot chart

k-Nearest Neighbor

There is an important (albeit subtle) difference between these two questions that you need to understand.  If a specific target can be provided then the problem can be understood to be a supervised one.  Supervised tasks require different techniques than do unsupervised tasks and the results are typically much more useful.  A supervised technique is given a specific purpose for the grouping, and that is to predict the target.  Clustering, an unsupervised task, produces groupings that are based on similarities, but there is no guarantee that these similarities are meaningful or that they will be useful for any particular purpose.

Diagram outlining the supervised data modeling methodology

Supervised modeling methodology

Strictly speaking, another condition must be met for supervised data mining – there must be data on the target.  It isn’t enough that the target information exists in principle – it must also exist on the data.

Acquiring data on the target is a key data science investment.  The value for the target variable for an individual is often called the individual’s label, emphasizing that often (but not always) one must incur expense to actively label the data.

Data Mining Tasks: Supervised vs Unsupervised Methods

Classification, regression, and causal modeling are typically solved with supervised methods.  Similarity matching, link prediction, and data reduction could be done with either supervised or unsupervised methods.  Clustering, co-occurrence grouping, and profiling are typically addressed using unsupervised methods.  The fundamental principles of data mining that I present underlie all these types of techniques.

Two main subclasses of supervised data mining, classification, and regression, are distinguished by the type of target.  Regression involves a numeric target whereas classification involves a categorical (and usually binary) target.  Consider these similar questions that we might address with supervised data mining:

  1. Will this customer purchase service S1 if given incentive #1?
  2. Which service package (S1, S2, or none) will a customer likely purchase if given incentive #1?
  3. How much will this customer use the service?
  1. This is a classification problem because it has a binary target (either the customer makes a purchase or they do not).
  2. This is also a classification problem, with a three-valued target.
  3. This is a regression problem because it has a numeric target. The target variable is the amount of usage (actual or predicted) per customer.

Considerations

There are some subtleties amongst these questions that we need to pay attention to.  For business applications, we often want a numerical prediction over a categorical target.  In our churn example, a basic yes/no prediction of whether or not a customer is likely to continue to subscribe to the service may not be sufficient.  It is almost a certainty that we want to model the probability that the customer will continue.  This is considered classification modeling rather than regression because our underlying target is categorical (a yes/no rather than a numeric value).

Where necessary, for the sake of clarity, this is called class probability estimation.

A vital part during the early stages of the data mining process is to:

  • Decide whether the line of attack will be supervised or unsupervised
  • If supervised, to produce a precise definition of a target variable.  This variable must be a specific quantity that will be the focus of the data mining.

Summary

Unsupervised Methods

  • A target variable is not specified
  • The algorithm will instead search for patterns and structure among the variables
  • Clustering is the most common unsupervised method

Supervised Methods

  • A target variable is specified
  • The algorithm learns from the examples by determining which values of the predictor variables are associated with the different values of the target variable

About jvaudio

I have masters degrees in information systems management, project management, and computer science. I have bachelors degrees in technical management and finance.

I love to learn. I love to write. I love technology. I love math.

Visit My Website
View All Posts
Recommended Posts