Business people around a meeting table

Turning business problems into data mining tasks

Every data-driven business problem is unique, with its own combination of goals and constraints.  Similar to any engineering field, though, there are sets of common tasks that are performed to address these types of business problems.  Data scientists work with the business stakeholders to break the business problem down into smaller sub-tasks.  The solutions to these sub-tasks will solve the larger business problem at hand.  As data has become more prolific and the field of data science has exploded, we have come to learn a lot about solving common data mining tasks, both scientifically and practically.

A critical skill for data scientists is the ability to decompose the larger business problem into this group of sub tasks such that each each sub task can be addressed with the tools available.  Over time, the data scientist will be able to recognize common problems, which will allow them to spend more time solving the unique aspect of this particular business problem.

The data science industry has come up with many different algorithms to address business problems, but they all seek to address the same underlying problems.  In the text below, I will use the term “individual” to refer to any entity in which we have data (i.e. a consumer or a business).  Most business analytics projects seek to find correlations between a particular variable describing an individual.  For example, if we were to look at historical data, we could determine the number of customers that entered our store during Black Friday after a new type of advertising model was introduced.  We might wish to know how this would compare to an alternative model.

What follows are the most common types of tasks that can be used to solve any business problem.  Remember, the goal of the data scientist is to break the business problem down into sub-tasks that can be solved by using one of these methods.

Methods to Solve Data Mining Tasks:

Classification and Class Probability Estimation

Classification and class probability estimation attempt to predict, for each individual within a population, which of a small set of classes that particular individual belongs to.  Usually, the classes are mutually exclusive.  In the example below, there are two classes:  will respond and will not respond.

For a classification task, a data mining procedure produces a model that, when given a new individual, determines which class this individual belongs to.  Closely related to classification is scoring or class probability estimation.  When you apply a scoring model to an individual, rather than a class prediction, it returns a score representing the probability that that particular individual belongs to each class.  In our example above, a scoring model would be used to evaluate each customer and provide a score that tells us how likely each customer is to respond to the offer.  Classification and scoring are very similar and with some rudimentary tweaks, one model can be used interchangeably with the other.

The quintessential type of learning problem in machine learning is binary classification.  If you are given a sample of instances, each one labeled either “positive” or “negative”, the aim is to learn to predict the correct label from previously unseen (or future) instances.  A common example of this type of behavior is when your email provider is predicting whether or not an email message is spam.  You can state a binary classification problem as an optimization – find function that minimizes the average number of misclassifications on new instances drawn from the distribution that generated your sample.  You can also think of it this way, if we have to pay $1 every time that we predict a positive instance as a negative instance or a negative instance as a positive instance, then we will want to find a predictor that minimizes our expected loss.

Formally, we will say an instance x is positive if it has associated label y=1 and negative if its label y=0. We then define the 0–1 misclassification loss for a binary prediction p when the label is y to be

Equation for the miss-classification loss for a binary prediction p when the label is y

Now suppose that an instance x has a positive label with probability η and we have made a prediction p. For that x the point-wise risk is

Point-wise risk equationThe first term is the average loss of a prediction p in the case of a positive example, occurring with the probability of n, and the second term is the average loss for a negative example, occurring with the probability 1 – n.

If we consider the spam email example mentioned earlier, suppose that with a probability of 0.95 a randomly chosen recipient says that a particular email is a spam message.  A prediction of “spam” for that email will incur an average loss of 0.95 x 0 + 0.05 x 1 = 0.05 whereas a prediction of “not spam” will incur a loss of 0.95.

Let us now suppose that instead of merely predicting the correct label, we wanted to know the probability that an email would be considered spam.  This would be a related type of problem called binary class probability estimation.  Since the predictions here are probabilities rather than concrete predictions, there isn’t a sensible notion of a misclassification.  How can a prediction that an email message might be spam with a probability of 0.95 be wrong?  If it really isn’t spam, then it may just be one of the 5% of cases that are consistent with the probability estimate.

What we really want is a penalty with an expected value that is minimized if our probability estimates are consistent with the true probability of a positive label for a given instance.  This typical requirement on the loss for a probability estimation problem is known as Fisher consistency.

If (y,p) is a loss for probability estimation then the above requirement can be framed in terms of its associated point-wise risk: L(η,p)=η(1,p)+(1η)(0,p). Stated formally, Fisher consistency says that no matter what true probability η we have

Fisher consistency equationThat is, predicting p = n will always achieve the smallest possible point-wise risk.

We will call losses that have this Fisher consistency proper losses in line with the terminology of proper scoring rules used when probability elicitation is studied in economics.

Regression (also known as Value Estimation)

Regression attempts to estimate or predict, for each individual, the numerical value of some variable for that particular individual.

In our example above, the property (or variable) that is to be predicted is service usage.  A model could be generated by looking at other individuals that are similar in the population and their historical usage.  A regression procedure produces a model that, given a particular individual, estimates the value of the particular variable that is specific to that individual.

Regression is related to classification, but they are two different models.  It is best to think of it this way:  classification predicts whether something will happen, whereas regression predicts how much something will happen.  In order to help you understand the concept, let’s consider the simplest form of a regression – something called a linear, bivariate regression.  This type of regression describes an unchanging relationship between two (and only two – hence the bi in bivariate) occurrences.  Suppose that you are wondering if there is a connection between the time that students spend doing their math homework and the grades that they receive.  You can plot the data on a graph where the x-axis is the average number of hours that a student studies per week and the y-axis represents the exam grades of the students (on a scale of 1-100 say).  You plot each data point on the graph and the data points will typically be scattered somewhat throughout.  Regression analysis creates the single line through these points that best summarizes their distribution.

Linear regression graph example

Linear regression graph

Similarity Matching

Similarity matching attempts to identify (you guessed it) similar individuals based upon data known about them.  Similarity matching can be used directly to find similar entities.

They use similarity matching based upon “firmographic” data that describes the characteristics of the companies.  Similarity matching is the basis for one of the most popular methods for making product recommendations (finding people that are similar to you in terms of products that they like i.e. Amazon’s recommendations).  Similarity measures underlie certain solutions to other data mining tasks such as classification, regression, and clustering.

Clustering or Cluster Analysis

Clustering attempts to group individuals in a population together by their similarity to one another, but not driven by any specific purpose.  Clustering divides data into groups (clusters) that are meaningful, useful, or both.   Cluster analysis groups these data objects based only upon the information found in the data that describes the objects and their relationships.  The goal of cluster analysis is that the objects within a group be similar or related to one another and dissimilar or unrelated to objects in another group(s).  The greater the homogeneity within the group and the greater the differences between the groups, the better and more distinct is the clustering.

Cluster analysis graph using DBSCAN which assumes clusters of a similar density

Cluster analysis graph

Clustering is useful in preliminary domain exploration to see which natural groups exist because these groups in turn may suggest other data mining tasks or approaches for analysis.  Clustering is also used as input to decision-making processes focusing on questions such as:  “What products should we offer our customers?  How should our customer service staff be structured?”

Cluster analysis is related to other techniques that are used to divide data objects into groups.  For example, clustering can be viewed as a form of classification in that it creates a labeling of objects with class labels.  Cluster analysis however, derives these labels only from data.  Classification, in contrast, is supervised classification in that new and unlabeled objects are assigned a class label using a model that is developed from objects with already known class labels.  This is why you will sometimes hear clustering referred to as unsupervised classification.

In data mining, if you hear the term classification used without any qualifications, it typically refers to supervised classification.

Co-occurrence Grouping

Co-occurrence grouping (also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to find associations between entities based upon the transactions involving them.

While clustering looks at similarity between objects based upon the objects’ attributes, co-occurrence grouping considers similarity of objects based upon their appearing together in transactions.  For example, a store might uncover that Doritos are purchased together with Pepsi much more frequently than we might expect.  Decision makers at the store could then take action based upon this insight, such as creating a special promotion, product display, or a combination offer.  Co-occurrence of products in purchases is a common type of grouping known as market-basket analysis.  Some recommendation systems also perform a type of affinity grouping by finding, for example, books that are frequently purchased by the same people.

The result of co-occurrence grouping is a description of items that occur together.  These descriptions usually include statistics on the frequency of the co-occurrence and an estimate of how surprising it is.

Co-occurrence network diagram for school and technology

Co-Occurrence Network


Profiling (also known as behavior description), attempts to characterize the typical behavior of an individual, group, or population.

Behavior may not have a simple description, profiling cell phone usage might require a complex description of night and weekend airtime averages, international usage, roaming charges, text messages, etc.  Behavior can be described generally over an entire population, or down to the level of small groups or even individuals.

Profiling is often used to establish behavioral norms for anomaly detection applications such as fraud detection and monitoring for intrusions to computers systems.  For example, if we know what kind of purchases a person typically makes on a credit card, we can determine whether a new charge on the card fits that profile or not.  We can use the degree of mismatch as a suspicion score and issue an alarm if it is too high.

The profiling process consists of several steps (from Wikipedia):

  • Preliminary Grounding: the profiling process starts with a specification of the applicable problem domain and the identification of the goals of analysis
  • Data Collection: the target dataset for analysis is formed by selecting the relevant data in light of existing domain knowledge and data understanding
  • Data Preparation: the data is preprocessed to remove noise and reduce complexity by eliminating attributes
  • Data Mining: the data is analyzed with the algorithm or heuristics developed to suit the data, model, and goals
  • Interpretation: the mined patterns are evaluated on their relevance and validity by specialists in the application domain
  • Application: the constructed profiles are applied (such as categories of persons) to fine-tune the algorithms
  • Institutional Decision: the institution decides what actions or policies to apply to groups or individuals whose data matches a relevant profile

Link Prediction

Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link.  Link prediction is common in social networking systems.

Link prediction can also estimate the strength of a link.  For example, for recommending movies to customers one can think of a graph between customers and the movies they’ve watched or rated.  Within the graph, we search for links that do not exist between customers and movies, but that we predict should exist and should be strong.  These links form the basis for these recommendations.

Link prediction model diagram

Link prediction model

Data Reduction

Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information of the larger set.  The smaller dataset may be easier to deal with or to process.  Moreover, the smaller dataset may better reveal the information.

Data reduction usually involves loss of information – what is important is the trade-off for improved insight.  One of the primary tasks in any data reduction effort is the organization of all of the data that has been collected for a particular purpose.  A basic example of this is data de-duplication, but there are many modern software packages available that can churn through many millions of data points very quickly using industry best practices.

Causal Modeling

Causal modeling attempts to help us understand what events or actions actually influence others.  It is also commonly known as structural modeling, path modeling, and analysis of covariance structures.

Techniques for causal modeling include those involving a substantial investment in data, such as randomized controlled experiments (such as A/B tests), as well as sophisticated methods for drawing causal conclusions from observational data.  Both experimental and observational methods for causal modeling generally can be viewed as “counterfactual” analysis:  they attempt to understand what would be the difference between the situations – which cannot both happen – where the “treatment” event (i.e. showing an advertisement to a particular individual) were to happen, and were not to happen.

It is vitally important that the data scientist includes the exact assumptions being made for the causal conclusion to hold.  When undertaking causal modeling, a person or business needs to weigh the trade-off of increasing investment to reduce the assumptions being made, versus deciding that the conclusions are good enough given the assumptions.  It is possible for assumptions to be made that could render the causal conclusions invalid even in the most carefully controlled and randomized experiment.  A great example of this is the placebo effect in medicine which illustrates the notorious situation where an assumption was overlooked in carefully designed randomized experimentation.

Practically every set of attributes on a questionnaire has some degree of correlation between the individual attributes.  Typically, several of them are at least moderately correlated.  There are statistical procedures, such as factor analysis, that are used for dealing with correlated independent variables, although these correlated attributes are commonly used as inputs to the regression model.

Let’s suppose that in the soda model (above) creating using standard showed that both the sweetness and the calorie count were related to the overall rating of the soda.  Then the regression coefficients would indicate the impact, “all things being equal”, that changing the perceived sweetness level would have on the overall acceptance and perception.  However, since the sweetness level and the number of calories are correlated, all things are NOT equal, and there is bias in the model.

The potential “model specification error” is harder to deal with.  Regression assumes that the model (the equation it was asked to solve) is an accurate representation of the problem or the system that is being studied, and that nothing was added or was left out.  In our soda example above, if the brands are identified to the respondents, then the image of the brands will have a significant impact on their ratings.  People have an entirely different perception of RC Cola than they do Coca Cola.  This is why blind, identified, and misidentified ratings have vastly different outcomes.

Using typical regression modeling, we can add some image attributes to the model, but this doesn’t solve the problem since the model would probably still be misspecified because it is almost impossible to capture every nuance of a products image and performance.  Inevitably, some parts of these are always left out or are otherwise impossible to quantify.  A more accurate way to specify the model would be to conclude that there are a series of performance attributes that drive the overall product image and that these, in turn, drive the overall rating of the product.

Measuring the image and overall performance of a product cannot be done directly, but it can be derived from a series of indicators.  Causal modeling will derive the measures (called ‘unobserved exogenous variables’), and parcel out the impact of each on the overall rating of the product.  And since image actually has an impact on taste, the direct effect of image, and the indirect effect of image (through it’s impact on product performance) on the overall rating of the product can be computed.  Furthermore, if taste in turn has an impact on image, that effect can also be quantified.  This would look like the diagram below:

Diagram representing the flow of causality

The flow of causality

The arrows in this diagram represent the flow of causality – or the effect if you will – in the model.  These indicate that there is a statistically significant relationship between the variables.  Sometimes, the path coefficients (the regression coefficients) are included on the arrows to indicate the impact that one variable has on the next variable.

About jvaudio

I have masters degrees in information systems management, project management, and computer science. I have bachelors degrees in technical management and finance.

I love to learn. I love to write. I love technology. I love math.

Visit My Website
View All Posts
Recommended Posts