Data mining should be thought of as a craft. It involves the application of a substantial amount of science and technology, but the proper application of data mining still invokes the process of an artisan. Just as a sculptor may have several different tools at their disposal, so does the data scientist, and it is the knowledge of understanding when to use what tool to achieve the desired result where the parallels lie. As with sculpting or any other mature art form, there is a well-understood process that places a structure around the problem which allows for a reasonable degree of consistency, repeatability, and objectiveness. As I mentioned in my article What is Data Science, the data mining process is codified and illustrated by the Cross Industry Standard Process for Data Mining (CRISP-DM) illustrated below:
This process diagram makes explicit the fact that iteration is the rule rather than the exception. Going thru the process once without having solved the problem is, generally speaking, not a failure. Oftentimes, the entire process is an exploration of the data, and after the first iteration the data scientist knows much more than they did previously. The next iteration can be much more well-informed. Let’s take a look at each of these steps in detail so that we understand the process and then I will discuss an alternative approach that I feel much better represents reality from Enda Ridge called the Guerilla Analytics Workflow (cool name, huh?).
Seldom times are business projects pre-packaged as clear and ambiguous data mining problems, there is frequently a disconnect and failure to understand the problem at hand. Often recasting the problem and designing a solution is an iterative process of discovery – both of the problem itself and of the data. The CRISP-DM diagram above makes it clear that this process is a series of cycles within a cycle rather than a linear process. The initial problem formulation may not be complete or optimal so multiple iterations may be necessary for an acceptable solution formulation to appear.
The business understanding phase represents a part of the craft where the analysts’ creativity plays a large role. Data science has some things to say, but oftentimes the key to success is a creative problem formulation by an analyst regarding how to cast the business problem as one or more data science problems. This is why it is so important for an analyst to have a high-level understanding of the fundamentals. We have a set of powerful tools to solve particular data mining problems (which I have outlined before). The early stages of the process involve designing a solution that takes advantage of these tools. This can mean structuring the problem so that one or more sub-problems involve building models for classification, regression, probability estimation, etc.
In the first phase, the design team should carefully think about the problem to be solved and about the use case. This is one of the most important fundamental aspects of data science – the careful consideration of the problem.
What exactly do we want to do? How exactly would we do it? What parts of this use case constitute possible data mining models?
As we discuss these questions in more detail, we will begin with a simplified view of the use case, but as we go forward, we will loop back and realize that often the use scenario must be adjusted to better reflect the actual business need. We will also present conceptual tools to help us frame a business problem in terms of expected value which can allow us to systematically decompose it into data mining tasks and is a crucial requirement for many analysts when trying to get approval for a project.
We need to think of data as comprising the available raw material from which we can build the solution to our business problem. It is important for us to understand the strengths and limitations of the data because it is rare that there is an exact match with the problem that we are tasked with solving. Historical data often is collected for purposes unrelated to the current business problem, or, in some instances, for not explicit purpose at all. A customer database, a transaction database, and a marketing response database contain different information, may cover different intersecting populations, and they are likely to have varying degrees of reliability.
It is also quite common for the costs of data to vary. Some data is available for free (both in terms of financial cost and effort to obtain), while others can cost a considerable amount. Some data can be purchased, whereas other data simply won’t exist and will require entire ancillary projects to arrange their collection. A crucial part of the data understanding phase is estimating the costs and benefits of each data source and deciding whether or not further investment is warranted. For example, even after all datasets have been acquired, collating them may require additional effort.
As data understanding progresses, solution paths may change direction in response and team efforts may fork from each other. Fraud detection provides a good example of this. Data mining has been used extensively for fraud detection and many fraud detection problems involve classic supervised data mining tasks. Consider the task of catching credit card fraud (the most oft-cited example): Charges show up on each customer’s account, so fraudulent charges are usually caught – if not initially by the company, then later on by the customer when they review their account activity or statement. We can assume that nearly all fraud is identified and reliably labeled, since the legitimate customer and the person perpetrating the fraud are different people and have opposite goals. Thus credit card transactions have reliable labels (fraud and legitimate) that may serve as targets for a supervised technique.
Let’s now consider the related problem of catching Medicare fraud. This is a huge problem in the US that costs billions of dollars each year. Although this might seem like a conventional fraud detection problem, as we consider the relationship of the business problem to the data, we quickly realize that the problem is significantly different. The perpetrators of fraud – the medical providers who submit false claims (and sometimes their patients) – are also legitimate users; there is no separate disinterested party who will declare exactly what the “correct” charges should be. Consequently, the Medicare billing data will not have a reliable target variable indicating fraud, and a supervised learning approach that could work for credit card fraud is not applicable. Such a problem will usually require unsupervised approaches such as profiling, clustering, anomaly detection, and co-occurrence grouping.
The fact that both of these are fraud detection problems is a superficial similarity that is actually misleading. In data understanding, we need to dig beneath the surface to uncover the structure of the business problem and the data that are available to us, and then match them to one or more data mining tasks for which we have substantial science and technology to apply. It is not unusual for a business problem to contain several data mining tasks, oftentimes of different types, and the combination of their solutions will be necessary.
The analytic tools and technologies that we can bring to bear are powerful, but they impose certain requirements on the data they use. They often require data to be in a form different from how the data is provided naturally, and some conversion will be necessary. Therefore, a data preparation phase often proceeds along with data understanding, in which the data is manipulated and converted into forms that yield better results.
Common examples of data preparation are converting data into a tabular format, removing or inferring missing values, and converting data to different types. Some data mining techniques are designed for symbolic and categorical data, while others handle only numeric values. In addition, numerical values must often be normalized or scaled so that they are comparable. Standard techniques and rules of thumb are available for doing such conversions.
One important concern during data preparation is to beware of leaks. A leak is a situation where a variable collected in historical data gives information on the target variable – information that appears in historical data but is not actually available when the decision has to be made.
Let’s look at another example: Consider predicting whether a customer will be a “big spender”; knowing the categories of the items purchased (or worse – the amount of tax paid…) are very predictive, but are not known at decision-making time. Leakage must be considered carefully during data preparation, because data preparation typically is performed after the fact – from historical data.
The output of modeling is some form of model or pattern that captures regularities in the data. The modeling stage is the primary place where data mining techniques are applied to the data. It is important to have some understanding of the fundamental ideas of data mining, including the sorts of techniques and algorithms that exist, because this is the part of the craft where most science and technology can be brought to bear. We will be covering modeling extensively in the near future.
The purpose of the evaluation phase is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on. If we look hard enough at any data set we will find patterns, but they may not survive careful scrutiny and analysis. We would like to have confidence that the models and patterns extracted from the data are true regularities and not just idiosyncrasies or sample anomalies. While it is possible to deploy results immediately after data mining, this is not advisable; it is usually far easier, cheaper, faster, and safer to test a model first in a controlled setting.
Of equal importance, the evaluation phase also serves to help ensure that the model satisfies the original business goals. Recall that the primary goal of data science for business is to support decision-making, and that we started the process by focusing on the business problem that we would like to solve. Usually, a data mining solution is only a piece of the larger solution, and it needs to be evaluated as such. Furthermore, even if a model passes strict evaluation tests, there may be external considerations that make it impractical.
Evaluating the results of data mining includes both quantitative and qualitative assessments. Various stakeholders have interests in the business decision-making that will be accomplished or supported by the resultant models. In many cases, these stakeholders need to sign off on the deployment of the models, and in order to do so, need to be satisfied by the quality of the model’s decisions. What that means varies from application to application, but often stakeholders are looking to see whether the model is going to do more good than harm, and especially that the model is unlikely to make catastrophic mistakes. In order to facilitate such qualitative assessments, the data scientist must think about the comprehensibility of the model to stakeholders (not just to the data scientists). If the model itself is not comprehensible (i.e. maybe the model is a very complex mathematical formula), how can the data scientists work to make the behavior of the model more comprehensible?
Finally, a comprehensive evaluation framework is important because getting detailed information on the performance of a deployed model may be difficult or impossible. Often, there is only limited access to the deployment environment so making a comprehensive evaluation in production is difficult. Deployed systems typically contain many moving parts and assessing the contribution of a single part is difficult. Companies with sophisticated data science teams will (wisely) build testbed environments that mirror production data as closely as possible in order to get the most realistic evaluations before taking the risk of deployment.
Nonetheless, in some cases we may want to extend evaluation into the development environment, for example, by instrumenting a live system to be able to conduct randomized experiments. In our churn example for AT&T, if we have decided from laboratory tests that a data mined model will give us better churn reduction, we may want to move on to an “in vivo” evaluation, in which a live system randomly applies the model to some customers while keeping other customers as a control group. Such experiments must be carefully designed. We may also want to instrument deployed systems for evaluations to make sure that the world is not changing to the detriment of the model’s decision-making.
In deployment, the results of data mining – and increasingly the data mining techniques themselves – are put into real use in order to realize some return on investment. The clearest cases of deployment involve implementing a predictive model in some inofmration system or business process. In our AT&T churn example, a model for predicting the likelihood of churn could be integrated with the business process for churn management, for example, by sending special offers to customers who are predicted to be particularly at risk of leaving. A new fraud detection model may be built into a workforce management information system, to monitor accounts and create cases for fraud analysts to examine.
Increasingly, the data mining techniques themselves are deployed.
Two main reasons for deploying the data mining system itself rather than the models produced by a data mining system are:
- The world may change faster than the data science team can adapt (as with fraud and intrusion detection)
- A business has too many modeling tasks for their data science team to manually curate each model individually
In these cases, it may be best to deploy the data mining phase into production. In doing so, it is critical to instrument the process to alert the data science team of any seeming anomalies and to provide fail-safe operation.
Deploying a model into a production system typically requires that the model be recoded for the production environment, usually for greater speed or compatibility with an existing system. This may incur substantial expense and investment. In many cases, the data science team is responsible for producing a working prototype, along with its evaluation. These are passed to a development team.
Regardless of whether deployment is successful, the process often returns to the business understanding phase. The process of mining data produces a great deal of insight into the business problem and the difficulties of its solution. A second iteration can yield as improved solution. Just the experience of thinking about the business, the data, and the performance goals often leads to new ideas for improving business performance, and even new lines of business or new business ventures.
Please note that it is not necessary to fail in deployment to start the cycle again. The evaluation phase may reveal that the results are not good enough to deploy, and we need to adjust the problem definition or get different data. This is represented by the “shortcut” link from evaluation back to business understanding in the CRISP-DM process diagram. In practice, there should be shortcuts back from each stage to each prior one because the process always retains some exploratory aspects, and a project should be flexible enough to revisit prior steps based on the discoveries that have been made.
That brings us to the Guerrilla Analytics workflow, coined by Enda Ridge. The Guerrilla Analytics workflow considers data science as the following stages (from source data through to delivery):
|Data Science Workflow||Example Disruptions|
|Extract: taking data from a source system, the web, front end system reports|
|Receive: storing extracted data in the analytics environment and recording appropriate tracking information|
|Load: transferring data from receipt location into an analytics environment|
|Analytics: the data preparation, reshaping, modelling and visualization needed to solve the business problem|
|Work Products and Reporting: the ad-hoc analyses and formal project deliverables|