Introduction to Alternative Analytic Techniques
The application of various technologies to the analysis of data forms the foundation of business analytics. In this post, I am going to briefly cover some of the alternative analytics techniques and technologies that are available so that you can become familiar with them, understand what their goals are and what roles they play, when to use one technique or technology over another, and when it might be beneficial to either invest further in this technology or possibly consult experts in the field.
This post will present six different groups of related analytic techniques. Since this is a post in our data mining series, where it is appropriate we will draw comparisons and contrasts between these techniques and data mining. The main difference is that data mining focuses on the automated search for knowledge, patterns, or regularities from data. One of the most important skills for a business analyst is to be able to recognize what sort of analytic technique is appropriate for addressing a particular set of problems. If you would like a refresher on this, please read my post on transforming business problems into data mining tasks.
The term statistics has two different uses in business analytics. It is commonly used as a catchall term for the computation of particular numeric values of interest from data. Statistics in both of its forms, summary statistics and Statistics proper, are alternative analytic techniques and are complimentary to data mining.
Summary statistics should be chosen with close attention being paid to the business problem that needs to be solved, and also with the attention to the distribution of the data that we are summarizing.
For example, the average (the mean) income in the United States according to the Census Bureau economic survey was over $72,641. If we were to use that measure as a measure of the average income in order to make policy decisions, we would be misleading ourselves. The distribution of incomes in the US is highly skewed, with many people making relatively little and some people making an insane amount. In such cases, the arithmetic mean tells us relatively little about how much people are making. Instead, we should use a different measure of average income, such as the median. The median income – that amount were half the population makes more and half makes less – in the US, according to the Census Bureau was only $53,046 – considerably less than the mean. This example may seem obvious because we are so accustomed to hearing about the median income, but the same reasoning applies to any computation of summary statistics. We must be sure to think about the problem we would like to solve or about the question that we would like to answer. We have to make sure that we considered the distribution of the data, and determine whether the chosen statistic is appropriate.
Summary statistics are the basic building blocks of much data science theory and practice and you should have a full understanding of this foundational element.
The other common use of the term statistics is to denote the field of study that goes by that name, for which we might differentiate by using the proper name, Statistics. The field of Statistics provides us with a huge amount of knowledge that underlies analytics, and can be thought of as a component of the larger field of data science. For example, Statistics helps us to understand different data distributions and what Statistics are appropriate to summarize each. Statistics helps us understand how to use data to test hypotheses and to estimate the uncertainty of conclusions. In relation to data mining, hypothesis testing can help determine whether an observed pattern is likely to be valid – general regularity as opposed to a chance occurrence in some particular data set. Many of the techniques for extracting models or patterns from data have their roots in Statistics.
For example, a preliminary study may suggest that customers in the Northeast have a churn rate of 24.5%, whereas the nationwide average churn rate is only 14%. This may just be a chance fluctuation since the churn rate is not consistent and it varies over regions and over time – the differences are to be expected. The Northeast rate is 1.75 times the US average which seems unusually high. What is the chance that this is due to random variation? Statistical hypothesis testing is used to answer such questions. Closely related is the quantification of uncertainty into confidence intervals. The overall churn rate is 15%, but there is some variation; traditional statistical analysis may reveal that 95% of time the churn rate is expected to fall between 13% and 17%. This contrasts with the complementary process of data mining, which may be seen as hypothesis generation. Can we find patterns in data in the first place? Hypothesis generation should then be followed by careful hypothesis testing (generally on a different data set). In addition, data mining procedures may produce numerical estimates, and we often also want to provide confidence intervals on these estimates. Whenever you present a statistical analysis, you should also provide a confidence interval as well whenever possible.
One statistical term that is often heard in the context of business analytics is correlation. For example, “are there any indicators that correlate with a customer’s later defection from AT&T?” As with the term statistics, correlation has both a general-purpose meaning (variations in one quantity tell us something about variations in the other), and a special technical meaning (e.g. linear correlation based on a particular mathematical formula).
Querying a database is another available option as an alternative analytic technique. A query is a specific request for a subset of data or for statistics about data, formulated in a technical language (SQL – Structured Query Language) and posted to a database system. Many tools are available to answer one-off questions about data posed by an analyst. These tools are usually front ends to database systems, based on SQL, or some tool with a graphical user interface that helps users formulate queries (such as SSMS – Sequel Server Management Studio). For example, if the analyst can define profitable in operational terms that are computable from items in the database, then, using a query tool, the analyst could answer: “Were the most profitable customers in the Northeast?” The analyst may then run the query to retrieve a list of the most profitable customers, likely ranked by their profitability. This activity differs fundamentally from data mining in that there is no discovery of patterns or models, we are simply answering the question directly.
Database queries are appropriate when an analyst already has an idea (or question about) of what might be an interesting subpopulation of the data, and wants to investigate this population or confirm a hypothesis about it. For example, if an analyst suspects that the middle-aged women living in the Northeast have some particularly interesting churning behavior, she could compose a SQL query:
If a decision is made that those are the people to be targeted with an offer, a query tool can be used to retrieve all the information about them (“*”) from the Customers table in the database. For more information on SQL, please read my articles Introduction to SQL and Introduction to Advanced SQL.
In contrast, data mining could be used to come up with the answer to what this query solves in the first place – is there a pattern or regularity in the data? A data mining procedure might examine prior customers who did and did not defect, and determine the segment (characterized as having an age greater than or equal to 40 and that are females that live in the Northeast) is predictive with respect to churn rate. After translating this into a SQL query, a query tool could then be used to find the matching records in the database.
Query tools generally have the ability to execute sophisticated logic, including computing summary statistics over subpopulations, sorting, joining together multiple tables with related data, and much more. Often times, data scientists become quite adept at writing queries to extract the data that they need.
Online Analytical Processing (OLAP) provides an easy to use a graphical user interface to query large data collections, for the purposes of facilitating data exploration. The idea of online processing is that it is done in real time, so analysts and decision-makers can find answers to their questions quickly and efficiently. Unlike the ad hoc querying enabled by tools like SQL, for OLAP, the dimensions of analysis must be pre-programmed into the OLAP system (in a cube for example). If we foresee that we would want to explore sales volume by region and time, we could have these three dimensions programmed into the system, and drill down into populations, often simply by clicking and dragging and manipulating dynamic charts.
OLAP systems are designed to facilitate manual or visual exploration of the data by analysts. OLAP performs no modeling or automatic pattern finding on the data. As an additional contrast, unlike with OLAP, data mining tools generally can incorporate new dimensions of analysis easily as part of the exploration whereas it can be a royal pain to add a new dimension to your OLAP system after it has been created. OLAP tools can be a useful complement to data mining tools used for discovery in the business data.
Data warehouses collect and coalesce data from across an enterprise, often from multiple transaction processing systems, each with its own database. Analytical systems can access data warehouses. Data warehousing may be seen as a facilitating technology of data mining. It is not always necessary however, as most data mining does not access a data warehouse. Firms that decide to invest in data warehouses often can apply data mining more broadly and deeply throughout the organization.
For example, if a data warehouse integrates records from sales and billing as well as from human resources, it can be used to find characteristic patterns of effective salespeople. Data warehousing can be an important alternative analytic technique, but it serves as an important component more often than not.
Regression analysis is one of the most common alternative analytic techniques and is widely used across statistics and data analysis. Within this context, we are less interested in explaining a particular data set as we are in extracting patterns that will generalize to other data for the purpose of improving some business process. Typically, this will involve estimating or predicting values for cases that are not in the analyzed data set. So, for example, we are less interested in digging into the reasons for churn in a particular set of historical data, and more interested in predicting which customers who have not yet left would be the best to target with a marketing campaign to reduce future churn. In future articles, we will spend time talking about testing patterns on new data to evaluate their generality, and about techniques that we can use for reducing the tendency to find patterns specific to a particular set of data, but that do not generalize to the population from which the data originated.
The topic of explanatory modeling versus predictive modeling can elicit deep-seated debate, but it is important to realize that there is considerable overlap in the techniques used, and that the lessons learned from explanatory modeling do not all apply to predictive modeling.
Machine Learning and Data Mining
The latest alternative analytic technique that is taking the field by storm is Machine Learning. The collection of methods for extracting (predictive) models from data, now known as machine learning methods, were developed in several fields concurrently, most notably Machine Learning, Applied Statistics, and Pattern Recognition. Machine Learning is a field of study that arose as a subfield of Artificial Intelligence, which was concerned with methods for improving the knowledge or performance of an intelligent agent over time, in response to the agents experience in the world. Such improvement often involves analyzing data from the environment and making predictions about unknown quantities, and over the years this data analysis aspect of machine learning has come to play a very large role in the field. As machine learning methods were developed somewhat abruptly, the scientific disciplines of Machine Learning, Applied Statistics, and Pattern Recognition developed close ties, and the separation between the fields has blurred.
The field of data mining, sometimes referred to as KDD (Knowledge Discovery and Data Mining), started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns. Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly.
Generally speaking, because Machine Learning is concerned with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. It is also concerned with issues of agency and cognition – how will an intelligent agent use learned knowledge to reason and act in its environment – which are not concerns of data mining.
Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more concerned with applications than Machine Learning is. As such, research focused on commercial applications and business issues of data analysis tend to gravitate towards the KDD community rather than to Machine Learning (although, this is beginning to change). KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, etc.
Answering Business Questions with These Techniques
Now that we know some of the available technologies and techniques, we must understand how to use them. In order to illustrate how these techniques apply to business analytics, consider a set of questions that may arise and the technologies that would be appropriate for answering them. These questions are all related but each is subtly different. It is important to understand these differences in order to understand what technologies one needs to employ and what people may be necessary to consult.
- Who are the most profitable customers?
If profitable can be defined clearly based on existing data, this is a straightforward database query and an analyst can quickly and easily answer it. A standard query tool can be used to retrieve a set of customer records from a database. The results could be sorted by cumulative transaction amount, or some other operational indicator of profitability.
- Is there really a difference between the profitable customers and the average customer?
This is a question about a conjecture or hypothesis (in this case, there is a difference in value to the company between the profitable customers and the average customer), and statistical hypothesis testing would be used to confirm or disconfirm it. Statistical analysis could also drive a profitability or confidence bound that the difference was real. Typically, the result would be like: the value of these profitable customers is significantly different from that of the average customer, with probability less than 5% that this is due to random chance.
- But who really are these customers? Can I characterize them in some way?
We’d like to do more than just list out the profitable customers. We’d like to describe the common characteristics of profitable customers. Characteristics of individual customers can be extracted from a database using techniques such as database querying, which also can be used to generate summary statistics. A deeper analysis should involve determining what characteristics differentiate profitable customers from unprofitable ones. This is the realm of data science, using data mining techniques for automated pattern finding.
- Will some particular new customer be profitable? How much revenue should I expect this customer to generate?
These questions could be addressed by data mining techniques that examine historical customer records and produce predictive models of profitability. Such techniques would generate models from historical data that could then be applied to new customers to generate predictions.
Please note that this last pair of questions are subtly different data mining questions. The first, a classification question, may be phrased as a prediction of whether a given new customer will be profitable (yes/no or the probability thereof). The second may be phrased as a prediction of the value (numerical) that the customer will bring it to the company.