What is Data Science?
You’ve no doubt heard the buzzwords thrown around within the technology community (i.e. big data, data mining, etc.), but you might not have a clear understanding of what exactly data science is.
Data-driven decision-making (DDD) refers to the practice of basing decisions upon the analysis of data, rather than purely on intuition. A common situation in business for example is when a sales representative selects which advertisements to run in her market based purely upon her many years of experience in the field. Alternatively, she could base her choice of advertisements on the analysis of data regarding how consumers respond to different types of ads. She could also use a combination of these two approaches. All of these are accepted business practices and are routinely practiced.
The benefits of having a data-driven decision-making process in place have been shown to be considerable. Economist Erik Brynjolfsson and his colleagues from MIT and Wharton School of Business conducted a study to determine how DDD impacts the performance of a business. Their study shown that, statistically, the more data-driven that a company is, the more productive it is – even while controlling for a wide range of possible baffling factors. The differences are considerable in fact. Their study shown that one standard deviation higher on the DDD scale resulted in between 4%-6% increase in overall productivity. Companies with higher levels of DDD also demonstrated higher returns on assets and equity, better asset utilization, and higher market value.
Figure 1 shows data science supporting data-driven decision-making, but also overlapping with data-driven decision-making. The implication is that in the modern world, computer systems are making many business decisions automatically, without human intervention. The most readily apparent examples of this that come to mind are Amazon’s or Netflix’s recommendations.
The Data Science Workflow You Know is Wrong
When people think about the data science workflow, they often reference the industry standard diagram:
An alternative – from the Communications of the ACM Blog, is as follows:
These diagrams do not reflect reality, and should only be considered from a very high-level. In reality, data science is a highly iterative process that must deal with a lot of issues that will necessitate that you modify or redo your work instead of preparing your data analysis report for management. For example, after doing some early analysis, a data profiling exercise reveals that some of your data extract has been truncated. It will take you a considerable amount of time to check that you didn’t corrupt the file when you uploaded it. It is likely that you will have to obtain and load another data extract. The workflow diagrams in figures 2 and 3 are purely academic and not representative of what you will encounter in the real world, so please keep that in mind as we dive deeper into this topic.
Distinguishing Between Data Processing and Data Science
It is not uncommon nowadays to see data processing skills, systems, and technologies being labeled as data science, but it is important to make the distinction between data processing and data science. Data processing supports data science – think of it as one of components of data science. Data science needs access to data and it benefits from sophisticated data engineering that data processing may help facilitate, but they cannot be considered one and the same. In practical terms, it is best to consider data science as extracting knowledge or facilitating data-driven decision-making and data processing as the collection and sorting of information. As companies become more adept at processing massive amounts of data, they will begin asking themselves: “What can I do now that I couldn’t do before? or What can I do better now than I could before?” This will be the golden era of data science.
Distinguishing Between Statistics and Data Science
People new to data science or without experience in the field will often confuse statistics with data science. While certainly related (statistics plays a large role in data science), they are not the same. Statistics is used for data analysis, typically used as a way to extract interesting patterns from individual data sets with well-formulated queries to explain or examine some phenomenon. Data science aims to discover and extract actionable knowledge from the data – knowledge that can be used to make decisions and predictions – not just explain what is going on.
The articles and posts that appear on this page will provide you with the knowledge and skills needed to approach problems “data-analytically.” Going forward, I will write about the following topics in-depth (and other errata as it crosses my mind or becomes relevant in the marketplace):
- Data Mining
- Predictive Modeling
- Fitting a Model to Data
- Overfitting and Its Avoidance
- Similarity, Neighbors, and Clusters
- Determining a Good Model
- Visualizing Model Performance
- Evidence and Probabilities
- Representing and Mining Text
- Stepping Towards Analytical Engineering
- Miscellaneous Data Science Tasks, Tips, and Techniques
- Data Science and Business Strategy