Introduction to Statistics
Statistics plays a pivotal role in research in the medical field, the sciences, economics, and the social sciences (among others). Statistics is often how we interpret results from the data that our experiments have generated. Typically, the process is as follows:
- Based upon some observation(s), a hypothesis, and possibly a set of alternative hypotheses, are formulated in order to explain the observed behavior
- These hypotheses are tested by experimentation in accordance with the Scientific Method. These experiments usually include the collection of data, the analysis of the results, and the determination of these results against our hypotheses in order to determine if our hypotheses are valid
- These steps are repeated, each time continuing to make observations, collecting and analyzing data, and improving our hypotheses
While you are probably most familiar with statistics from sports, it also is vital in the decision-making process used by governments, businesses, academics, and more to help inform their decision-making for their marketing efforts, financial planning, manufacturing decisions, and strategic planning.
Statistics is a discipline that is concerned with the collection and analysis of data based on a probabilistic approach. What does this mean? It means that hypothesis about a general population (our observed entity) are tested on a smaller sample and that our conclusions are made based upon how well the properties of the sample extend to the population at large. In effect, we are extrapolating our observed results in a subset of the data to the data as a whole.
Before we begin taking a look at some statistical concepts – what they are, when and how to use them – it is important that you understand some key terms. In this article I will provide a cursory introduction, and, as necessary, will explain in further detail as we explore the topic in depths.
- Data and Data Sets
- A collection of related sets of information that is composed of separate elements but can be manipulated as a unit; Observations from the environment
- A finite or infinite collection of items under consideration; A complete set of data that we wish to study or analyze. A key focus of the field of statistics is the study of characteristics of interest about a population
- A portion drawn from a population, the study of which is intended to lead to statistical estimates of the attributes of the whole population; A subset of the data from the population which we analyze in order to learn about the population. A major objective in the field of statistics is to make inferences about a population based on properties of the sample
- Random Sample
- A sample of subjects that is randomly selected from a group and is therefore assumed to be representative of that group; A sample in which each member of the population has an equal chance of being included and in which the selection of one member is independent from the selection of all other members
- Random Variable
- A quantity having a numerical value for each member of a group, especially one whose values occur according to a frequency distribution; A variable which represents value(s) from a random sample. We will use letters at the end of the alphabet, especially x, y and z, as random variables; Also called a variate
- Independent Random Variable
- A variable that is chosen, and then measured or manipulated, by the researcher in order to study some observed behavior
- Dependent Random Variable
- A variable whose value depends on the value of one or more independent variables
- Discrete Variable
- A type of variable, also called a categorical or nominal variable, which has a finite number of possible values that do not have an inherent order. For example, hair color would be a discrete variable, because it can only have a limited number of values, such as red, brown, and black, that does not occur in any particular order. Different from other variable types such as continuous variables; A variable which can take a discrete set of values (e.g. cards in a deck or scores on an IQ test). Discrete variables can take either a finite or infinite set of values, although for our purposes we usually consider discrete variables which only take a finite set of values
- Continuous Variable
- A variable which can take all the values in a finite or infinite interval (e.g. weight or temperature). A continuous variable can take an infinite set of values
- A quantity which is calculated from a sample and is used to estimate a corresponding characteristic (i.e. parameter) about the population from which the sample is drawn
Types of Data Measurements
There are four types of data measurements that you need to be aware of (also called data scales):
|Nominal||Provides a name. If numeric, then no scale is implied||Male, Female, 1(Republican), 2(Democrat), 3(Independent)|
|Ordinal||Provides an ordered scale||1(Excellent), 2(Good), 3(Fair), 4(Poor)|
|Interval||Can be manipulated mathematically. Scale in equal increments||Temperature in centigrade (80 degrees is 20 degrees hotter than 60 degrees which is 20 degrees hotter than 40 degrees but 80 degrees is not twice as hot as 40 degrees)|
|Ratio||Interval scale with a meaningful zero||Temperature in Kelvin (80 degrees is twice as hot as 40 degrees); Weight, length, age|
Nominal data can be labeled, but it cannot be calculated or compared. For example, we can’t say that Female < Male or Male > Female (as much as sexist idiots may try). Nominal data is also referred to as categorical data. Ordinal data can be compared, but it cannot be added or subtracted or calculated in any other way. For example, imagine a customer service survey ranking their performance and it have scoring options one thru five with 1 equating to Poor and 5 equating to Excellent (with appropriate values in between). Nominal and ordinal data are called non-metric data.
Metric data can be manipulated mathematically, that is, you can add, subtract, multiply, divide, etc. the data. As we will see throughout this series, it makes sense to take the mean, standard deviation, etc. of metric data. There are two types of metric data, interval and ratio data. The difference between these two types of metric data is that ratio data has an absolute zero value. So, in the case of ratio data, it makes sense to say that one data element is 50% larger than another or that it is twice as effective as another.
A random variable can be considered metric or non-metric, nominal, ordinal, interval, or ratio depending upon whether the underlying data corresponding to the random variable has this type.
I will be exploring statistics from both a mathematical perspective and from a practical standpoint. In particular, I will be focusing primarily on how you can use Microsoft Excel and R (with R Studio) to conduct in depth and detailed statistical analysis of your data. These skills are crucial to the modern business analyst and also form one of the key pillars of data science. Learning these skills will benefit you greatly in your career and personal growth.