Statistics for the Behavioral Sciences | Lesson 7 Correlation | Roger N. Morrissette, PhD |
I. Correlations [Video Lesson 7 I] [YouTube version]
A correlation is a statistical test that demonstrates the relationship between two variables. Even though you may be able to show a significant relationship between two variables, a correlation does not show a causal relationship between the two variables. For example, although depression and self-esteem are two variables that are significantly correlated to each other we can not say that low self-esteem causes depression. Likewise, we can not say that depression causes low self-esteem. The two variables may be significantly correlated but no causal relationship is assumed. Correlations are best represented graphically by a scatterplot and best calculated by using the Pearson Product Moment Correlation formula.
Let's test this hypothesis that depression scores are negatively correlated to self-esteem scores. We design our surveys and sample 8 subjects. Their data is presented below. Data for a correlation are always presented in two columns like the data set shown below. Depression scores are our X data and Self-Esteem Scores are our Y data:
Depression [X]
Self-Esteem [Y]
10
104
12
100
19
98
4
150
25
75
15
105
21
82
7
133
II. Scatterplots [Video Lesson 7 II] [YouTube version]
A scatterplot is a graphical representation of the two sets of data you are comparing. The X-axis plots your first or "X" data, and the Y-axis plots your second or "Y" set of data. The scatterplot can tell you two important things about the relationship between your two variables. First it can show you if you have a weak or strong relationship between your variables. Secondly, it can tell you if your variables are negatively or positively related.
A. Scatterplots can show the strength of the relationship between two variables
1. Weak relationships will have a wide scattering of the plots
2. Strong relationships will have a minimal scattering of the plots
B. Scatterplots can show the direction or type of the relationship between two variables
1. Positive Correlation
both factors vary in the same direction
as one factor increases, the other increases
2. Negative Correlation
both factors vary in opposite directions
as one factor increases, the other decreases
3. Zero or Neutral Correlation
the two factors show no relationship to one another
III. The Pearson Product Moment Correlation [Correlation Coefficient] [Video Lesson 7 III] [YouTube version] [Correlation Calculation - YouTube version] [mp4 version]
The correlation coefficient is a statistic that calculates the actual relationship between two variables. It has a range between -1.00 and +1.00. You can not get a correlation of 1.5. A value of -1.00 would be a perfect [very strong] negative correlation, a value of +1.00 would be a perfect [very strong] positive correlation, and a value of 0.00 would be a [very weak] zero or neutral correlation. To calculate the correlation coefficient we use the Pearson Product Moment Correlation [r]:
The formula reads: r equals. In the numerator: n or number of pairs multiplied by the sum of X and Y then subtract the sum of X times the sum of Y. In the denominator: Take the square root of the final sum of n times the sum of X squared minus the sum of X then squared, then multiply that value by n times the sum of Y squared, then minus the sum of Y then squared.
Depression [X] | Self-Esteem [Y] |
10 | 104 |
12 | 100 |
19 | 98 |
4 | 150 |
25 | 75 |
15 | 105 |
21 | 82 |
7 | 133 |
To calculate the correlation coefficient [r] for the data above we first need to expand the columns just as we did when we calculated standard deviation. If you look at the formula above you will see that we need an X squared column, a Y squared column and an X times Y column. This first step is show below:
X | Y | X2 | Y2 | XY |
10 | 104 | 100 | 10816 | 1040 |
12 | 100 | 144 | 10000 | 1200 |
19 | 98 | 361 | 9604 | 1862 |
4 | 150 | 16 | 22500 | 600 |
25 | 75 | 625 | 5625 | 1875 |
15 | 105 | 225 | 11025 | 1575 |
21 | 82 | 441 | 6724 | 1722 |
7 | 133 | 49 | 17689 | 931 |
The next step is to calculate the sums of our columns:
X | Y | X2 | Y2 | XY |
10 | 104 | 100 | 10816 | 1040 |
12 | 100 | 144 | 10000 | 1200 |
19 | 98 | 361 | 9604 | 1862 |
4 | 150 | 16 | 22500 | 600 |
25 | 75 | 625 | 5625 | 1875 |
15 | 105 | 225 | 11025 | 1575 |
21 | 82 | 441 | 6724 | 1722 |
7 | 133 | 49 | 17689 | 931 |
113 | 847 | 1961 | 93983 | 10805 |
n = 8 |
Now we have all the information we need to solve our equation:
r = [8 x 10805] - [113 x 847] / square root [[8 x 1961] - [113]2] x [[8 x 93983] - [847]2]
r = [86440] - [95711] / square root [[15688] - [12769]] x [[751864] - [717409]]
r = - 9271 / square root [[2919] x [34455]]
r = - 9271 / square root [100574145]
r = - 9271 / 10028.666
r = - 0.9244
Our correlational coefficient is negative and very close to 1.00 which tells us that we have a strong negative relationship between our two variables. If we look at the scatterplot of our data we can see that the scatterplot is in aligned with our correlational coefficient.
IV. Determining Significance [Video Lesson 7 IV] [YouTube version]
Now that we have calculated our correlation coefficient we need determine how significant it is. There are two ways to determine the significance of a correlation: the first is to calculate the Coefficient of Determination and the second is to use the R Table.
A. The Coefficient of Determination
The coefficient of determination determines how much of the variance of one factor can be explained by the variability of a factor with which it is correlated. To calculate the coefficient of determination we simply square the r value.
Coefficient of Determination = r2
For our r of - 0.9244, the coefficient of determination would be r2 = 0.8545. This means that 85% of the variance of our depression scores can be explained by the variance of our self-esteem scores. This is a pretty high value and suggests a very strong relationship between the two variables. Although the coefficient of determination is a good predictor of the strength of the relationship between two variables, it does not predict significance. We will need to use the R Table to confirm if our correlation is statistically significant.
B. The R Table
The R Table is located in its entirety in Appendix A in the back of the text book. A shortened version is also available at the bottom of this lecture. It starts on page 435 and gives the critical R values based on degrees of freedom of your sample, the level of significance of the statistical test, and whether your hypothesis is one- or two-tailed. These three factors plus the critical R values are represented in the R Table and will be explained one at a time.
1. Degrees of Freedom.
The term degrees of freedom refers to the number of scores within a data set that are free to vary. In any sample with a fixed mean, the sum of the deviation scores is equal to zero. If your sample has an n equal to 10. The first 9 scores are free to vary but the 10th score must be a specific value that makes the entire distribution equal to zero. Therefore in a single sample the degrees of freedom would be equal to n - 1. The degrees of freedom for a correlation is slightly different because n equals number of pairs not simply sample size. Therefore, the degrees of freedom for a correlation in n - 2. So to calculate the degrees of freedom you simply take the number of pairs and subtract two. For our data set of depression and self-esteem scores the degrees of freedom are calculated the following way:
df = n -2
df = 8 - 2
df = 6
The R Table shows the degrees of freedom values in the far left column as shown below:
Levels of Significance for a One-Tailed Test
Levels of Significance for a Two-Tailed Test
df | .10 | .05 | .02 | .01 |
1 | .988 | .997 | .9995 | .9999 |
2 | .900 | .950 | .980 | .990 |
3 | .805 | .878 | .934 | .959 |
4 | .729 | .811 | .882 | .917 |
5 | .669 | .754 | .833 | .874 |
6 | .622 | .707 | .789 | .834 |
7 | .582 | .666 | .750 | .798 |
8 | .549 | .632 | .716 | .765 |
9 | .521 | .602 | .685 | .735 |
10 | .497 | .576 | .658 | .708 |
The table continues
2. One- or Two-tailed hypotheses.
The number of tails of a hypothesis predict the direction of the hypothesis. This concept will be discussed in greater detail in chapter 11. For now, you should know that if a correlation hypothesis is simply predicting an effect without predicting either a negative or positive direction of that effect, it is considered a Two-Tailed hypothesis. If the hypothesis is predicting either a negative or positive direction then it is a One-Tailed hypothesis. Since our hypothesis as stated predicts a negative correlation it is a One-Tailed Test. The two levels of hypothesis tests are highlighted below:
Levels of Significance for a One-Tailed Test
Levels of Significance for a Two-Tailed Test
df | .10 | .05 | .02 | .01 |
1 | .988 | .997 | .9995 | .9999 |
2 | .900 | .950 | .980 | .990 |
3 | .805 | .878 | .934 | .959 |
4 | .729 | .811 | .882 | .917 |
5 | .669 | .754 | .833 | .874 |
6 | .622 | .707 | .789 | .834 |
7 | .582 | .666 | .750 | .798 |
8 | .549 | .632 | .716 | .765 |
9 | .521 | .602 | .685 | .735 |
10 | .497 | .576 | .658 | .708 |
3. Levels of Significance.
The levels of significance or "p values" will also be discussed in greater detail in chapters 11, 12, and 13. For now you should simply know that a level of significance at .05 is equivalent to p = .05 which means that there is a 95% probability of statistical significance [1.00 - 0.05 = 0.95 or 95%] between your two variables. The .05 value is considered standard in science. Levels of significance that are smaller show greater significance and values that are larger show less significance. This value must be given to you in the problem. For our example let's use a p = .05. The table below shows the highlighted levels of significance:
Levels of Significance for a One-Tailed Test
Levels of Significance for a Two-Tailed Test
df | .10 | .05 | .02 | .01 |
1 | .988 | .997 | .9995 | .9999 |
2 | .900 | .950 | .980 | .990 |
3 | .805 | .878 | .934 | .959 |
4 | .729 | .811 | .882 | .917 |
5 | .669 | .754 | .833 | .874 |
6 | .622 | .707 | .789 | .834 |
7 | .582 | .666 | .750 | .798 |
8 | .549 | .632 | .716 | .765 |
9 | .521 | .602 | .685 | .735 |
10 | .497 | .576 | .658 | .708 |
4. Critical R Values.
Critical values are threshold values for significance. Your calculated r value must exceed the critical r value in the R Table to be considered significant. The table below shows the highlighted critical r values:
Levels of Significance for a One-Tailed Test
Levels of Significance for a Two-Tailed Test
df | .10 | .05 | .02 | .01 |
1 | .988 | .997 | .9995 | .9999 |
2 | .900 | .950 | .980 | .990 |
3 | .805 | .878 | .934 | .959 |
4 | .729 | .811 | .882 | .917 |
5 | .669 | .754 | .833 | .874 |
6 | .622 | .707 | .789 | .834 |
7 | .582 | .666 | .750 | .798 |
8 | .549 | .632 | .716 | .765 |
9 | .521 | .602 | .685 | .735 |
10 | .497 | .576 | .658 | .708 |
Now let's put it all together. The table below shows the criteria of our example to determine if our calculated r value of is significant:
Levels of Significance for a One-Tailed Test
Levels of Significance for a Two-Tailed Test
df | .10 | .05 | .02 | .01 |
1 | .988 | .997 | .9995 | .9999 |
2 | .900 | .950 | .980 | .990 |
3 | .805 | .878 | .934 | .959 |
4 | .729 | .811 | .882 | .917 |
5 | .669 | .754 | .833 | .874 |
6 | .622 | .707 | .789 | .834 |
7 | .582 | .666 | .750 | .798 |
8 | .549 | .632 | .716 | .765 |
9 | .521 | .602 | .685 | .735 |
10 | .497 | .576 | .658 | .708 |
According to Table R, For a One-Tailed test at p = .05 with 6 degrees of freedom the critical value we must exceed to consider our calculated r value to be significant is .622.
Since our calculated r = -0.9244
We conclude that our correlation is significant.
*Note that the final cumulative percent score should equal 100%
Additional Links about the Concepts that might help: