What is the degree of correlation among independent variables in a regression model called?

Get the answer to your homework problem.

Try Numerade free for 7 days

We don’t have your requested question, but here is a suggested video that might help.

Discussion

You must be signed in to discuss.

Video Transcript

If we have two variables that are perfectly linearly correlated either positively or negatively, then the coefficient of determination Will always be one. Let me say that again. We have two variables that have perfect linear correlation perfect positive or perfect negative linear correlation. Then the coefficient of determination R squared will always be one, meaning that, yeah, each variable Either of the two Explains 100 have the variation in the other variable.

Correlated Chronometric and Psychometric Variables

Arthur R. Jensen, in Clocking the Mind, 2006

Multiple Correlation

A multiple correlation coefficient [R] yields the maximum degree of liner relationship that can be obtained between two or more independent variables and a single dependent variable. [R is never signed as + or −. R2 represents the proportion of the total variance in the dependent variable that can be accounted for by the independent variables.] The independent variables are each optimally weighted such that their composite will have the largest possible correlation with the dependent variable. Because the determination of these weights [beta coefficients] is, like any statistic, always affected [the R is always inflated] by sampling error, the multiple R is properly “shrunken” so as to correct for the bias owing to sampling error. Shrinkage of R is based on the number of independent variables and the sample size. When the number of independent variables is small [100], the shrinkage procedure has a negligible effect. Also, the correlations among the independent variables that go into the calculation of R can be corrected for attenuation [measurement error], which increases R. Furthermore, R can be corrected for restriction of the range of ability in the particular sample when its variance on the variables entering into R is significantly different from the population variance, assuming it is adequately estimated. Correction of correlations for restriction of range is frequently used in studies based on students in selective colleges, because they typically represent only the upper half of the IQ distribution in the general population.

Two examples of the multiple R between several RT variables and a single “IQ” score are given below. To insure a sharp distinction between RTs based on very simple ECTs and timed scores on conventional PTs, the following examples were selected to exclude any ECTs on which the mean RTs are greater than 1 s for normal adults or 2 s for young children. Obviously, not much cogitation can occur in this little time.

The simplest example is the Hick paradigm. Jensen [1987a] obtained values of R in large samples, where the independent variables are various parameters of RT and MT derived from Hick data, viz. mean RT, RTSD, the intercept and slope of the regression of RT on bits, and mean MT.

Without corrections for attenuation and restriction of range in the samples, R=.35; with both of these corrections, R=.50. This is the best estimate we have of the population value of the largest correlation that can be obtained between a combination of variables obtained from the Hick parameters and IQ as measured by one or another single test, most often the Raven matrices.

Vernon [1988] analyzed four independent studies totaling 702 subjects. Each study used a wide variety of six to eight ETCs that were generally more complex and far more heterogeneous in their processing demands than the much simpler and more homogeneous Hick task. The average value of the multiple R [shrunken but not corrected for restriction of range] relating RT and IQ was .61, and for RTSD and IQ R was .60. For RT and RTSD combined, R was .66.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780080449395500100

Applying the Tools to Multivariate Data

J. Douglas Carroll, Paul E. Green, in Mathematical Tools for Applied Multivariate Analysis, 1997

6.2.2 Strength of Overall Relationship and Statistical Significance

The squared multiple correlation coefficient is R2, and this measures the portion of variance in Y [as measured about its mean] that is accounted for by variation in X1 and X2. As mentioned in Chapter 1, the formula is

R2=1−∑i=112ei2∑i=112[Yi−Y¯] 2R2=1−34.099354.25=0.904

The statistical significance of R, the positive square root of R2, is tested via the analysis of variance subtable of Table 6.2 by means of the F ratio:

F=42.25

which, with 2 and 9 degrees of freedom, is highly significant at the α = 0.01 level. Thus, as described in Chapter 1, the equivalent null hypotheses

Rp=0β1=β2=0

are rejected at the 0.01 level, and we conclude that the multiple correlation is significant.

Up to this point, then, we have established the estimating equation and measured, via R2, the strength of the overall relationship between Y versus X1 and X2.

If we look at the equation again

Y^i=−2.263+1.550Xi1−0.239X i2

we see that the intercept is negative. In terms of the current problem, a negative 2.263 days of absenteeism is impossible, illustrating, of course, the possible meaninglessness of extrapolation beyond the range of the predictor variables used in developing the parameter values.

The partial regression coefficient for X1 seems reasonable; it says that predicted absenteeism increases 1.55 days per unit increase in attitude rating. This is in accord with the scatter plot [Fig. 1.2] that shows the association of Y with X1 alone.

The partial regression coefficient for X2, while small in absolute value, is negative, even though the scatter plot of Y on X2 alone [Fig. 1.2] shows a positive relationship. The key to this seeming contradiction lies in the strong positive relationship between the predictors X1 and X2 [also noted in the scatter plot of Fig. 1.2]. Indeed, the correlation between X1 and X2 is 0.95. The upshot of all of this is that once X1 is in the equation, X2 is so redundant with X1 that its inclusion leads to a negative partial regression coefficient that effectively is zero [given its large standard error].

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B978012160954250007X

Correlation

Milan Meloun, Jiří Militký, in Statistical Data Analysis, 2011

Problem 7.13 Significance of the relationship between the nitrogen content in soil and in corn

In Problem 7.6 the multiple correlation coefficient expressing the relationship between the nitrogen content in com and a linear combination of organically bound nitrogen and inorganically bound nitrogen in soil is equal toR^1232=0.6945. Examine the null hypothesis H0: R1[2,3] = 0.

Solution: According to Eq. [7.58], the test criterionFR=18−30.694523−11−0.69452=6.988 is higher than the quantile of the Fisher-Snedecor distribution F0.95[2, 15] = 3.682, and therefore the null hypothesis H0: Rl[2,3] = 0 is rejected at significance level α = 0.05.

Conclusion: The content of nitrogen in soil significantly affects the content of nitrogen in com. Inorganically bound nitrogen contributes predominantly.

Case Rm > 0: To calculate the sample multiple correlation coefficient,R^m2 the complicated exact expression or a convenient approximation may be used. Gurland [6] has proposed a relatively precise approximation

[7.62]R^m21−R^m2≈n−1R^m21− R^m2+m−1n−m×Fr,n−m

where the quantity Fr,n–m has the F-distribution with r and [n – m] degrees of freedom. Then

[7.62a]r=Kn−1+m−1/Z

where

[7.62b]Z=n−1KK+2+m−1n− 1K+m−1

[7.62c]andK=R^m21−R^m2

For large sample sizes, the square of the multiple correlation coefficient reaches approximately a normal distribution with the mean valueER^m2=Rm2 and varianceDR^m2 =4Rm21−Rm22 n−1. The random variable

[7.63]uR=n−1 R^m2−Rm22Rm1−Rm2

has the normalized normal distribution. Also, the Fisher and other transformations for speeding up convergence to normality can be used.

For the mean value of the squared multiple correlation coefficient

[7.64]ER^m2 =Rm2+m−1n−11−Rm2−2n−mn2−1Rm21−Rm2+⋯

The variance is given by

[7.65]DR^m2=4Rm21−Rm22n −m2n2−1n+3≈4Rm21−Rm22n

For smaller sample sizes, the estimateR^m2 is overestimated. The corrected multiple correlation coefficient is expressed by

[7.66]R^m*2=R^m2−m−3n−m1−R^m2−2n−3n−m21−R^m2+⋯

It can be seen thatR^m*2 0 is an analog of the shrinkage coefficient in estimators of the Stein estimator type, and 1/t presents a regularization parameter. In this case, by Theorem 4.10, the leading part of the quadratic risk [3] is

R0[ρ]=R0[α,t]=σ2−2αφ[ts[t]]+α2Δ[t,t ]/s2[t].

If α = 1, we have

R0[ρ] =R0[1,t]=1s2[t ]ddt[t[σ2−κ[t]]].

In this case, the empirical risk is Remp[t] = s2[t]R0[t]. For the optimum value α=αopt=s2[t]φ[ts[t]] /Δ[t,t], we have

R0[ρ]= R0[dopt,t]=σ2 [1−s2[t]φ2[ts[t]]Δ[t,t]].

Example 1. Let λ → 0 [the transition to the case of fixed dimension under the increasing sample size N → ∞]. To simplify formulas, we write out only leading terms of the expressions. If λ = 0, then s[t] = 1, h[t] = n−1tr[I + tΣ]−1, k[t] = ϕ[t], Δ[t, t] = ϕ[t], tϕ′[t]. Set Σ = I. We have

φ[t]≈σ2r2t1+t,h[t]≈11+t,Δ[t,t]≈σ2r2 t2[1+t]2,

where r2 = g2/σ2 is the square of the multiple correlation coefficient. The leading part of the quadratic risk [3] is

R0=σ2[1− 2α2t/[1+t]+α 2r2t2/[1+t]2].

For the optimal choice of d, as well as for the optimal choice of t, we have α = [1 + t]/t and Ropt = σ2[1 − r2], i.e., the quadratic risk [3] asymptotically attains its a priori minimum.

Example 2. Let N → ∞ and n → ∞ so that the convergence holds λ = n/N → λ0. Assume that the matrices Σ are nondegenerate for each n, σ2→σ02,r2=gT ∑−1g/σ2→r02, and the parameters γ → 0. Under the limit transition, for each fixed t ≥ 0, the remainder terms in Theorems 4.8-4.11 vanish. Let d = 1 and t → ∞ [the transition to the standard nonregularized regression under the increasing dimension asymptotics]. Under these conditions,

s[t] →1−λ0,s′[t]→0, φ[ts[t]]→σ02r*2,κ[t]→κ[∞]=def σ02r02[1−λ0]+σ02λ0,tκ′[t]→0.

The quadratic risk [3] tends to R0 so that

lim⁡t→∞lim⁡γ→0lim⁡N→∞|ER[t]−R0|=0,

where R0=defσ02[1 −r02]/[1−λ0]. This limit expression was obtained by I. S. Yenyukov [see in [2]]. It presents an explicit dependence of the quality of the standard regression procedure on the dimension of observations and the sample size. Note that under the same conditions, the empirical risk Remp→σ02[1−r02][1−λ0] that is less than σ02[1−r02].

Example 3. Under the same conditions as in Example 2, let the coefficients d be chosen optimally and then t → ∞. We have α = αopt[t] = s2[t]Φ[ts[t]]/Δ[t, t] and t → ∞. Then,

s[t]→1−λ0,φ[ts[t]]→σ02r02,Δ[t,t]→σ02[1−λ0][λ0[1−r02]+[1 −λ0]r02],αopt→r02[1−λ0][λ0[1−r02]+[1−λ0]r02].

By [23], the quadratic risk [3] R0[t, αopt] → R0 as t → ∞, where

R0=σ02[1−r02][λ0+[1−λ0]r02]λ0[1−r02]+[1−λ0]r02≤σ02[1−r02 ]1−λ0.

If λ0 = 1, the optimal shrinkage coefficient αopt → 0 and the quadratic risk remains finite [tends to σ02] in spite of the absence of a regularization, whereas the quadratic risk for the standard linear regression tends to infinity.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780444530493500072

Volume 3

J. Ferré, in Comprehensive Chemometrics, 2009

3.02.3.5.1 The collinearity problem

The columns of the design matrix X can be anything between orthogonal and perfectly correlated [one column being a multiple of another column, or a linear combination of several columns]. When the columns are orthogonal, the estimated regression coefficients are independent and their variance [or a combined measure of their variance] is smaller than in the non-orthogonal case for that same number of training points. Independency and low variance are important when the objective is to interpret the coefficients [the effect of the variables in the measured response y] and also to obtain predictions with a low variance. Hence, orthogonality of the x-variables is often sought for by the experimenter whenever it is possible to fix the x-values [e.g., in designed experiments]. For many experimental situations, statistical experimental design presents optimal X matrices that have orthogonal columns [e.g., full factorial designs, factorial fractional designs, Plackett–Burman designs], and offers criteria such as D- and A-optimality to select the points from a list of candidates, in such a way that the X matrix formed with the selected points has columns that are as much orthogonal as possible. The opposite situation is a perfect correlation between two or more columns of X, a situation called singularity. This can be expressed as83

[74]∑k=1Kwkxk=0

where xk is the kth column of X and the wk are constants, not all of which are zero. In this case, the rank of XTX is less than K, and XTX is singular and cannot be inverted; hence, it is not possible to calculate the OLS estimates of the regression coefficients with Equation [7]. Usually [e.g., near-infrared [NIR] spectra], the x-variables are neither completely correlated nor orthogonal and the coefficients can be estimated by OLS solution but their variance–covariance will be larger than for the case of orthogonal x’s. The most problematic situation is the existence of nearly exact linear combinations among the independent variables, that is when Equation [74] holds approximately. This situation is usually called collinearity, multicollinearity, or ill-conditioning. A more precise definition of collinearity can be found in Gunst84 [p 81]. Collinearity has adverse effects on the regression results. The more collinear the x-variables are, the more unstable the OLS estimators of coefficients are: the estimated coefficients may change substantially with small changes in the observed y due to random errors. This instability means large [inflated] variances and covariances of the coefficients of variables involved in the linear dependencies, which make it difficult to interpret the impact of each regressor on the response. Moreover, the coefficient estimates cannot be interpreted separately, they are often too large or of wrong sign, and the t-tests of significance [Equation [51]] can indicate that the coefficients are statistically insignificant. In addition, despite the coefficients being estimated poorly, the model can have a good fit; hence, the traditional analysis of the model adequacy with summary statistics such as SSE, multiple correlation coefficient, or residual plots, will not signal the collinearity problems. Note that these statistics reflect how well the fitted model estimates the observed y but not necessarily the validity of the model for prediction. Actually, the prediction for new x measurements may be good at points whose combination of x’s are similar to those in the training data [so collinearity in X is not that big a problem if we are only interested in predictions at points with the same collinearity pattern]. However, prediction at the points that do not have the same pattern of collinearity as X or extrapolation beyond the range of the data can be very adversely affected and have large errors. Gunst and Mason83 show an example with the problems associated with collinearity. Mandel85 reasoned that collinearity must be seen as a warning to limit the use of the regression model for predictions in a specific subspace of the x-space. This amounts to making a difference between the sample domain [SD], defined by the maximum and minimum value of each x-variable, and the effective prediction domain [EPD], which is the part of the x-space in which training data lie, and in which and near which prediction is safe. The collinearity problem and these domains are illustrated in Figure 9 for two independent variables, x1 and x2. [Similar discussions can be found in Sergent et al.,86 Belsley et al.1 [p 87], Mandel,87 and Larose88 [p 117].] Four data sets with five points each are simulated with given values of x1 and x2 [Table 2]. The SD is the rectangle ABCD. The model was given by Equation [4], E[y∣xi] = β0 + β1xi,1 + β2xi,2, with β0 = 0.5, β1 = 0.2, and β2 = 0.15. For each point, random error ɛi from a normal distribution with mean zero and standard deviation 0.03 was added. Hence, the measured y is yi = 0.5 + 0.2xi,1 + 0.15xi,2 + ɛi. Data sets R1 and R2 were simulated with the same values of x1 and x2, but with different values of random error, to test the stability of the coefficients to the variation of the random error. Data sets R3 and R4 were also simulated with the same values of x1 and x2 but with different values of random error. The independent variables in R1 and R2 illustrate a situation where x1 and x2 are not correlated with each other; that is, they are orthogonal. The x-variables in R3 and R4 illustrate a collinear situation where x1 and x2 are correlated with each other, so that as one increases, so does the other. The random error that was added to data sets R1 and R3 was the same, and the random error added to data sets R2 and R4 was the same. For each data set, the model ŷ = b0 + b1x1 + b2x2 [Equation [8]] was calculated. The values of x1 and x2, E[y∣xi], ɛi, and yi are listed in Table 2. The estimated coefficients, the coefficient of multiple determination, and the variance inflation factors [VIFs] [Section 3.02.3.5.3[iii]] are given in Table 3. The four models are plotted in Figure 9. Figure 9[a] plots the models of data sets R1 and R2. The points are well spread over the whole regression domain and form a solid basis for the model. A bit larger or smaller yi values due to random error do not vary excessively the coefficients. This is translated into stable coefficient estimates b1 and b2, each with small variances as follows:

Figure 9. [a] Plot of the fitted model for data sets R1 and R2 with an orthogonal model matrix X. [b] Plot of the fitted model for data sets R3 and R4, with collinear data: the x1 values increase as the corresponding values for x2 increase.

Table 2. Simulated data corresponding to Figure 9. Data sets R1 and R2 correspond to Figure 9[a]. Data sets R3 and R4 correspond to Figure 9[b]

R1R2x1x2E[y∣xi]εiyiɛiyi

0.2	0.2	0.570	0.026	0.596	−0.013	0.557
0.2	0.8	0.660	0.003	0.663	−0.033	0.627
0.5	0.5	0.675	−0.026	0.649	0.012	0.687
0.8	0.2	0.690	0.026	0.716	−0.029	0.661
0.8	0.8	0.780	−0.013	0.767	0.005	0.785

R3R4x1x2E[y∣xi]ɛiyiɛiyi

0.2	0.2	0.570	0.026	0.596	−0.013	0.557
0.3	0.2	0.590	0.003	0.593	−0.033	0.557
0.5	0.5	0.675	−0.026	0.649	0.012	0.687
0.7	0.6	0.730	0.026	0.756	−0.029	0.701
0.8	0.8	0.780	−0.013	0.767	0.005	0.785

Table 3. Estimated coefficients, coefficient of multiple determination, and VIF for the data sets in Table 2

R1R2R3R4

b0	0.536	0.473	0.509	0.489
b1	0.187	0.218	0.361	−0.099
b2	0.098	0.162	−0.038	0.473
R2	0.934	0.949	0.946	0.995
VIF1	1	1	22.7	22.7
VIF2	1	1	22.7	22.7

VIF, variance inflation factor.

var[b]=[1.59−1.39−1.39−1.392.78 0.00−1.380.002.78]σ2

Figure 9[b] illustrates the problems caused by collinearity among the columns of X in data sets R3 and R4. One of the dimensions of the x-space is very poorly spanned, with almost no data dispersion: the data mainly varies along the diagonal of the SD, whereas the perpendicular direction is hardly spanned. Consequently, the model is stable along the direction with higher x-variability, but easily modifiable by random errors in the direction of low variability. This means very poor, high variance, coefficient estimates for the variables that are involved in the collinearity:

var[b]=[1.29−5.253.33−5.2587.18−83.333.33 −83.3383.33]σ2

The high variability associated with the estimated coefficients b1 and b2 means that y samples may produce coefficient estimates with very different values [note how despite the y having the same random errors as R1 and R2, the coefficients are much more affected]. Clearly, this instability is unacceptable to the experimenter if the coefficients must be interpreted. This situation, however, cannot be detected from the fit. Note how the fit in the models R1 and R2 is not better than for the models R3 and R4.

Predictions are also severely affected by collinearity. A prediction at a point P2 in the direction AC will have low uncertainty for all four models. However, a prediction at a point P1 near vertex A [but inside the SD] will be predicted very differently by the models from R1 and from R3. The model from R1 will produce a prediction with a low variance, whereas the model from R3 will produce a prediction with a large variance due to the uncertainty of the model in that direction. Point P1 will have a large leverage and be detected as an outlier for this model. A possible solution to these problems is to reconsider the domain, and shift from the SD to the EPD. Figure 9[b] illustrates the EPD. The two new axes correspond to the PCA decomposition used in the PCR model. PCA defines a new variable in the direction AC and another in the direction BD, and the new EPD is now defined as the ‘sample domain’ for these new variables [as the largest and smallest values of the scores along these two PCs]. In this zone, predictions are stable, and point P1, which in the original MLR model would be predicted wrongly, now will be outside the limits of the EPD and be detected as an outlier.

This example shows how interrelationships among the x’s can severely restrict the effective use of the model, which may only be adequate for prediction in limited regions of the predictor variables. The analyst must investigate the correlation structure among the predictor variables with regression diagnostics to determine if the multivariate data being analyzed corresponds to case R1 or to case R3. The next section identifies some possible causes of collinearity, and possible solutions.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780444527011000764

What is it called when independent variables are correlated?

Key Takeaways. Multicollinearity is a statistical concept where several independent variables in a model are correlated. Two variables are considered to be perfectly collinear if their correlation coefficient is +/- 1.0. Multicollinearity among independent variables will result in less reliable statistical inferences.

What is the independent variable known as in regression analysis?

Independent variables are also known as predictors, factors, treatment variables, explanatory variables, input variables, x-variables, and right-hand variables—because they appear on the right side of the equals sign in a regression equation.

What is correlation coefficient in regression?

Correlation in Linear Regression The square of the correlation coefficient, r², is a useful value in linear regression. This value represents the fraction of the variation in one variable that may be explained by the other variable.

Related Question

Discussion

Video Transcript

Correlated Chronometric and Psychometric Variables

Multiple Correlation

Applying the Tools to Multivariate Data

6.2.2 Strength of Overall Relationship and Statistical Significance

Correlation

Problem 7.13 Significance of the relationship between the nitrogen content in soil and in corn

Volume 3

3.02.3.5.1 The collinearity problem

What is it called when independent variables are correlated?

What is the independent variable known as in regression analysis?

What is correlation coefficient in regression?

Bài Viết Liên Quan

Toplist mới

Bài mới nhất

Chủ Đề