if the coefficient of correlation between two variables x and y is negative then the regression coefficient of Y on X is
Answers
Answer:
Suppose Income increases Drinking but otherwise correlates with better health and healthier livestyle.
Then L = Liver problems might have a positive correlation with both I and D, so the model would be L = a I or L = b D, where b > 0 and possibly even a > 0 (if the effect of income via drinking is more important than the better (negative) effects).
If both I and D are included in the model, then the model would better predict the risk of liver problems:
L = c D - d I,
where c > 0 but d < 0. So the negative coefficient -d is "correct", although in a one-variable model a positive coefficient a > 0 might be correct.
BTW, I guess that c > b, as in the one-variable model L = b D the drinking might indicate higher income and hence otherwise healthier livestyle, whereas in the 2-variable model it cannot indicate that, as the income is "known" in that model. (In fact, now a bigger D probably predicts also an otherwise bad livestyle, so maybe c >> b.)
So even if the variables I and D correlate with each other, it might be wise to include them both. However, then there is a risk of overfitting if you do not have sufficiently different data points.
Partial linear squares or such might be a better method.
If you can measure a Healthy livestyle separately, its model might be L = -h H for some -h < 0. This is an example of a negative coefficient in a 1-variable model. Then maybe you would get a 3-variable model L = c' D - h' H + 0 I with c', h' >0.
(Due to the "In fact..." I guess that c > c' > b.)
Disclaimer: I'm a mathematician, not a statistician (nor medical scientist), so feel free to correct me if necessary. A better answer might be the above
The correlation coefficient of the linear relationship between the variables x and y. xi – the values of the x-variable in a sample. x – the mean of the values of the x-variable. ... ȳ – the mean of the values of the y-variable.