Collection of notes for various classes I've taken.
To formally determine the correlation coefficient and the regression line, we use the sum of squares notation ($S_{xx}$, $S_{yy}$, $S_{xy}$). This notation simplifies the formulas for both concepts.
The core of linear analysis involves measuring the variability within a single variable (variance) and the co-variability between two variables (covariance).
| Term | Definition | Formula | Purpose |
|---|---|---|---|
| $S_{xx}$ | Sum of squares for $X$ | $\sum (x_i - \bar{x})^2 = \sum x_i^2 - \frac{(\sum x_i)^2}{n}$ | Measures the total variation in the independent variable $X$. |
| $S_{yy}$ | Sum of squares for $Y$ | $\sum (y_i - \bar{y})^2 = \sum y_i^2 - \frac{(\sum y_i)^2}{n}$ | Measures the total variation in the dependent variable $Y$. |
| $S_{xy}$ | Sum of products for $X$ and $Y$ | $\sum (x_i - \bar{x})(y_i - \bar{y}) = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}$ | Measures the co-variability between $X$ and $Y$. |
The Pearson coefficient ($r$) is calculated by normalizing the covariance ($S_{xy}$) by the square root of the product of the variances ($S_{xx}$ and $S_{yy}$). This forces the result to be between $-1$ and $+1$.
Formula for $r$ using Sum of Squares:
\[r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}}\]Interpretation:
Simple Linear Regression aims to find the line of best fit (the regression line) that minimizes the sum of squared errors between the observed $Y$ values and the predicted $\hat{Y}$ values.
Regression Equation (Line of Best Fit):
\(\hat{y} = b_0 + b_1 x\)
The slope is the estimated change in $\hat{y}$ for every one-unit increase in $x$. It is the ratio of the co-variability ($S_{xy}$) to the variability of the independent variable ($S_{xx}$).
Formula for the Slope:
\[b_1 = \frac{S_{xy}}{S_{xx}}\]The intercept is the estimated value of $\hat{y}$ when $x = 0$. It ensures the regression line passes through the mean point ($\bar{x}$, $\bar{y}$) of the data.
Formula for the Intercept:
\(b_0 = \bar{y} - b_1 \bar{x}\)
The coefficient of determination is a measure of how well the regression line fits the data, specifically quantifying the proportion of variance in $Y$ explained by $X$.
Relationship to Correlation: In simple linear regression: $R^2 = r^2$.
Formula using Sum of Squares (Variance Decomposition):
\(R^2 = \frac{\text{Explained Variation (SSR)}}{\text{Total Variation (SST)}} = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} = \frac{SSR}{SST}\)
Interpretation: $R^2$ is the proportion of the total variability in $Y$ that is accounted for by the linear model with $X$.
Given the dataset:
| $X$ (Hours) | $Y$ (Score) | $X^2$ | $Y^2$ | $XY$ | |
|---|---|---|---|---|---|
| 2 | 60 | 4 | 3600 | 120 | |
| 4 | 75 | 16 | 5625 | 300 | |
| 5 | 80 | 25 | 6400 | 400 | |
| 7 | 90 | 49 | 8100 | 630 | |
| 8 | 95 | 64 | 9025 | 760 | |
| $\Sigma$ | 26 | 400 | 158 | 32750 | 2210 |
| $\bar{x}$ | 5.2 | $\bar{y}$ | 80 |
$S_{xx}$ (Variation in X): \(S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = 158 - \frac{(26)^2}{5} = 158 - \frac{676}{5} = 158 - 135.2 = \mathbf{22.8}\)
$S_{yy}$ (Variation in Y): \(S_{yy} = \sum y_i^2 - \frac{(\sum y_i)^2}{n} = 32750 - \frac{(400)^2}{5} = 32750 - \frac{160000}{5} = 32750 - 32000 = \mathbf{750}\)
$S_{xy}$ (Co-variation): \(S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n} = 2210 - \frac{(26)(400)}{5} = 2210 - \frac{10400}{5} = 2210 - 2080 = \mathbf{130}\)
Pearson’s $r$: \(r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} = \frac{130}{\sqrt{(22.8)(750)}} = \frac{130}{\sqrt{17100}} \approx \frac{130}{130.77} \approx \mathbf{0.994}\)
Coefficient of Determination ($R^2$): \(R^2 = r^2 = (0.994)^2 \approx \mathbf{0.988}\)