Nate's Notes

Collection of notes for various classes I've taken.

Buy Me A Coffee

October 17

Formal Calculation of Correlation and Simple Linear Regression

To formally determine the correlation coefficient and the regression line, we use the sum of squares notation ($S_{xx}$, $S_{yy}$, $S_{xy}$). This notation simplifies the formulas for both concepts.

Sum of Squares Notation

The core of linear analysis involves measuring the variability within a single variable (variance) and the co-variability between two variables (covariance).

Term Definition Formula Purpose
$S_{xx}$ Sum of squares for $X$ $\sum (x_i - \bar{x})^2 = \sum x_i^2 - \frac{(\sum x_i)^2}{n}$ Measures the total variation in the independent variable $X$.
$S_{yy}$ Sum of squares for $Y$ $\sum (y_i - \bar{y})^2 = \sum y_i^2 - \frac{(\sum y_i)^2}{n}$ Measures the total variation in the dependent variable $Y$.
$S_{xy}$ Sum of products for $X$ and $Y$ $\sum (x_i - \bar{x})(y_i - \bar{y}) = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}$ Measures the co-variability between $X$ and $Y$.

Pearson’s Correlation Coefficient ($r$)

The Pearson coefficient ($r$) is calculated by normalizing the covariance ($S_{xy}$) by the square root of the product of the variances ($S_{xx}$ and $S_{yy}$). This forces the result to be between $-1$ and $+1$.

Simple Linear Regression (SLR)

Simple Linear Regression aims to find the line of best fit (the regression line) that minimizes the sum of squared errors between the observed $Y$ values and the predicted $\hat{Y}$ values.

A. The Slope ($b_1$)

The slope is the estimated change in $\hat{y}$ for every one-unit increase in $x$. It is the ratio of the co-variability ($S_{xy}$) to the variability of the independent variable ($S_{xx}$).

B. The Intercept ($b_0$)

The intercept is the estimated value of $\hat{y}$ when $x = 0$. It ensures the regression line passes through the mean point ($\bar{x}$, $\bar{y}$) of the data.

Coefficient of Determination ($R^2$)

The coefficient of determination is a measure of how well the regression line fits the data, specifically quantifying the proportion of variance in $Y$ explained by $X$.

Example

Given the dataset:

$X$ (Hours) $Y$ (Score) $X^2$ $Y^2$ $XY$  
2 60 4 3600 120  
4 75 16 5625 300  
5 80 25 6400 400  
7 90 49 8100 630  
8 95 64 9025 760  
$\Sigma$ 26 400 158 32750 2210
$\bar{x}$ 5.2 $\bar{y}$ 80    

A. Calculate Sum of Squares

  1. $S_{xx}$ (Variation in X): \(S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = 158 - \frac{(26)^2}{5} = 158 - \frac{676}{5} = 158 - 135.2 = \mathbf{22.8}\)

  2. $S_{yy}$ (Variation in Y): \(S_{yy} = \sum y_i^2 - \frac{(\sum y_i)^2}{n} = 32750 - \frac{(400)^2}{5} = 32750 - \frac{160000}{5} = 32750 - 32000 = \mathbf{750}\)

  3. $S_{xy}$ (Co-variation): \(S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n} = 2210 - \frac{(26)(400)}{5} = 2210 - \frac{10400}{5} = 2210 - 2080 = \mathbf{130}\)

B. Determine Correlation ($r$) and $R^2$

  1. Pearson’s $r$: \(r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} = \frac{130}{\sqrt{(22.8)(750)}} = \frac{130}{\sqrt{17100}} \approx \frac{130}{130.77} \approx \mathbf{0.994}\)

  2. Coefficient of Determination ($R^2$): \(R^2 = r^2 = (0.994)^2 \approx \mathbf{0.988}\)

C. Determine Regression Line ($\hat{y} = b_0 + b_1 x$)

  1. Calculate Slope ($b_1$): \(b_1 = \frac{S_{xy}}{S_{xx}} = \frac{130}{22.8} \approx \mathbf{5.696}\)
    • Interpretation: For every extra hour studied, the exam score is predicted to increase by about 5.7 points.
  2. Calculate Intercept ($b_0$): \(b_0 = \bar{y} - b_1 \bar{x} = 80 - (5.696)(5.2) \approx 80 - 29.619 \approx \mathbf{50.381}\)
    • Interpretation: A student who studies zero hours is predicted to score approximately 50.4 on the exam.
  3. Final Regression Equation: \(\hat{y} = 50.381 + 5.696 x\)