Simple Linear Regression

Setting: One response variable with one quantitative explanatory variable.

Assumptions:
1) The response variable is normally distributed.
2) The observations (y-value) are independent.
3) The variance is constant.

A fundamental research goal is to investigate the manner in which a response variable depends on one or more explanatory variables. Regression modeling is a widely used technique for identifying a functional relationship between the response variable (also called the dependent variable) and explanatory variables (or independent variables). Regression provides the researcher with the model of a specified kind (linear, quadratic, logarithmic, etc.) that best fits the scatterplot of their data. In the setting of simple linear regression, there is a single explanatory variable \(x\) and the researcher attempts to fit the pattern in the scatterplot with a line of the form \(y=\beta_0+\beta_1x\). The coefficient \(\beta_0\) is the \(y\)-intercept, which is the value the model predicts for response variable \(y\) when \(x=0\). The second coefficient \(\beta_1\) is the slope, which is the amount \(y\) tends to change when \(x\) is increased by one unit.

Before performing a linear regression, the researcher should first display the data visually in a scatterplot to determine whether a linear function is an appropriate choice for modeling the relationship between \(y\) and \(x\). Below are a few scatterplots depicting various trends. In graphs (a) and (b), there is a linear relationship between \(y\) and \(x\), and so a linear regression model is appropriate. The first graph displays data collected on an experiment in which there are repeated observations for each level of \(x\) chosen by the researcher, whereas the second graph displays a linear trend in \(x\)- and \(y\)-data collected on a random sample of subjects.

On the other hand graphs (c) and (d) display trends that are non-linear. If there is a pronounced curved pattern in the scatterplot, the researcher should not attempt to calculate the correlation coefficient r and should fit another type of function or transform the data. For example, graph (c) can be fit by a root function, and graph (d) can be fit by a quadratic function. The reader can consult the section on Multiple Regression for the non-linear cases.

The scatterplot in (e) is an example of a weak linear trend. This occurs when the values of \(x\) do not strongly influence the value of \(y\), and so a high percentage of the variation in the values of \(y\) is due to other factors not accounted by treating \(y\) as a linear function of \(x\). The unexplained variation in \(y\) is modeled using the error term \(\varepsilon\).


Since the \(y\)-values are random, the observed slope of the regression line is random too. Thus, the researcher should consider whether or not to trust that the slope \(\beta_1\) is non-zero, and thereby determine if \(x\) has a statistically significant influence on \(y\).

This hypothesis test assesses whether the explanatory variable \(x\) has a statistically significant effect on the response variable \(y\).

a) Hypothesis Test
Process:

b) Confidence Interval
The confidence interval for \(\beta_1\) provides an interval estimate of the change of the response variable \(y\) when the explanatory variable \(x\) is increased by 1 unite. The \((1-\alpha)100\%\) confidence interval of \(\beta_1\) is \(b_1 \pm t_{\alpha/2, N-2} \ s(b_1)\), where \(t_{\alpha/2, N-2}\) is the upper \(\frac{\alpha}{2}\)th percentile of the t-distribution with \(N-2\) degrees of freedom.

a) The correlation \(r\) measures the direction and strength of the linear relationship between two quantitative variables, when the researcher has already decided there is a linear trend to model.

Facts about correlation:  The range of the correction \(r\) is between -1 and 1. A positive \(r\)-value indicates a positive association between the variables (as values of \(x\) increase, the values of \(y\) tend to increase as well), while a negative \(r\) indicates a negative association between the variables (as values of \(x\) increase, the values of \(y\) tend to decrease). A frequently used rule of thumb for interpreting the correlation coefficient \(r\) is:

The correlation coefficient \(r\) should only be calculated when a linear trend has already been observed in the scatterplot. If \(r\) is calculated for a non-linear pattern its value can be very misleading (see Anscombe’s quartet). For example, \(r\) could be near zero while there is a very strong parabolic pattern in the scatterplot indicative of a strong quadratic association between the response and explanatory variable (see for example graph (d) above).


b) The coefficient of determination \(R^2\) is the fraction of the variation in the values of \(y\) that is accounted for by the linear regression line with \(x\). Therefore, R2 is a measure of how well the value of the response variable \(y\) is explained by the linear regression line with the explanatory variable \(x\). In the simple linear regression setting, the coefficient of determination is the square of the correlation coefficient \(r\). If \(r= 0.7\), then R2= 49% of the variability of the response variable is accounted for by the linear regression line.

The value of R2 measures how well the model fits the present data, however the true measure of a model is how well it performs in practice.

A confidence interval is used to estimate the mean value \(y\) will have at a specified \(x\)-value \(x_h\), whereas a prediction interval is used to predict a single new \(y\)-value at a specified \(x\)-value \(x_h\). For example, suppose a researcher has produced a linear regression function that relates the color loss \(y\) in a restoration to the material loss due to grinding. If the researcher wants to know how much color loss will occur on average when a specified amount of material \(x_h\) is removed, then they will calculate a confidence interval for this mean color loss. But if a dentist wants to use this model to predict how much color loss will occur at the same cut level \(x_h\) for their next patient, then they should calculate a prediction interval. In both cases, the intervals are centered at the \(y\)-value obtained from substituting the specified cut level \(x_h\) into the regression equation, however, the margins of error will be greater when predicting a single \(y\)-value than for estimating the mean response.

a) Confidence interval for the mean response given \(x=x_h\)
One use of the regression model is to estimate the mean value the response variable \(y\) will have at a given value of the explanatory variable, say \(x =x_h\). The estimated value of this mean is found by plugging \(x_h\) into the formula for the regression line. However, since there is a random error in the model, a confidence interval should be used to estimate the mean value of \(y\) at this specified level for \(x\).

The \((1-\alpha)100\%\) confidence interval for the mean response given \(x=x_h\) is \(\hat{y}_h \pm t_{\alpha/2, N-2} \ s(\hat{y}_h)\), where \(\hat{y}_h=b_0+b_1 x_h\) is a point estimator for the mean response, and \(s(\hat{y}_h)=\sqrt{MSE\{ \frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2} \}}\) is the estimated standard deviation of the mean response.

b) Prediction interval for a single \(y\)-value given \(x=x_h\)
Another important use of the regression model is to predict a single new \(y\)-value at a specified \(x\)-value. The new observation to be predicted is typically the result of a new trial and should be independent of the trials on which the regression model is built. The \((1-\alpha)100\%\) prediction interval for the new response given \(x=x_h\) is \(\hat{y}_h \pm t_{\alpha/2, N-2} \ s(\hat{y}_{h(new)})\), where \(\hat{y}_h=b_0+b_1 x_h\) is a point estimator for the new response and \(s(\hat{y}_{h(new)})=\sqrt{MSE\{1+ \frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2} \}}\) is the estimated standard deviation of the new response.

Prediction intervals and confidence intervals differ conceptually. A confidence interval is used to estimate the value of a population parameter such as a mean, whereas a prediction interval is used to estimate the value of the response variable on a new trial performed independently of the data used to produce the regression model. The confidence interval for the mean response at each given value of \(x\) has a margin of error that is smaller than the corresponding prediction interval for estimating a single new \(y\)-value.

The margin of error for a confidence or prediction interval grows larger as the specified \(x\)-value \(x_h\) gets further from the sample mean \(\bar{x}\). This corresponds to increasing uncertainty in attempts to extrapolate beyond observed data points, however, the researcher should still be cautious in any attempt to apply the model to values of \(x\) beyond those that have been included in one’s study because there is no guarantee that the linear trend will continue.

Effects of Grinding of Dental Restorations on Tooth Color: 12 specimens of a composite dental material stained in target color A3.5 are grinded down in 9 steps resulting in cumulative reductions of 20µm, 50µm, 100µm, 150µm, 200µm, 250µm, 300µm, 400µm, and 500μm. At each step, the lightness (L* from black to white) of the specimens was measured with a spectroradiometer. In this setting, the explanatory variable is the material loss in µm, and the response variable is the lightness L*. The researcher observes a linear trend in the scatterplot, so a simple linear regression is performed and the model obtained is Lightness Lightness = 73.71 + 0.016\(\times\)Cut Level.

The researcher concludes from the scatterplot that the assumption of constant variance is satisfied. The Normal Probability Plot for L*shows a pronounced deviation from normality in that the distribution L* is light-tailed; however, this means the normality assumption fails in a way that makes the hypothesis test more conservative (i.e. the true p-value is smaller than the p-value reported by the software). So, if the reported p-value is less than \(\alpha=0.05\), then it is still correct to reject the null hypothesis.

The hypothesis test for \(\beta_1=0\) vs \(\beta_1 \neq 0\) gives a p –value of 2E-16. So the researcher rejects the null hypothesis, and concludes that cut level has a statistically significant effect on lightness L*. The correlation coefficient \(r\) is 0.918, which shows a very strong linear association between lightness and cut level. The coefficient of determination R2 = 84.19%, so only 15.81% of the variability in lightness is not accounted for by the linear regression line with cut level as the explanatory variable. The 95% confidence interval for the slope \(\beta_1\) is \(0.016 \pm 0.001\), so an increase in the cut level by 1µm results in an increase in lightness between 0.015 and 0.017

The researcher is now interested in estimating the mean value of L* for the specified cut level of \(x_h=230\)µm. The 95% Confidence Interval for mean value of \(y\) at this specified level for \(x_h\) is \(77.33 \pm 0.20\). However, the researcher may be interested in predicting the lightness of a new observation, so the 95% prediction interval is calculated to be \(77.33\pm2.16\) at this specified level for \(x_h=230\)µm. The reader should note that predicting lightness for this new single observation has a much larger margin of error than when estimating the mean value of lightness. For another specified \(x\)-value \(x_h=350\)µm, the 95% confidence interval for the mean value of lightness is \(79.22 \pm 0.27\) and the 95% prediction interval for a new observation at this same cut level is \(79.22 \pm 2.17\). The researcher will note that \(x_h=350\)µm is further away from the sample mean \(\bar{x}=218\)µm, so larger margins of error were obtained for each of these intervals.

Wear of Resin Composite: A researcher is interested in studying the effect of wear on a specific resin composite (Filtek Z250). In a laboratory setting, the researcher subjects 7 resin specimens to simulated wear challenges of 50, 100, 150, 300, 500, 750, and 1,000 thousand cycles. The wear (mean facet depth in µm) is measured using a profilometer. In this setting, the explanatory variable is the number of cycles (in thousands), and the response variable is the wear depth in µm. The researcher observes a linear trend in the scatterplot, so a simple linear regression is performed and the model obtained is Wear Depth = 28.04 + 1.25\(\times\)Cycles. Both the assumptions of normality and constant variance are satisfied.

The p-value of 2E-16 is obtained for the hypothesis test for \(\beta_1\) versus \(\beta_1 \neq 0\), so the researcher concludes that the number of cycles has a statistically significant effect on wear depth. The correlation coefficient \(r\) is 0.993, so a very strong linear association exists between wear depth and number of cycles. The coefficient of determination R2 = 98.57%, so only 1.43% of the variability in wear depth is not accounted for by the linear regression line with the number of cycles as the explanatory variable. The researcher concludes that the \(y\)-value (wear depth) is almost entirely determined by the regression line. The 95% confidence interval for the slope \(\beta_1\) is \(1.25 \pm 0.036\), so an additional cycle results in an increase in wear depth between 1.214 µm and 1.286 µm. In this case, the slope \(\beta_1\) will be the wear rate of the resin composite. The 95% Confidence Interval for the mean value of wear depth at a specified level of \(x_h=400\) cycles is \(528.51 \pm 12.16\). The 95% prediction interval for a new observation is calculated to be \(528.51 \pm 102.57\) at the same level of 400 cycles.


R Code and Examples

Visualizing Scatterplot for Regression: R script file

###-----------------------
### create scatterplots 
###-----------------------

#1# store data

    xdata<-c(73.24,
             73.22,
             74.39,
             73.28,
             73.75,
             75.26,
             76.46,
             74.35,
             74.46,
             76.51,
             77.3,
             77.31,
             74.61,
             77.41,
             78.21,
             78.08,
             78.38,
             75.13,
             76.74,
             78.86,
             79.21
    )
    ydata<-c(20.0000,
             18.9167,
             21.0833,
             45.1667,
             50.0833,
             39.5833,
             97.1667,
             101.1667,
             96.6667,
             153.3333,
             151.1667,
             148.6667,
             199.3333,
             200.8333,
             200.5833,
             266.4167,
             258.1667,
             261.5833,
             298.6667,
             304.8333,
             299.1667
    )

#2# graph points
    
    #example 1 with basic graphing functionality
    plot(xdata,ydata)

    #add regression line
    abline(lm(ydata~xdata))
    
    #example 2 with more graphing options
    plot(x=xdata,
         y=ydata,
         pch=21,#plotting character is circle
         bg="blue",#fill color of circle
         col="blue",#boundary color of circle
         cex=.75, #75% of normal circle size
         xlab="Xlabel",
         ylab="Ylabel",
         main="Title",
         cex.main=2, #200% of normal title size
         cex.lab=1.25 #125% of normal axes label size
    )
    
    abline(lm(ydata~xdata),
           col="red", #line color
           lwd=1.5 #150% of normal line width
    )    

Regression Example 1 – Effects of Grinding of Dental Restorations on Tooth Color: Lightness Data and R script file

###-----------------------
### Regression Analysis 
###-----------------------

# Example 1 - Effects of Grinding of Dental Restorations on Tooth Color

# Correlation
corr = cor(Lightness$CutLevel, Lightness$L)
corr

# Check Normality
qqnorm(Lightness$L)
qqline(Lightness$L)

# Regression model
mod <- lm(L~CutLevel, data = Lightness)
summary(mod)

# Confidence interval for model parameters beta0 and beta1
confint(mod)

# Graph with confidence intervals and prediction intervals
my.pred <- predict(mod, interval="confidence") 
plot( x=Lightness$CutLevel,y=Lightness$L, col = "blue",bg="blue", pch=21, cex=.5, 
      xlab=expression(paste("X = Cut Level (", mu, "m)")), ylab = "Y = Lightness")
lines(Lightness$CutLevel, my.pred[, 1], col="red")
lines(Lightness$CutLevel, my.pred[, 2], col="deepskyblue1", lty=2) 
lines(Lightness$CutLevel, my.pred[, 3], col="deepskyblue1", lty=2)

my.newpred <- predict(mod, interval = "prediction")
lines(Lightness$CutLevel, my.newpred[ ,2], col = "chartreuse2", lty = 5) 
lines(Lightness$CutLevel, my.newpred[,3], col = "chartreuse2", lty = 5)

legend(3,82, legend=c("95% CI", "95% PI"),
       col=c("deepskyblue1", "chartreuse2"), lty=c(2,5), cex=0.8)

# Confidence and Prediction Intervals given Xh=230 and Xh=350
newdata <- data.frame(CutLevel = c(230, 350)) 
predict(mod, newdata, interval = "confidence", level =.95)
predict(mod, newdata, interval = "prediction", level =.95)

Regression Example 2 – Wear of Resin Composite: R script file

###-----------------------
### Regression Analysis 
###-----------------------

# Example 2 - Wear of Resin Composite
# Generate Data
set.seed(168585)
beta0 = 16.5
beta1 = 1.27
n <- 10
sdev <- 50
Cycles <- rep(c(50, 100, 150, 300, 500, 750, 1000), n)
Wear_Depth <- rnorm(70, (beta0+beta1*Cycles), sd = sdev)
# beta0+beta1*Cycles+rnorm(1, 0, sd=sdev)
mydat <- data.frame(Wear_Depth=Wear_Depth, Cycles=Cycles)
corr = cor(mydat$Wear_Depth, mydat$Cycles)
corr^2
mean(Cycles)

# Create a simple linear regression model
mod <- lm(Wear_Depth~Cycles, data = mydat)
summary(mod)
# Confidence interval for model parameters beta0 and beta1
confint(mod)

# Graph with confidence intervals and prediction intervals
my.pred <- predict(mod, interval="confidence") 
plot(mydat$Wear_Depth~mydat$Cycles, col = "blue",bg="blue", pch=21, cex=.5, 
      ylab=expression(paste("Y = Wear Depth (", mu, "m)")), xlab = "X = Cycles")
lines(mydat$Cycles[order(mydat$Cycles)], my.pred[order(mydat$Cycles), 1], col="red")
lines(mydat$Cycles[order(mydat$Cycles)], my.pred[order(mydat$Cycles), 2], col="deepskyblue1", lty=2) 
lines(mydat$Cycles[order(mydat$Cycles)], my.pred[order(mydat$Cycles), 3], col="deepskyblue1", lty=2)

my.newpred <- predict(mod, interval = "prediction")
lines(mydat$Cycles[order(mydat$Cycles)], my.newpred[order(mydat$Cycles),2], col = "chartreuse2", lty = 5) 
lines(mydat$Cycles[order(mydat$Cycles)], my.newpred[order(mydat$Cycles),3], col = "chartreuse2", lty = 5)

legend(50,1200, legend=c("95% CI", "95% PI"),
       col=c("deepskyblue1", "chartreuse2"), lty=c(2,5), cex=0.8)

# Confidence and Prediction Intervals given Xh=400 and Xh=550
newdata <- data.frame(Cycles = c(400, 550)) 
predict(mod, newdata, interval = "confidence", level =.95)
predict(mod, newdata, interval = "prediction", level =.95)

Previous Page | Next Page