Overview

The objective of this tutorial is to provide students with an introduction to linear regression using R. This tutorial covers the following topics:

  • General overview of simple and multiple linear regression
  • Running simple and multiple linear regression using R
  • Testing the significance of variables
  • Measures of model fit
  • Residual Analysis
  • Identifying outliers, leverage, and influential points

What is Simple Linear Regression?

Consider the following data pairs which are in the form \((x_1,y_1 ),(x_2,y_2), \ldots ,(x_n,y_n)\) for two variables \(x\) and \(y\).

\[ \begin{aligned} (2,14), (5,30), (8,46), (10, 56), (11,59), (12,62), (13,67),(15,80),& \\ (18,92), (20,106), (23,118), (25,126), (27,137), (28,142), (30,151)& \end{aligned} \]

The scatter plot is shown below

The goal of simple linear regression is to develop a linear function to explain the variation in \(y\) based on the variation in \(x\). For the above data, the following linear function best explains the relationship between \(y\) and \(x\)

\[ y = 5.54 + 4.87x \]

In this model 5.54 is called the intercept and 4.87 is called the slope. How did we arrive at the values of the slope and the intercept. How do we know \(y = 5.54 + 4.87x\) is the best model? The next section attempts to answer these questions.

Ordinary Least Squares Regression Line

This section is inspired by the example in Chapter 4 of Barreto and Howland (2006). Go to the folowing website ShinyApp. Change the values of intercept and slope. Visually try to find the intercept and slope which best represents the data.

Let us say the best value we got was 14 for the intercept and 12 for the slope.

\[ y = 14 + 12x \]

Let us look at \(x=5\). Corresponding to \(x=5\) there are two values of \(y\), the actual observed value and the predicted value. The predicted value of \(y\) is \(14 + 12 \times 5 = 74\) and the actual value of \(y\) is 79. The residual is the difference between the actual and predicted value.

Now observe how the Residual Sum of Squares changes with intercept and slope. What is the Residual Sum of Squares?

For a specific value of slope and intercept:

  • For each value of \(x\), calculate the residual which is the difference between the observed and the predicted value.
  • Square the residuals and sum them up across all values of \(x\). This can be thought of as a measure of error.

Why do we square the residuals? This is to prevent negative residuals and positive residuals from canceling each other out. Observations which are far off from the line are poorly predicted. It does not matter if they are above or below the line.

The goal in linear regression is to choose the slope and intercept such that the Residual Sum of Squares is as small as possible. Excel and R have functions which will automatically calculate the values of the slope and the intercept which minimizes the Residual Sum of Squares.

If the shiny app is not working, you can repeat the above exercise using the excel workbook Reg.xls. Go to the sheet ByEye and minSSRes.

Mathematical Foundations of Linear Regression

In simple linear regression we are fitting a function of the form:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

\(\epsilon\) corresponds to the error term which is assumed to have zero mean, constant variance, uncorrelated, and normally distributed. If the above assumptions are not valid, then the regression model is unacceptable.

In a multiple linear regression we are fitting a function of the form:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon \]

\(y\) is called the dependent variable.

\(x_1,x_2, \ldots ,x_n\) are called independent variables. They are also referred to as predictors, covariates, and regressors in statistical literature.

\(\beta_0,\beta_1,\beta_2, \ldots , \beta_n\) are called regression coefficients.

Note that because of the assumptions on \(\epsilon\), \(y\) will have a normal distribution with expected value:

\[ E[y \mid x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n \] If the constant variance of the error term is \(\sigma^2\), then the variance of \(y\) is:

\[ Var[y \mid x] = \sigma^2 \]

Simple Linear Regression in R

Open the dataset WatCon.csv in RStudio. This dataset corresponds to example 6.12 in Tang and Ang (2007). We want to study water consumption as a function of population.

> setwd("C:/Users/avinash/Dropbox/Teaching/Tutorial/Linear Regression")
> WCData <- read.csv("WatCon.csv", header = TRUE)
> WCData
   CITY    POP  WC
1     1  50000 100
2     2 100000 110
3     3 200000 110
4     4 250000 113
5     5 300000 125
6     6 400000 130
7     7 500000 130
8     8 600000 145
9     9 700000 155
10   10 800000 150

The first key assumption in linear regression is the existence of a linear relationship between \(y\) and \(x\). To verify this, make sure the scatter plots looks linear.

> plot(WCData$POP, WCData$WC, xlab = "Population", ylab = "Water Consumption", pch = 16, cex = 1.3, col = "blue")

> WConLR <- lm(WC~POP,data=WCData)
> summary(WConLR)

Call:
lm(formula = WC ~ POP, data = WCData)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0967 -3.6530 -0.0098  3.7402  6.0488 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 9.893e+01  2.846e+00   34.76 5.13e-10 ***
POP         7.145e-05  6.203e-06   11.52 2.93e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.74 on 8 degrees of freedom
Multiple R-squared:  0.9431,    Adjusted R-squared:  0.936 
F-statistic: 132.7 on 1 and 8 DF,  p-value: 2.925e-06

In the above expression, lm() is the function which performs Linear Regression. WConLR is the name of the model which contains the results of the linear regression.

Within the lm(), you will find the expression WC ~ POP which means WC is the dependent variable and POP is the independent variable. summary(ModelName) outputs all the necessary information. The estimate column provides the coefficients.

  • The intercept is 98.93 and the slope is 0.00007145. If the population increases by 1, does the water consumption increase or decrease? By how much does it increase or decrease?
  • Does this mean that the impact of population is insignificant?

Now create a new variable NEWPOP which measures the population in 1000s.

> WCData$NEWPOP <- WCData$POP / 1000
> WConLR1 <- lm(WC~NEWPOP,data=WCData)
> summary(WConLR1)

Call:
lm(formula = WC ~ NEWPOP, data = WCData)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0967 -3.6530 -0.0098  3.7402  6.0488 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 98.932363   2.845933   34.76 5.13e-10 ***
NEWPOP       0.071455   0.006203   11.52 2.93e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.74 on 8 degrees of freedom
Multiple R-squared:  0.9431,    Adjusted R-squared:  0.936 
F-statistic: 132.7 on 1 and 8 DF,  p-value: 2.925e-06
  • What is the intercept?
  • What is the coefficient of NEWPOP?

The magnitude of the coefficients depends on the units in which the independent variable are measured. Therefore, we cannot comment on the significance of a variable based on the magnitude.

If you want to determine whether an independent variable has a significant impact on the dependent variable then you look at the t-value column. If the absolute value of the t value is greater than 1.645, then that variable has a significant impact on the dependent variable at 90% confidence level. Another way to study the impact of variables is to look at the p value which corresponds to Pr(>|t|) column. For small values of t values (absolute value less than 1.64 at 90% confidence), we have to remove that variable. If we are testing at 95% confidence, then the absolute value of the t value must be greater than 1.96.

The intercept and slope are sample estimates. When we check the significance of the independent variables, we are conducting the following hypothesis test to check if the true value of the coefficient (in the population) is zero or not.

\[ \begin{aligned} H_0: \beta_{1}=0 & \\ H_a: \beta_{1} \neq 0 & \end{aligned} \]

The t statistic or the p value column helps us make this decision. In this case, since the t value is greater than 1.645, we reject the null at 95% confidence level. Hence, we reject the hypothesis that the coefficient is equal to 0 in the population at 95% confidence level and the variable is significant.

Calculating the coefficient estimates

Let \(\hat{\beta}_0\) and \(\hat{\beta}_1\) denote the sample estimate of \(\beta_0\) and \(\beta_1\) respectively. If you want to verify the results from R, you can estimate these values using the following formula:

\[ S_{xx} = \sum_{i=1}^{n} x_i^2 - \frac{ \left(\sum_{i=1}^{n} x_i \right)^2}{n} \]

\[ S_{xy} = \sum_{i=1}^{n} x_iy_i - \frac{ \left(\sum_{i=1}^{n} x_i \right) \left(\sum_{i=1}^{n} y_i \right)}{n} \] \[ \hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} \] \[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

where

\[ \bar{y} = \frac{\sum_{i=1}^{n} y_i}{n} \]

and

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]

For the above example,

\[ \sum_{i=1}^{n} x_i = 3900 \]

\[ \sum_{i=1}^{n} y_i = 1268 \] \[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = 390 \] \[ \bar{y} = \frac{\sum_{i=1}^{n} y_i}{n} = 126.8 \] \[ \sum_{i=1}^{n} x_i^2 = 2105000 \] \[ S_{xx} = \sum_{i=1}^{n} x_i^2 - \frac{ \left(\sum_{i=1}^{n} x_i \right)^2}{n} = 2105000 - \frac{15210000 }{10} = 584000 \] \[ S_{xy} = \sum_{i=1}^{n} x_iy_i - \frac{ \left(\sum_{i=1}^{n} x_i \right) \left(\sum_{i=1}^{n} y_i \right)}{n} = 536250 - \frac{4945200}{10} = 413730 \] \[ \hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} = \frac{413730}{584000} =0.07145 \]

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} = 126.8 - 0.07145\times 390 = 98.93 \] Therefore, the linear regression equation can be written as,

\[ \hat{y} = 98.93 + 0.07145x \] where \(\hat{y}\) is the predicted value of \(y\).

Measures of Fit

When we ran the summary() command, we also got the following pieces of information at the end. Let us look at how these are calculated and what they imply.

In the linear regression model, one of the assumptions we made was that the residuals have constant variance. Let \(\sigma^2\) denote this variance.

The sample estimate of this variance \(\hat{\sigma}^2\) is given as

\[ \hat{\sigma}^2 = \frac{ \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 }{n-2} = 22.47 \]

\[ \hat{\sigma} = 4.74 \] The estimate \(\hat{\sigma} = 4.74\) is the residual standard error.

The coefficient of determination \(R^2\) provides an estimate of the proportion of variation in \(y\) explained by the variation by \(x\) and is given as:

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2}{\sum_{i=1}^{n} \left( y_i - \bar{y}_i \right)^2}= 1 - \frac{179.76}{3161.6} = 0.9432 \] This information is given as Multiple R-squared in the R output. For the current example, 94.32 percent of the variability in \(y\) which is water consumption is explained by the regression model. One must be careful in using \(R^2\) in evaluating how good the model is, particularly for multiple linear regression. More on that later.

Several statisticians prefer to use another metric called Adjusted R-squared which is shown below:

\[ R^2_{Adj} = 1 - \frac{ \frac{\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2}{n-k}}{ \frac{\sum_{i=1}^{n} \left( y_i - \bar{y}_i \right)^2}{n-1}} = 1 - \frac{ \frac{179.76}{8}}{ \frac{3161.6}{9}} = 0.9360 \] where \(k\) is the number of parameters which are being estimated which in this example is 2 (intercept and slope).

The F-statistic (\(F_0\)) is another way of testing the significance of the independent variable. Essentially we are doing the hypothesis test:

\[ \begin{aligned} H_0: \beta_{1}=0 &\\ H_a: \beta_{1} \neq 0& \end{aligned} \] \[ F_{0} = \frac{ \sum_{i=1}^{n} \left( \hat{y}_i - \bar{y}_i \right)^2}{ \frac{\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2}{n-k}} = \frac{2981.38}{ \frac{179.76}{8}} = 132.68 \]

We can perform the hypothesis test by reading the \(F_{\alpha,1,n-k}\) value from an F distribution table and reject the null if \(F_0 > F_{\alpha,1,n-k}\). A quicker way would be to look at the p value and reject the \(H_0\) if p value is less than \(\alpha\). For example, for this case, the p value is definitely less than 0.05 at 95% confidence. Therefore, we reject the \(H_0\) and conclude that population is a significant variable. This test is more useful for multiple linear regression models.

Residual Analysis

In the linear regression model, the error term is assumed to have zero mean, constant variance, uncorrelated, and normally distributed. Therefore, in order to validate that the error term has these propeties we will have to conduct further tests on the residuals.

The error term having zero mean is easily verified. Determine the residuals, take the average and it should be close to 0.

> WCData$resid <- residuals(WConLR1)
> print(mean(WCData$resid))
[1] 2.664969e-16

The residuals (\(e_i\)) are also useful in identifying outliers or critical data points which should be examined further or removed. Generally when analyzing residuals, it is easy to deal with scaled residuals. Two popular methods of scaling are standardized residuals (\(d_i\)) and studentized residuals (\(r_i\)) for data point \(i\). According to classic books on regression, such as Montgomery, Peck, and Vining (2012), the standardized and studentized residuals can be calculated as follows:

\[ \begin{aligned} e_i &= y_i - \hat{y}_i, &i= 1,2, \ldots, n \\ d_i &= \frac{e_i}{\hat{\sigma}},& i= 1,2, \ldots, n \\ r_i &= \frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}},& i= 1,2, \ldots, n \end{aligned} \]

where \(h_{ii}\) denotes what is called as leverage for point \(i\). More on leverage, later.

The standardized and studentized residuals can be calculated using the rstandard() and rstudent() command. But here is the slightly confusing part. What R calls standardized residuals (those returned by the rstandard command) is studentized residuals according to the Montgomery, Peck, and Vining (2012) book. The scaled residuals returned by rstudent() command corresponds to the externally scaled studentized residuals, formula on page 135 of Montgomery, Peck, and Vining (2012) book. If you find the different scaled residuals discussion tedious, dont worry about it. My recommendation is to pick the residuals returned by the rstandard() command and proceed with the analysis.

> WCData$residstd <- rstandard(WConLR1)
> print(WCData$residstd)
 [1] -0.6307264  0.9515981 -0.7427578 -0.8603577  1.0378161  0.5527380
 [7] -1.0483993  0.7421398  1.4879529 -1.6438477

If the absolute value of any of the standardized residual is high (\(>3\)), that indicates an outlier and should be examined further. In this case, there are no outliers.

Now let us check if the error term or the residuals is normally distributed. We do that using qq probability plot and a histogram. The normal probability plot can be created using the qqnorm() function.

> qqnorm(WCData$residstd, ylab="Standardized Residuals", xlab="Normal Scores", main="Probability Plot") 
> qqline(WCData$residstd)

> hist(WCData$residstd, xlab="Standardized Residuals", ylab="Frequency") 

If the standardized residuals are truly normal, they should lie exactly or very close to the line in the probability plot. In this case, there appears to be a deviation from normality which is a red flag. The deviation from normality is also verified by the histogram.

The next step is to plot the studentized residuals versus the fitted values.

> plot(fitted(WConLR1), WCData$residstd, ylab="Standardized Residuals", xlab="Fitted Values") 
> abline(0,0)

In this plot, make sure that the residuals are randomly distributed about the 0 line. The residuals must also be contained within a horizontal band around the zero line. There should not be any discernible pattern in the residuals. There is a lot of subjectivity involved in this decision. If the residuals satisfy these properties (which it does in this case), then the constant variance assumption is satisfied.

If the studentized residuals start showing patterns, that implies the variance is not constant. This could imply the model is missing an important independent variable or transformation of the data might be required or the true relationship between the dependent and independent variables is not linear.

For example, the plot shown below clearly shows that the variance increases with fitted values and the presence of outliers.

Multiple Linear Regression

Now let us take the case of multiple linear regression. Open the dataset npts.csv. Let the dependent variable be number of trips made (ntrip). Regress ntrip against income, urban, wrkrcnt, numadult, auttrcnt, and hhsize.

Before running multiple linear regression, it is often a good practise to check the relationship between the different pairs of variables visually using scatterplot.

> NPTS <- read.csv("npts.csv", header=TRUE)
> names(NPTS)
 [1] "houseid"  "mileage"  "census_d" "census_r" "drvrcnt"  "hhloc"   
 [7] "hhsize"   "hh_0to4"  "hh_race"  "msasize"  "numadult" "popdnsty"
[13] "ref_age"  "ref_educ" "ref_sex"  "wrkrcnt"  "hh_5to21" "income"  
[19] "ntrip"    "pur_recr" "pur_shop" "pur_work" "auttrcnt" "pur_hbot"
[25] "pur_nhb"  "no_trans" "urban"   
> NPTS1 <- NPTS[c("ntrip","income","urban","wrkrcnt","numadult","auttrcnt","hhsize")]
> plot(NPTS1,pch = 16, cex = 1.3, col = "blue" )

Since, almost all the variables considered ae categorical we don’t see any nice linear relationships. So we expect the model fit to be lower. However, we can see evidence of positive correlation between wrkrctn, numadult, and hhsize. Let us run the linear regression model.

> NPTSLR <- lm(ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + hhsize, data = NPTS)
> summary(NPTSLR)

Call:
lm(formula = ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + 
    hhsize, data = NPTS)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.043  -2.917  -0.657   2.268  37.028 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.018e+00  3.532e-01   2.882  0.00399 ** 
income       1.548e-05  5.697e-06   2.717  0.00664 ** 
urban       -3.676e-01  2.371e-01  -1.550  0.12122    
wrkrcnt      1.435e+00  1.478e-01   9.705  < 2e-16 ***
numadult    -1.695e+00  2.036e-01  -8.322  < 2e-16 ***
auttrcnt     6.161e-01  1.418e-01   4.344 1.47e-05 ***
hhsize       2.303e+00  1.070e-01  21.515  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.969 on 1993 degrees of freedom
Multiple R-squared:  0.3723,    Adjusted R-squared:  0.3704 
F-statistic:   197 on 6 and 1993 DF,  p-value: < 2.2e-16

The estimates of the coefficients are highlighted here.

Multi-Collinearity

Now repeat the process by adding hh_0to4 and hh_5to21. What is the result?

> NPTSLR1 <- lm(ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + hhsize + hh_0to4 + hh_5to21, data = NPTS)
> summary(NPTSLR1)

Call:
lm(formula = ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + 
    hhsize + hh_0to4 + hh_5to21, data = NPTS)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.240  -2.868  -0.601   2.237  39.002 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.090e+00  3.476e-01   3.135 0.001745 ** 
income       1.634e-05  5.607e-06   2.914 0.003611 ** 
urban       -4.415e-01  2.335e-01  -1.891 0.058791 .  
wrkrcnt      1.331e+00  1.460e-01   9.115  < 2e-16 ***
numadult    -2.084e+00  2.060e-01 -10.118  < 2e-16 ***
auttrcnt     4.878e-01  1.404e-01   3.473 0.000525 ***
hhsize       2.850e+00  1.249e-01  22.819  < 2e-16 ***
hh_0to4     -2.056e+00  2.522e-01  -8.152 6.24e-16 ***
hh_5to21            NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.889 on 1992 degrees of freedom
Multiple R-squared:  0.3926,    Adjusted R-squared:  0.3905 
F-statistic: 183.9 on 7 and 1992 DF,  p-value: < 2.2e-16

When such things happen, that means the linear regression has not been estimated correctly. In this case we have what is known as multi-collinearity in predictors. The independent variables cannot be linear functions of each other. In this case: hh_0to4 + hh_5to21 + numadult = hhsize.

Under such conditions it is not possible to explain the variation in ntrip using all four variables. So we will need to drop one variable and rerun the regression. I dropped hh_0to4 and rerun the model. Now the model works.

> NPTSLR1 <- lm(ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + hhsize + hh_5to21, data = NPTS)
> summary(NPTSLR1)

Call:
lm(formula = ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + 
    hhsize + hh_5to21, data = NPTS)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.240  -2.868  -0.601   2.237  39.002 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.090e+00  3.476e-01   3.135 0.001745 ** 
income       1.634e-05  5.607e-06   2.914 0.003611 ** 
urban       -4.415e-01  2.335e-01  -1.891 0.058791 .  
wrkrcnt      1.331e+00  1.460e-01   9.115  < 2e-16 ***
numadult    -2.826e-02  2.862e-01  -0.099 0.921375    
auttrcnt     4.878e-01  1.404e-01   3.473 0.000525 ***
hhsize       7.944e-01  2.129e-01   3.731 0.000196 ***
hh_5to21     2.056e+00  2.522e-01   8.152 6.24e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.889 on 1992 degrees of freedom
Multiple R-squared:  0.3926,    Adjusted R-squared:  0.3905 
F-statistic: 183.9 on 7 and 1992 DF,  p-value: < 2.2e-16

You can drop either one of the four variables in this case. When we have a large number of variables it is important to be wary of multi-collinearity. You need to watch out for multi-collinearity especially when you have indicator variables. For example, if you have an indicator variable for MALE and another indicator variable for FEMALE. We cannot include both of them in the model as: MALE + FEMALE = 1.

The above cases are easy to detect. Sometimes, we may have predictors with almost perfectly linear relationships or strong correlations. Multi-collinearity can be identified by generating the correlation matrix of the predictors.

> library(corrplot)
> NPTS2 <- NPTS[c("ntrip","income","urban","wrkrcnt","numadult","auttrcnt","hhsize", "hh_5to21")]
> npts2cor <- cor(NPTS2)
> corrplot(npts2cor, method = "number")

There are strong correlations between hhsize and hh_5to21 and hhsize and numadult. One possibility is to consider dropping hhsize. In such cases, having all predictors may lead to incorrect estimates.

Sometimes one predictor is strongly related to a combination of other predictors. Such cases cannot be detected by looking at the correlation matrix. In such cases, we need to calculate the variance inflation factors (VIF). VIF for an independent variable is obtained from the \(R^2_i\) of the linear regression of that independent variable with other independent variables as follows.

\[ VIF = \frac{1}{1 - R^2_i} \] R has several packages for generating VIFs. I am going to use the car package.

> library(car)
> vif(NPTSLR1)
  income    urban  wrkrcnt numadult auttrcnt   hhsize hh_5to21 
1.252065 1.054152 1.550578 3.951683 1.715548 7.596657 4.659714 

Normally when you have VIFs > 4, then we have strong multi-collinearity. In our case, we have the VIF for hhsize to be 7.59. The high VIF indicates that hhsize can be explained as a combination of other variables. When you have several variables with VIFs > 4, then remove the variable with the highest VIF. Rerun the model. Calculate the VIF. Repeat the process until all VIFs are <= 4.

In this model, the worst offender is hhsize. So I will remove hhsize and recalculate the VIFs. NOw all VIFs are less than 4 and we have eliminated the multi-collinearity problem.

> NPTSLR1 <- lm(ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + hh_5to21, data = NPTS)
> vif(NPTSLR1)
  income    urban  wrkrcnt numadult auttrcnt hh_5to21 
1.249541 1.052106 1.548266 1.509617 1.709820 1.140231 
> summary(NPTSLR1)

Call:
lm(formula = ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + 
    hh_5to21, data = NPTS)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.659  -2.887  -0.663   2.225  40.362 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.174e+00  3.480e-01   3.372  0.00076 ***
income       1.727e-05  5.619e-06   3.074  0.00214 ** 
urban       -4.799e-01  2.340e-01  -2.051  0.04044 *  
wrkrcnt      1.352e+00  1.464e-01   9.237  < 2e-16 ***
numadult     8.114e-01  1.775e-01   4.571 5.15e-06 ***
auttrcnt     4.575e-01  1.406e-01   3.253  0.00116 ** 
hh_5to21     2.874e+00  1.252e-01  22.961  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.905 on 1993 degrees of freedom
Multiple R-squared:  0.3883,    Adjusted R-squared:  0.3865 
F-statistic: 210.9 on 6 and 1993 DF,  p-value: < 2.2e-16

Now check all variables for significance. In our case, looking at the t statistics and p value we find that all variables are significant, which is good. Note that numadult was not significant in the previous version of the model. This was due to multi-collinearity issues.

If we have several variables which are not significant, then you can start removing these variables gradually from the model. Start with the variables with the lowest t statistics and remove them one by one or in groups. When you are removing in groups, you may want to do an F-test and check occasionally. More on that later.

Measures of Fit

Let us now look at the four measures outputted by R. The sample estimate of this variance \(\hat{\sigma}^2\) is given as:

\[ \hat{\sigma}^2 = \frac{ \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 }{n-k} = \frac{47944.1}{2000-7} = 24.05 \] In the above formula, \(k\) is the number of independent variables plus 1 (for the intercept).

\[ \hat{\sigma} = 4.90 \] The estimate \(\hat{\sigma} = 4.90\) is the residual standard error. The coefficient of determination \(R^2\) provides an estimate of proportion of variation in \(y\) explained by the variation by \(x\) and is given as:

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2}{\sum_{i=1}^{n} \left( y_i - \bar{y}_i \right)^2}= 1 - \frac{47944.1}{78384.5} = 0.388 \]

One must be careful while using \(R^2\) to judge how good the model is. In multiple linear regression, as you add variables, the \(R^2\) either remains the same or increases. That does not mean that the model is better. Because of this issue, statisticians have a preference for adjusted \(R^2\).

\[ R^2_{Adj} = 1 - \frac{ \frac{\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2}{n-k}}{ \frac{\sum_{i=1}^{n} \left( y_i - \bar{y}_i \right)^2}{n-1}} = 1 - \frac{ \frac{47944.1}{1993}}{ \frac{78384.5}{1999}} = 0.3865 \]

The F-statistic (\(F_0\)) is more important for multiple linear regression compared to simple linear regression. The F-statistic can be used to test the significance of the regression. Is there a linear relationship between the dependent variable and any of the independent variable?

\[ \begin{aligned} H_0: &\beta_{1}= \beta_2 = \ldots = \beta_k = 0 & \\ H_a: &\beta_{j} \neq 0 \,\, for \,\, atleast \,\,\, one \,\, j & \end{aligned} \] \[ F_{0} = \frac{ \sum_{i=1}^{n} \frac{ \left( \hat{y}_i - \bar{y}_i \right)^2}{k-1}}{\frac{\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2}{n-k}} = \frac{ \frac{30440.4}{6} }{ \frac{47944.1}{1993}} = 210.8974 \]

We can perform the hypothesis test by reading the \(F_{\alpha,k,n-k}\) value from an F distribution table and reject the null if \(F_0 > F_{\alpha,k,n-k}\). A quicker way would be to look at the p value and reject the \(H_0\) if p value is less than \(\alpha\). For example, in this case, the p value is definitely less than 0.05 at 95% confidence. Therefore, we reject the \(H_0\) and conclude that there exists a linear relationship between ntrips and atleast one of the independent variables.

Residual Analysis

The histogram of the standardized residuals appear normal. However, the normal probability plot shows a clear departure from normality at the ends.

> NPTS$residstd <- rstandard(NPTSLR1)
> qqnorm(NPTS$residstd, ylab="Standardized Residuals", xlab="Normal Scores", main="Probability Plot") 
> qqline(NPTS$residstd)

> hist(NPTS$residstd, xlab="Standardized Residuals", ylab="Frequency") 

> plot(fitted(NPTSLR1), NPTS$residstd, ylab="Standardized Residuals", xlab="Fitted Values") 
> abline(0,0)

The standardized residuals versus fit plots show significant presence of outliers and non-constant variance. clearly the variance increases with increase in fitted values. Therefore, either we are missing other important independent variables or a variable transformation is required.

Confidence and Prediction Intervals

The R function predict() can be used to construct both confidence intervals for mean of the dependent variable as well as prediction intervals for the dependent variable. Consider a household with an income of 30000, located in urban area, with one worker, two adults, 1 kid between the age of 5 to 21, and owning one auto. Let us construct confidence and prediction intervals on the number of trips. First, we create a new data frame with the relevant information.

> NPTSnew <- data.frame("income" = 30000, "urban" = 1, "wrkrcnt" = 1, "numadult" = 2, "auttrcnt" = 1, "hh_5to21" = 1)

The 95% confidence interval can be calculated as follows.

> predict(NPTSLR1, NPTSnew , interval = "confidence")
       fit      lwr      upr
1 7.518149 7.153435 7.882864

If you want the 99% confidence interval, then add a level variable.

> predict(NPTSLR1, NPTSnew , interval = "confidence", level = 0.99) 
       fit      lwr      upr
1 7.518149 7.038665 7.997633

As expected, the 99% confidence interval is wider. The 95% prediction interval can also be calculated as follows. As expected the prediction interval is wider than the confidence interval.

> predict(NPTSLR1, NPTSnew , interval = "prediction")
       fit       lwr      upr
1 7.518149 -2.107672 17.14397

The confint() command provides the 95% confidence interval on the parameter estimates. The coefficient plus minus twice the standard error often acts as a good approximation for the 95% confidence interval. Other confidence levels can be modeled using the level option.

> confint(NPTSLR1)
                    2.5 %        97.5 %
(Intercept)  4.910654e-01  1.856065e+00
income       6.255585e-06  2.829527e-05
urban       -9.387846e-01 -2.092456e-02
wrkrcnt      1.064960e+00  1.639100e+00
numadult     4.632729e-01  1.159467e+00
auttrcnt     1.816560e-01  7.333222e-01
hh_5to21     2.628451e+00  3.119381e+00

Now that we have the coefficient estimates and their confidence intervals, we can interpret the coefficients.

Let us look at income. Holding all other factors constant, a 10000 $ increase in annual income results in 0.17 more trips made. Holding all other factors constant, an additional worker in the household results in 1.352 more trips. Holding all other factors constant, an additional kid whose age is between 5 and 21 results in 2.87 more trips.

Be careful when you are interpreting dummy variables. For example, urban can take two values 1 or 0. Therefore, when you are interpreting urban, you are interpreting relative to the case when urban is 0. Holding all other factors constant, a household located in urban area makes 0.48 fewer trips compared to a household in non-urban areas. This is a surprising result.

When you get such counterintuitive results, investigate further. Maybe your original hypothesis was incorrect. Maybe there was some issue with the data collection. If you are not able to get a consistent real world interpretation for a variable, do not use the variable in the model even if the p values look significant.

Advanced Topics

Restrictions

Consider the following regression model.

> NPTSLR_Full <- lm(ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + hh_5to21 + hh_0to4 + no_trans, data = NPTS)
> summary(NPTSLR_Full)

Call:
lm(formula = ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + 
    hh_5to21 + hh_0to4 + no_trans, data = NPTS)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.235  -2.893  -0.600   2.250  38.969 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.716e-01  4.056e-01   2.396 0.016684 *  
income       1.638e-05  5.608e-06   2.920 0.003538 ** 
urban       -3.513e-01  2.828e-01  -1.242 0.214349    
wrkrcnt      1.335e+00  1.463e-01   9.131  < 2e-16 ***
numadult     7.703e-01  1.775e-01   4.339  1.5e-05 ***
auttrcnt     4.778e-01  1.416e-01   3.375 0.000752 ***
hh_5to21     2.851e+00  1.249e-01  22.820  < 2e-16 ***
hh_0to4      7.919e-01  2.130e-01   3.718 0.000206 ***
no_trans     1.562e-01  2.763e-01   0.565 0.571872    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.89 on 1991 degrees of freedom
Multiple R-squared:  0.3927,    Adjusted R-squared:  0.3903 
F-statistic: 160.9 on 8 and 1991 DF,  p-value: < 2.2e-16

We find that both the variables urban and no_trans are insignificant and let us say we want to remove them. Let us run a new regression model after removing urban and no_trans.

> NPTSLR_Res <- lm(ntrip ~ income + wrkrcnt + numadult + auttrcnt + hh_5to21 + hh_0to4, data = NPTS)
> summary(NPTSLR_Res)

Call:
lm(formula = ntrip ~ income + wrkrcnt + numadult + auttrcnt + 
    hh_5to21 + hh_0to4, data = NPTS)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.185  -2.856  -0.598   2.192  39.236 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.378e-01  3.213e-01   2.608  0.00918 ** 
income      1.470e-05  5.543e-06   2.652  0.00805 ** 
wrkrcnt     1.322e+00  1.460e-01   9.053  < 2e-16 ***
numadult    7.375e-01  1.768e-01   4.171 3.16e-05 ***
auttrcnt    5.398e-01  1.378e-01   3.918 9.25e-05 ***
hh_5to21    2.850e+00  1.250e-01  22.798  < 2e-16 ***
hh_0to4     8.121e-01  2.128e-01   3.816  0.00014 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.892 on 1993 degrees of freedom
Multiple R-squared:  0.3915,    Adjusted R-squared:  0.3897 
F-statistic: 213.7 on 6 and 1993 DF,  p-value: < 2.2e-16

In the new model, all variables are significant but the R-squared and Adjusted R-squared have decreased. Is is it safe to drop the two variables? This can be answered by doing a test of restrictions. Essentially we are conducting the following hypothesis test.

\[ \begin{aligned} H_o: & \beta_{urban} = \beta_{no\_trans} = 0 &\\ H_a: & \text{At least one of} \,\,\beta_{urban}\,\, or\,\, \beta_{no\_trans} \neq 0 & \end{aligned} \]

Let \(SS_{Res}^F\) and \(SS_{Res}^R\) represent the residual sum of squares of the full and restricted model respectively. Note that

\[ SS_{Res} = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 \] Determine \(F_0\) as follows

\[ F_0 = \frac{ \frac{SS_{Res}^R - SS_{Res}^F }{r} }{ \frac{SS_{Res}^F}{n-k} } \]

where \(r\) denotes the number of restrictions which in this case is 2.

> NPTSLR_Full_SSE <- sum(residuals(NPTSLR_Full)^2)
> NPTSLR_Res_SSE <- sum(residuals(NPTSLR_Res)^2)
> print(NPTSLR_Full_SSE)
[1] 47603.69
> print(NPTSLR_Res_SSE)
[1] 47696.79
> r <- 2
> k <- 9
> n <- nrow(NPTS)
> num_F0 <- (NPTSLR_Res_SSE - NPTSLR_Full_SSE)/r
> den_F0 <- (NPTSLR_Full_SSE)/(n-k)
> F0 <- num_F0/den_F0
> print(F0)
[1] 1.946859

We can also perform the above analysis using the anova() command as shown below.

> anova(NPTSLR_Res,NPTSLR_Full)
Analysis of Variance Table

Model 1: ntrip ~ income + wrkrcnt + numadult + auttrcnt + hh_5to21 + hh_0to4
Model 2: ntrip ~ income + urban + wrkrcnt + numadult + auttrcnt + hh_5to21 + 
    hh_0to4 + no_trans
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1   1993 47697                           
2   1991 47604  2    93.097 1.9469  0.143

Now compare \(F_0\) against \(F_{\alpha,r,n-p}\) from the F distribution table, If \(F0 > F_{\alpha,r,n-p}\) then reject the null. An easier way would be to look at the p value and check if p value is less than \(\alpha\). In this case, at 5% significance, p value is 0.143. Since \(0.143 > 0.05\), we cannot reject the null. Therefore, urban and no_trans do not contribute significantly to the model and can be removed.

Let us now look at another type of restriction.

> NPTSLR_Full <- lm(ntrip ~ income + wrkrcnt + numadult + auttrcnt + hh_5to21 + hh_0to4, data = NPTS)
> summary(NPTSLR_Full)

Call:
lm(formula = ntrip ~ income + wrkrcnt + numadult + auttrcnt + 
    hh_5to21 + hh_0to4, data = NPTS)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.185  -2.856  -0.598   2.192  39.236 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.378e-01  3.213e-01   2.608  0.00918 ** 
income      1.470e-05  5.543e-06   2.652  0.00805 ** 
wrkrcnt     1.322e+00  1.460e-01   9.053  < 2e-16 ***
numadult    7.375e-01  1.768e-01   4.171 3.16e-05 ***
auttrcnt    5.398e-01  1.378e-01   3.918 9.25e-05 ***
hh_5to21    2.850e+00  1.250e-01  22.798  < 2e-16 ***
hh_0to4     8.121e-01  2.128e-01   3.816  0.00014 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.892 on 1993 degrees of freedom
Multiple R-squared:  0.3915,    Adjusted R-squared:  0.3897 
F-statistic: 213.7 on 6 and 1993 DF,  p-value: < 2.2e-16

In the above model, you will notice that the coefficients of numadult and hh_0to4 are very close. Let us say we want to test

\[ \begin{aligned} H_o: &\beta_{numadult} = \beta_{hh\_0to4} &\\ H_a: &\beta_{numadult} \neq \beta_{hh\_0to4}& \end{aligned} \] The original model is

\[ y = \ldots + \beta_{numadult} \times numadult + \beta_{hh\_0to4} \times hh\_0to4 + \ldots \]

If the above restriction \(\beta_{numadult} = \beta_{hh\_0to4}\) is valid, then the new model will be

\[ y = \ldots + \beta_{numadult} \times \left( numadult + hh\_0to4 \right) + \ldots \]

We can create the restricted model by removing variables numadult and hh_0to4 and creating a new variable which is the sum of numadult and hh_0to4.

> NPTS$numadult_hh_0to4 <- NPTS$numadult + NPTS$hh_0to4
> NPTSLR_Res <- lm(ntrip ~ income + wrkrcnt + auttrcnt + hh_5to21 + numadult_hh_0to4, data = NPTS)
> summary(NPTSLR_Res)

Call:
lm(formula = ntrip ~ income + wrkrcnt + auttrcnt + hh_5to21 + 
    numadult_hh_0to4, data = NPTS)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.136  -2.835  -0.591   2.188  39.313 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      8.095e-01  3.023e-01   2.677  0.00748 ** 
income           1.472e-05  5.542e-06   2.656  0.00798 ** 
wrkrcnt          1.316e+00  1.442e-01   9.126  < 2e-16 ***
auttrcnt         5.304e-01  1.330e-01   3.988 6.91e-05 ***
hh_5to21         2.853e+00  1.243e-01  22.950  < 2e-16 ***
numadult_hh_0to4 7.684e-01  1.315e-01   5.841 6.05e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.891 on 1994 degrees of freedom
Multiple R-squared:  0.3915,    Adjusted R-squared:   0.39 
F-statistic: 256.6 on 5 and 1994 DF,  p-value: < 2.2e-16

Note that the number of restrictions in this case is only 1.

> NPTSLR_Full_SSE <- sum(residuals(NPTSLR_Full)^2)
> NPTSLR_Res_SSE <- sum(residuals(NPTSLR_Res)^2)
> print(NPTSLR_Full_SSE)
[1] 47696.79
> print(NPTSLR_Res_SSE)
[1] 47698.43
> r <- 1
> k <- 7
> n <- nrow(NPTS)
> num_F0 <- (NPTSLR_Res_SSE - NPTSLR_Full_SSE)/r
> den_F0 <- (NPTSLR_Full_SSE)/(n-k)
> F0 <- num_F0/den_F0
> print(F0)
[1] 0.06842986
> anova(NPTSLR_Res,NPTSLR_Full)
Analysis of Variance Table

Model 1: ntrip ~ income + wrkrcnt + auttrcnt + hh_5to21 + numadult_hh_0to4
Model 2: ntrip ~ income + wrkrcnt + numadult + auttrcnt + hh_5to21 + hh_0to4
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1   1994 47698                           
2   1993 47697  1    1.6377 0.0684 0.7937

Since the p value is greater than \(\alpha\) at 5% significance, we fail to reject the null. Therefore, we go with the restricted model. The impact of an additional adult and additional kid whose age is less than or equal to 4 in the household on the number of trips made is the same.

Building a model

Do a descriptive analysis. Study the mean, median, min, max of each independent variable of interest. Make sure the values make sense. Look at correlations between pairs of independent variables. Note that you may have to study correlations differently if they are categorical variables. In the above example, I used pearson correlation to study the relationship between numadult, hh_0to4, and hhsize all of which are discrete variables. Pearson correlation might not be the best for this case. When you are studying the relationship between continuous and categorical or different categorical variables, explore other correlation measures or use crosstabulations.

You may have to recode the variables if necessary. For example, let us look at numadult variable.

> table(NPTS$numadult)

   1    2    3    4    5    6    7    8 
 533 1219  177   54   11    3    1    2 

The number of datapoints drops of rapidly after numadult = 4. So one recommendation is to recode the numadult variable as 1,2,3, and greater than or equal to 4.

Study the correlation between each independent variable and the dependent variable and make sure they make sense.

Include all relevant independent variables and remove them one by one. However, such an approach might not work if there is multi-collinearity issues. How to remove? Start with the least significant variable and gradually remove them one by one. If you are removing groups, do F-tests and check.

The other approach is to include a small number of independent variables and gradually add new variables. In the process, continuously check for significance.

R also has packages and commands which help in variable selection. The regsubset() command in leaps() package and the step() command will help in variable selection based on various criterias.

Even if you arrive at the final model using R packages. You may have to play around with the model specifications. For example, income can be included as a continuous or as a categorical variable. You can include income as a continuous variable directly. The other option is to categorize based on income and create dummy variables for low income (income <= 30000), medium income (30000 < income <= 80000), and high income and study the dummy variables.

You may have to do transformations - take log of dependent or certain independent variables, box cox transformation, center the variables (by subtracting mean and divding by the standard deviation) to make the model fit the linear regression assumptions.

No package will help you identify all of these trends. You will have to use your intuition and judgement. The final goal is to come up with a model which is parsimonious, has good measures of fit and where the explanation of the coefficients make sense.

Leverage and Influence

Sometimes in linear regression, a small subset of data points can affect the overall regression results. This section provides a brief outline on how to identify such points. For a detailed discussion on leverage and influence see chapter 6 of Montgomery, Peck, and Vining (2012). Consider the following data frame.

x y col
5.2720 22.9263 black
1.0356 13.0035 black
18.0000 86.5663 blue
1.0521 11.4897 black
7.4846 42.3463 black
6.8150 39.5922 black
4.3134 29.6026 black
12.0000 20.0000 red
1.9054 14.4761 black
3.9393 23.5980 black
9.3811 49.6728 black
5.9817 34.3181 black
8.3588 40.7720 black
4.1324 25.2963 black
9.4299 42.9972 black
3.0882 18.4772 black
9.4476 60.4081 black
9.4603 56.4520 black
1.3338 12.3898 black
1.4830 8.7563 black

A leverage point is defined as a datapoint whose \(x\) value is far away from the regular range (or mean) of \(x\). For example, in the above point, the blue point is a leverage point. The blue point is not expected to change the intercept and slope but may have an impact on overall model performance.

An outlier is a data point which has a large residual or the y value is significantly different from what was predicted by the model. Influential points are leverage points with large residuals. Influential points are expected to have a big impact on the model coefficients. For example in the above figure, the red point is an influential point.

The coefficients and R squared from four different regressions are shown below.

  • Both Red and Blue : Both red and blue data point were included in the data
  • Red, No Blue: Only the red point was included. The blue point was removed.
  • Blue, No Red: Only the blue point was included. The red point was removed.
  • No red, No Blue: Both red and blue points were removed. Only the black points were considered.

As can be seen from the table, when the blue point was included, there was only minor changes in the intercept and slope and R squared. However, when the red point was included, the intercept and slope changes significantly.

Dataset Intercept Slope Rsquared
Both Red and Blue 7.9285 3.9912 0.7598
Red, No Blue 9.7530 3.5998 0.6158
Blue, No Red 5.8704 4.6607 0.9559
No Red, No Blue 4.8091 4.8982 0.9307

There are several metrics which help identify data points with high leverage and influence. Let fulldf be the dataframe which contains the above \(x\) and \(y\) values.

> fulldflr <- lm(y~x,data = fulldf)
> influence.measures(fulldflr)
Influence measures of
     lm(formula = y ~ x, data = fulldf) :

     dfb.1_    dfb.x    dffit  cov.r   cook.d    hat inf
1  -0.10365  0.03019 -0.14235 1.1336 1.05e-02 0.0524    
2   0.03631 -0.02812  0.03645 1.2776 7.03e-04 0.1235    
3  -0.45622  0.73580  0.78218 1.8091 3.09e-01 0.4345   *
4  -0.02453  0.01898 -0.02463 1.2777 3.21e-04 0.1230    
5   0.03305  0.03160  0.10905 1.1576 6.22e-03 0.0546    
6   0.04557  0.01485  0.10316 1.1544 5.57e-03 0.0511    
7   0.09580 -0.04552  0.11254 1.1649 6.63e-03 0.0598    
8   1.18676 -2.90200 -3.59874 0.0422 1.23e+00 0.1430   *
9  -0.03546  0.02558 -0.03604 1.2451 6.87e-04 0.1008    
10 -0.00123  0.00065 -0.00139 1.1978 1.02e-06 0.0640    
11 -0.00512  0.07577  0.12646 1.1894 8.37e-03 0.0780    
12  0.03479 -0.00288  0.05732 1.1717 1.73e-03 0.0501    
13 -0.00176 -0.00607 -0.01340 1.1960 9.50e-05 0.0629    
14  0.01943 -0.00975  0.02236 1.1938 2.65e-04 0.0617    
15  0.00362 -0.04581 -0.07572 1.2076 3.02e-03 0.0789    
16 -0.04856  0.03035 -0.05147 1.2096 1.40e-03 0.0766    
17 -0.02353  0.28352  0.46702 0.9219 1.00e-01 0.0792    
18 -0.01722  0.20058  0.32959 1.0556 5.35e-02 0.0794    
19 -0.03171  0.02402 -0.03193 1.2660 5.40e-04 0.1152    
20 -0.18429  0.13798 -0.18592 1.2215 1.80e-02 0.1113    

In the influence measures metrics, there are three columns I recommend looking at. For leverage, look at hat inf and for influence look at cook distance and dffit.

  • hat inf: Look at all data points with hat inf > \(2 \times \frac{k}{n}\) where \(k\) is the number of parameters estimated. In this case, \(k=2\) and therefore \(2 \times \frac{k}{n} = 2\times \frac{2}{20}= 0.2\). Data point 3 is idenfitied as a high leverage point.
  • Cook Distance: Look at all data points with cook.d > 1.00. In this case, data point 8 is identified as a high influence point.
  • dffit: Look at all data points with absolute value of dffit > \(2 \times \sqrt{\frac{k}{n}}\). In this case \(2 \times \sqrt{\frac{k}{n}} = 0.63\). Data points 3 and 8 are identified as high influence points.

Just because a point is identified as being influential, doesn’t mean you need to remove it from the data set. The key is to study the point further. Is there any error in collecting the data. Does the \(y\) and \(x\) value make sense? Sometimes, the outlier points provide valuable information which is missing in other points.

Modeling Nonlinear Forms as Linear Regression

Linear Regression can also be used to model nonlinear function with specific forms. For example:

\[ Y = b_0 + b_1x + b_2x^2 + \ldots + b_mx^m \]

Substituting \(x_1 = x, x_2 = x^2, \ldots, x_m x^m\) and you have a linear model.

\[ Y = b_0exp(b_1x) \]

The above model is a nonlinear model. Taking log on both sides we get:

\[ ln(Y) = ln(b_0) + b_1x \]

We transform the nonlinear model to a linear model.

Open the file ConcBact.csv. Develop a scatter plot of time on the x axis and concentration on the y axis.

> CB <- read.csv("ConcBact.csv", header = TRUE)
> CB
  Time    Conc
1  0.5  180000
2  1.0  185000
3  1.5  225000
4  2.0  453000
5  2.5  534000
6  3.0  950000
7  3.5 1350000
8  4.0 1850000
> plot(CB$Time, CB$Conc)

The scatter plot clearly shows that there is a nonlinear exponential relationship.

So I assume:

\[ CONC = b_0exp(b_1T) \]

By taking natural log, you get a linear relationship:

\[ ln(CONC) = ln(b_0) + b_1T \]

Thus, I do a linear regression of natural logarithm of CONC vs time.

> CTLR <- lm(log(Conc) ~ Time, data=CB)
> summary(CTLR)

Call:
lm(formula = log(Conc) ~ Time, data = CB)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.26172 -0.10098  0.04117  0.07441  0.24688 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.4880     0.1315   87.34 1.52e-10 ***
Time          0.7317     0.0521   14.05 8.12e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1688 on 6 degrees of freedom
Multiple R-squared:  0.9705,    Adjusted R-squared:  0.9656 
F-statistic: 197.3 on 1 and 6 DF,  p-value: 8.124e-06
> plot(CB$Time, log(CB$Conc), xlab = "Time", ylab = "log(Conc)")
> abline(CTLR)

As can be seen in the above figure, linear regression taking the natural logarithm works very well.

References

Barreto, Humberto, and Frank M Howland. 2006. “Introductory Econometrics : Using Monte Carlo Simulation with Microsoft Excel.” Cambridge, New York.

Montgomery, Douglas C, Elizabeth A Peck, and G Geoffrey Vining. 2012. Introduction to Linear Regression Analysis. Vol. 821. John Wiley & Sons.

Tang, Wilson H, and A Ang. 2007. Probability Concepts in Engineering: Emphasis on Applications to Civil & Environmental Engineering. Wiley Hoboken, NJ.