In this Case Study, you will practice calculating a linear regression in R and plotting the results. You will also learn about writing your own functions.

Get the data

We will be attempting to find a linear regression that models college tuition rates, based on a dataset from US News and World Report. Alas, this data is from 1995, so it is very outdated; still, we will see what we can learn from it.


Question 1:

The dataset is on our "Data" page in sakai (tuition-final.csv); see also the related "tuition-documentation.txt" file for some further information. You already know the read.csv( ) and read.table( ) to read the data into R and call it tuition. Use the functions you have learned previously to familiarize yourself with the data in tuition. Check out

YOUR CODE HERE

Question 2:

Make a new variable in tuition called Acc.Rate that contains the acceptance rate for each university.

YOUR CODE HERE

Question 3:

Find which line corresponds to UNC (“University of North Carolina at Chapel Hill”).

YOUR CODE HERE

Writing functions

We have seen many examples of using functions in R, like summary( ) or t.test( ). Now you will learn how to write your own functions. Defining a function means writing code that looks something like this:

my_function <- function(VAR_1, VAR_2){
  
  # do some stuff
  return(result)
  
}

Then you run the code in R to “teach” it how your function works, and after that, you can use it like you would any other pre-existing function. For example, try out the following:

add1 <- function(a, b){
  
  # add the variables
  c = a + b
  return(c)
  
}

add2 <- function(a, b = 3){
  
  # add the variables
  c = a + b
  return(c)
  
}

# Try adding 5 and 7
add1(5, 7)
add2(5, 7)

# Try adding one variable
add1(5)
add2(5)

Question 4:

What was the effect of b = 3 in the definition of add2( )?


Question 5:

Recall that the equations for simple linear regression are: \[\beta_1 = r \frac{S_X}{S_Y} \hspace{0.5cm} \beta_0 = \bar{Y} - \beta_1 \bar{X}\]

Write your own functions, called beta1( ) and beta0( ) that take as input some combination of Sx, Sy, r, y_bar, and x_bar, and use that to calculate \(\beta_1\) and \(\beta_0\).

YOUR CODE HERE

Question 6:

Try your function with Sy = 0. Did it work? If not, fix your function code. Explain why it would be a problem to do linear regression with \(S_Y = 0\).


Linear Regression by hand

Use the code below to make a scatterplot of college tuition versus average SAT score.

plot(tuition$Avg.SAT, tuition$Out.Tuition, main = "title", xlab = "label", ylab = "label", pch = 7, cex = 2, col = "blue")

Question 7:

Make your own scatterplot, but change the input of plot( ) so that it looks nice.

 YOUR CODE HERE

What do pch and cex do?


Question 8:

Change the color of your scatterplot using col = tuition$Public. What did this do?


Question 9:

We have used the function abline( ) to add a vertical line or a horizontal line to a graph. However, it can also add lines by slope and intercept. Read the documentation of abline( ) until you understand how to do this. Then add a line with slope 10 and intercept 0 to your plot. Does this seem to fit the data well?


Question 10:

Use the functions you already know in R and the ones you created, beta1( ) and beta0( ), to find the slope and intercept for a regression line of Avg.SAT on Out.Tuition. Remake your scatterplot, and add the regression line. What do you conclude about the relationship between average SAT score and a college’s tuition?


Question 11:

Write a new function called predict_yval(X, Y, x_new) that takes as input a vector of explanatory variables (X), a vector of y-variables (Y), and a new x-value that we want to predict (x_new). The output of the function should be the predicted y-value for x_new from a regression line. (Hint: You can use functions inside functions.)

YOUR CODE HERE

Now find the average SAT score and tuition of UNC and of Duke, and compare their predicted values to the truth:

# Find UNC values
x_unc = 
y_unc = 

# Find Duke values
x_duke = 
y_duke = 


# Predict tuitions
predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, x_unc)
predict_yval(tuition$Avg.SAT, tuition$Out.Tuition, x_duke)

Would you say you are getting a deal at UNC? How about at Duke?

lm() and diagnostics

You now have functions to calculate the slope and intercept of a linear regression, and to predict values. As you might expect, R was already able to do this, using the function lm( ). In class, you saw how to read the output of lm( ). Run the following regression of Avg.SAT on Out.Tuition, and refamiliarize yourself with the output. (You can also check that your beta1 and beta0 outputs were correct, while you are at it.)

  my_lm = lm(Out.Tuition ~ Avg.SAT, data = tuition)
  summary(my_lm)

(Description of variables, comment on how outdated it is, look at UNC 2974)

  • Explore data a bit (open ended)
  • pairs() with SAT and ACT: which one to use? in the future…
  • Write functions to calculate linear regressions
  • Use functions to find slope and intercept and plot the line on a scatterplot, for SAT or ACT (their choice)
  • write function to predict a value (you can call other functions from inside!), predict UNC and Duke, look at actual values. Are you getting a deal?
  • Do a t-test for the betas by hand???
  • teach them lm()
  • run lm() on Size, do diagnostics, notice it’s bad
  • split by public and private
  • Outliers: Expenditure per student for private schools (Caltech etc have influential outliers)
  • Multiple regression in lm()

(ANOVA after)

Automatic diagnostics are: Resids, Normal qq for stdized resids, stdized resids, Leverage (cooks).