In this case study, you will continue to perform multiple regression, but you will be asked to think about which variables should or should not be included.

Preliminary checks

First we will predict the price of a laptop based on many variables, both quantitative and categorial. Begin by downloading the data as usual. By now, you should find it natural to explore basic information about a dataset and its variables after downloading.

laptops = read.csv("laptops.csv")
  
summary(laptops)
##  Max.Horizontal.Resolution Memory.Technology  Installed.Memory Processor.Speed
##  Length:99                 Length:99          Min.   : 250.0   Min.   : 867   
##  Class :character          Class :character   1st Qu.: 512.0   1st Qu.:1600   
##  Mode  :character          Mode  :character   Median :1000.0   Median :1800   
##                                               Mean   : 907.6   Mean   :1711   
##                                               3rd Qu.:1000.0   3rd Qu.:2000   
##                                               Max.   :2000.0   Max.   :2330   
##   Processor         Manufacturer         Infrared          Bluetooth        
##  Length:99          Length:99          Length:99          Length:99         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Docking.Station    Port.Replicator    Fingerprint         Subwoofer        
##  Length:99          Length:99          Length:99          Length:99         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  External.Battery       CDMA           Operating.System   Warranty.Days   
##  Length:99          Length:99          Length:99          Min.   : 365.0  
##  Class :character   Class :character   Class :character   1st Qu.: 365.0  
##  Mode  :character   Mode  :character   Mode  :character   Median : 365.0  
##                                                           Mean   : 489.9  
##                                                           3rd Qu.: 365.0  
##                                                           Max.   :1095.0  
##      Price       
##  Min.   : 539.0  
##  1st Qu.: 796.5  
##  Median :1075.0  
##  Mean   :1091.9  
##  3rd Qu.:1299.5  
##  Max.   :1899.0

Summarize the data, and fix anything that seems nonsensical. (This should be your first step before any analysis.)


Question 1:

  1. Run the following code:
for(i in 1:ncol(laptops)){
  par(ask = TRUE)
  plot(laptops[,i], xlab = names(laptops)[i])
}

What did this do? What was the role of the line par(ask = TRUE)? How did we use the loop to get each variable name to print on the x-axis?

As you looked at the plots, did anything stand out to you as a possible problem for regression?

  1. Alter the above code so that instead of plotting each variable alone, you plot it against Price. Comment on what you see.

Note: When you are done with this question, change the code chunks to eval = FALSE, to avoid printing all the plots in your final output.


Question 2:

For each of the following regressions, explain what is wrong with the output of lm( ), and why exactly it occurred. Explain your answers with appropriate plots or tables where possible.

# a
lm_a = lm(Price ~ Subwoofer, data = laptops)

# b
lm_b = lm(Price ~ Max.Horizontal.Resolution^2, data = laptops)
summary(lm_b)

# c
lm_c = lm(Price ~ Manufacturer + Operating.System, data = laptops)

# d
lm_d1 = lm(Price ~ Processor.Speed+Processor, data = laptops)
lm_d2 = lm(Price ~ Processor.Speed*Processor, data = laptops)

ANOVA for nested models

Recall that we can use ANOVA tests to compare two multiple regressions, when one model is nested in the other. This is particularly useful when the models have many factors, so it might be hard to tell which variable is more significant from the t-scores.


Question 3:

Consider the following model:

  lm_3 = lm(Price ~ Port.Replicator + Bluetooth + Manufacturer, data = laptops)

If you had to remove exactly one of the three variables from the model, which one would you remove? Why?

YOUR CODE HERE

Question 4:

Consider the issue you noticed in 2(d). Soon, we will want to build our full regression model, and we will have to decide whether to include Operating.System or Manufacturer. Regress each of these two variables individually against Price. Which one would you rather include in the full model? Justify your answer.

  YOUR CODE HERE

Collinearity

Recall from lecture that one major concern in Multiple Regression is collinearity, or correlation between explanatory variables. One way to measure this is through the Variance Inflation Factor. Use the code below to install an R package that will calculate this, as well as to get rid of the useless variables we discovered in Questions 1-4.

  # Install vif package
  require("car")
  
  # Get rid of identified useless variables
  bad = c("Port.Replicator", "Subwoofer", "CDMA")
  lt = laptops[, !(names(laptops) %in% bad)]

Question 5:

Try the following regression, and then use vif( ) to check for collinearity. Are there any variables we should be worried about? Decide which ones to remove (if any) from lt.

  lm_4 = lm(Price ~ .-Operating.System, data = lt)
    YOUR CODE HERE

Question 6:

Compare the following regressions via anova( ), and look at vif( ) for each. Make an argument for keeping either Manufacturer or Operating.System in your final regression.

  lm_5 = lm(Price ~ .-Manufacturer, data = lt)
  lm_6 = lm(Price ~ .-Operating.System, data = lt)

  YOUR CODE HERE

Narrowing down the model

We have now established a final set of candidate variables from which to predict the price of laptops. Install the R package called "leaps". This package automatically performs several types of variable selection.


Question 7

  1. Look at the documentation for the function regsubsets( ). How many types of variable selection can be performed? What are they? Which measures of model fit does the function output? ```

```

  1. Apply regsubsets( ) to a regression predicting Price from all reasonable variables, using forward selection. Plot the results by using plot( ) on the output. Use the option scale = "adjr2" inside plot( ) to change the measure of model fit to be adjusted R-squared.
  YOUR CODE HERE
  1. Using regsubsets( ) to search exhaustively, and using Mallow's Cp as the measure of model fit, what is the best model for predicting Price?
  YOUR CODE HERE

Question 8

Use your final model in 6c for the following: a. Make a plot of the predicted prices of each laptop in the dataset versus the true prices. Hint: use predict( ) Is there anything we might be concerned about from these predictions?

    YOUR CODE HERE
  1. Look at some diagnostic plots and/or measurements for your final model, and comment on them.
    YOUR CODE HERE

Your Turn

Suppose you are consulting in marketing. One of your clients, Cooper, says "Customers treat all PC manufacturers the same. People only pay more for some brands because those laptops happen to include better features." Another client, Tina, says "No, customers have a preference for specific manufacturers, and they will pay more for these brands even if the laptops are otherwise identical."

Based on this dataset, who do you think is right, Cooper or Tina? Do you believe price differences in PCs are only due to different features, or is there a manufacturer effect as well? Be creative in your answer; go beyond your response to Question 5. Make sure to support your argument with plots and clear explanations.

Note: A "PC" in this case refers any laptop that is not made by Apple.