In this case study, you will continue to perform multiple regression, but you will be asked to think about which variables should or should not be included.
First we will predict the price of a laptop based on many variables, both quantitative and categorial. Begin by downloading the data as usual. By now, you should find it natural to explore basic information about a dataset and its variables after downloading.
laptops = read.csv("laptops.csv")
summary(laptops)
## Max.Horizontal.Resolution Memory.Technology Installed.Memory Processor.Speed
## Length:99 Length:99 Min. : 250.0 Min. : 867
## Class :character Class :character 1st Qu.: 512.0 1st Qu.:1600
## Mode :character Mode :character Median :1000.0 Median :1800
## Mean : 907.6 Mean :1711
## 3rd Qu.:1000.0 3rd Qu.:2000
## Max. :2000.0 Max. :2330
## Processor Manufacturer Infrared Bluetooth
## Length:99 Length:99 Length:99 Length:99
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Docking.Station Port.Replicator Fingerprint Subwoofer
## Length:99 Length:99 Length:99 Length:99
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## External.Battery CDMA Operating.System Warranty.Days
## Length:99 Length:99 Length:99 Min. : 365.0
## Class :character Class :character Class :character 1st Qu.: 365.0
## Mode :character Mode :character Mode :character Median : 365.0
## Mean : 489.9
## 3rd Qu.: 365.0
## Max. :1095.0
## Price
## Min. : 539.0
## 1st Qu.: 796.5
## Median :1075.0
## Mean :1091.9
## 3rd Qu.:1299.5
## Max. :1899.0
Summarize the data, and fix anything that seems nonsensical. (This should be your first step before any analysis.)
for(i in 1:ncol(laptops)){
par(ask = TRUE)
plot(laptops[,i], xlab = names(laptops)[i])
}
What did this do? What was the role of the line par(ask = TRUE)
? How did we use the loop to get each variable name to print on the x-axis?
As you looked at the plots, did anything stand out to you as a possible problem for regression?
Price
. Comment on what you see.
Note: When you are done with this question, change the code chunks to eval = FALSE
, to avoid printing all the plots in your final output.
For each of the following regressions, explain what is wrong with the output of lm( )
, and why exactly it occurred. Explain your answers with appropriate plots or tables where possible.
# a
lm_a = lm(Price ~ Subwoofer, data = laptops)
# b
lm_b = lm(Price ~ Max.Horizontal.Resolution^2, data = laptops)
summary(lm_b)
# c
lm_c = lm(Price ~ Manufacturer + Operating.System, data = laptops)
# d
lm_d1 = lm(Price ~ Processor.Speed+Processor, data = laptops)
lm_d2 = lm(Price ~ Processor.Speed*Processor, data = laptops)
Recall that we can use ANOVA tests to compare two multiple regressions, when one model is nested in the other. This is particularly useful when the models have many factors, so it might be hard to tell which variable is more significant from the t-scores.
Consider the following model:
lm_3 = lm(Price ~ Port.Replicator + Bluetooth + Manufacturer, data = laptops)
If you had to remove exactly one of the three variables from the model, which one would you remove? Why?
YOUR CODE HERE
Consider the issue you noticed in 2(d). Soon, we will want to build our full regression model, and we will have to decide whether to include Operating.System
or Manufacturer
. Regress each of these two variables individually against Price
. Which one would you rather include in the full model? Justify your answer.
YOUR CODE HERE
Recall from lecture that one major concern in Multiple Regression is collinearity, or correlation between explanatory variables. One way to measure this is through the Variance Inflation Factor. Use the code below to install an R package that will calculate this, as well as to get rid of the useless variables we discovered in Questions 1-4.
# Install vif package
require("car")
# Get rid of identified useless variables
bad = c("Port.Replicator", "Subwoofer", "CDMA")
lt = laptops[, !(names(laptops) %in% bad)]
Try the following regression, and then use vif( )
to check for collinearity. Are there any variables we should be worried about? Decide which ones to remove (if any) from lt
.
lm_4 = lm(Price ~ .-Operating.System, data = lt)
YOUR CODE HERE
Compare the following regressions via anova( )
, and look at vif( )
for each. Make an argument for keeping either Manufacturer
or Operating.System
in your final regression.
lm_5 = lm(Price ~ .-Manufacturer, data = lt)
lm_6 = lm(Price ~ .-Operating.System, data = lt)
YOUR CODE HERE
We have now established a final set of candidate variables from which to predict the price of laptops. Install the R package called "leaps". This package automatically performs several types of variable selection.
regsubsets( )
. How many types of variable selection can be performed? What are they? Which measures of model fit does the function output? ``````
regsubsets( )
to a regression predicting Price
from all reasonable variables, using forward selection. Plot the results by using plot( )
on the output. Use the option scale = "adjr2"
inside plot( )
to change the measure of model fit to be adjusted R-squared. YOUR CODE HERE
regsubsets( )
to search exhaustively, and using Mallow's Cp as the measure of model fit, what is the best model for predicting Price
? YOUR CODE HERE
Use your final model in 6c for the following: a. Make a plot of the predicted prices of each laptop in the dataset versus the true prices. Hint: use predict( )
Is there anything we might be concerned about from these predictions?
YOUR CODE HERE
YOUR CODE HERE
Suppose you are consulting in marketing. One of your clients, Cooper, says "Customers treat all PC manufacturers the same. People only pay more for some brands because those laptops happen to include better features." Another client, Tina, says "No, customers have a preference for specific manufacturers, and they will pay more for these brands even if the laptops are otherwise identical."
Based on this dataset, who do you think is right, Cooper or Tina? Do you believe price differences in PCs are only due to different features, or is there a manufacturer effect as well? Be creative in your answer; go beyond your response to Question 5. Make sure to support your argument with plots and clear explanations.
Note: A "PC" in this case refers any laptop that is not made by Apple.