Dataset
Description
A data frame with 392 observations on the following
9 variables.
mpg miles per gallon
cylinders Number of cylinders between 4 and 8
displacement Engine displacement (cu. inches)
horsepower Engine horsepower
weight Vehicle weight (lbs.)
acceleration Time to accelerate from 0 to 60 mph
(sec.)
year Model year (modulo 100)
origin Origin of car (1. American, 2. European, 3.
Japanese)
name Vehicle name
Dataset
IMPORTING THE DATASET
DATA CLEANING
Checking for missing values
Checking for duplicates
Checking for structure in the dataset
EXPLORATORY DATA
ANALYSIS
CMSC 177
Exploratory Data
Analysis
Perform exploratory data analysis on the fuel efficiency of
the cars. In particular, produce the necessary summary
measures and visualizations to allow you to comment on
the relation, if any, of the mpg, year and origin of the cars.
(Note: For origin, 1 = US, 2 = Europe, 3 = Japan.).
1.
Scatter Plot of
MPG vs. Year
There is a noticeable increase in MPG
as we move right along the x-axis
(Year). This indicates an improvement
in fuel efficiency over these years.
Cars manufactured in later years tend
to have higher MPG, suggesting
advancements in technology and
design that have led to more fuel-
efficient vehicles.
Exploratory Data
Analysis
Boxplot of MPG
by Origin
The median MPG appears to be highest for cars
from Japan (3), followed by Europe (2), and then the
US (1). This suggests that, on average, cars from
Japan and Europe are more fuel-efficient than those
from the US.
There are several outliers in the data, particularly
for the US and Europe. These outliers represent cars
that have exceptionally high or low MPG compared
to the majority of cars from the same region.
The overall distribution of MPG values is different
for each region. Japanese cars tend to have higher
MPG values, while US cars tend to have lower MPG
values. European cars fall somewhere in between.
Exploratory Data
Analysis
Cylinders: There is a clear negative
relationship between mpg and the number of
cylinders. Cars with more cylinders tend to
have lower mpg.
Displacement: Similar to cylinders, there is a
strong negative correlation between mpg and
engine displacement. Larger engines tend to
be less fuel-efficient.
Horsepower: There is also a negative
correlation between mpg and horsepower,
indicating that more powerful cars usually
consume more fuel.
MPG vs Other Variables
2. Produce a scatterplot matrix which
includes all of the variables in the
data set.
Weight: Heavier cars have lower mpg,
showing a negative correlation.
Acceleration: There is a slight positive
correlation between mpg and acceleration,
suggesting that cars with better fuel
efficiency might also have better
acceleration capabilities.
Year: There is a positive trend between
mpg and the year of the car, indicating
improvements in fuel efficiency over time.
MPG vs Other Variables
2. Produce a scatterplot matrix which
includes all of the variables in the
data set.
For mpg:
Strong negative correlation with cylinders (-0.78), displacement (-0.81), horsepower
(-0.78), and weight (-0.83), indicating that cars with higher mpg tend to have fewer
cylinders, lower displacement, less horsepower, and weigh less.
Positive correlation with acceleration (0.42) and year (0.58), suggesting that more recent
models and those with better acceleration tend to have higher mpg.
Positive correlation with origin (0.57), indicating that cars from certain origins (likely
non-US) tend to have higher mpg.
3. Compute the matrix of correlations between the
variables using the function cor(). You will need to
exclude the name variable, which is qualitative.
For year:
Positive correlation with origin (0.18), indicating a trend over years possibly
linked to changes in manufacturing origins or standards.
For origin:
This attribute correlates positively with mpg, acceleration, and year,
suggesting regional differences in car manufacturing that affect these
attributes.
3. Compute the matrix of correlations between the
variables using the function cor(). You will need to
exclude the name variable, which is qualitative.
4. Use the lm() function to perform a multiple linear
regression with mpg as the response and all other variables
except name as the predictors. Use the summary() function
to print the results. Comment on the output. In particular,
answer the following questions:
i. Is there a relationship between the
predictors and the response?
There are multiple predictors that
have relationship with the response
because their associated p-value is
significant
ii. Which predictors appear to have a
statistically significant relationship to
the response?
The predictors: displacement, weight,
year, and origin have a statistically
significant relationship.
4. Use the lm() function to perform a multiple linear
regression with mpg as the response and all other variables
except name as the predictors. Use the summary() function
to print the results. Comment on the output. In particular,
answer the following questions:
iii. What does the coefficient for the
year variable suggest?
The coefficient of year is 0.7507
which is about 3/4. This tells us the
relationship between year and MPG.
It suggests that every 3 years, the
mpg goes up by 4.
5. Use the plot() function to produce diagnostic plots of the
linear regression fit. Comment on any problems you see
with the fit. Do the residual plots suggest any unusually
large outliers? Does the leverage plot identify any
observations with unusually high leverage?
The residual plot shows the residuals
(prediction errors) of the model
plotted against the fitted values. If
the plot shows a pattern, it may
indicate that the model is not
adequately capturing the underlying
relationships in the data. Since the
residual plot shows a pattern
Homoscedasticity assumption is
violated.
5. Use the plot() function to produce diagnostic plots of the
linear regression fit. Comment on any problems you see
with the fit. Do the residual plots suggest any unusually
large outliers? Does the leverage plot identify any
observations with unusually high leverage?
The normal Q-Q plot compares the
quantiles of the residuals to the quantiles
of a normal distribution. If the residuals
are normally distributed, the points on the
plot should lie approximately on a
straight line. Thus the data set is
normally distributed.
The scale-location plot shows the
residuals plotted against the fitted values,
with the square root of the leverage on
the x-axis. This plot can help you identify
any outliers in the data. Since no data
points can be found beyond-3 and 3, the
data set has no outliers.
5. Use the plot() function to produce diagnostic plots of the
linear regression fit. Comment on any problems you see
with the fit. Do the residual plots suggest any unusually
large outliers? Does the leverage plot identify any
observations with unusually high leverage?
The leverage plot shows the influence
of each observation on the fit of the
model. Observations with high leverage
can have a disproportionate influence
on the fit of the model. The Cook's
distance plot shows the influence of
each observation on the fit of the
model, with observations with higher
influence shown above the dashed red
line. Thus the data set has no high
leverage points.
6. Use the * and : symbols to fit linear regression models
with interaction effects. Do any interactions appear to be
statistically significant?
Interaction between horsepower: displacement
is statistically significant
Interaction between horsepower:origin is
statistically significant
7. Try a few different transformations of the variables,
such as log x, √x , x^2. Comment on your findings.
log(acceleration) is still very significant but
less significant than acceleration
log(horsepower) is more significant than
horsepower
Transform log(acceleration)
Transform log(horsepower)
7. Try a few different transformations of the variables,
such as log x, √x , x^2. Comment on your findings.
Squaring horsepower doesn’t change the
significance
Square root transformation of the cylinders is
more significant than the cylinders.
Transform horsepower^2
Transform cylinders^0.5
Logistic Regression &
Support Vector Machine
CMSC 177
Logistic
Regression & SVM
1. Create a binary variable, mpg01, that contains a 1 if mpg
contains a value above its median, and a 0 if mpg contains a
value below its median.
2. Explore the data graphically in order to investigate the
association between mpg01 and the other features. Which
of the other features seen most likely to be useful in
predicting mpg01? Describe your findings.
Weight appears to be a strong predictor
of mpg01 because lighter vehicles are
more likely to have higher fuel
efficiency (represented by mpg01 = 1).
Displacement could be a useful
predictor, as smaller engines (lower
displacement) are typically more fuel-
efficient.
Horsepower is likely a good predictor,
indicating that less powerful engines are
more likely to be fuel-efficient.
The number of cylinders could be a
useful predictor, as vehicles with fewer
cylinders tend to be more fuel-efficient.
2. Explore the data graphically in order to investigate the
association between mpg01 and the other features. Which
of the other features seen most likely to be useful in
predicting mpg01? Describe your findings.
Based on the scatter plots, the
features that seem most likely to
be useful in predicting mpg01 are
cylinders, displacement,
horsepower, and weight. These
features show clearer trends or
separations when plotted against
mpg01, suggesting they have
stronger associations with the
binary mpg variable.
Logistic
Regression & SVM
3. Split the data into a training set and a test set.
4. Perform logistic regression on the training data in order
to predict mpg01 using the variables that seemed most
associated with mpg01 in (2). What is the test error of the
model obtained?
The model misclassifies approximately 12% of the observations
in the test dataset.
5. Repeat part (4), but now using Naïve Bayes.
The model for Naive Bayes misclassifies approximately 13% of
the observations in the test dataset.
6. Repeat part (4), but now using support vector machines.
The output 1 indicates that the test error is 100%, which suggests that all
predictions made by the SVM model on the test dataset are incorrect compared
to the actual values of mpg01 in the test dataset.