Regression Analysis
Regression analysis provides the capability to estimate relationships between a dependent (response) variable and one or more independent (predictor) variables. You can use regression equations to predict the value of a predictor variable Y for each response variable X value. Also, you can use regression analysis to fit curves in a regression model and assess interaction effects. Regression techniques include linear regression, logistic regression, polynomial regression, and stepwise regression. These techniques are adapted to the research question and used to solve complex problems, forecast values, and predict trends in data.
Case - Predicting the Price of a Home
As a data analyst working with a prominent real estate company in Seattle, I have been entrusted with the task of harnessing the power of data to drive better decision-making and pricing strategies. Our mission revolves around utilizing historical data to uncover patterns and insights that can significantly impact the pricing strategies for our clients' homes. We understand that the real estate market is a dynamic and complex landscape, where a multitude of factors come into play when determining the selling price of a property. To this end, I was provided with a substantial dataset comprising historical information on a myriad of attributes associated with houses in the Seattle area, along with their corresponding selling prices.
In the subsequent sections of this report, I will outline the methodologies, findings, and recommendations derived from this analysis. Our ultimate goal is to enhance our company's ability to serve our clients by offering them a competitive edge in the challenging real estate market of Seattle.
Goal
The primary objective of this project is to develop robust regression models that can accurately predict a house's selling price based on a range of attributes. By delving into this dataset, we aim to uncover the relationships and dependencies between various house features and the ultimate selling price. The insights and models developed through this analysis will be instrumental in empowering our real estate company to make more informed, data-driven decisions when setting prices for our clients' properties. With these models at our disposal, we can help clients set competitive and attractive prices, ultimately ensuring successful and timely property sales.
Result
...
Preparing the Dataset
I will be using the housing_v2.csv file to access the housing dataset for this report. The dataset contains 23 columns and 2692 rows. However, in this report will be focusing on 13 specific variables; price, bedrooms, bathrooms, sqft_living, sqft_above, sqft_lot, age, grade, appliance_age, crime, backyard, school_rating, and view. There are multiple data types amongst these variables.
housing <- read.csv(file="housing_v2.csv", header=TRUE, sep=",")
# converting appropriate variables to factors
housing <- within(housing, {
view <- factor(view)
backyard <- factor(backyard)
})
# number of columns
ncol(housing)
# number of rows
nrow(housing)
23
2692
Variable
​
price
bedrooms
bathrooms
sqft_living
sqft_above
sqft_lot
age
grade
appliance_age
crime
backyard
school_rating
view
What does it represent?
​
Sale price of the home
Number of bedrooms
Number of bathrooms
Size of the living area in sqft
Size of the upper level in sqft
Size of the lot in sqft
Age of the home
Measure of craftsmanship and the quality of home
Average age of all appliances in home
Crime rater per 100,000 people
Home has backyard (1) or not (0)
Average rating of schools in the area
Home backs out to lake (2), backs out to trees (1),
or backs out to road (0)
​

Model #1 : First Order Regression
fig 1.1
The image to the right (figure 1.1) is a scatterplot of Price against Living Area. As the living area square footage increases the price of the home increases, thus there is a positive relationship between these two variables. The correlation coefficient for Price against the Living Area is 0.6895. This value shows there is a moderate strength in the positive relationship as both variables move in the same direction.


fig 1.2
The image to the left (figure 1.2) is a scatterplot of Price against Age of Home. This graph shows there isnt an obvious relationship between the two variables. The correlation coefficient for Price against the Age of Home is -0.0746. This value is extremely low, showing there is a weak relationship between these two variable as they move in opposite directions.
Considering both images, the data points do not appear to be quadratic or show curvature, so I believe a linear model will suffice.
Reporting Results

fig 1.3
The general form for the Regression model is
The prediction equation of the multiple regression model is
The multiple regression model for view 1 (trees)
The multiple regression model for view 2 (lake)
# Correlation Matrix
myvars1 <- c ("price","sqft_living","age") housing_subset1 <- housing[myvars1]
print("cor")
corr_matrix <- cor(housing_subset1, method = "pearson")
round(corr_matrix, 4)
The response variable is the price. The predictor variables are living area, upper-level area, age of home, number of bathrooms, and view (respectively). All variables are quantitative except the variable view is qualitative. The prediction model is created using the outputs from the coefficient estimates, which are provided in figure 1.3. The estimated coefficient for variable living area is 129.3, meaning on average the price of a home increases by 129.3 units ($129.30) for each 1 unit increase in the living area (sqft). The estimated coefficient for the variable lake view is 249,000, meaning the price of the home increases by $249,000 in the presence of a lake view.
R-Squared
This value measures how much variance within the dependent variables is explained by the independent variable in the model. The value represents the model's fitness for the variables, a higher value signifies a good fit.
The R-squared value is 0.6029 and the adjusted value is 0.602. These values show the prediction model is a moderate fit for the variables included.
Residuals against Fitted Values
# Creating Residuals against Fitted Values Plot plot(fitted_values, residuals, main = "Residuals against Fitted Values", xlab = "Fitted Values", ylab = "Residuals", col="pink", pch = 19)

A residuals against fitted values plot is a scatter plot with residuals on the y-axis and fitted values on the x-axis. This plot can be used to reveal non-linearity, variances, and outliers within the data. The residuals against fitted values plot for this model doesn’t have an apparent pattern, which fulfills the assumption of homoscedasticity.
Normal Q-Q Plot
# Q-Q Plot qqnorm(residuals, pch = 19, col="pink", frame = FALSE) qqline(residuals, col = "red", lwd = 2)

The normal Q-Q plot helps discern the validity of assumptions. Since most points are plotted close to the line, it can be assumed the residuals are normally distributed.
Evaluating Model #1
The null hypothesis:
The alternative hypothesis:
The overall P-value is 2.2e-16, which is less than the level of significance at 5%. With this we can conclude the model is statistically significant and can reject the null hypothesis.
The F-statistic is 679.3.
Looking at each individual variable to perform beta test, the null hypothesis is
The alternative hypothesis is
The P-value for the square footage of the living area (sqft_living) is < 2e-26 and the square footage for the upstairs living area (sqft_above) is 0.00894.
The P-value for the age of the home (age) is < 2e-16 and the number of bathrooms (bathrooms) is 9.13e-13.
The P- value for a view of trees (view1) and the view of a lake (view2) is < 2e-16.
Making Predictions Using Model
Scenario 1.1: A home has a 2150 sqft living area, 1050 sqft upper living area, is 15 years old, has 3 bathrooms, and backs out to the road. Including these variables, the model predicts a fitted value where the price is $459,828.20.
The 90% prediction interval provides the lower and upper bound of price values at ($239563, $680093.40).
The 90% confidence interval is ($446087.90, $473568.50).
# making first prediction
newdata <- data.frame(sqft_living = 2150, sqft_above = 1050, age = 15, bathrooms = 3, view = '0')
print("prediction interval") prediction_pred_int <- predict(model1, newdata, interval="predict", level=0.90) round(prediction_pred_int, 4)
print("confidence interval") prediction_conf_int <- predict(model1, newdata, interval="confidence", level=0.90) round(prediction_conf_int, 4)
Scenario 1.2: A home has a 4250 sqft living area, 2100 sqft upper living area, is 5 years old, has 5 bathrooms, and backs out to the lake. Including these variables, the model predicts a fitted value where the price is $1,074,285.
​
​
​
​
​
​
​
The 90% prediction interval provides the lower and upper bound of price values at ($852522.60, $1296048).
The 90% confidence interval is ($1045117, $1103454).
The prediction interval means you can be 90% sure the new observation will fall within the given range. The confidence interval suggests with 90% confidence that the population parameter is likely to fall within the interval. The prediction interval is usually wider than the confidence interval because there is more uncertainty included in the forecast. Prediction intervals are also wider because of the random variation with each individual value.
# making the second prediction
newdata <- data.frame(sqft_living = 4250, sqft_above = 2100, age = 5, bathrooms = 5, view = '2')
print("prediction interval") prediction_pred_int <- predict(model1, newdata, interval="predict", level=0.90) round(prediction_pred_int, 4)
print("confidence interval") prediction_conf_int <- predict(model1, newdata, interval="confidence", level=0.90) round(prediction_conf_int, 4)

Model #2 : Complete Second Order Regression Model with Quantitative Variables
fig 2.1
The image to the right (figure 2.1) is a scatterplot of Price against the average school rating in the area. As the average rating increases the price of the home will do the same, thus there is a positive relationship between these two variables.



fig 2.2
The image to the left (figure 2.2) is a scatterplot of the Price against Crime Rate per 100,000 people. These variables operate inversely, as the crime rate increases the price of the home decreases.

I believe a second-order regression is appropriate because the relationship shown in the scatterplots follows a quadratic curve.
Reporting Results
The general form for the Regression model is
The prediction equation of the multiple regression model is
The multiple regression model for view 1 (trees)
The multiple regression model for view 2 (lake)

fig 2.3
The general form for the Regression model is
The prediction equation of the multiple regression model is
The complete second order model
The response variable is the price of the home. The predictor variables are the average school rating and the crime rate per 100,000 people. The variables are quantitative and the variables are squared to fit the non-linear behavior. It is important to note the is an interaction term between these two variables (shown by x1x2). The prediction model is created using the outputs from the coefficient estimates, which are provided in figure 2.3. The estimated coefficient for variable living area is 129.3, meaning on average the price of a home increases by 129.3 units ($129.30) for each 1 unit increase in the living area (sqft). The estimated coefficient for the variable lake view is 249,000, meaning the price of the home increases by $249,000 in the presence of a lake view.
An interaction term is
What is an Interaction Term?
Residuals against Fitted Values

Normal Q-Q Plot
