Vermont Real Estate Pricing Model

This is the term project for a regression class I took in the summer of 2021 as I worked toward my MS in Data Science at RIT. I looked at a dataset that has long interested me, the Vermont real estate tax data published weekly by the Vermont Dept. of Taxes. The model did not work well, but this project helped me get a grasp of linear (and non-linear) regression models.

Executive Summary

The price of real estate in Vermont is relatively stable, but as of summer 2021 has climbed rapidly in the last 18 months as low interest rates and the effects of the pandemic have created a sellers’ market. The state contains many second homes, used by residents of southern New England states and the New York / New Jersey area as getaways in the summer or winter. 

This study attempts to find factors that contribute to the sales price of real estate in Vermont through a combination of continuous and categorical variables, which go outside of the typical factors considered, like number of bedrooms and square footage. 

The data comes from the state of Vermont tax department, which publishes roughly 30,000 real estate transactions per year with sales price and other data, in addition to town-by-town data on taxes, residential (a.k.a. ‘homestead’) ownership and more. 

In particular, the study is looking to see if out of state buyers paid a premium; if there is a premium paid for property in a “ski town”, as defined by the residential / non-residential ratio; does the overall town property value impact individual property value; and whether the length of time the seller owned the property impacts the sales price. These factors play into the current policy and political debates in Vermont about affordability and land ownership. 

Introduction

The dataset
The dataset for this model comes from a random sample of 149 observations taken from more than 31,000 real estate transactions recorded in 2019 by the Vermont state tax department. These observations are matched with the grand list (total property value) for the property town, as well as the proportion of residential property to non-residential property in the property’s town. The latter is an indicator of whether the town is a so-called “ski town”, which tend to have a high proportion of second homes or rental units. 

Variables
The variables were selected from those available in the state’s tax department database. The sale price of the property is the response variable. There are five continuous variables: land size, ownership duration in weeks, listed value in thousands, residential vs. non-residential percentage, municipal grand list in ten millions, and one categorical variable, which is whether the property buyer was an in-state resident or from out-of-state. All of these are expected to have some impact on the sale price, at least from listening to general sources like the media. A note on the listed value: this is the value used by the town to calculate property tax, and is only updated roughly every 10-15 years. It is not the price the property was listed at for sale. 

A few of the categorical variables intended to investigate were abandoned early on as impractical – for example, there are more than 250 towns in Vermont, a small number of which are represented in the model, and to include all of these, at least initially, is not practical. 

The data is published in XML form for the years covered. I used R to clean, prune irrelevant data and to limit the data to sales transactions (as opposed to transfers between family members), and reformatted to CSV for use in the JMP statistical analysis program.

Objectives
The goal of this exercise is to develop a model to predict the sales price for a Vermont property, as well as to find which variables that are available in public databases impact that pricing. Several questions to explore lie at the heart of many contemporary political and policy debates around school funding and wealth distribution in the state, in particular in light of the pandemic and the influx of second home owners relocating to Vermont. These questions include: 

  • Do out-of-state buyers pay a premium? 
  • Is there a premium paid for property in a ‘ski town’, as defined by the resident / non-resident ownership ratio? 
  • Does the length of ownership positively or negatively impact the sale price? 
  • Does a town’s overall property value impact the sale price of an individual property?

Analysis

Model Adequacy
Analysis began with the fit least squares regression of all six variables to get an overview of the full model. The first take is that the model with all the variables has some issues. There is some clustering with the residual by predicted plot (Figure 1.1)*, but not enough to cause major concern; and the RSquares are very low (Figure 1.2) – around 50%, meaning the model is not accounting very well for the variability. 

The normal probability plot for the residuals looks good in terms of variance. It appears there are a few outliers, but in general it fits (Figure 1.3). However, in addition the PRESS statistic is very low at about .38 (Figure 1.4), meaning the model has further issues. 

There are, however, at least two variables that appear to be significant, and at least one that is on the fence – the listed value and the grand list value are both significant, while the residential ownership percentage is very close. 

From this initial analysis, it does appear that there are a few points that may be influential and/or have leverage. A review of the hat matrix diagonal values shows that there are multiple potential candidates for leverage, and a plot (Figure 1.11) shows that this is indeed the case: multiple points qualify as having leverage, as they fall above the 2p/n threshold. 

To find any influential points within those leverage points, a review of the studentized residuals shows several candidates. Plotting and labeling these with the hat matrix leverage points shows that there is at least one observation, row 17, that is influential (Figure 1.12) by having both a high y-hat value and a high residual value. When we re-run the model with this observation removed, it does show some impact on the model, by changing the intercept of the residential percentage regressor and making it significant (Figure 1.13). Removing this observation also makes the RSquare and PRESS values worse. Interestingly, this observation is not as influential based on the Cook’s D Influence measure, which instead indicates observation 4, with the only value over 1 from the dataset, as an influential point (Figure 1.14). 

A quick check of the correlation matrix shows that the variables are not highly correlated (Figure 1.5). To further investigate this, a look at the VIFs, shows all the values are close to one, meaning that there is a very low probability of multicollinearity (Figure 1.6) in this model. 

Model selection and validation
After looking for general issues with the full model, the next step is to search for the best model using the available regressors. Review of the best models for each number of regressors (Figure 1.7) shows that there are three clear options. These are: 

A: x3, x4, x5: list value, residential percentage and grand list value; 

B: x1, x3, x4, x5: the land, list value, residential percentage and grand list value; and 

C: x1, x2, x3, x4, x5:  the final one just adds the time of ownership to B. 

Comparison of these three candidate models begins by running regression analysis. Looking at Model A (Figure 1.8), this model has two significant variables, good VIF numbers, and an RSquare value roughly equal to the full model. It does have a better PRESS statistic (and the best PRESS statistic of the three candidate models).

Model B (Figure 1.9) has a better RSquare (although all of the models have poor RSquare values) relative to Model A, good VIF, but also a low PRESS statistic. 

Model C (Figure 1.10) displays similar values for the RSquare and VIF numbers, but has a worse PRESS statistic – the worst out of the three candidate models. 

A look at the prediction profiler helps confirm this choice (Figure 1.18), although there is little apparent difference between the three candidate models, as they predict a very similar price. 

Based on this analysis, Model A is the selected model. It has marginally better error measures (RSquare, PRESS) than the full model, and uses the least number of regressors. 

The first validation of this model comes from comparing the prediction formula from the original to the model run on a new set of 50 additional observations from the dataset. Comparing this to the original model does not inspire confidence in the model, as the RSquare values are atrocious, as is the PRESS statistic (Figure 1.16). As a further test of the validation, running a validation and training split on all of the data – the original set and the additional 50 observations. This also confirms that the model is shaky at best, with very low RSquare values and high predicted root average square prediction errors (RASE) (Figure 1.17). 

Conclusions and Recommendations

This model shows some promise, but in general it is a poor model. The low RSquare values show that the model (both the full and the selected model) do a poor job of accounting for the variance in the predicted responses. From a different perspective, it’s also a good starting point. There are many other variables provided by the dataset that could be added to the analysis – some are simple, many of them are complex. Cleaning and preparation of the dataset consumed a lot of time that would have been better spent on analysis. 

The two significant predictor variables are somewhat obvious. The grand list value of a property is the assessment of the property value by a trained, licensed appraiser. Even though it might be out of date, it is still probably a very good predictor of the sale value, as the model shows.

The grand list value also correlates with the sale price, and is also not surprising. One way that realtors set prices is by using “comparables”, which show what similar properties sold for in the recent past. A higher grand list makes it more likely that the comparables are higher as well. In addition, the listed value and the grand list are essentially derived from the same source – licensed assessors. 

This analysis showed that there is also not a correlation between the percentage of homestead vs non-homestead properties in town and the sale price, although the p-value was nearly significant. This is somewhat surprising because the conventional wisdom is that there is a premium paid for properties in so-called “ski towns”. This regressor could perhaps be made more sensitive by finding data that would further break the homestead vs. non-homestead vs. commercial properties, to refine the sales to strictly properties that are for living in, and exclude retail properties. This would make the comparison between residential and vacation homes, rather than residential and vacation homes + commercial properties. 

The other regressor variables are more surprising. The analysis showed that there was no correlation between an in-state vs out-of-state buyer and the sale price, which runs contrary to accepted wisdom. The thought is that out-of-state buyers pay a premium, which drives up the price for locals, but the data do not bear this out. Further exploration in this could be refined by creating a categorical variable of the five most common states, or between the type of transaction: in-state to in-state sale; in-state to out-of-state sale; out-of-state to in-state; and out-of-state to out-of-state. 

Time of ownership was not significant, and the amount of land in the property is also not significant. These two factors likely do not have a direct impact on the sale price, although the coefficient for land is positive and for length of ownership negative.

Written By

roromitch