Is there a relationship between market assessment value and sale price in NYC Commercial Real Estate?
NYC makes the following data sets available for public use: -PLUTO contains tax-lot information, including the city’s assessed market value of each property (for taxation purposes) -Rolling Sales Data contains property transactions and sale prices
Marrying these two sources, we can quickly determine if the Market Value
of a property (i.e., the Assessment Value) correlates at all with the Sale Price
. What we find is that certain property classes are more homogeneous than others. For example, Office buildings have a stronger correlation than Hotels. Overall, however, the market price rarely seems to track the sale price.
library(tidyverse)
library(modelr)
library(purrr)
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
fig.width = 10,
fig.height = 8
)
For this modeling exercise, the data has already been downloaded and joined:
sale_augmented <- read_rds("data/sales_augmented.rds")
We will look at a specific case, 31 West 27th Street, to guide the discussion. This office building appears in the sales data 4 times in 10 years.
A couple convenience functions for quickly glimpsing our example property:
lookat <- function(boro= 1,blck = 829,lt = 16, data = sale_augmented) {
data %>%
filter(BOROUGH == boro, BLOCK == blck, LOT == lt) %>%
arrange(desc(SALE_DATE)) %>%
glimpse()
}
lookhead <- function(boro= 1,blck = 829,lt = 16, data = sale_augmented) {
data %>%
filter(BOROUGH == boro, BLOCK == blck, LOT == lt)
}
lookat()
## Observations: 4
## Variables: 64
## $ BOROUGH <int> 1, 1, 1, 1
## $ NEIGHBORHOOD <chr> "FLATIRON", "FLATIRON", "FLATIR...
## $ BUILDING.CLASS.CATEGORY <chr> "23 LOFT BUILDINGS", "23 LOFT...
## $ TAX.CLASS.AT.PRESENT <chr> "4", "4", "4", "4"
## $ BLOCK <chr> "829", "829", "829", "829"
## $ LOT <chr> "16", "16", "16", "16"
## $ EASE.MENT <chr> NA, NA, NA, NA
## $ BUILDING.CLASS.AT.PRESENT <chr> "L1", "L1", "L1", "L1"
## $ ADDRESS <chr> "31 WEST 27TH STREET", "31 WEST...
## $ APARTMENT.NUMBER <chr> "", "", "", ""
## $ ZIP.CODE <chr> "10001", "10001", "10001", "10001"
## $ RESIDENTIAL.UNITS <dbl> 0, 0, 0, 0
## $ COMMERCIAL.UNITS <dbl> 17, 17, 17, 17
## $ TOTAL.UNITS <dbl> 17, 17, 17, 17
## $ LAND.SQUARE.FEET <dbl> 9876, 9876, 9876, 9876
## $ GROSS.SQUARE.FEET <dbl> 106800, 106800, 106800, 106800
## $ YEAR.BUILT <dbl> 1910, 1910, 1910, 1910
## $ TAX.CLASS.AT.TIME.OF.SALE <chr> "4", "4", "4", "4"
## $ BUILDING.CLASS.AT.TIME.OF.SALE <chr> "L1", "L1", "L1", "L1"
## $ SALE.PRICE <dbl> 80775000, 65000000, 45700000, 3...
## $ SALE_DATE <date> 2014-03-28, 2012-07-16, 2009-1...
## $ SALE_YEAR <dbl> 2014, 2012, 2009, 2006
## $ Address <chr> "31 WEST 27 STREET", "31 WEST 2...
## $ lat <dbl> 40.77033, 40.36161, 40.36162, NA
## $ lon <dbl> -76.82009, -77.29831, -77.29831...
## $ ZipCode <chr> "10001", "10001", "10001", NA
## $ BldgClass <chr> "L1", "L1", "L1", NA
## $ Building_Type <chr> "L", "L", "L", NA
## $ BBL_derive <chr> "1008290016", "1008290016", "10...
## $ OwnerType <chr> NA, "", "", NA
## $ OwnerName <chr> "31 W27ST 9 OWNER, LLC", "31 W ...
## $ LotArea <dbl> 9876, 9876, 9876, NA
## $ BldgArea <dbl> 106800, 106800, 108594, NA
## $ ComArea <dbl> 106800, 106800, 100000, NA
## $ ResArea <dbl> 0, 0, 8594, NA
## $ NumBldgs <dbl> 1, 1, 1, NA
## $ NumFloors <dbl> 12, 12, 12, NA
## $ UnitsRes <dbl> 0, 0, 0, NA
## $ UnitsTotal <dbl> 17, 17, 22, NA
## $ LotFront <dbl> 100, 100, 100, NA
## $ LotDepth <dbl> 98.75, 98.75, 98.75, NA
## $ BldgFront <dbl> 100, 100, 100, NA
## $ BldgDepth <dbl> 89, 89, 89, NA
## $ ProxCode <dbl> 3, 3, 3, NA
## $ IrrLotCode <chr> "N", "N", "N", NA
## $ CornerLot <chr> NA, NA, NA, NA
## $ AssessLand <dbl> 2700000, 2700000, 2700000, NA
## $ AssessTotal <dbl> 12874950, 12886650, 5220000, NA
## $ ExemptLand <dbl> 0, 0, 0, NA
## $ ExemptTotal <dbl> 0, 0, 0, NA
## $ YearBuilt <dbl> 1910, 1910, 1910, NA
## $ YearAlter1 <dbl> 1987, 1987, 1987, NA
## $ YearAlter2 <dbl> 0, 0, 0, NA
## $ OfficeArea <dbl> 96800, 0, 0, NA
## $ RetailArea <dbl> 10000, 0, 0, NA
## $ GarageArea <dbl> 0, 0, 0, NA
## $ StrgeArea <dbl> 0, 106800, 100000, NA
## $ FactryArea <dbl> 0, 0, 0, NA
## $ OtherArea <dbl> 0, 0, 0, NA
## $ LotType <dbl> 5, 5, 5, NA
## $ BsmtCode <dbl> 2, 2, 5, NA
## $ BuiltFAR <dbl> 10.81, 10.81, 11.00, NA
## $ BBL <chr> "1008290016", "1008290016", "10...
## $ CondoNo <dbl> 0, 0, 0, NA
From 2006 to 2014, SALE.PRICE
increases from $31,500,000
to $80,775,000
, a 265% increase. In roughly the same period, AssessTotal
went from $5,220,000
(in 2009) to $12,874,950
, a similar increase.
Let’s look at price as a multiple of the Assessed Value:
sale_augmented %>% filter(BOROUGH == 1, BLOCK == 829, LOT == 16) %>%
select(ADDRESS,SALE_DATE, SALE.PRICE, AssessTotal) %>%
mutate(SalePriceToAssesstmentRatio = SALE.PRICE/AssessTotal)
## # A tibble: 4 x 5
## ADDRESS SALE_DATE SALE.PRICE AssessTotal
## <chr> <date> <dbl> <dbl>
## 1 31 WEST 27 STREET 2006-06-08 31500000 NA
## 2 31 WEST 27TH STREET 2009-10-22 45700000 5220000
## 3 31 WEST 27TH STREET 2012-07-16 65000000 12886650
## 4 31 WEST 27TH STREET 2014-03-28 80775000 12874950
## # ... with 1 more variables: SalePriceToAssesstmentRatio <dbl>
Is this a consistent pattern across all the sales data?
sale_augmented %>%
select(ADDRESS,SALE_DATE, SALE.PRICE, AssessTotal) %>%
mutate(SalePriceToAssesstmentRatio = SALE.PRICE/AssessTotal) %>%
filter(!is.na(SalePriceToAssesstmentRatio), !is.infinite(SalePriceToAssesstmentRatio)) %>%
mutate(Ratio_bin = ntile(SalePriceToAssesstmentRatio,5)) %>%
group_by(Ratio_bin) %>%
summarise(average_ratio = mean(SalePriceToAssesstmentRatio)) %>%
ggplot()+
aes(x = Ratio_bin, y = average_ratio)+
geom_col()+
theme_minimal()+
labs(title = "Sale Price To Assesstment Ratio")
In many instances, the ratio is wildly off. This begs the question: which types of buildings are more homogeneous?
sale_augmented %>%
select(ADDRESS,SALE_DATE,Building_Type, SALE.PRICE, AssessTotal) %>%
mutate(SalePriceToAssesstmentRatio = SALE.PRICE/AssessTotal) %>%
filter(!is.na(SalePriceToAssesstmentRatio)) %>%
filter(AssessTotal<500000000) %>%
#filter(SALE.PRICE>5e+07) %>%
ggplot()+
aes(x = SALE.PRICE, y=AssessTotal, group = Building_Type, color = Building_Type)+
geom_point()+
geom_smooth(se=F, method = "lm")+
scale_y_continuous(labels=scales::comma)+
theme_minimal()
The building designations can be looked up in the PLUTO Data Dictionary Appendix C
A few of the building classes stand out. For example, “O” stands for ‘Office’. In addition, “H” Hotels and “I” Hospitals seem to correlate more than other classes.
We will fit a simple linear model to each building class using the nested modelr
approach.
library(modelr)
f1 <- lm(AssessTotal~SALE.PRICE, data = sale_augmented)
summary(f1)
##
## Call:
## lm(formula = AssessTotal ~ SALE.PRICE, data = sale_augmented)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158958485 -2569836 -2512504 -1870750 2621421738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.533e+06 1.431e+04 177.0 <2e-16 ***
## SALE.PRICE 1.335e-01 1.034e-03 129.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10590000 on 550095 degrees of freedom
## (390644 observations deleted due to missingness)
## Multiple R-squared: 0.02944, Adjusted R-squared: 0.02944
## F-statistic: 1.669e+04 on 1 and 550095 DF, p-value: < 2.2e-16
Clearly the overall model does not fair well with an Adj R Squared sub 0.2, but what about individual classes?
# helpfer functions for modeling with groups:
group_model <- function(df) {
lm(AssessTotal ~ SALE.PRICE, data = df)
}
extract_coef <- function(model) coef(model)["SALE.PRICE"]
First we nest the data frames by category:
by_group<-
sale_augmented %>%
select(ADDRESS,SALE_DATE,Building_Type, SALE.PRICE, AssessTotal) %>%
mutate(SalePriceToAssesstmentRatio = SALE.PRICE/AssessTotal) %>%
filter(!is.na(SalePriceToAssesstmentRatio)) %>%
group_by(Building_Type) %>%
nest()
by_group
## # A tibble: 27 x 2
## Building_Type data
## <chr> <list>
## 1 B <tibble [126,091 x 5]>
## 2 C <tibble [93,813 x 5]>
## 3 D <tibble [114,460 x 5]>
## 4 S <tibble [17,402 x 5]>
## 5 K <tibble [9,499 x 5]>
## 6 L <tibble [819 x 5]>
## 7 V <tibble [14,529 x 5]>
## 8 E <tibble [3,327 x 5]>
## 9 I <tibble [520 x 5]>
## 10 M <tibble [921 x 5]>
## # ... with 17 more rows
Next, using purrr::map
and broom::glance
we can apply various functions over the nested data frames in order to fit models and extract summary statistics:
by_group %>%
mutate(Number_of_Sales = map_dbl(data,nrow)) %>%
mutate(model = map(data, group_model)) %>%
mutate(Sale_Coef = map_dbl(model,extract_coef)) %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance) %>%
arrange(-adj.r.squared) %>%
filter(p.value <= (0.05)) %>%
glimpse()
## Observations: 17
## Variables: 16
## $ Building_Type <chr> "L", "O", "N", "H", "F", "E", "B", "S", "G", "...
## $ data <list> [<# A tibble: 819 x 5, ...
## $ Number_of_Sales <dbl> 819, 3503, 132, 8726, 2636, 3327, 126091, 1740...
## $ model <list> [<1.594152e+06, 1.320010e-01, -1372752.29, 52...
## $ Sale_Coef <dbl> 0.13200104, 0.13088686, 0.11407901, 0.11723693...
## $ r.squared <dbl> 0.354059930, 0.332790155, 0.289172620, 0.20407...
## $ adj.r.squared <dbl> 0.353269305, 0.332599578, 0.283704718, 0.20398...
## $ sigma <dbl> 7875291.30, 22896266.51, 2403543.44, 5902366.8...
## $ statistic <dbl> 447.823220, 1746.224730, 52.885471, 2236.84038...
## $ p.value <dbl> 1.360568e-79, 5.494034e-310, 2.954454e-11, 0.0...
## $ df <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ logLik <dbl> -14166.208, -64333.076, -2125.696, -148426.537...
## $ AIC <dbl> 28338.415, 128672.153, 4257.392, 296859.073, 8...
## $ BIC <dbl> 28352.539, 128690.637, 4266.041, 296880.295, 8...
## $ deviance <dbl> 5.067051e+16, 1.835361e+18, 7.510127e+14, 3.03...
## $ df.residual <int> 817, 3501, 130, 8724, 2634, 3325, 126089, 1740...
The top building classes are:
Class | Name |
---|---|
L | Lofts (a type of office) |
O | Office |
N | Asylums |
H | Hotel |
F | Factory/Industrial |
The very highest adjusted R-Squared achieved with a linear model is 0.35 across the “Lofts Office” group, followed closely by general “Office”. The coefficients for these classes are quite low: 0.13
. With more advanced modeling and residual analysis, these models could be improved dramatically, especially for the office classes.