In this exercise we will cover the basics of spatial regression analysis including a discussion on spatial weight matrices, spatial models, model comparisons, and finally displaying the results using bivariate maps. To start, we will discuss the different ways to develop spatial weights for spatial regression analyses. These models were designed to address underlying spatial patterns in our data. Remember that one assumption of regression analysis is independence of the observations. If the observations are spatially autocorrelated (a measure of similarity or correlation between nearby observations), estimates of the coefficients and the error terms may not be detectable from random noise.
Spatial weights describes the various ways that “neighbors” are determined in a spatial analysis. Tobler’s First Law of Geography (Waldo Tobler, 1970) states that,
“everything is related to everything else, but near things are more related than distant things.”
While “related” should likely be restated as “similar”, this forms the basis for spatial autocorrelation and spatial interpolation analyses. With this in mind, we can look at the basic ways “neighborhoods” can be determined in a spatial analysis. In this exercise we will examine two common methods: contiguity (common boundaries) and distance (K-nearest neighbors).
Contiguity generally describes neighbors as individuals with shared or common boundary. In the example below, we see a hypothetical dataset with six individual polygons.
The six polygons contain a total of eighteen (18) neighbors within the neighborhood. The number of individual neighbors is identified within each polygon. Additionally, we can define how neighbor borders are identified.
In the example above we can see the designation of neighbors can be identified by a chess analogy: horizontal/vertical cases (rooks), diagonal cases (bishops), or in all cases (queens) along a regular grid. In irregular polygons, rooks neighbors are connected along edges, bishop neighbors by points, and queens neighbors by both points and edges. So in this example, the neighborhood established using the queens case contains eight (8) neighbors, four (4) neighbors by edges in the rooks case, and four (4) neighbors by points in the bishops case.
Neighbors can also be assigned by distance or K-nearest neighbors. Using this method, distance is measured from the centroid of a polygon until a specific number of neighbors are achieved.
In the distance example, the K-value is equal to the number of individuals designated as neighbors. Within the K = 1 circle, the neighborhood is identified as the single closest neighbor. K = 3 contains the three (3) nearest neighbors and K = 5 contains the five (5) nearest neighbors.
During the first half of this course we examined ordinary least squares (OLS), generalized linear model (GLM), and linear mixed models. In this exercise we will look at some basic spatial regression models including: spatially lagged x - explanatory variable(s), spatial lag model, and spatial error model. Additionally, we will discuss spatial durbin and spatial durbin error nested models.
In a non-spatial OLS model, \(y = X \beta + \varepsilon\), we examine the relationship between one or more independent variables and a dependent variable. With a spatially lagged, or SLX model, we see an addition of a spatial component to the equation that accounts for the average value of neighboring X (independent variable) values.
\(y = X \beta + WX\theta + \varepsilon\), where \(WX\theta\) is the average value of our neighbors independent variable(s)
So the SLX equation examines whether variables that impact our neighbors also impact us. This is considered a local model as there is only one-way interaction between our neighbors and us.
In the Spatial Lag model, we lose the lagged X values that were used in the SLX model and instead add a spatially lagged y (dependent variable) value.
\(y = \rho W y + X \beta + \varepsilon\), where \(\rho W y\) is the spatially lagged y value of our neighbors
In this global model the dependent variable among our neighbors influences our dependent variable. Therefore there is a feedback loop that occurs where affects on our neighbor(s) y affects our y and our neighbor(s) y variable.
The Spatial Error model does not include lagged dependent or independent variables, but instead includes a function of our unexplained error and that of our neighbors.
\(y = X \beta + u\), \(u = \lambda W u + \varepsilon\), where \(\lambda W u + \varepsilon\) is the function of our unexplained error (residuals) and our neighbors residual values
This is a global model where the higher than expected residual values suggest a missing explanatory variable that is spatially correlated. This would lead to unexplained clusters of spatially correlated values within our neighborhood that were not included as a predictor variable(s).
In addition to the models discussed above, we can also examine a modern approach to model specifications. These nested models contain some lagged and/or error values and can be local or global models. These can be seen as exploratory analyses that use likelihood ratio tests to help infer the most appropriate model for our data. However, you must begin with an understanding of whether the relationships in your data are local or global in nature.
The Spatial Durbin model includes both lagged y and lagged x values. This model can be simplified into global SLX, Spatial Lag, Spatial Error models, or OLS.
\(y = \rho Wy + X\beta + WX\theta + \varepsilon\)
Because of the inclusion of the lagged y, this model would be used if the relationships in your data are global, meaning that if the y variable is impacted in one region that impact will spill over to every region in the dataset.
This nested model does not contain a lagged y and can be simplified into a local Spatial Error, SLX, or OLS model.
\(y = X\beta + WX\theta + u\), \(u = \lambda Wu + \varepsilon\)
This nested model would be used if the relationships in your data are local, meaning changes in one region will only impact the immediate neighbors of the region and not all of the regions in the dataset.
With this basic knowledge of spatial regression and the associated models, we can now examine a dataset to:
This will be a very code dense exercise due to the complexity of the analysis and the amount of data required to complete all of the steps listed above. Try to keep your naming conventions straight-forward so you can easily track down the data you need. When creating your own script and markdown document, consider adding descriptive text to convey the purpose of the specific analysis.
There are several specialty packages that will be used in this exercise due to the specific nature of the analyses. Some of these packages you will need to install while several others we have used in previous exercises and should already be installed.
|biscale
| car
| cleangeo
|
cowplot
|
| geosphere
|
maps
| mapproj
| maptools
|
|spatialreg
| spdep
|
tidyverse
| visreg
|
To begin, we will install the following:
packages <- c("biscale", "car", "cleangeo", "cowplot", "geosphere", "maps",
"mapproj", "maptools", "spatialreg", "spdep", "tidyverse","visreg")
sapply(packages, require, character.only=T)
## biscale car cleangeo cowplot geosphere maps mapproj
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## maptools spatialreg spdep tidyverse visreg
## TRUE TRUE TRUE TRUE TRUE
Occasionally packages may mask functions needed from other packages.
You can require the use of a function through a specific package by
using a syntax such as spatialreg::impacts()
. You can also
create a selective function to like
impacts <- spatialreg::impacts
to require the use of a
function from a specific package. Because of the significant amount of
script involved in this exercise, you may need to refer to the help menu
(e.g. ?poly2nb
) for additional information regarding
specific functions used throughout the analysis.
Generally we have been using biological data for each analysis. In
this exercise we will examine the effect of numerous socioeconomic
factors related to percentage of children under the age of 18 who are
living below the poverty line in the Southern United States. Data for
this analysis was collected from the U.S. Census Bureau, American
Community Survey, Bureau of Labor Force Statistics, U.S. Department of
Agriculture Economic Research Service, and the County Health Rankings.
Independent variables include: rural, urban, manufacturing, agriculture,
retail, health care, construction, less than high school degree,
unemployment, income ratio, teen birth, unmarried, single mother,
uninsured, as well as race variable like Black and Hispanic. Because
this data has to be downloaded we will need to go through the process of
importing this information using the read.csv()
command.
Once we have imported the data we can view the structure.
If forked, this data will be available in the Data folder within the repository.
In previous exercises, we used a simple implementation of
read.csv()
to import the dataset. However, this can
occasionally prove to be problematic as columns can take on formatting
that doesn’t necessarily fit the data. So in this example we will
designate the format for each column.
data <- read.csv('https://raw.githubusercontent.com/chrismgentry/Spatial-Regression/master/Data/childpov18_southfull.csv',
colClasses = c("character", "character", "character",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric"))
Notice there are 35 different columns in this dataset. One of the columns is labeled as X2016.child.poverty. This is because when imported, R does not permit columns names to begin with a numeric values. Therefore it will always put an X at the start of the column name. It is always good practice to view your data when imported to address any potential issues like column names or formats. To fix this, we can rename that specific column name:
For this exercise we will subset the data in order to reduce the time and complexity of the analysis. We will select the state of Texas for this example.
For additional QA/QC, we can view a summary of our data to check for any missing data or errors.
## FIPS State County_Name Longitude
## Length:254 Length:254 Length:254 Min. :-106.24
## Class :character Class :character Class :character 1st Qu.:-100.78
## Mode :character Mode :character Mode :character Median : -98.54
## Mean : -98.65
## 3rd Qu.: -96.57
## Max. : -93.74
## Latitude child.pov.2016 unemployment_13 per_lhs_13
## Min. :26.13 Min. : 0.00 Min. : 2.600 Min. : 6.60
## 1st Qu.:30.13 1st Qu.:18.52 1st Qu.: 5.000 1st Qu.:16.32
## Median :31.79 Median :23.65 Median : 5.800 Median :20.00
## Mean :31.66 Mean :23.70 Mean : 6.126 Mean :21.75
## 3rd Qu.:33.18 3rd Qu.:28.88 3rd Qu.: 6.700 3rd Qu.:25.65
## Max. :36.28 Max. :56.50 Max. :15.700 Max. :55.00
## per_black_13 per_hispanic_13 per_ag_13 per_construction_13
## Min. : 0.000 Min. : 2.233 Min. : 0.5895 Min. : 0.000
## 1st Qu.: 1.155 1st Qu.:16.632 1st Qu.: 4.8213 1st Qu.: 6.268
## Median : 3.900 Median :24.625 Median : 9.1786 Median : 7.754
## Mean : 6.231 Mean :32.938 Mean :11.7818 Mean : 8.250
## 3rd Qu.: 8.784 3rd Qu.:47.754 3rd Qu.:16.8112 3rd Qu.: 9.759
## Max. :33.314 Max. :98.432 Max. :51.5464 Max. :18.153
## per_manufacturing_13 per_retail_13 per_healthss_13 per_singlemom_13
## Min. : 0.00 Min. : 0.000 Min. : 4.762 Min. : 0.000
## 1st Qu.: 4.43 1st Qu.: 9.182 1st Qu.:19.480 1st Qu.: 7.512
## Median : 7.53 Median :11.458 Median :22.204 Median : 9.586
## Mean : 7.97 Mean :10.897 Mean :22.538 Mean : 9.798
## 3rd Qu.:10.82 3rd Qu.:12.727 3rd Qu.:25.636 3rd Qu.:12.183
## Max. :29.45 Max. :19.300 Max. :38.697 Max. :21.034
## urban rural lnchildpov lnunemployment
## Min. :0.0000 Min. :0.0000 Min. :-4.605 Min. :0.9555
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 2.919 1st Qu.:1.6094
## Median :0.0000 Median :0.0000 Median : 3.163 Median :1.7579
## Mean :0.3228 Mean :0.1929 Mean : 2.995 Mean :1.7684
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 3.363 3rd Qu.:1.9021
## Max. :1.0000 Max. :1.0000 Max. : 4.034 Max. :2.7537
## lnlesshs lnblack lnhispanic lnag
## Min. :1.887 Min. :-4.6052 Min. :0.8034 Min. :-0.5285
## 1st Qu.:2.793 1st Qu.: 0.1442 1st Qu.:2.8113 1st Qu.: 1.5730
## Median :2.996 Median : 1.3610 Median :3.2038 Median : 2.2169
## Mean :3.013 Mean : 0.8919 Mean :3.2347 Mean : 2.1164
## 3rd Qu.:3.245 3rd Qu.: 2.1729 3rd Qu.:3.8661 3rd Qu.: 2.8220
## Max. :4.007 Max. : 3.5060 Max. :4.5894 Max. : 3.9425
## lnconstruction lnmanufacturing lnretail lnhealthss
## Min. :-4.605 Min. :-4.605 Min. :-4.605 Min. :1.561
## 1st Qu.: 1.835 1st Qu.: 1.488 1st Qu.: 2.217 1st Qu.:2.969
## Median : 2.048 Median : 2.019 Median : 2.439 Median :3.100
## Mean : 2.031 Mean : 1.776 Mean : 2.323 Mean :3.089
## 3rd Qu.: 2.278 3rd Qu.: 2.381 3rd Qu.: 2.544 3rd Qu.:3.244
## Max. : 2.899 Max. : 3.383 Max. : 2.960 Max. :3.656
## lnsinglemom lnchpov17 lnuninsured lnincome_ratio
## Min. :-4.605 Min. :-4.600 Min. :2.642 Min. :1.158
## 1st Qu.: 2.017 1st Qu.: 2.800 1st Qu.:3.155 1st Qu.:1.454
## Median : 2.260 Median : 3.125 Median :3.244 Median :1.532
## Mean : 2.155 Mean : 2.940 Mean :3.238 Mean :1.534
## 3rd Qu.: 2.500 3rd Qu.: 3.342 3rd Qu.:3.333 3rd Qu.:1.606
## Max. : 3.046 Max. : 4.132 Max. :3.653 Max. :2.267
## lnsevere_housing lnteenbirth lnunmarried
## Min. :0.010 Min. :2.507 Min. :0.010
## 1st Qu.:2.445 1st Qu.:3.909 1st Qu.:3.120
## Median :2.624 Median :4.114 Median :3.523
## Mean :2.595 Mean :4.080 Mean :3.230
## 3rd Qu.:2.800 3rd Qu.:4.317 3rd Qu.:3.859
## Max. :3.473 Max. :4.852 Max. :4.605
We can see from this that we have two columns that are categorical data, urban and rural, even though they are in numeric format. Additionally there are a few variables that have minimum values below zero. We will address those later on in this exercise. Now that we have our data set to just include Texas counties we can move ahead with an exploration of the data.
We are going to begin this analysis by writing a script to define our equation with an assignment operator followed by our variables. This will keep us from having to copy and paste the equation repeatedly throughout the analysis.
equation <- child.pov.2016 ~ rural + urban + lnmanufacturing + lnag +
lnretail + lnhealthss + lnconstruction + lnlesshs +
lnunemployment + lnsinglemom + lnblack + lnhispanic +
lnuninsured + lnincome_ratio + lnteenbirth + lnunmarried
This equation will be used to explain the level of child poverty by the list of variables and log variables in the dataset. We are also going to limit the use of scientific notation to numbers greater than 5 decimal places within the output and summaries. This is adjusted in the options:
To test this relationship between child poverty and our independent variables we will start by running a simple OLS.
##
## Call:
## lm(formula = equation, data = tx_pov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2797 -3.7164 0.0318 3.9219 22.4168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -86.50820 12.85507 -6.730 1.27e-10 ***
## rural -1.99975 1.50036 -1.333 0.183861
## urban -0.23421 1.25471 -0.187 0.852085
## lnmanufacturing 0.12695 0.56070 0.226 0.821079
## lnag -1.10187 0.77451 -1.423 0.156149
## lnretail -1.28789 1.34550 -0.957 0.339448
## lnhealthss 5.13853 2.03292 2.528 0.012134 *
## lnconstruction -2.60999 1.10021 -2.372 0.018479 *
## lnlesshs 8.60026 2.64769 3.248 0.001330 **
## lnunemployment 8.86334 1.95041 4.544 8.79e-06 ***
## lnsinglemom 2.14337 0.92279 2.323 0.021042 *
## lnblack -0.66033 0.31448 -2.100 0.036811 *
## lnhispanic -1.26238 0.99343 -1.271 0.205072
## lnuninsured 14.37337 3.91819 3.668 0.000301 ***
## lnincome_ratio 6.00025 3.44936 1.740 0.083240 .
## lnteenbirth 1.97988 1.29756 1.526 0.128380
## lnunmarried -0.04797 0.47643 -0.101 0.919884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.965 on 237 degrees of freedom
## Multiple R-squared: 0.5052, Adjusted R-squared: 0.4718
## F-statistic: 15.13 on 16 and 237 DF, p-value: < 2.2e-16
In this analysis we can see a number of significant (\(\alpha =
0.05\)) log variables including: employment in healthcare
or construction, less than high school education, rate of unemployment,
single mother households, rate of uninsured individuals, and a single
race variable. The model is also statistically significant with a
p-value of < 2.2e-16. You can view a series of regression plots to
visualize the relationship between the response and explanatory
variables using the visreg()
function from the
visreg package (visualization of regression models).
For simplicity when displaying for this exercise all of the plots will
be viewed in a single frame.
Although we will not be viewing each of the models created in this
example with visreg()
you can use this on regression models
created for your related assignment.
Referring back to the section above on Contiguity, we know that neighbors are described as individuals with shared or common boundary. Those can be of several different cases including rook, bishop, and queen. In the example below, we run our analyses based on queens case neighbors.
In order to determine if there is any underlying spatial
relationships in our residuals we can run various tests of our data.
However, in order to do this we need to provide a spatial weights
matrix. So we need to run the following commands to create a list of
neighbors. But before that we need to create a county polygon dataset
containing a variable, Federal Information Processing Standards codes
(FIPS), that will eventually allow us to connect the poverty data if
necessary and convert the object to a SpatialPolygons
class.
fips <- county.fips
fips.codes <- separate(data = fips, col = polyname, into = c("state", "county"), sep = ",")
tx_fips <- subset(fips.codes, state=="texas", select=fips)
texas <- maps::map(database = "county", regions = "texas", fill=T, plot=F)
tx_sp = map2SpatialPolygons(texas,tx_fips$fips,CRS("+proj=longlat"))
When running the next step of the process to create a list of
neighbors for each county in the dataset, an error message stated that
the polygons had invalid geometry. This means that when the data was
digitized one of the polygons might not have been closed or there are
overlapping vertices or lines. This can be repaired with the
cleangeo
package that somewhat acts like
janitor
but for spatial data.
cleaned <- clgeo_Clean(tx_sp)
neighb.data <- poly2nb(cleaned, queen=T)
cont.neighb <- nb2listw(neighb.data,style="W", zero.policy = TRUE)
Now that we have created a spatial dataset and a list of neighbors, we can determine if there is any residual spatial dependence.
In this step we will use a Moran’s Correlation Test designed for regression residuals to examines the residuals of the OLS regression using the spatial relationship matrix. In this case the null hypothesis (HO) is “no spatial correlation in the residuals”.
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
##
## Moran I statistic standard deviate = 3.0923, p-value = 0.0009931
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.095560223 -0.016773540 0.001319666
In viewing the results from our Moran’s Test for regression residuals we see that we have a significant p-value of 0.00099. So we would reject the HO due to detection of spatial dependency in our dataset. This would suggest we should be using a spatial model.
Another way to analyze the data is to use a LaGrange Multiplier Test. The function reports the estimates of simple LM tests for error dependence (LMerr), for a missing spatially lagged dependent variable (LMlag), for robust variants of these tests (RLMerr to test for error dependence in the possible presence of a missing lagged dependent variable and RLMlag the other way round), and a portmanteau test (SARMA, in fact LMerr + RLMlag). We can run all of these at once using the following script:
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
##
## LMerr = 6.3382, df = 1, p-value = 0.01182
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
##
## LMlag = 5.809, df = 1, p-value = 0.01594
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
##
## RLMerr = 1.0913, df = 1, p-value = 0.2962
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
##
## RLMlag = 0.56209, df = 1, p-value = 0.4534
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
##
## SARMA = 6.9003, df = 2, p-value = 0.03174
The results of the five (5) LaGrange Multiplier Tests are as follows:
LMerr | LMlag | RLMerr | RLMlag | SARMA |
---|---|---|---|---|
0.012 | 0.016 | 0.296 | 0.453 | 0.032 |
Based on these results, the LaGrange Multiplier Tests suggests a
spatial error or spatial lag model would improve the fit of our model.
The robust models make an attempt to filter out the potential false
positives in the data generation. I do not believe the
spdep
or spatialreg
packages have the ability
to run a true SARMA model, and in statistics courses I have taken in the
past, it was suggested that SARMA models are rarely appropriate no
matter the P-value. Additionally, because I infrequently use this
analysis, I am not familiar enough with Seasonal Autoregressive Moving
Average (SARMA) models to know when they are necessary.
A handy decision tree for model selection was developed by Dr. Luc Anselin, one of the principle developers of spatial econometrics.
Following the LM test results from the OLS and the decision tree above we will continue with the spatial error and spatial lag models even though the error model is more than likely the correct path.
Although the results above indicated an error model would be the most appropriate model, here we will examine the Spatially Lagged X Model and later on the Spatial Lag Model. As described above, the SLX model accounts for the average value of neighboring X values within our model. Essentially this is a one-way interaction where potentially our neighbors have an impact on us, but that is the limit of the interactions. To run an SLX model we need to provide the dataset and the spatial weights matrix we create before.
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2280 -3.7042 -0.3626 3.8252 19.7106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -127.06759 30.16906 -4.212 0.0000369 ***
## rural -2.46167 1.59469 -1.544 0.12410
## urban 0.02043 1.34435 0.015 0.98789
## lnmanufacturing -0.09738 0.62167 -0.157 0.87567
## lnag -0.73469 0.97278 -0.755 0.45091
## lnretail -0.69809 1.43939 -0.485 0.62816
## lnhealthss 4.35219 2.33117 1.867 0.06323 .
## lnconstruction -2.18978 1.16544 -1.879 0.06157 .
## lnlesshs 8.07940 2.98545 2.706 0.00734 **
## lnunemployment 6.87451 2.18694 3.143 0.00190 **
## lnsinglemom 2.41027 0.97739 2.466 0.01442 *
## lnblack -1.05893 0.39612 -2.673 0.00807 **
## lnhispanic 1.19434 1.53993 0.776 0.43882
## lnuninsured 10.42508 5.02773 2.074 0.03928 *
## lnincome_ratio 6.29879 3.64999 1.726 0.08580 .
## lnteenbirth 2.33871 1.35931 1.721 0.08674 .
## lnunmarried 0.08268 0.51228 0.161 0.87193
## lag.rural -0.40666 3.80296 -0.107 0.91494
## lag.urban -0.95068 3.06209 -0.310 0.75650
## lag.lnmanufacturing -0.45707 1.23771 -0.369 0.71227
## lag.lnag 0.08274 1.63159 0.051 0.95960
## lag.lnretail 5.42065 3.54293 1.530 0.12745
## lag.lnhealthss 9.20346 4.69963 1.958 0.05145 .
## lag.lnconstruction -3.43565 2.53190 -1.357 0.17618
## lag.lnlesshs 0.06447 5.74321 0.011 0.99105
## lag.lnunemployment 1.68346 4.89665 0.344 0.73133
## lag.lnsinglemom -3.68631 2.49604 -1.477 0.14114
## lag.lnblack 0.73559 0.61437 1.197 0.23247
## lag.lnhispanic -2.09550 2.24940 -0.932 0.35257
## lag.lnuninsured -2.86477 8.42585 -0.340 0.73418
## lag.lnincome_ratio 5.28207 8.10652 0.652 0.51535
## lag.lnteenbirth 6.43891 3.71497 1.733 0.08445 .
## lag.lnunmarried 0.14316 1.28215 0.112 0.91120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.899 on 221 degrees of freedom
## Multiple R-squared: 0.5473, Adjusted R-squared: 0.4817
## F-statistic: 8.349 on 32 and 221 DF, p-value: < 2.2e-16
Notice that in the summary we see coefficients for our independent
variables as well as those of our neighbors. Looking at the p-values we
can see that overall the lagged variables were not significant further
emphasizing the need for an error model. If this was the suggested model
for your data, a more informative way to view the results would be to
examine the combined impacts of the x and lagged x variables. Using the
summary(impacts(SLX.model, cont.neighb, R=1000), zstats = TRUE)
script will produce a matrix of direct effect and indirect effect
coefficients, z-scores, and p-values for the total impacts of each
variable. However, I am going to use the following script to limit the
results to only the p-values.
## Direct Indirect Total
## rural 0.122667918 0.91484305 0.474233470
## urban 0.987877742 0.75620619 0.761587350
## lnmanufacturing 0.875529947 0.71191528 0.670709149
## lnag 0.450101365 0.95955529 0.666112355
## lnretail 0.627682384 0.12601873 0.218455807
## lnhealthss 0.061907517 0.05019031 0.002894118
## lnconstruction 0.060253487 0.17479891 0.036700076
## lnlesshs 0.006804617 0.99104369 0.131309506
## lnunemployment 0.001669794 0.73099830 0.070568521
## lnsinglemom 0.013661702 0.13971252 0.614504413
## lnblack 0.007511418 0.23118307 0.549073423
## lnhispanic 0.437995184 0.35155445 0.614754961
## lnuninsured 0.038124147 0.73385842 0.304657969
## lnincome_ratio 0.084401627 0.51467051 0.171734814
## lnteenbirth 0.085338964 0.08305415 0.038675967
## lnunmarried 0.871779473 0.91109405 0.880342268
Again we see that only a few variables are significant in this analysis.
Remember that the Spatial Lag Model is a global model where the dependent variable among our neighbors influences our dependent variable. Therefore there is a feedback loop that occurs where affects on our neighbor(s) y affects our y and our neighbor(s) y variable. Although we will move on the to appropriate error model in the next section, we can run the Spatial Lag Model to examine the results. In this analysis, Rho is the spatially lagged y multiplier.
sp.lag.model <- spatialreg::lagsarlm(equation, data=tx_pov, cont.neighb)
summary(sp.lag.model, Nagelkerke = TRUE)
##
## Call:
## spatialreg::lagsarlm(formula = equation, data = tx_pov, listw = cont.neighb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.771862 -3.818987 -0.058792 3.594372 22.145552
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -83.3615118 12.2867007 -6.7847 1.163e-11
## rural -1.9341120 1.4333792 -1.3493 0.1772287
## urban -0.1509088 1.1967370 -0.1261 0.8996526
## lnmanufacturing 0.1788383 0.5358207 0.3338 0.7385567
## lnag -0.8646923 0.7428990 -1.1639 0.2444470
## lnretail -1.2277641 1.2852621 -0.9553 0.3394444
## lnhealthss 4.3264825 1.9521420 2.2163 0.0266727
## lnconstruction -2.3241880 1.0546191 -2.2038 0.0275372
## lnlesshs 7.7875446 2.5550808 3.0479 0.0023047
## lnunemployment 8.0523436 1.8984982 4.2414 2.221e-05
## lnsinglemom 2.3046173 0.8801564 2.6184 0.0088339
## lnblack -0.5809891 0.3034813 -1.9144 0.0555672
## lnhispanic -1.1766409 0.9489231 -1.2400 0.2149846
## lnuninsured 14.1425520 3.7377450 3.7837 0.0001545
## lnincome_ratio 5.1385533 3.3041076 1.5552 0.1198981
## lnteenbirth 1.7695359 1.2381580 1.4292 0.1529559
## lnunmarried -0.0094519 0.4544499 -0.0208 0.9834063
##
## Rho: 0.16893, LR test value: 5.2078, p-value: 0.022485
## Asymptotic standard error: 0.074846
## z-value: 2.2571, p-value: 0.024004
## Wald statistic: 5.0944, p-value: 0.024004
##
## Log likelihood: -841.9984 for lag model
## ML residual variance (sigma squared): 44.115, (sigma: 6.6419)
## Nagelkerke pseudo-R-squared: 0.51528
## Number of observations: 254
## Number of parameters estimated: 19
## AIC: 1722, (AIC for lm: 1725.2)
## LM test for residual autocorrelation
## test value: 1.206, p-value: 0.27212
We can see from the p-value (0.0224852) the model is significant with similar significant variables as in the OLS model. Again, due to the feedback loop, the summary table for this analysis is difficult to interpret because when our y increases, our neighbors y goes up as well, which then causes our y to increase. So once again we need to view the impact table. As with the previous impacts table, I limited the results to the p-values.
## Direct Indirect Total
## rural 0.12757506250 0.28260401 0.133039892417
## urban 0.83465297481 0.83437816 0.831452575236
## lnmanufacturing 0.90479507924 0.84915562 0.892546201989
## lnag 0.27276842170 0.35785587 0.270970915639
## lnretail 0.26429903088 0.44812528 0.283618204992
## lnhealthss 0.01331585967 0.11160936 0.014145435080
## lnconstruction 0.04915142945 0.16807678 0.047490193794
## lnlesshs 0.00059355749 0.07198425 0.000485597946
## lnunemployment 0.00005391193 0.04513198 0.000008191284
## lnsinglemom 0.01099539333 0.15204483 0.014946338043
## lnblack 0.06246554441 0.17238976 0.059892453014
## lnhispanic 0.20446093470 0.33315211 0.209583152213
## lnuninsured 0.00001633591 0.11631218 0.000123353408
## lnincome_ratio 0.10657143711 0.27752753 0.113749410449
## lnteenbirth 0.15139941773 0.26476791 0.156916411580
## lnunmarried 0.94531541290 0.94690375 0.944629630679
Now we will run the Spatial Error Model that was suggested earlier in our analysis by the Moran’s Correlation and LaGrange Multiplier Tests. The Spatial Error Model does not include lagged dependent or independent variables, but instead includes a function of our unexplained error and that of our neighbors. In this analysis, higher than expected residual values suggest a missing explanatory variable that is spatially correlated. In this analysis, Lambda is the error multiplier.
sp.err.model <- spatialreg::errorsarlm(equation, data=tx_pov, cont.neighb)
summary(sp.err.model, Nagelkerke = TRUE)
##
## Call:
## spatialreg::errorsarlm(formula = equation, data = tx_pov, listw = cont.neighb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.18297 -3.50397 -0.15273 3.68751 21.88176
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -81.632091 12.895227 -6.3304 2.445e-10
## rural -2.239971 1.441003 -1.5545 0.1200764
## urban -0.057247 1.217828 -0.0470 0.9625072
## lnmanufacturing 0.082049 0.547040 0.1500 0.8807747
## lnag -0.939302 0.786109 -1.1949 0.2321362
## lnretail -1.568230 1.276888 -1.2282 0.2193847
## lnhealthss 4.244038 2.017342 2.1038 0.0353979
## lnconstruction -2.354845 1.060230 -2.2211 0.0263463
## lnlesshs 8.622796 2.651494 3.2521 0.0011457
## lnunemployment 8.590102 1.923615 4.4656 7.984e-06
## lnsinglemom 2.467036 0.889356 2.7740 0.0055379
## lnblack -0.737356 0.321786 -2.2914 0.0219374
## lnhispanic -1.183896 1.045278 -1.1326 0.2573770
## lnuninsured 14.233722 4.026419 3.5351 0.0004076
## lnincome_ratio 5.447997 3.339481 1.6314 0.1028079
## lnteenbirth 1.617612 1.203351 1.3443 0.1788654
## lnunmarried -0.018476 0.443282 -0.0417 0.9667542
##
## Lambda: 0.26132, LR test value: 6.5162, p-value: 0.01069
## Asymptotic standard error: 0.091697
## z-value: 2.8498, p-value: 0.0043747
## Wald statistic: 8.1213, p-value: 0.0043747
##
## Log likelihood: -841.3442 for error model
## ML residual variance (sigma squared): 43.553, (sigma: 6.5994)
## Nagelkerke pseudo-R-squared: 0.51777
## Number of observations: 254
## Number of parameters estimated: 19
## AIC: NA (not available for weighted model), (AIC for lm: 1725.2)
In the summary we see seven (7) significant log variables including: employment in healthcare and construction, less than high school education, rate of unemployment, single mother households, percent uninsured, and percent of population that is black.
Comparing the values* of the three (3) models, we can see that overall the Spatial Error Model provides the best fit comparatively.
SLX | Model | Lag | Model | Err | Model |
---|---|---|---|---|---|
Adj-R2 | p-value | R2 | p-value | R2 | p-value |
0.4803 | < 2.2e-16 | 0.515 | 0.27 | 0.518 | 0.0105 |
*Because spatial analyses do not provide standard R2 values, a Nagelkerke pseudo-R2 value is calculated that is a relative measure indicating how well the model explains the data.
To test the validity of the error model we can run a Spatial Hausman Test to see if the results of the analysis verify our use of the model. Developed by Pace and LeSage, they state that:
For a given set of variables, a divergence between the coefficient estimates from a spatial error model and OLS suggest that neither is yielding regression parameter estimates matching the underlying parameters. This calls into questions use of either OLS or a standard error model for that set of variables.
Pace, R.K. and LeSage, J.P. (2008) A spatial Hausman test. Economics Letters 101(3), pgs. 282-284
##
## Spatial Hausman test (asymptotic)
##
## data: NULL
## Hausman test = 16.878, df = 17, p-value = 0.4627
Based on the p-value of 0.4627 we would not reject the HO that the estimation method should yield coefficients appropriate for a spatial error model. If the p-value had been significant, we would have concluded that the use of OLS or a standard error model may not be appropriate for this set of variables.
Alternatively, we could have used a nested model to explore the likelihood of which model would be appropriate for our data. As described above, a Spatial Durbin Model contains components of OLS, SLX, Spatial Lag, and Spatial Error models. The results of this model can determine if lagged y, error, or lagged x values are important in the model, or whether the model should be simplifed to include only the lagged y values (lag model), lagged x values (SLX), the errors (error model), or a simple OLS. The Spatial Durbin Error Model contains components of OLS, SLX, and Spatial Error models. The results of this model can determine if both errors and lagged x values are important in the model, or whether the model should be simplified to include only the lagged x values (SLX), the errors (error model), or a simple OLS.
Spatial Durbin Model | Spatial Durbin Error Model |
---|---|
\(y = \rho Wy + X\beta + WX\theta + \varepsilon\) | \(y = X\beta + WX\theta + u\), \(u = \lambda Wu + \varepsilon\) |
Simplified to… | Simplified to… |
SLX \(y = X \beta + WX\theta + \varepsilon\) | Spatial Error \(y = X \beta + u\), \(u = \lambda W u + \varepsilon\) |
Simplified to… | Simplified to… |
Spatial Lag \(y = \rho W y + X \beta + \varepsilon\) | SLX \(y = X \beta + WX\theta + \varepsilon\) |
Simplified to… | Simplified to… |
Spatial Error \(y = X \beta + u\), \(u = \lambda W u + \varepsilon\) | OLS \(y = X \beta + \varepsilon\) |
Simplified to… | |
OLS \(y = X \beta + \varepsilon\) |
Because we already know that an error model is the most appropriate analysis we will use the Spatial Durbin Error Model for this portion of the exercise. If we were unsure after going through the steps above, we could also run the Spatial Durbin Model.
sd.err <- spatialreg::errorsarlm(equation, tx_pov, cont.neighb, etype = "emixed")
sdm <- spatialreg::lagsarlm(equation, tx_pov, cont.neighb, type = "mixed")
Remembering that the Spatial Durbin Error Model is a local model which includes the errors and lag x values, we can view a summary of that model.
##
## Call:
## spatialreg::errorsarlm(formula = equation, data = tx_pov, listw = cont.neighb,
## etype = "emixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.31268 -3.71075 -0.25827 3.84243 20.13158
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -111.6357732 31.2516592 -3.5722 0.0003541
## rural -2.6061859 1.4697087 -1.7733 0.0761845
## urban -0.0036737 1.2246298 -0.0030 0.9976065
## lnmanufacturing -0.0361237 0.5669329 -0.0637 0.9491950
## lnag -0.6387240 0.8667384 -0.7369 0.4611660
## lnretail -0.7602055 1.3324769 -0.5705 0.5683247
## lnhealthss 4.4780843 2.0978006 2.1347 0.0327891
## lnconstruction -2.2097522 1.0690477 -2.0670 0.0387315
## lnlesshs 8.1522201 2.6697969 3.0535 0.0022619
## lnunemployment 6.6995970 1.9666295 3.4066 0.0006577
## lnsinglemom 2.4974968 0.8944633 2.7922 0.0052355
## lnblack -1.0769648 0.3524474 -3.0557 0.0022455
## lnhispanic 1.2738204 1.3694293 0.9302 0.3522761
## lnuninsured 9.8773038 4.4636980 2.2128 0.0269109
## lnincome_ratio 6.2190548 3.3482208 1.8574 0.0632513
## lnteenbirth 2.3902735 1.2884053 1.8552 0.0635650
## lnunmarried 0.0320226 0.4853005 0.0660 0.9473898
## lag.rural -1.3128721 3.7998265 -0.3455 0.7297122
## lag.urban -1.9439595 2.9475806 -0.6595 0.5095682
## lag.lnmanufacturing -0.9133376 1.2381458 -0.7377 0.4607176
## lag.lnag -0.0666780 1.6347868 -0.0408 0.9674657
## lag.lnretail 5.7003753 3.4875001 1.6345 0.1021505
## lag.lnhealthss 8.5446798 4.6175020 1.8505 0.0642418
## lag.lnconstruction -3.9155614 2.6003283 -1.5058 0.1321198
## lag.lnlesshs -0.6141597 5.6967052 -0.1078 0.9141467
## lag.lnunemployment 2.4147811 4.8250298 0.5005 0.6167444
## lag.lnsinglemom -4.5989952 2.4272777 -1.8947 0.0581304
## lag.lnblack 0.8723242 0.6194842 1.4081 0.1590879
## lag.lnhispanic -2.2381753 2.1987671 -1.0179 0.3087146
## lag.lnuninsured -4.0541475 8.2617533 -0.4907 0.6236296
## lag.lnincome_ratio 5.7241026 8.1058882 0.7062 0.4800849
## lag.lnteenbirth 5.5412619 3.6139085 1.5333 0.1251981
## lag.lnunmarried -0.0052713 1.2477725 -0.0042 0.9966293
##
## Lambda: 0.26166, LR test value: 6.5802, p-value: 0.010312
## Asymptotic standard error: 0.091677
## z-value: 2.8542, p-value: 0.0043147
## Wald statistic: 8.1464, p-value: 0.0043147
##
## Log likelihood: -830.0329 for error model
## ML residual variance (sigma squared): 39.84, (sigma: 6.3119)
## Nagelkerke pseudo-R-squared: 0.55886
## Number of observations: 254
## Number of parameters estimated: 35
## AIC: NA (not available for weighted model), (AIC for lm: 1734.6)
We could also run the global Spatial Durbin Model summary using
summary(sdm)
. However, using a model-driven approach we
understand that overall a global model is not appropriate for our data.
Meaning, for example, that unemployment in one county could affect
poverty in a neighboring county, but it is unlikely to have an impact on
a county across the state or every county within the state. So keeping
that in mind we will move on using only the Spatial Durbin Error Model.
Examining the impacts matrix as we have before we can determine if the
Spatial Durbin Error Model is the most appropriate model for our data or
if we should restrict the model to a spatial error, SLX, or OLS
model.
## Direct Indirect Total
## rural 0.0761844736 0.72971217 0.355952285
## urban 0.9976064575 0.50956820 0.539528715
## lnmanufacturing 0.9491949607 0.46071765 0.497839479
## lnag 0.4611660110 0.96746573 0.674062995
## lnretail 0.5683246595 0.10215051 0.213119052
## lnhealthss 0.0327890584 0.06424176 0.007398399
## lnconstruction 0.0387314564 0.13211981 0.036904181
## lnlesshs 0.0022619003 0.91414670 0.200064447
## lnunemployment 0.0006576804 0.61674440 0.068869307
## lnsinglemom 0.0052355241 0.05813040 0.422107610
## lnblack 0.0022455459 0.15908789 0.739705286
## lnhispanic 0.3522761423 0.30871460 0.626926803
## lnuninsured 0.0269109288 0.62362964 0.471568678
## lnincome_ratio 0.0632513175 0.48008491 0.186038792
## lnteenbirth 0.0635650059 0.12519813 0.064776023
## lnunmarried 0.9473897639 0.99662928 0.985967634
Based on the impacts analysis, we can see not many of the impacts are significant, so we might consider limiting the model to either the spatial error, SLX, or OLS models. Since we have already run the analysis and determined the most appropriate model of those three, we should see similar results in our likelihood ration test. This will help us test the HO that we should restrict the model to a more simple model. The first step is to determine if we should restrict the model from a Spatial Durbin Error model to a spatial error mode.
##
## Likelihood ratio for spatial linear models
##
## data:
## Likelihood ratio = 22.623, df = 16, p-value = 0.1242
## sample estimates:
## Log likelihood of sd.err Log likelihood of sp.err.model
## -830.0329 -841.3442
Our results are saying that, with a high p-value of 0.1242, we should not reject the null hypothesis. Meaning that we should restrict the model to a spatial error model. To provide further evidence of that, the p-value for the comparison of the Spatial Durbin Error model and the SLX model is 0.0103 and 0.0333 for the OLS model, which means we should not restrict this to a SLX or OLS model.
In the section above we utilized contiguity functions for the creation of the neighborhood. In this example we will use distance calculations to examine the relationships in our data with K-nearest neighbors. While the previous spatial regression equations used spatial weight matrices in the analysis, K-nearest neighbors allows us to fine tune the distance value to fit specific criteria. For example, in a biological framework, spatial distributions such as home-range, seed dispersal, migrations patterns, etc. often have established spatial values. While it is possible to extend the level of neighbors via contiguity, those neighbors are still based of shared borders and not distance values such as K-nearest neighbors that can be specifically tuned for the analysis. An advantage of this spatial relationship is that it ensures some neighbors for every target feature, even when feature densities vary widely across the study area.
To begin this analysis we will need to create centroid for our county
polygons. This provides a location in each county to base the distance
value. We can do that by simply using the centroid()
function in the geosphere package to create xy data
from the polygons
Next, we need to create a neighbor list from our centroids based on a k-nearest value. For this example we will examine k = 1, k = 3, and k = 5. Then we need to calculate the distance value so the model can create a radius to encompass the neighbors. Finally, we need to produce the list of neighbors within the neighborhood.
#Create neighbors
all.dist.k1 <- knn2nb(knearneigh(all.xy, k=1, longlat = TRUE))
all.dist.k3 <- knn2nb(knearneigh(all.xy, k=3, longlat = TRUE))
all.dist.k5 <- knn2nb(knearneigh(all.xy, k=5, longlat = TRUE))
#Determine max k distance value to neighbor
all.max.k1 <- max(unlist(nbdists(all.dist.k1, all.xy, longlat=TRUE)))
all.max.k3 <- max(unlist(nbdists(all.dist.k3, all.xy, longlat=TRUE)))
all.max.k5 <- max(unlist(nbdists(all.dist.k5, all.xy, longlat=TRUE)))
#Calculate neighbors based on distance
all.sp.dist.k1 <- dnearneigh(all.xy, d1=0, d2=1 * all.max.k1, longlat = TRUE)
all.sp.dist.k3 <- dnearneigh(all.xy, d1=0, d2=1 * all.max.k3, longlat = TRUE)
all.sp.dist.k5 <- dnearneigh(all.xy, d1=0, d2=1 * all.max.k5, longlat = TRUE)
#Create neighbor list
all.dist.neighb.k1 <- nb2listw(all.sp.dist.k1,style="W", zero.policy = TRUE)
all.dist.neighb.k3 <- nb2listw(all.sp.dist.k3,style="W", zero.policy = TRUE)
all.dist.neighb.k5 <- nb2listw(all.sp.dist.k5,style="W", zero.policy = TRUE)
This provides us with the spatial matrices needed to run the spatial lag and spatial error distance models based on values of one (1), three (3), and five (5) neighbors. The lag and error models are similar to the models completed above, except using distance calculation for the spatial matrix rather than a matrix based on contiguity. Additionally, because we already completed the Moran’s Correlation and LaGrange Multiplier Tests above, we do not need to repeat that step for this exercise and will move right on to the analysis.
To calculate a distance lag model for each k-distance value we will use the following:
all.dist.lag.k1 <- spatialreg::lagsarlm(equation, data = tx_pov, listw = all.dist.neighb.k1)
all.dist.lag.k3 <- spatialreg::lagsarlm(equation, data = tx_pov, listw = all.dist.neighb.k3)
all.dist.lag.k5 <- spatialreg::lagsarlm(equation, data = tx_pov, listw = all.dist.neighb.k5)
For this example we will only view the summary for the K=1 lag model:
##
## Call:
## spatialreg::lagsarlm(formula = equation, data = tx_pov, listw = all.dist.neighb.k1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.26420 -3.92130 -0.19287 3.75935 21.93251
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -83.780273 12.298965 -6.8120 9.627e-12
## rural -1.791795 1.439814 -1.2445 0.2133292
## urban -0.182614 1.197555 -0.1525 0.8788010
## lnmanufacturing 0.238012 0.539998 0.4408 0.6593826
## lnag -0.828526 0.742989 -1.1151 0.2647969
## lnretail -1.297936 1.285854 -1.0094 0.3127844
## lnhealthss 4.441177 1.957354 2.2690 0.0232702
## lnconstruction -2.363470 1.052807 -2.2449 0.0247731
## lnlesshs 7.864772 2.539457 3.0970 0.0019547
## lnunemployment 8.148034 1.910640 4.2646 2.003e-05
## lnsinglemom 2.294017 0.881707 2.6018 0.0092738
## lnblack -0.558226 0.302814 -1.8435 0.0652615
## lnhispanic -1.181199 0.947876 -1.2462 0.2127081
## lnuninsured 13.828022 3.745415 3.6920 0.0002225
## lnincome_ratio 5.182556 3.307955 1.5667 0.1171860
## lnteenbirth 1.846950 1.238280 1.4915 0.1358185
## lnunmarried -0.026773 0.454500 -0.0589 0.9530259
##
## Rho: 0.1861, LR test value: 5.2624, p-value: 0.02179
## Asymptotic standard error: 0.08413
## z-value: 2.212, p-value: 0.026965
## Wald statistic: 4.8931, p-value: 0.026965
##
## Log likelihood: -841.9711 for lag model
## ML residual variance (sigma squared): 44.148, (sigma: 6.6444)
## Nagelkerke pseudo-R-squared: 0.51538
## Number of observations: 254
## Number of parameters estimated: 19
## AIC: 1721.9, (AIC for lm: 1725.2)
## LM test for residual autocorrelation
## test value: 1.5471, p-value: 0.21357
To calculate a distance error model for each k-distance value we will use the following:
all.dist.err.k1 <- spatialreg::errorsarlm(equation, data = tx_pov, listw = all.dist.neighb.k1)
all.dist.err.k3 <- spatialreg::errorsarlm(equation, data = tx_pov, listw = all.dist.neighb.k3)
all.dist.err.k5 <- spatialreg::errorsarlm(equation, data = tx_pov, listw = all.dist.neighb.k5)
For this example we will only view the summary for the K=1 err model:
##
## Call:
## spatialreg::errorsarlm(formula = equation, data = tx_pov, listw = all.dist.neighb.k1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.77306 -3.60469 -0.22649 3.77691 22.26279
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -82.994877 12.921589 -6.4230 1.336e-10
## rural -2.217650 1.447042 -1.5325 0.1253892
## urban -0.238744 1.203329 -0.1984 0.8427295
## lnmanufacturing 0.038484 0.543262 0.0708 0.9435256
## lnag -0.674507 0.783858 -0.8605 0.3895159
## lnretail -1.876566 1.264909 -1.4836 0.1379261
## lnhealthss 4.702683 2.029486 2.3172 0.0204940
## lnconstruction -2.219853 1.049585 -2.1150 0.0344315
## lnlesshs 7.826914 2.659579 2.9429 0.0032514
## lnunemployment 9.228441 1.917868 4.8118 1.496e-06
## lnsinglemom 2.320612 0.875628 2.6502 0.0080438
## lnblack -0.637487 0.324895 -1.9621 0.0497470
## lnhispanic -0.885510 1.051503 -0.8421 0.3997111
## lnuninsured 14.018928 4.074680 3.4405 0.0005806
## lnincome_ratio 5.668915 3.330218 1.7023 0.0887056
## lnteenbirth 1.768904 1.203243 1.4701 0.1415308
## lnunmarried 0.032203 0.445633 0.0723 0.9423923
##
## Lambda: 0.32346, LR test value: 7.877, p-value: 0.0050067
## Asymptotic standard error: 0.10609
## z-value: 3.0489, p-value: 0.0022969
## Wald statistic: 9.2957, p-value: 0.0022969
##
## Log likelihood: -840.6638 for error model
## ML residual variance (sigma squared): 43.289, (sigma: 6.5794)
## Nagelkerke pseudo-R-squared: 0.52035
## Number of observations: 254
## Number of parameters estimated: 19
## AIC: NA (not available for weighted model), (AIC for lm: 1725.2)
While the two models are relatively comparable, we can see that again the error model is generally a better fit for the dataset.
Distance | Lag | Model | Distance | Err | Model |
---|---|---|---|---|---|
R2 | p-value | AIC | R2 | p-value | AIC |
0.5154 | 0.0218 | 1721.9 | 0.5204 | 0.005 | 1719.3 |
In order to connect the poverty data to spatial data we need to have
a common column to merge the two datasets. For this example we will use
the FIPS codes. We can create an output including variables from the
original dataset as well as the analyses. For this example we will use
the tx_pov
and all.dist.err.k1
data to create
a bivariate map of variables.
To create the output we will combine columns from the poverty and error model dataset.
dist.err.data <- summary(all.dist.err.k1, correlation=TRUE, Nagelkerke = TRUE)
dist.err.output <- cbind.data.frame(tx_pov$FIPS,
dist.err.data$fitted.values,
dist.err.data$residual,
tx_pov$child.pov.2016,
tx_pov$lnsinglemom,
tx_pov$lnuninsured,
tx_pov$lnlesshs,
tx_pov$lnincome_ratio,
stringsAsFactors = FALSE)
#Renaming columns
colnames(dist.err.output) <- c("fips","fitted","resid","childpov",
"single_mom","uninsured","less_hs","income_ratio")
We will use this data to create a bivariate map. Bivariate maps are a
variation of a choropleth map that portrays two separate phenomena
simultaneously. The main objective is to accurately and graphically
illustrate the relationship between two spatially distributed variables.
In order to create this map we will use biscale
to split
the data into quantiles, then rank those values, and create a value of
the combined ranks of two variables such as single mother households and
child poverty. In order for this to be displayed in ggplot
we will use the fortify function.
tx_fortify <- fortify(tx_sp)
tx_poly <- merge(x = tx_fortify, y = dist.err.output,
by.x = "id", by.y = "fips", all = TRUE)
bivariate_data <- bi_class(tx_poly, x = childpov, y = single_mom,
dim = 3, style = "quantile")
legend <- bi_legend(pal = "DkViolet",
dim = 3,
xlab = "Child Poverty",
ylab = "Single Mother\n Households",
size = 6)
Now that we have data and a legend, we need to create additional datasets to serve as a basemap for our analysis.
world <- map_data("world")
states <- map_data("state")
southern_states <- subset(states, region %in%
c("texas", "arkansas", "louisiana", "mississippi",
"alabama", "georgia", "florida", "north carolina",
"south carolina", "tennessee", "oklahoma",
"kentucky", "west virginia", "virginia",
"maryland", "delaware", "district of columbia"))
Now we can use ggplot
to create the bivariate map and
use the ggdraw() and draw_plot()
functions in
cowplot
to add the legend to the final map.
mom_pov_map <- ggplot() +
geom_polygon(data = world, aes(x=long,y=lat, group=group), fill = "gray95", color = "white") +
geom_polygon(data = states, aes(x=long,y=lat, group=group), fill = "gray", color = "white") +
geom_polygon(data = southern_states, aes(x=long,y=lat, group=group), fill = NA, size = 0.01, color = "white") +
geom_polygon(data = bivariate_data, aes(x=long, y=lat, group=group, fill = bi_class), color = "grey50", show.legend = FALSE) +
bi_scale_fill(pal = "DkViolet", dim = 3) +
coord_map("conic", lat0 = 30, xlim=c(-107,-92), ylim=c(25,37)) +
theme_void() + theme(legend.title.align=0.5) +
theme(panel.background = element_rect(fill = 'deepskyblue'),
panel.grid.major = element_line(colour = NA)) +
labs(x = "Longitude", y = "Latitude", fill = "Child Poverty",
title = "Bivariate Map of Child Poverty and Single Mother Households") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
mom_pov_map
Using cowplot
we can add the custom legend to the
bivariate map and complete the output.
Now it’s your turn! Although most of this might not be applicable to your thesis research, try this analysis on a dataset of your choice or recreate this analysis choosing a different state and poverty equation/selection. If you can apply this to your thesis, then add it to your website! Otherwise create a new repository for your presentation on Thursday.