1 Introduction to Spatial Regression

In this exercise we will cover the basics of spatial regression analysis including a discussion on spatial weight matrices, spatial models, model comparisons, and finally displaying the results using bivariate maps. To start, we will discuss the different ways to develop spatial weights for spatial regression analyses. These models were designed to address underlying spatial patterns in our data. Remember that one assumption of regression analysis is independence of the observations. If the observations are spatially autocorrelated (a measure of similarity or correlation between nearby observations), estimates of the coefficients and the error terms may not be detectable from random noise.

1.1 Spatial Weights

Spatial weights describes the various ways that “neighbors” are determined in a spatial analysis. Tobler’s First Law of Geography (Waldo Tobler, 1970) states that,

“everything is related to everything else, but near things are more related than distant things.”

While “related” should likely be restated as “similar”, this forms the basis for spatial autocorrelation and spatial interpolation analyses. With this in mind, we can look at the basic ways “neighborhoods” can be determined in a spatial analysis. In this exercise we will examine two common methods: contiguity (common boundaries) and distance (K-nearest neighbors).

1.1.1 Contiguity

Contiguity generally describes neighbors as individuals with shared or common boundary. In the example below, we see a hypothetical dataset with six individual polygons.

Neighbors Example
Neighbors Example

The six polygons contain a total of eighteen (18) neighbors within the neighborhood. The number of individual neighbors is identified within each polygon. Additionally, we can define how neighbor borders are identified.

Contiguity Neighbors
Contiguity Neighbors

In the example above we can see the designation of neighbors can be identified by a chess analogy: horizontal/vertical cases (rooks), diagonal cases (bishops), or in all cases (queens) along a regular grid. In irregular polygons, rooks neighbors are connected along edges, bishop neighbors by points, and queens neighbors by both points and edges. So in this example, the neighborhood established using the queens case contains eight (8) neighbors, four (4) neighbors by edges in the rooks case, and four (4) neighbors by points in the bishops case.

1.1.2 Distance

Neighbors can also be assigned by distance or K-nearest neighbors. Using this method, distance is measured from the centroid of a polygon until a specific number of neighbors are achieved.

Distance Neighbors
Distance Neighbors

In the distance example, the K-value is equal to the number of individuals designated as neighbors. Within the K = 1 circle, the neighborhood is identified as the single closest neighbor. K = 3 contains the three (3) nearest neighbors and K = 5 contains the five (5) nearest neighbors.

1.2 Basic Spatial Regression Models

During the first half of this course we examined ordinary least squares (OLS), generalized linear model (GLM), and linear mixed models. In this exercise we will look at some basic spatial regression models including: spatially lagged x - explanatory variable(s), spatial lag model, and spatial error model. Additionally, we will discuss spatial durbin and spatial durbin error nested models.

1.2.1 Spatially Lagged X Variable(s)

In a non-spatial OLS model, \(y = X \beta + \varepsilon\), we examine the relationship between one or more independent variables and a dependent variable. With a spatially lagged, or SLX model, we see an addition of a spatial component to the equation that accounts for the average value of neighboring X (independent variable) values.

\(y = X \beta + WX\theta + \varepsilon\),  where \(WX\theta\) is the average value of our neighbors independent variable(s)

So the SLX equation examines whether variables that impact our neighbors also impact us. This is considered a local model as there is only one-way interaction between our neighbors and us.

1.2.2 Spatial Lag Model

In the Spatial Lag model, we lose the lagged X values that were used in the SLX model and instead add a spatially lagged y (dependent variable) value.

\(y = \rho W y + X \beta + \varepsilon\),  where \(\rho W y\) is the spatially lagged y value of our neighbors

In this global model the dependent variable among our neighbors influences our dependent variable. Therefore there is a feedback loop that occurs where affects on our neighbor(s) y affects our y and our neighbor(s) y variable.

1.2.3 Spatial Error Model

The Spatial Error model does not include lagged dependent or independent variables, but instead includes a function of our unexplained error and that of our neighbors.

\(y = X \beta + u\),   \(u = \lambda W u + \varepsilon\),  where \(\lambda W u + \varepsilon\) is the function of our unexplained error (residuals) and our neighbors residual values

This is a global model where the higher than expected residual values suggest a missing explanatory variable that is spatially correlated. This would lead to unexplained clusters of spatially correlated values within our neighborhood that were not included as a predictor variable(s).

1.3 Nested Spatial Regression Models

In addition to the models discussed above, we can also examine a modern approach to model specifications. These nested models contain some lagged and/or error values and can be local or global models. These can be seen as exploratory analyses that use likelihood ratio tests to help infer the most appropriate model for our data. However, you must begin with an understanding of whether the relationships in your data are local or global in nature.

1.3.1 Spatial Durbin Model

The Spatial Durbin model includes both lagged y and lagged x values. This model can be simplified into global SLX, Spatial Lag, Spatial Error models, or OLS.

\(y = \rho Wy + X\beta + WX\theta + \varepsilon\)

Because of the inclusion of the lagged y, this model would be used if the relationships in your data are global, meaning that if the y variable is impacted in one region that impact will spill over to every region in the dataset.

1.3.2 Spatial Durbin Error Model

This nested model does not contain a lagged y and can be simplified into a local Spatial Error, SLX, or OLS model.

\(y = X\beta + WX\theta + u\),   \(u = \lambda Wu + \varepsilon\)

This nested model would be used if the relationships in your data are local, meaning changes in one region will only impact the immediate neighbors of the region and not all of the regions in the dataset.

2 Spatial Regression

With this basic knowledge of spatial regression and the associated models, we can now examine a dataset to:

  1. Explore the data using OLS
  2. Determine the presence of spatial autocorrelation
  3. Choose appropriate spatial weights
  4. Run spatial regression models
  5. Plot the results

This will be a very code dense exercise due to the complexity of the analysis and the amount of data required to complete all of the steps listed above. Try to keep your naming conventions straight-forward so you can easily track down the data you need. When creating your own script and markdown document, consider adding descriptive text to convey the purpose of the specific analysis.

2.1 Packages used in this exercise

There are several specialty packages that will be used in this exercise due to the specific nature of the analyses. Some of these packages you will need to install while several others we have used in previous exercises and should already be installed.

|biscale | car| cleangeo | cowplot |
| geosphere | maps | mapproj | maptools |
|spatialreg | spdep | tidyverse | visreg|

To begin, we will install the following:

packages <- c("biscale", "car", "cleangeo", "cowplot", "geosphere", "maps", 
              "mapproj", "maptools", "spatialreg", "spdep", "tidyverse","visreg")
sapply(packages, require, character.only=T)
##    biscale        car   cleangeo    cowplot  geosphere       maps    mapproj 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##   maptools spatialreg      spdep  tidyverse     visreg 
##       TRUE       TRUE       TRUE       TRUE       TRUE

Occasionally packages may mask functions needed from other packages. You can require the use of a function through a specific package by using a syntax such as spatialreg::impacts(). You can also create a selective function to like impacts <- spatialreg::impacts to require the use of a function from a specific package. Because of the significant amount of script involved in this exercise, you may need to refer to the help menu (e.g. ?poly2nb) for additional information regarding specific functions used throughout the analysis.

2.2 Dataset

Generally we have been using biological data for each analysis. In this exercise we will examine the effect of numerous socioeconomic factors related to percentage of children under the age of 18 who are living below the poverty line in the Southern United States. Data for this analysis was collected from the U.S. Census Bureau, American Community Survey, Bureau of Labor Force Statistics, U.S. Department of Agriculture Economic Research Service, and the County Health Rankings. Independent variables include: rural, urban, manufacturing, agriculture, retail, health care, construction, less than high school degree, unemployment, income ratio, teen birth, unmarried, single mother, uninsured, as well as race variable like Black and Hispanic. Because this data has to be downloaded we will need to go through the process of importing this information using the read.csv() command. Once we have imported the data we can view the structure.

If forked, this data will be available in the Data folder within the repository.

In previous exercises, we used a simple implementation of read.csv() to import the dataset. However, this can occasionally prove to be problematic as columns can take on formatting that doesn’t necessarily fit the data. So in this example we will designate the format for each column.

data <- read.csv('https://raw.githubusercontent.com/chrismgentry/Spatial-Regression/master/Data/childpov18_southfull.csv', 
                 colClasses = c("character", "character", "character", 
                                "numeric", "numeric", "numeric", "numeric",
                                "numeric", "numeric", "numeric", "numeric",
                                "numeric", "numeric", "numeric", "numeric", 
                                "numeric", "numeric", "numeric", "numeric",
                                "numeric", "numeric", "numeric", "numeric",
                                "numeric", "numeric", "numeric", "numeric", 
                                "numeric", "numeric", "numeric", "numeric",
                                "numeric", "numeric", "numeric", "numeric"))
Now we can view the data:

Notice there are 35 different columns in this dataset. One of the columns is labeled as X2016.child.poverty. This is because when imported, R does not permit columns names to begin with a numeric values. Therefore it will always put an X at the start of the column name. It is always good practice to view your data when imported to address any potential issues like column names or formats. To fix this, we can rename that specific column name:

names(data)[names(data)=="X2016.child.poverty"] <- "child.pov.2016"

For this exercise we will subset the data in order to reduce the time and complexity of the analysis. We will select the state of Texas for this example.

tx_pov <- data %>% subset(State == "TX")

For additional QA/QC, we can view a summary of our data to check for any missing data or errors.

summary(tx_pov)
##      FIPS              State           County_Name          Longitude      
##  Length:254         Length:254         Length:254         Min.   :-106.24  
##  Class :character   Class :character   Class :character   1st Qu.:-100.78  
##  Mode  :character   Mode  :character   Mode  :character   Median : -98.54  
##                                                           Mean   : -98.65  
##                                                           3rd Qu.: -96.57  
##                                                           Max.   : -93.74  
##     Latitude     child.pov.2016  unemployment_13    per_lhs_13   
##  Min.   :26.13   Min.   : 0.00   Min.   : 2.600   Min.   : 6.60  
##  1st Qu.:30.13   1st Qu.:18.52   1st Qu.: 5.000   1st Qu.:16.32  
##  Median :31.79   Median :23.65   Median : 5.800   Median :20.00  
##  Mean   :31.66   Mean   :23.70   Mean   : 6.126   Mean   :21.75  
##  3rd Qu.:33.18   3rd Qu.:28.88   3rd Qu.: 6.700   3rd Qu.:25.65  
##  Max.   :36.28   Max.   :56.50   Max.   :15.700   Max.   :55.00  
##   per_black_13    per_hispanic_13    per_ag_13       per_construction_13
##  Min.   : 0.000   Min.   : 2.233   Min.   : 0.5895   Min.   : 0.000     
##  1st Qu.: 1.155   1st Qu.:16.632   1st Qu.: 4.8213   1st Qu.: 6.268     
##  Median : 3.900   Median :24.625   Median : 9.1786   Median : 7.754     
##  Mean   : 6.231   Mean   :32.938   Mean   :11.7818   Mean   : 8.250     
##  3rd Qu.: 8.784   3rd Qu.:47.754   3rd Qu.:16.8112   3rd Qu.: 9.759     
##  Max.   :33.314   Max.   :98.432   Max.   :51.5464   Max.   :18.153     
##  per_manufacturing_13 per_retail_13    per_healthss_13  per_singlemom_13
##  Min.   : 0.00        Min.   : 0.000   Min.   : 4.762   Min.   : 0.000  
##  1st Qu.: 4.43        1st Qu.: 9.182   1st Qu.:19.480   1st Qu.: 7.512  
##  Median : 7.53        Median :11.458   Median :22.204   Median : 9.586  
##  Mean   : 7.97        Mean   :10.897   Mean   :22.538   Mean   : 9.798  
##  3rd Qu.:10.82        3rd Qu.:12.727   3rd Qu.:25.636   3rd Qu.:12.183  
##  Max.   :29.45        Max.   :19.300   Max.   :38.697   Max.   :21.034  
##      urban            rural          lnchildpov     lnunemployment  
##  Min.   :0.0000   Min.   :0.0000   Min.   :-4.605   Min.   :0.9555  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 2.919   1st Qu.:1.6094  
##  Median :0.0000   Median :0.0000   Median : 3.163   Median :1.7579  
##  Mean   :0.3228   Mean   :0.1929   Mean   : 2.995   Mean   :1.7684  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.: 3.363   3rd Qu.:1.9021  
##  Max.   :1.0000   Max.   :1.0000   Max.   : 4.034   Max.   :2.7537  
##     lnlesshs        lnblack          lnhispanic          lnag        
##  Min.   :1.887   Min.   :-4.6052   Min.   :0.8034   Min.   :-0.5285  
##  1st Qu.:2.793   1st Qu.: 0.1442   1st Qu.:2.8113   1st Qu.: 1.5730  
##  Median :2.996   Median : 1.3610   Median :3.2038   Median : 2.2169  
##  Mean   :3.013   Mean   : 0.8919   Mean   :3.2347   Mean   : 2.1164  
##  3rd Qu.:3.245   3rd Qu.: 2.1729   3rd Qu.:3.8661   3rd Qu.: 2.8220  
##  Max.   :4.007   Max.   : 3.5060   Max.   :4.5894   Max.   : 3.9425  
##  lnconstruction   lnmanufacturing     lnretail        lnhealthss   
##  Min.   :-4.605   Min.   :-4.605   Min.   :-4.605   Min.   :1.561  
##  1st Qu.: 1.835   1st Qu.: 1.488   1st Qu.: 2.217   1st Qu.:2.969  
##  Median : 2.048   Median : 2.019   Median : 2.439   Median :3.100  
##  Mean   : 2.031   Mean   : 1.776   Mean   : 2.323   Mean   :3.089  
##  3rd Qu.: 2.278   3rd Qu.: 2.381   3rd Qu.: 2.544   3rd Qu.:3.244  
##  Max.   : 2.899   Max.   : 3.383   Max.   : 2.960   Max.   :3.656  
##   lnsinglemom       lnchpov17       lnuninsured    lnincome_ratio 
##  Min.   :-4.605   Min.   :-4.600   Min.   :2.642   Min.   :1.158  
##  1st Qu.: 2.017   1st Qu.: 2.800   1st Qu.:3.155   1st Qu.:1.454  
##  Median : 2.260   Median : 3.125   Median :3.244   Median :1.532  
##  Mean   : 2.155   Mean   : 2.940   Mean   :3.238   Mean   :1.534  
##  3rd Qu.: 2.500   3rd Qu.: 3.342   3rd Qu.:3.333   3rd Qu.:1.606  
##  Max.   : 3.046   Max.   : 4.132   Max.   :3.653   Max.   :2.267  
##  lnsevere_housing  lnteenbirth     lnunmarried   
##  Min.   :0.010    Min.   :2.507   Min.   :0.010  
##  1st Qu.:2.445    1st Qu.:3.909   1st Qu.:3.120  
##  Median :2.624    Median :4.114   Median :3.523  
##  Mean   :2.595    Mean   :4.080   Mean   :3.230  
##  3rd Qu.:2.800    3rd Qu.:4.317   3rd Qu.:3.859  
##  Max.   :3.473    Max.   :4.852   Max.   :4.605

We can see from this that we have two columns that are categorical data, urban and rural, even though they are in numeric format. Additionally there are a few variables that have minimum values below zero. We will address those later on in this exercise. Now that we have our data set to just include Texas counties we can move ahead with an exploration of the data.

2.3 Ordinary Least Squares

We are going to begin this analysis by writing a script to define our equation with an assignment operator followed by our variables. This will keep us from having to copy and paste the equation repeatedly throughout the analysis.

equation <- child.pov.2016 ~ rural + urban + lnmanufacturing + lnag + 
  lnretail + lnhealthss + lnconstruction + lnlesshs + 
  lnunemployment + lnsinglemom + lnblack + lnhispanic + 
  lnuninsured + lnincome_ratio + lnteenbirth + lnunmarried

This equation will be used to explain the level of child poverty by the list of variables and log variables in the dataset. We are also going to limit the use of scientific notation to numbers greater than 5 decimal places within the output and summaries. This is adjusted in the options:

options(scipen = 5)

To test this relationship between child poverty and our independent variables we will start by running a simple OLS.

ols <- lm(equation, data=tx_pov)
summary(ols)
## 
## Call:
## lm(formula = equation, data = tx_pov)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2797  -3.7164   0.0318   3.9219  22.4168 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -86.50820   12.85507  -6.730 1.27e-10 ***
## rural            -1.99975    1.50036  -1.333 0.183861    
## urban            -0.23421    1.25471  -0.187 0.852085    
## lnmanufacturing   0.12695    0.56070   0.226 0.821079    
## lnag             -1.10187    0.77451  -1.423 0.156149    
## lnretail         -1.28789    1.34550  -0.957 0.339448    
## lnhealthss        5.13853    2.03292   2.528 0.012134 *  
## lnconstruction   -2.60999    1.10021  -2.372 0.018479 *  
## lnlesshs          8.60026    2.64769   3.248 0.001330 ** 
## lnunemployment    8.86334    1.95041   4.544 8.79e-06 ***
## lnsinglemom       2.14337    0.92279   2.323 0.021042 *  
## lnblack          -0.66033    0.31448  -2.100 0.036811 *  
## lnhispanic       -1.26238    0.99343  -1.271 0.205072    
## lnuninsured      14.37337    3.91819   3.668 0.000301 ***
## lnincome_ratio    6.00025    3.44936   1.740 0.083240 .  
## lnteenbirth       1.97988    1.29756   1.526 0.128380    
## lnunmarried      -0.04797    0.47643  -0.101 0.919884    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.965 on 237 degrees of freedom
## Multiple R-squared:  0.5052, Adjusted R-squared:  0.4718 
## F-statistic: 15.13 on 16 and 237 DF,  p-value: < 2.2e-16

In this analysis we can see a number of significant (\(\alpha = 0.05\)) log variables including: employment in healthcare or construction, less than high school education, rate of unemployment, single mother households, rate of uninsured individuals, and a single race variable. The model is also statistically significant with a p-value of < 2.2e-16. You can view a series of regression plots to visualize the relationship between the response and explanatory variables using the visreg() function from the visreg package (visualization of regression models). For simplicity when displaying for this exercise all of the plots will be viewed in a single frame.

par(mfrow=c(4,4), mar=c(2, 2, 2, 2))
visreg(ols)

Although we will not be viewing each of the models created in this example with visreg() you can use this on regression models created for your related assignment.

2.4 Spatial Regression Analysis, Contiguity

Referring back to the section above on Contiguity, we know that neighbors are described as individuals with shared or common boundary. Those can be of several different cases including rook, bishop, and queen. In the example below, we run our analyses based on queens case neighbors.

2.4.1 Creating a list of contiguity neighbors

In order to determine if there is any underlying spatial relationships in our residuals we can run various tests of our data. However, in order to do this we need to provide a spatial weights matrix. So we need to run the following commands to create a list of neighbors. But before that we need to create a county polygon dataset containing a variable, Federal Information Processing Standards codes (FIPS), that will eventually allow us to connect the poverty data if necessary and convert the object to a SpatialPolygons class.

fips <- county.fips
fips.codes <- separate(data = fips, col = polyname, into = c("state", "county"), sep = ",")
tx_fips <- subset(fips.codes, state=="texas", select=fips)

texas <- maps::map(database = "county", regions = "texas", fill=T, plot=F)
tx_sp = map2SpatialPolygons(texas,tx_fips$fips,CRS("+proj=longlat"))

When running the next step of the process to create a list of neighbors for each county in the dataset, an error message stated that the polygons had invalid geometry. This means that when the data was digitized one of the polygons might not have been closed or there are overlapping vertices or lines. This can be repaired with the cleangeo package that somewhat acts like janitor but for spatial data.

cleaned <- clgeo_Clean(tx_sp)
neighb.data <- poly2nb(cleaned, queen=T)
cont.neighb <- nb2listw(neighb.data,style="W", zero.policy = TRUE)

Now that we have created a spatial dataset and a list of neighbors, we can determine if there is any residual spatial dependence.

2.4.2 Moran’s Correlation and LaGrange Multiplier Tests

In this step we will use a Moran’s Correlation Test designed for regression residuals to examines the residuals of the OLS regression using the spatial relationship matrix. In this case the null hypothesis (HO) is “no spatial correlation in the residuals”.

lm.morantest(ols, cont.neighb)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
## 
## Moran I statistic standard deviate = 3.0923, p-value = 0.0009931
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.095560223     -0.016773540      0.001319666

In viewing the results from our Moran’s Test for regression residuals we see that we have a significant p-value of 0.00099. So we would reject the HO due to detection of spatial dependency in our dataset. This would suggest we should be using a spatial model.

Another way to analyze the data is to use a LaGrange Multiplier Test. The function reports the estimates of simple LM tests for error dependence (LMerr), for a missing spatially lagged dependent variable (LMlag), for robust variants of these tests (RLMerr to test for error dependence in the possible presence of a missing lagged dependent variable and RLMlag the other way round), and a portmanteau test (SARMA, in fact LMerr + RLMlag). We can run all of these at once using the following script:

lm.LMtests(ols, cont.neighb, test="all")
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
## 
## LMerr = 6.3382, df = 1, p-value = 0.01182
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
## 
## LMlag = 5.809, df = 1, p-value = 0.01594
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
## 
## RLMerr = 1.0913, df = 1, p-value = 0.2962
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
## 
## RLMlag = 0.56209, df = 1, p-value = 0.4534
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = equation, data = tx_pov)
## weights: cont.neighb
## 
## SARMA = 6.9003, df = 2, p-value = 0.03174

The results of the five (5) LaGrange Multiplier Tests are as follows:

LMerr LMlag RLMerr RLMlag SARMA
0.012 0.016 0.296 0.453 0.032

Based on these results, the LaGrange Multiplier Tests suggests a spatial error or spatial lag model would improve the fit of our model. The robust models make an attempt to filter out the potential false positives in the data generation. I do not believe the spdep or spatialreg packages have the ability to run a true SARMA model, and in statistics courses I have taken in the past, it was suggested that SARMA models are rarely appropriate no matter the P-value. Additionally, because I infrequently use this analysis, I am not familiar enough with Seasonal Autoregressive Moving Average (SARMA) models to know when they are necessary.

A handy decision tree for model selection was developed by Dr. Luc Anselin, one of the principle developers of spatial econometrics.

Following the LM test results from the OLS and the decision tree above we will continue with the spatial error and spatial lag models even though the error model is more than likely the correct path.

2.4.3 Spatially Lagged X Model

Although the results above indicated an error model would be the most appropriate model, here we will examine the Spatially Lagged X Model and later on the Spatial Lag Model. As described above, the SLX model accounts for the average value of neighboring X values within our model. Essentially this is a one-way interaction where potentially our neighbors have an impact on us, but that is the limit of the interactions. To run an SLX model we need to provide the dataset and the spatial weights matrix we create before.

SLX.model <- spatialreg::lmSLX(equation, data=tx_pov, cont.neighb)
summary(SLX.model)
## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2280  -3.7042  -0.3626   3.8252  19.7106 
## 
## Coefficients:
##                       Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)         -127.06759   30.16906  -4.212 0.0000369 ***
## rural                 -2.46167    1.59469  -1.544   0.12410    
## urban                  0.02043    1.34435   0.015   0.98789    
## lnmanufacturing       -0.09738    0.62167  -0.157   0.87567    
## lnag                  -0.73469    0.97278  -0.755   0.45091    
## lnretail              -0.69809    1.43939  -0.485   0.62816    
## lnhealthss             4.35219    2.33117   1.867   0.06323 .  
## lnconstruction        -2.18978    1.16544  -1.879   0.06157 .  
## lnlesshs               8.07940    2.98545   2.706   0.00734 ** 
## lnunemployment         6.87451    2.18694   3.143   0.00190 ** 
## lnsinglemom            2.41027    0.97739   2.466   0.01442 *  
## lnblack               -1.05893    0.39612  -2.673   0.00807 ** 
## lnhispanic             1.19434    1.53993   0.776   0.43882    
## lnuninsured           10.42508    5.02773   2.074   0.03928 *  
## lnincome_ratio         6.29879    3.64999   1.726   0.08580 .  
## lnteenbirth            2.33871    1.35931   1.721   0.08674 .  
## lnunmarried            0.08268    0.51228   0.161   0.87193    
## lag.rural             -0.40666    3.80296  -0.107   0.91494    
## lag.urban             -0.95068    3.06209  -0.310   0.75650    
## lag.lnmanufacturing   -0.45707    1.23771  -0.369   0.71227    
## lag.lnag               0.08274    1.63159   0.051   0.95960    
## lag.lnretail           5.42065    3.54293   1.530   0.12745    
## lag.lnhealthss         9.20346    4.69963   1.958   0.05145 .  
## lag.lnconstruction    -3.43565    2.53190  -1.357   0.17618    
## lag.lnlesshs           0.06447    5.74321   0.011   0.99105    
## lag.lnunemployment     1.68346    4.89665   0.344   0.73133    
## lag.lnsinglemom       -3.68631    2.49604  -1.477   0.14114    
## lag.lnblack            0.73559    0.61437   1.197   0.23247    
## lag.lnhispanic        -2.09550    2.24940  -0.932   0.35257    
## lag.lnuninsured       -2.86477    8.42585  -0.340   0.73418    
## lag.lnincome_ratio     5.28207    8.10652   0.652   0.51535    
## lag.lnteenbirth        6.43891    3.71497   1.733   0.08445 .  
## lag.lnunmarried        0.14316    1.28215   0.112   0.91120    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.899 on 221 degrees of freedom
## Multiple R-squared:  0.5473, Adjusted R-squared:  0.4817 
## F-statistic: 8.349 on 32 and 221 DF,  p-value: < 2.2e-16

Notice that in the summary we see coefficients for our independent variables as well as those of our neighbors. Looking at the p-values we can see that overall the lagged variables were not significant further emphasizing the need for an error model. If this was the suggested model for your data, a more informative way to view the results would be to examine the combined impacts of the x and lagged x variables. Using the summary(impacts(SLX.model, cont.neighb, R=1000), zstats = TRUE) script will produce a matrix of direct effect and indirect effect coefficients, z-scores, and p-values for the total impacts of each variable. However, I am going to use the following script to limit the results to only the p-values.

summary(spatialreg::impacts(SLX.model, cont.neighb), zstats = TRUE)[["pzmat"]]
##                      Direct   Indirect       Total
## rural           0.122667918 0.91484305 0.474233470
## urban           0.987877742 0.75620619 0.761587350
## lnmanufacturing 0.875529947 0.71191528 0.670709149
## lnag            0.450101365 0.95955529 0.666112355
## lnretail        0.627682384 0.12601873 0.218455807
## lnhealthss      0.061907517 0.05019031 0.002894118
## lnconstruction  0.060253487 0.17479891 0.036700076
## lnlesshs        0.006804617 0.99104369 0.131309506
## lnunemployment  0.001669794 0.73099830 0.070568521
## lnsinglemom     0.013661702 0.13971252 0.614504413
## lnblack         0.007511418 0.23118307 0.549073423
## lnhispanic      0.437995184 0.35155445 0.614754961
## lnuninsured     0.038124147 0.73385842 0.304657969
## lnincome_ratio  0.084401627 0.51467051 0.171734814
## lnteenbirth     0.085338964 0.08305415 0.038675967
## lnunmarried     0.871779473 0.91109405 0.880342268

Again we see that only a few variables are significant in this analysis.

2.4.4 Spatial Lag Model

Remember that the Spatial Lag Model is a global model where the dependent variable among our neighbors influences our dependent variable. Therefore there is a feedback loop that occurs where affects on our neighbor(s) y affects our y and our neighbor(s) y variable. Although we will move on the to appropriate error model in the next section, we can run the Spatial Lag Model to examine the results. In this analysis, Rho is the spatially lagged y multiplier.

sp.lag.model <- spatialreg::lagsarlm(equation, data=tx_pov, cont.neighb)
summary(sp.lag.model, Nagelkerke = TRUE)
## 
## Call:
## spatialreg::lagsarlm(formula = equation, data = tx_pov, listw = cont.neighb)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -20.771862  -3.818987  -0.058792   3.594372  22.145552 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                    Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)     -83.3615118  12.2867007 -6.7847 1.163e-11
## rural            -1.9341120   1.4333792 -1.3493 0.1772287
## urban            -0.1509088   1.1967370 -0.1261 0.8996526
## lnmanufacturing   0.1788383   0.5358207  0.3338 0.7385567
## lnag             -0.8646923   0.7428990 -1.1639 0.2444470
## lnretail         -1.2277641   1.2852621 -0.9553 0.3394444
## lnhealthss        4.3264825   1.9521420  2.2163 0.0266727
## lnconstruction   -2.3241880   1.0546191 -2.2038 0.0275372
## lnlesshs          7.7875446   2.5550808  3.0479 0.0023047
## lnunemployment    8.0523436   1.8984982  4.2414 2.221e-05
## lnsinglemom       2.3046173   0.8801564  2.6184 0.0088339
## lnblack          -0.5809891   0.3034813 -1.9144 0.0555672
## lnhispanic       -1.1766409   0.9489231 -1.2400 0.2149846
## lnuninsured      14.1425520   3.7377450  3.7837 0.0001545
## lnincome_ratio    5.1385533   3.3041076  1.5552 0.1198981
## lnteenbirth       1.7695359   1.2381580  1.4292 0.1529559
## lnunmarried      -0.0094519   0.4544499 -0.0208 0.9834063
## 
## Rho: 0.16893, LR test value: 5.2078, p-value: 0.022485
## Asymptotic standard error: 0.074846
##     z-value: 2.2571, p-value: 0.024004
## Wald statistic: 5.0944, p-value: 0.024004
## 
## Log likelihood: -841.9984 for lag model
## ML residual variance (sigma squared): 44.115, (sigma: 6.6419)
## Nagelkerke pseudo-R-squared: 0.51528 
## Number of observations: 254 
## Number of parameters estimated: 19 
## AIC: 1722, (AIC for lm: 1725.2)
## LM test for residual autocorrelation
## test value: 1.206, p-value: 0.27212

We can see from the p-value (0.0224852) the model is significant with similar significant variables as in the OLS model. Again, due to the feedback loop, the summary table for this analysis is difficult to interpret because when our y increases, our neighbors y goes up as well, which then causes our y to increase. So once again we need to view the impact table. As with the previous impacts table, I limited the results to the p-values.

summary(spatialreg::impacts(sp.lag.model, listw = cont.neighb, R=100), zstats = TRUE)[["pzmat"]]
##                        Direct   Indirect          Total
## rural           0.12757506250 0.28260401 0.133039892417
## urban           0.83465297481 0.83437816 0.831452575236
## lnmanufacturing 0.90479507924 0.84915562 0.892546201989
## lnag            0.27276842170 0.35785587 0.270970915639
## lnretail        0.26429903088 0.44812528 0.283618204992
## lnhealthss      0.01331585967 0.11160936 0.014145435080
## lnconstruction  0.04915142945 0.16807678 0.047490193794
## lnlesshs        0.00059355749 0.07198425 0.000485597946
## lnunemployment  0.00005391193 0.04513198 0.000008191284
## lnsinglemom     0.01099539333 0.15204483 0.014946338043
## lnblack         0.06246554441 0.17238976 0.059892453014
## lnhispanic      0.20446093470 0.33315211 0.209583152213
## lnuninsured     0.00001633591 0.11631218 0.000123353408
## lnincome_ratio  0.10657143711 0.27752753 0.113749410449
## lnteenbirth     0.15139941773 0.26476791 0.156916411580
## lnunmarried     0.94531541290 0.94690375 0.944629630679

2.4.5 Spatial Error Model

Now we will run the Spatial Error Model that was suggested earlier in our analysis by the Moran’s Correlation and LaGrange Multiplier Tests. The Spatial Error Model does not include lagged dependent or independent variables, but instead includes a function of our unexplained error and that of our neighbors. In this analysis, higher than expected residual values suggest a missing explanatory variable that is spatially correlated. In this analysis, Lambda is the error multiplier.

sp.err.model <- spatialreg::errorsarlm(equation, data=tx_pov, cont.neighb)
summary(sp.err.model, Nagelkerke = TRUE)
## 
## Call:
## spatialreg::errorsarlm(formula = equation, data = tx_pov, listw = cont.neighb)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -21.18297  -3.50397  -0.15273   3.68751  21.88176 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                   Estimate Std. Error z value  Pr(>|z|)
## (Intercept)     -81.632091  12.895227 -6.3304 2.445e-10
## rural            -2.239971   1.441003 -1.5545 0.1200764
## urban            -0.057247   1.217828 -0.0470 0.9625072
## lnmanufacturing   0.082049   0.547040  0.1500 0.8807747
## lnag             -0.939302   0.786109 -1.1949 0.2321362
## lnretail         -1.568230   1.276888 -1.2282 0.2193847
## lnhealthss        4.244038   2.017342  2.1038 0.0353979
## lnconstruction   -2.354845   1.060230 -2.2211 0.0263463
## lnlesshs          8.622796   2.651494  3.2521 0.0011457
## lnunemployment    8.590102   1.923615  4.4656 7.984e-06
## lnsinglemom       2.467036   0.889356  2.7740 0.0055379
## lnblack          -0.737356   0.321786 -2.2914 0.0219374
## lnhispanic       -1.183896   1.045278 -1.1326 0.2573770
## lnuninsured      14.233722   4.026419  3.5351 0.0004076
## lnincome_ratio    5.447997   3.339481  1.6314 0.1028079
## lnteenbirth       1.617612   1.203351  1.3443 0.1788654
## lnunmarried      -0.018476   0.443282 -0.0417 0.9667542
## 
## Lambda: 0.26132, LR test value: 6.5162, p-value: 0.01069
## Asymptotic standard error: 0.091697
##     z-value: 2.8498, p-value: 0.0043747
## Wald statistic: 8.1213, p-value: 0.0043747
## 
## Log likelihood: -841.3442 for error model
## ML residual variance (sigma squared): 43.553, (sigma: 6.5994)
## Nagelkerke pseudo-R-squared: 0.51777 
## Number of observations: 254 
## Number of parameters estimated: 19 
## AIC: NA (not available for weighted model), (AIC for lm: 1725.2)

In the summary we see seven (7) significant log variables including: employment in healthcare and construction, less than high school education, rate of unemployment, single mother households, percent uninsured, and percent of population that is black.

2.4.5.1 Comparing contiguity models, model selection

Comparing the values* of the three (3) models, we can see that overall the Spatial Error Model provides the best fit comparatively.

SLX Model Lag Model Err Model
Adj-R2 p-value R2 p-value R2 p-value
0.4803 < 2.2e-16 0.515 0.27 0.518 0.0105

*Because spatial analyses do not provide standard R2 values, a Nagelkerke pseudo-R2 value is calculated that is a relative measure indicating how well the model explains the data.

To test the validity of the error model we can run a Spatial Hausman Test to see if the results of the analysis verify our use of the model. Developed by Pace and LeSage, they state that:

For a given set of variables, a divergence between the coefficient estimates from a spatial error model and OLS suggest that neither is yielding regression parameter estimates matching the underlying parameters. This calls into questions use of either OLS or a standard error model for that set of variables.

Pace, R.K. and LeSage, J.P. (2008) A spatial Hausman test. Economics Letters 101(3), pgs. 282-284

spatialreg::Hausman.test(sp.err.model)
## 
##  Spatial Hausman test (asymptotic)
## 
## data:  NULL
## Hausman test = 16.878, df = 17, p-value = 0.4627

Based on the p-value of 0.4627 we would not reject the HO that the estimation method should yield coefficients appropriate for a spatial error model. If the p-value had been significant, we would have concluded that the use of OLS or a standard error model may not be appropriate for this set of variables.

2.4.5.2 Nested Spatial Durbin Models

Alternatively, we could have used a nested model to explore the likelihood of which model would be appropriate for our data. As described above, a Spatial Durbin Model contains components of OLS, SLX, Spatial Lag, and Spatial Error models. The results of this model can determine if lagged y, error, or lagged x values are important in the model, or whether the model should be simplifed to include only the lagged y values (lag model), lagged x values (SLX), the errors (error model), or a simple OLS. The Spatial Durbin Error Model contains components of OLS, SLX, and Spatial Error models. The results of this model can determine if both errors and lagged x values are important in the model, or whether the model should be simplified to include only the lagged x values (SLX), the errors (error model), or a simple OLS.

Spatial Durbin Model Spatial Durbin Error Model
\(y = \rho Wy + X\beta + WX\theta + \varepsilon\) \(y = X\beta + WX\theta + u\),   \(u = \lambda Wu + \varepsilon\)
Simplified to… Simplified to…
SLX   \(y = X \beta + WX\theta + \varepsilon\) Spatial Error   \(y = X \beta + u\),   \(u = \lambda W u + \varepsilon\)
Simplified to… Simplified to…
Spatial Lag   \(y = \rho W y + X \beta + \varepsilon\) SLX   \(y = X \beta + WX\theta + \varepsilon\)
Simplified to… Simplified to…
Spatial Error   \(y = X \beta + u\),   \(u = \lambda W u + \varepsilon\) OLS   \(y = X \beta + \varepsilon\)
Simplified to…
OLS   \(y = X \beta + \varepsilon\)

Because we already know that an error model is the most appropriate analysis we will use the Spatial Durbin Error Model for this portion of the exercise. If we were unsure after going through the steps above, we could also run the Spatial Durbin Model.

sd.err <- spatialreg::errorsarlm(equation, tx_pov, cont.neighb, etype = "emixed")
sdm <- spatialreg::lagsarlm(equation, tx_pov, cont.neighb, type = "mixed")

Remembering that the Spatial Durbin Error Model is a local model which includes the errors and lag x values, we can view a summary of that model.

summary(sd.err, Nagelkerke = TRUE)
## 
## Call:
## spatialreg::errorsarlm(formula = equation, data = tx_pov, listw = cont.neighb, 
##     etype = "emixed")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -21.31268  -3.71075  -0.25827   3.84243  20.13158 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                         Estimate   Std. Error z value  Pr(>|z|)
## (Intercept)         -111.6357732   31.2516592 -3.5722 0.0003541
## rural                 -2.6061859    1.4697087 -1.7733 0.0761845
## urban                 -0.0036737    1.2246298 -0.0030 0.9976065
## lnmanufacturing       -0.0361237    0.5669329 -0.0637 0.9491950
## lnag                  -0.6387240    0.8667384 -0.7369 0.4611660
## lnretail              -0.7602055    1.3324769 -0.5705 0.5683247
## lnhealthss             4.4780843    2.0978006  2.1347 0.0327891
## lnconstruction        -2.2097522    1.0690477 -2.0670 0.0387315
## lnlesshs               8.1522201    2.6697969  3.0535 0.0022619
## lnunemployment         6.6995970    1.9666295  3.4066 0.0006577
## lnsinglemom            2.4974968    0.8944633  2.7922 0.0052355
## lnblack               -1.0769648    0.3524474 -3.0557 0.0022455
## lnhispanic             1.2738204    1.3694293  0.9302 0.3522761
## lnuninsured            9.8773038    4.4636980  2.2128 0.0269109
## lnincome_ratio         6.2190548    3.3482208  1.8574 0.0632513
## lnteenbirth            2.3902735    1.2884053  1.8552 0.0635650
## lnunmarried            0.0320226    0.4853005  0.0660 0.9473898
## lag.rural             -1.3128721    3.7998265 -0.3455 0.7297122
## lag.urban             -1.9439595    2.9475806 -0.6595 0.5095682
## lag.lnmanufacturing   -0.9133376    1.2381458 -0.7377 0.4607176
## lag.lnag              -0.0666780    1.6347868 -0.0408 0.9674657
## lag.lnretail           5.7003753    3.4875001  1.6345 0.1021505
## lag.lnhealthss         8.5446798    4.6175020  1.8505 0.0642418
## lag.lnconstruction    -3.9155614    2.6003283 -1.5058 0.1321198
## lag.lnlesshs          -0.6141597    5.6967052 -0.1078 0.9141467
## lag.lnunemployment     2.4147811    4.8250298  0.5005 0.6167444
## lag.lnsinglemom       -4.5989952    2.4272777 -1.8947 0.0581304
## lag.lnblack            0.8723242    0.6194842  1.4081 0.1590879
## lag.lnhispanic        -2.2381753    2.1987671 -1.0179 0.3087146
## lag.lnuninsured       -4.0541475    8.2617533 -0.4907 0.6236296
## lag.lnincome_ratio     5.7241026    8.1058882  0.7062 0.4800849
## lag.lnteenbirth        5.5412619    3.6139085  1.5333 0.1251981
## lag.lnunmarried       -0.0052713    1.2477725 -0.0042 0.9966293
## 
## Lambda: 0.26166, LR test value: 6.5802, p-value: 0.010312
## Asymptotic standard error: 0.091677
##     z-value: 2.8542, p-value: 0.0043147
## Wald statistic: 8.1464, p-value: 0.0043147
## 
## Log likelihood: -830.0329 for error model
## ML residual variance (sigma squared): 39.84, (sigma: 6.3119)
## Nagelkerke pseudo-R-squared: 0.55886 
## Number of observations: 254 
## Number of parameters estimated: 35 
## AIC: NA (not available for weighted model), (AIC for lm: 1734.6)

We could also run the global Spatial Durbin Model summary using summary(sdm). However, using a model-driven approach we understand that overall a global model is not appropriate for our data. Meaning, for example, that unemployment in one county could affect poverty in a neighboring county, but it is unlikely to have an impact on a county across the state or every county within the state. So keeping that in mind we will move on using only the Spatial Durbin Error Model. Examining the impacts matrix as we have before we can determine if the Spatial Durbin Error Model is the most appropriate model for our data or if we should restrict the model to a spatial error, SLX, or OLS model.

summary(spatialreg::impacts(sd.err, listw = cont.neighb, R = 100), zstats = TRUE)[["pzmat"]]
##                       Direct   Indirect       Total
## rural           0.0761844736 0.72971217 0.355952285
## urban           0.9976064575 0.50956820 0.539528715
## lnmanufacturing 0.9491949607 0.46071765 0.497839479
## lnag            0.4611660110 0.96746573 0.674062995
## lnretail        0.5683246595 0.10215051 0.213119052
## lnhealthss      0.0327890584 0.06424176 0.007398399
## lnconstruction  0.0387314564 0.13211981 0.036904181
## lnlesshs        0.0022619003 0.91414670 0.200064447
## lnunemployment  0.0006576804 0.61674440 0.068869307
## lnsinglemom     0.0052355241 0.05813040 0.422107610
## lnblack         0.0022455459 0.15908789 0.739705286
## lnhispanic      0.3522761423 0.30871460 0.626926803
## lnuninsured     0.0269109288 0.62362964 0.471568678
## lnincome_ratio  0.0632513175 0.48008491 0.186038792
## lnteenbirth     0.0635650059 0.12519813 0.064776023
## lnunmarried     0.9473897639 0.99662928 0.985967634

Based on the impacts analysis, we can see not many of the impacts are significant, so we might consider limiting the model to either the spatial error, SLX, or OLS models. Since we have already run the analysis and determined the most appropriate model of those three, we should see similar results in our likelihood ration test. This will help us test the HO that we should restrict the model to a more simple model. The first step is to determine if we should restrict the model from a Spatial Durbin Error model to a spatial error mode.

LR.Sarlm(sd.err,sp.err.model)
## 
##  Likelihood ratio for spatial linear models
## 
## data:  
## Likelihood ratio = 22.623, df = 16, p-value = 0.1242
## sample estimates:
##       Log likelihood of sd.err Log likelihood of sp.err.model 
##                      -830.0329                      -841.3442

Our results are saying that, with a high p-value of 0.1242, we should not reject the null hypothesis. Meaning that we should restrict the model to a spatial error model. To provide further evidence of that, the p-value for the comparison of the Spatial Durbin Error model and the SLX model is 0.0103 and 0.0333 for the OLS model, which means we should not restrict this to a SLX or OLS model.

2.5 Spatial Regression Analysis, K-nearest neighbors

In the section above we utilized contiguity functions for the creation of the neighborhood. In this example we will use distance calculations to examine the relationships in our data with K-nearest neighbors. While the previous spatial regression equations used spatial weight matrices in the analysis, K-nearest neighbors allows us to fine tune the distance value to fit specific criteria. For example, in a biological framework, spatial distributions such as home-range, seed dispersal, migrations patterns, etc. often have established spatial values. While it is possible to extend the level of neighbors via contiguity, those neighbors are still based of shared borders and not distance values such as K-nearest neighbors that can be specifically tuned for the analysis. An advantage of this spatial relationship is that it ensures some neighbors for every target feature, even when feature densities vary widely across the study area.

2.5.1 Creating a list of K-neighbors

To begin this analysis we will need to create centroid for our county polygons. This provides a location in each county to base the distance value. We can do that by simply using the centroid() function in the geosphere package to create xy data from the polygons

all.xy <-centroid(tx_sp)
colnames(all.xy) <- c("x","y")

Next, we need to create a neighbor list from our centroids based on a k-nearest value. For this example we will examine k = 1, k = 3, and k = 5. Then we need to calculate the distance value so the model can create a radius to encompass the neighbors. Finally, we need to produce the list of neighbors within the neighborhood.

#Create neighbors
all.dist.k1 <- knn2nb(knearneigh(all.xy, k=1, longlat = TRUE))
all.dist.k3 <- knn2nb(knearneigh(all.xy, k=3, longlat = TRUE))
all.dist.k5 <- knn2nb(knearneigh(all.xy, k=5, longlat = TRUE))

#Determine max k distance value to neighbor
all.max.k1 <- max(unlist(nbdists(all.dist.k1, all.xy, longlat=TRUE)))
all.max.k3 <- max(unlist(nbdists(all.dist.k3, all.xy, longlat=TRUE)))
all.max.k5 <- max(unlist(nbdists(all.dist.k5, all.xy, longlat=TRUE)))

#Calculate neighbors based on distance
all.sp.dist.k1 <- dnearneigh(all.xy, d1=0, d2=1 * all.max.k1, longlat = TRUE)
all.sp.dist.k3 <- dnearneigh(all.xy, d1=0, d2=1 * all.max.k3, longlat = TRUE)
all.sp.dist.k5 <- dnearneigh(all.xy, d1=0, d2=1 * all.max.k5, longlat = TRUE)

#Create neighbor list
all.dist.neighb.k1 <- nb2listw(all.sp.dist.k1,style="W", zero.policy = TRUE)
all.dist.neighb.k3 <- nb2listw(all.sp.dist.k3,style="W", zero.policy = TRUE)
all.dist.neighb.k5 <- nb2listw(all.sp.dist.k5,style="W", zero.policy = TRUE)

This provides us with the spatial matrices needed to run the spatial lag and spatial error distance models based on values of one (1), three (3), and five (5) neighbors. The lag and error models are similar to the models completed above, except using distance calculation for the spatial matrix rather than a matrix based on contiguity. Additionally, because we already completed the Moran’s Correlation and LaGrange Multiplier Tests above, we do not need to repeat that step for this exercise and will move right on to the analysis.

2.5.2 Distance Lag Model

To calculate a distance lag model for each k-distance value we will use the following:

all.dist.lag.k1 <- spatialreg::lagsarlm(equation, data = tx_pov, listw = all.dist.neighb.k1)
all.dist.lag.k3 <- spatialreg::lagsarlm(equation, data = tx_pov, listw = all.dist.neighb.k3)
all.dist.lag.k5 <- spatialreg::lagsarlm(equation, data = tx_pov, listw = all.dist.neighb.k5)

For this example we will only view the summary for the K=1 lag model:

summary(all.dist.lag.k1, Nagelkerke = TRUE)
## 
## Call:
## spatialreg::lagsarlm(formula = equation, data = tx_pov, listw = all.dist.neighb.k1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -21.26420  -3.92130  -0.19287   3.75935  21.93251 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                   Estimate Std. Error z value  Pr(>|z|)
## (Intercept)     -83.780273  12.298965 -6.8120 9.627e-12
## rural            -1.791795   1.439814 -1.2445 0.2133292
## urban            -0.182614   1.197555 -0.1525 0.8788010
## lnmanufacturing   0.238012   0.539998  0.4408 0.6593826
## lnag             -0.828526   0.742989 -1.1151 0.2647969
## lnretail         -1.297936   1.285854 -1.0094 0.3127844
## lnhealthss        4.441177   1.957354  2.2690 0.0232702
## lnconstruction   -2.363470   1.052807 -2.2449 0.0247731
## lnlesshs          7.864772   2.539457  3.0970 0.0019547
## lnunemployment    8.148034   1.910640  4.2646 2.003e-05
## lnsinglemom       2.294017   0.881707  2.6018 0.0092738
## lnblack          -0.558226   0.302814 -1.8435 0.0652615
## lnhispanic       -1.181199   0.947876 -1.2462 0.2127081
## lnuninsured      13.828022   3.745415  3.6920 0.0002225
## lnincome_ratio    5.182556   3.307955  1.5667 0.1171860
## lnteenbirth       1.846950   1.238280  1.4915 0.1358185
## lnunmarried      -0.026773   0.454500 -0.0589 0.9530259
## 
## Rho: 0.1861, LR test value: 5.2624, p-value: 0.02179
## Asymptotic standard error: 0.08413
##     z-value: 2.212, p-value: 0.026965
## Wald statistic: 4.8931, p-value: 0.026965
## 
## Log likelihood: -841.9711 for lag model
## ML residual variance (sigma squared): 44.148, (sigma: 6.6444)
## Nagelkerke pseudo-R-squared: 0.51538 
## Number of observations: 254 
## Number of parameters estimated: 19 
## AIC: 1721.9, (AIC for lm: 1725.2)
## LM test for residual autocorrelation
## test value: 1.5471, p-value: 0.21357

2.5.3 Distance Error Model

To calculate a distance error model for each k-distance value we will use the following:

all.dist.err.k1 <- spatialreg::errorsarlm(equation, data = tx_pov, listw = all.dist.neighb.k1)
all.dist.err.k3 <- spatialreg::errorsarlm(equation, data = tx_pov, listw = all.dist.neighb.k3)
all.dist.err.k5 <- spatialreg::errorsarlm(equation, data = tx_pov, listw = all.dist.neighb.k5)

For this example we will only view the summary for the K=1 err model:

summary(all.dist.err.k1, Nagelkerke = TRUE)
## 
## Call:
## spatialreg::errorsarlm(formula = equation, data = tx_pov, listw = all.dist.neighb.k1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -21.77306  -3.60469  -0.22649   3.77691  22.26279 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                   Estimate Std. Error z value  Pr(>|z|)
## (Intercept)     -82.994877  12.921589 -6.4230 1.336e-10
## rural            -2.217650   1.447042 -1.5325 0.1253892
## urban            -0.238744   1.203329 -0.1984 0.8427295
## lnmanufacturing   0.038484   0.543262  0.0708 0.9435256
## lnag             -0.674507   0.783858 -0.8605 0.3895159
## lnretail         -1.876566   1.264909 -1.4836 0.1379261
## lnhealthss        4.702683   2.029486  2.3172 0.0204940
## lnconstruction   -2.219853   1.049585 -2.1150 0.0344315
## lnlesshs          7.826914   2.659579  2.9429 0.0032514
## lnunemployment    9.228441   1.917868  4.8118 1.496e-06
## lnsinglemom       2.320612   0.875628  2.6502 0.0080438
## lnblack          -0.637487   0.324895 -1.9621 0.0497470
## lnhispanic       -0.885510   1.051503 -0.8421 0.3997111
## lnuninsured      14.018928   4.074680  3.4405 0.0005806
## lnincome_ratio    5.668915   3.330218  1.7023 0.0887056
## lnteenbirth       1.768904   1.203243  1.4701 0.1415308
## lnunmarried       0.032203   0.445633  0.0723 0.9423923
## 
## Lambda: 0.32346, LR test value: 7.877, p-value: 0.0050067
## Asymptotic standard error: 0.10609
##     z-value: 3.0489, p-value: 0.0022969
## Wald statistic: 9.2957, p-value: 0.0022969
## 
## Log likelihood: -840.6638 for error model
## ML residual variance (sigma squared): 43.289, (sigma: 6.5794)
## Nagelkerke pseudo-R-squared: 0.52035 
## Number of observations: 254 
## Number of parameters estimated: 19 
## AIC: NA (not available for weighted model), (AIC for lm: 1725.2)

2.5.3.1 Comparing distance models, model selection

While the two models are relatively comparable, we can see that again the error model is generally a better fit for the dataset.

Distance Lag Model Distance Err Model
R2 p-value AIC R2 p-value AIC
0.5154 0.0218 1721.9 0.5204 0.005 1719.3

3 Mapping the results

In order to connect the poverty data to spatial data we need to have a common column to merge the two datasets. For this example we will use the FIPS codes. We can create an output including variables from the original dataset as well as the analyses. For this example we will use the tx_pov and all.dist.err.k1 data to create a bivariate map of variables.

To create the output we will combine columns from the poverty and error model dataset.

dist.err.data <- summary(all.dist.err.k1, correlation=TRUE, Nagelkerke = TRUE)

dist.err.output <- cbind.data.frame(tx_pov$FIPS,
                                    dist.err.data$fitted.values, 
                                    dist.err.data$residual, 
                                    tx_pov$child.pov.2016, 
                                    tx_pov$lnsinglemom, 
                                    tx_pov$lnuninsured, 
                                    tx_pov$lnlesshs, 
                                    tx_pov$lnincome_ratio,
                                    stringsAsFactors = FALSE)

#Renaming columns
colnames(dist.err.output) <- c("fips","fitted","resid","childpov",
                        "single_mom","uninsured","less_hs","income_ratio")

We will use this data to create a bivariate map. Bivariate maps are a variation of a choropleth map that portrays two separate phenomena simultaneously. The main objective is to accurately and graphically illustrate the relationship between two spatially distributed variables. In order to create this map we will use biscale to split the data into quantiles, then rank those values, and create a value of the combined ranks of two variables such as single mother households and child poverty. In order for this to be displayed in ggplot we will use the fortify function.

tx_fortify <- fortify(tx_sp)

tx_poly <- merge(x = tx_fortify, y = dist.err.output, 
                 by.x = "id", by.y = "fips", all = TRUE)

bivariate_data <- bi_class(tx_poly, x = childpov, y = single_mom, 
                           dim = 3, style = "quantile")

legend <- bi_legend(pal = "DkViolet",
                    dim = 3,
                    xlab = "Child Poverty",
                    ylab = "Single Mother\n Households",
                    size = 6)

Now that we have data and a legend, we need to create additional datasets to serve as a basemap for our analysis.

world <- map_data("world")
states <- map_data("state")
southern_states <- subset(states, region %in% 
                            c("texas", "arkansas", "louisiana", "mississippi", 
                              "alabama", "georgia", "florida", "north carolina",
                              "south carolina", "tennessee", "oklahoma", 
                              "kentucky", "west virginia", "virginia", 
                              "maryland", "delaware", "district of columbia"))

Now we can use ggplot to create the bivariate map and use the ggdraw() and draw_plot() functions in cowplot to add the legend to the final map.

mom_pov_map <- ggplot() + 
  geom_polygon(data = world, aes(x=long,y=lat, group=group), fill = "gray95", color = "white") +
  geom_polygon(data = states, aes(x=long,y=lat, group=group), fill = "gray", color = "white") +
  geom_polygon(data = southern_states, aes(x=long,y=lat, group=group), fill = NA, size = 0.01, color = "white") +  
  geom_polygon(data = bivariate_data, aes(x=long, y=lat, group=group, fill = bi_class), color = "grey50", show.legend = FALSE) + 
  bi_scale_fill(pal = "DkViolet", dim = 3) +
  coord_map("conic", lat0 = 30, xlim=c(-107,-92), ylim=c(25,37)) +
  theme_void() + theme(legend.title.align=0.5) +
  theme(panel.background = element_rect(fill = 'deepskyblue'),
        panel.grid.major = element_line(colour = NA)) +
  labs(x = "Longitude", y = "Latitude", fill = "Child Poverty", 
       title = "Bivariate Map of Child Poverty and Single Mother Households") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))
mom_pov_map


Using cowplot we can add the custom legend to the bivariate map and complete the output.

final_map <- ggdraw() +
  draw_plot(mom_pov_map, 0, 0, 1, 1) +
  draw_plot(legend, 0.625, 0.025, 0.25, 0.25)
final_map

4 YOUR TURN!

Now it’s your turn! Although most of this might not be applicable to your thesis research, try this analysis on a dataset of your choice or recreate this analysis choosing a different state and poverty equation/selection. If you can apply this to your thesis, then add it to your website! Otherwise create a new repository for your presentation on Thursday.