Technical Appendix
Navigation
Data Challenges
Bayesian Inference Strategy
Checks for Model Fit
Dynamic Predicted Probabilities
Data Challenges
Among the merits of our data set is that it includes dependent variables that tightly cohere to our concept of the regional perspective. Nevertheless, it creates challenges for statistical testing. Our data set contains 420 observations. It is a small sample with limited statistical power and a high likelihood for us making Type II errors. Given that a large portion of this project is attempting to determine which characteristics of individuals and counties are predictive of taking a regional perspective, its sample size poses a significant challenge. The necessity of including many independent variables, both for hypothesis testing and avoiding omitted variable bias, worsens it. Unfortunately, maximum-likelihood regression models (e.g., ordered logit and ordered probit) that common statistical software programs like STATA, R, and SPSS use are based on a frequentist framework omit any observation with a missing value for any of the variables in a model. Under such a framework, we may lose degrees of freedom for estimating coefficients while decreasing the sample size due to missing data. Inclusion of any of the income dummy variables from our dataset in a frequentist model, for instance, immediately decreases our small sample by an additional 30 percent.
The multilevel nature of our data—responses from individuals grouped within counties that are grouped within MSAs—poses another challenge. Some of our independent variables are measured at the individual-level. Many of them are measured at the county-level. Ideally, we would run a multilevel model, accounting for this feature of the data. However, including county fixed effects would further decrease our degrees of freedom. Plus, some counties in our data have many respondents; others have much less. For instance, while four counties in the Atlanta MSA—Cobb, DeKalb, Fulton, and Gwinnett—each provides thirty or more respondents, sixteen of the 58 counties in our sample (about 28%) provide but one respondent each. Within a fixed effects framework, where the underlying assumption is that the fixed effects are independent of one another[i], the variation in the number of respondents per county introduces the danger of drawing wildly incorrect conclusions, if the single respondent within a county is not a representative sample of that county. Taking the issues into account, we employ a random effects framework. This permits random effects to emerge from a common normal distribution, allowing for “shrinkage”—observations at the extremes are drawn closer to the mean (Clark and Linzer, 2015). The random effects framework estimates only a mean and standard deviation of the normal distribution. This rescues substantial degrees of freedom for our models.
[i] More precisely, fixed effects can be modeled as a draw from a normal distribution normally around some mean, but with an infinite standard deviation, (Clark and Linzer, 2015; Gelman and Hill, 2007).
The multilevel nature of our data—responses from individuals grouped within counties that are grouped within MSAs—poses another challenge. Some of our independent variables are measured at the individual-level. Many of them are measured at the county-level. Ideally, we would run a multilevel model, accounting for this feature of the data. However, including county fixed effects would further decrease our degrees of freedom. Plus, some counties in our data have many respondents; others have much less. For instance, while four counties in the Atlanta MSA—Cobb, DeKalb, Fulton, and Gwinnett—each provides thirty or more respondents, sixteen of the 58 counties in our sample (about 28%) provide but one respondent each. Within a fixed effects framework, where the underlying assumption is that the fixed effects are independent of one another[i], the variation in the number of respondents per county introduces the danger of drawing wildly incorrect conclusions, if the single respondent within a county is not a representative sample of that county. Taking the issues into account, we employ a random effects framework. This permits random effects to emerge from a common normal distribution, allowing for “shrinkage”—observations at the extremes are drawn closer to the mean (Clark and Linzer, 2015). The random effects framework estimates only a mean and standard deviation of the normal distribution. This rescues substantial degrees of freedom for our models.
[i] More precisely, fixed effects can be modeled as a draw from a normal distribution normally around some mean, but with an infinite standard deviation, (Clark and Linzer, 2015; Gelman and Hill, 2007).
Bayesian Inference Strategy
Since Bayesian inference[i] provides more precise estimates when data sets are small and independent variables are collinear (Western and Jackman, 1994; Gelman and Hill, 2007, p. 262), we model our data and estimate coefficients using it. Although our dependent variable—the individual responses to the survey questions—and some of our independent variables are measured at the individual level, many of our independent variables are measured at the county level and invariant across individuals within the same county. A Bayesian multilevel model allows us to explicitly incorporate this feature of our data. Consequently, we model dependent variables as functions of the individual-level variables and a county-level random effect. A county-level random effect accounts for the effect that living in a particular county has on individuals within it. The random effect is a function of the county-level variables.
An additional benefit of a Bayesian multilevel model is the ability to incorporate information that is not explicitly in the data. In this case, we have no MSA-level predictors. However, the counties in our data set exist in geographic space and are likely to share unmeasured characteristics with their neighboring counties. Put more simply, DeKalb County and Fulton County—contiguous neighbors and part of the Atlanta MSA—are likely to have more in common with each other than either would with Bryan County in the Savannah MSA, even if we lack data on their commonalities. Bayesian inference allows us to explicitly “tie” geographically proximate counties together in our model, even without MSA-level data. Therefore, we estimate coefficients for each independent variable, random effects at the county-level and MSA-level, and cut points common to ordered logit models. We use dummy variables for income categories, number of years living within the city, and frequency of visiting other cities to detect nonlinearities.
Another benefit of Bayesian inference is that, as long as we assign a prior distribution, we can estimate missing data in the same manner that we would estimate any coefficient in a model. Since very few observations are missing for most of the independent variables in our data set, we have a good sense of what the prior distribution is—all variables with missing data follow a Bernoulli distribution. Thus, we can estimate all the missing observations within the data and estimate the “success parameter” p that shapes the outcomes of the Bernoulli draws. This permits us to conduct reality checks on our models. If our models properly fit the data, the estimates to fill the holes of missing data should seem reasonable. Plus, we should be able to simulate draws from each Bernoulli distribution with each estimate success probability. This should produce simulated data similar to the distribution of our observed independent variables. By contrast, we would know that our models are inaccurate representations of the data if they estimated missing data that seem unusual or misplaced, or if the success probabilities we estimate produce distributions without resemblance to the observed data.
For each model, we run three chains for 100,000 iterations, assessing chain convergence. Each chain explores the posterior distribution, generated from the prior distribution and the maximum likelihood estimate. Models iteratively update their estimates. When the three chains meet on the same distribution peak, the model is converged. Half of these iterations are discarded as a burn-in period. This ensures that the inclusion of estimates before the model has converged do not bias our inferences. The remainder is the set for drawing inferences about the posterior distributions of the parameters. We use the R-hat parameter to assess the convergence of the chains for each parameter within our models. The closer R-hat is to 1, the stronger the evidence that the chains have fully mixed and the models are sufficiently converged.[ii]
[i] Bayesian inference derives its name from Bayes’ Rule, which allows for our prior beliefs to be updated by new data to form our “posterior” beliefs, or our beliefs after seeing the data. It takes the data as given, and estimates the likelihood that the model produced it. The frequentist framework, by contrast, takes the model as given and estimates the likelihood we would see the data we did, given the model we have specified. For a pithy explanation of the differences between frequentists and Bayesians, see https://xkcd.com/1132/.
[ii] For practical purposes, an R-hat of 1.1 signals sufficient convergence, although it may prematurely declare convergence (Gelman and Shirley, 2011). For Model 1, 88% of parameters have an R-hat of 1.0, 7.8% have an R-hat of 1.1, and 4.4% have an R-hat of 1.2. The parameters with chains that have not completely converged are 23 of the 58 county-level random effect parameters. For Model 2 and Model 3, 88% of parameters have an R-hat of 1.0, and the remaining 12% have an R-hat of 1.1, indicating that the models have more than sufficiently converged.
An additional benefit of a Bayesian multilevel model is the ability to incorporate information that is not explicitly in the data. In this case, we have no MSA-level predictors. However, the counties in our data set exist in geographic space and are likely to share unmeasured characteristics with their neighboring counties. Put more simply, DeKalb County and Fulton County—contiguous neighbors and part of the Atlanta MSA—are likely to have more in common with each other than either would with Bryan County in the Savannah MSA, even if we lack data on their commonalities. Bayesian inference allows us to explicitly “tie” geographically proximate counties together in our model, even without MSA-level data. Therefore, we estimate coefficients for each independent variable, random effects at the county-level and MSA-level, and cut points common to ordered logit models. We use dummy variables for income categories, number of years living within the city, and frequency of visiting other cities to detect nonlinearities.
Another benefit of Bayesian inference is that, as long as we assign a prior distribution, we can estimate missing data in the same manner that we would estimate any coefficient in a model. Since very few observations are missing for most of the independent variables in our data set, we have a good sense of what the prior distribution is—all variables with missing data follow a Bernoulli distribution. Thus, we can estimate all the missing observations within the data and estimate the “success parameter” p that shapes the outcomes of the Bernoulli draws. This permits us to conduct reality checks on our models. If our models properly fit the data, the estimates to fill the holes of missing data should seem reasonable. Plus, we should be able to simulate draws from each Bernoulli distribution with each estimate success probability. This should produce simulated data similar to the distribution of our observed independent variables. By contrast, we would know that our models are inaccurate representations of the data if they estimated missing data that seem unusual or misplaced, or if the success probabilities we estimate produce distributions without resemblance to the observed data.
For each model, we run three chains for 100,000 iterations, assessing chain convergence. Each chain explores the posterior distribution, generated from the prior distribution and the maximum likelihood estimate. Models iteratively update their estimates. When the three chains meet on the same distribution peak, the model is converged. Half of these iterations are discarded as a burn-in period. This ensures that the inclusion of estimates before the model has converged do not bias our inferences. The remainder is the set for drawing inferences about the posterior distributions of the parameters. We use the R-hat parameter to assess the convergence of the chains for each parameter within our models. The closer R-hat is to 1, the stronger the evidence that the chains have fully mixed and the models are sufficiently converged.[ii]
[i] Bayesian inference derives its name from Bayes’ Rule, which allows for our prior beliefs to be updated by new data to form our “posterior” beliefs, or our beliefs after seeing the data. It takes the data as given, and estimates the likelihood that the model produced it. The frequentist framework, by contrast, takes the model as given and estimates the likelihood we would see the data we did, given the model we have specified. For a pithy explanation of the differences between frequentists and Bayesians, see https://xkcd.com/1132/.
[ii] For practical purposes, an R-hat of 1.1 signals sufficient convergence, although it may prematurely declare convergence (Gelman and Shirley, 2011). For Model 1, 88% of parameters have an R-hat of 1.0, 7.8% have an R-hat of 1.1, and 4.4% have an R-hat of 1.2. The parameters with chains that have not completely converged are 23 of the 58 county-level random effect parameters. For Model 2 and Model 3, 88% of parameters have an R-hat of 1.0, and the remaining 12% have an R-hat of 1.1, indicating that the models have more than sufficiently converged.
Checks for Model Fit
Predictive capacity is our first check of model fit. Many models, including our set, do not have prediction as an aim. We are not aiming in our analysis to perfectly determine how a respondent will answer. We seek to determine the effect of a few theoretically informed variables on the outcome in question. Accordingly, we should not expect that the variables in our models accurately predict how a respondent would answer. We should expect, however, our suite of variables in our models to perform better at predicting outcomes than if we chose the outcomes at random. In short, we should not expect to predict well but have at least minimal predictive capacity. We comfortably clear that bar. When we use our estimated coefficients to predict how each respondent will answer, based only on the variables in our models, we correctly predict the answer that the respondent actually gave considerably more often than an algorithm that picked a response for each respondent at random.[i]
As part of designating the prior distributions on the variables in our models, necessary for Bayesian inference models, we built in other checks for model fit. Specifically, missing data in the survey required us to specify the distribution of each of the individual-level independent variables.[ii] Each variable is a dummy variable, enabling us to assign a Bernoulli prior to each of them. The Bernoulli distribution is a one-parameter model (the “success parameter”). It permits us to specify that our prior belief was that the variables each followed a Bernoulli distribution with success parameter p.
Additionally, we allowed our models to estimate p. Since 0 and 1 bound the success parameter and we lacked prior beliefs as to what it was, we assigned p of each variable to a uniform prior bounded between 0 and 1. In other words, we could specify that we knew the bounds of p, but we could let the data tell us everything else. All three models estimate nearly identical values of p for each variable with missing data (Table A1). When we take 420 draws, equaling our sample size, from a Bernoulli distribution with the estimated p parameter, the predicted distribution is nearly identical to the observed distribution. It reassures us of the three income variables, where nearly a third of observations are missing. The models are able to estimate the distribution of that variable and they are able to impute the missing observations.
[i] Perhaps a stricter test of predictive capacity is examining the modal response. Instead of choosing a random response for each respondent, if we guessed that each respondent would give the answer that was most frequently given in the sample, would we guess correctly more often than the predictions of our model? All three of our models do a better job of picking the observed response than the alternative generation process.
[ii] No observations were missing in the county-level data. The complete empirical distribution of each variable was known. Therefore, we did not need to specify prior distributions for those variables.
As part of designating the prior distributions on the variables in our models, necessary for Bayesian inference models, we built in other checks for model fit. Specifically, missing data in the survey required us to specify the distribution of each of the individual-level independent variables.[ii] Each variable is a dummy variable, enabling us to assign a Bernoulli prior to each of them. The Bernoulli distribution is a one-parameter model (the “success parameter”). It permits us to specify that our prior belief was that the variables each followed a Bernoulli distribution with success parameter p.
Additionally, we allowed our models to estimate p. Since 0 and 1 bound the success parameter and we lacked prior beliefs as to what it was, we assigned p of each variable to a uniform prior bounded between 0 and 1. In other words, we could specify that we knew the bounds of p, but we could let the data tell us everything else. All three models estimate nearly identical values of p for each variable with missing data (Table A1). When we take 420 draws, equaling our sample size, from a Bernoulli distribution with the estimated p parameter, the predicted distribution is nearly identical to the observed distribution. It reassures us of the three income variables, where nearly a third of observations are missing. The models are able to estimate the distribution of that variable and they are able to impute the missing observations.
[i] Perhaps a stricter test of predictive capacity is examining the modal response. Instead of choosing a random response for each respondent, if we guessed that each respondent would give the answer that was most frequently given in the sample, would we guess correctly more often than the predictions of our model? All three of our models do a better job of picking the observed response than the alternative generation process.
[ii] No observations were missing in the county-level data. The complete empirical distribution of each variable was known. Therefore, we did not need to specify prior distributions for those variables.
Model 1 | Model 2 | Model 3 | |
Nonwhite | 0.293 | 0.293 | 0.292 |
Female | 0.667 | 0.668 | 0.667 |
Homeowner | 0.831 | 0.832 | 0.832 |
Income $0-25,000 | 0.160 | 0.160 | 0.161 |
Income $25,000-50,000 | 0.197 | 0.199 | 0.198 |
Income $50,000-75,000 | 0.233 | 0.234 | 0.234 |
Resident 1-5 years | 0.263 | 0.264 | 0.263 |
Resident 5+ years | 0.651 | 0.651 | 0.650 |
Visits Other Cities Daily | 0.242 | 0.242 | 0.242 |
Visits Other Cities Weekly | 0.159 | 0.159 | 0.159 |
Visits Other Cities Monthly or Bimonthly | 0.119 | 0.119 | 0.120 |
Proudly powered by Weebly