Leveraging Bayesian Methods for Analyzing CDC BRFSS Data

In public health research, the Centers for Disease Control and Prevention’s Behavioral Risk Factor Surveillance System (BRFSS) data is a common source of valuable information. However, finding the insights buried within this rich dataset sometimes requires sophisticated analytical techniques. This is where Bayesian methods shine.

Bayesian methods offer a unique framework for analyzing BRFSS data that can yield better insights and more robust conclusions. They utilize prior knowledge or beliefs about the data’s statistical parameters. This is helpful with datasets like the BRFSS, where one can use prior knowledge about health outcomes, population characteristics, and risk factors to provide context.

If one doesn’t have any prior knowledge about the statistical parameters, they can perform an empirical Bayes analysis. One such version of this is running a Bayesian model with weak, or default, priors, and then using priors derived from that model in a second model. This can help to improve model performance. 

In my previous analysis on the effects of lifestyle factors on physical health, I employed frequentist methods to calculate the statistics. Here, I will implement the same methodology using Bayesian methods. I will use the empirical Bayes approach I mentioned previously.

The code I ran in R follows. You’ll notice that it’s not much different from running an lm model, but instead of the lm function, you use the bf function from the brms package.

library(brms)
formula <- bf(PHYSHLTH ~ EXERANY2 + BMI + SMOKE100 + ALCDAY4)
# Run Bayesian regression model without specifying priors
bayesian_model <- brm(formula, data = data, chains = 4)
bayesian_model_2 <- brm(formula, data = data, prior = default_prior(object = bayesian_model, data = data), chains = 4)

We use the priors from the first model to help create the final model.

The results are as follows:

They are very similar to those of the frequentist linear models, and in most cases, the estimate falls within a few hundredths of the original model. However, the Bayesian credible intervals (95% CI) are somewhat different from the linear model’s confidence intervals. A plot of the Bayesian model shows the distribution of the results of the MCMC simulations:

In the context of this analysis, where the dependent variable values range from 0 – 31, the residual standard deviation (sigma) of 8.35 is quite substantial. Thus, it’s essential to consider its implications for understanding the variability of our data. 

This large value indicates that this model may not fully capture all the variation present in days of poor physical health. In practical terms, it means that the predictions made by this model may deviate notably from the actual observed values. Therefore, we should approach the interpretations of the regression coefficients and predictions with caution, recognizing the inherent uncertainty introduced by this level of variability.

Although this model provides valuable insights into the relationships between the various lifestyle factors and the number of days of poor physical health, I must acknowledge its limitations due to the residual variability. Further exploration may be necessary to understand the sources of this variability and potentially refine the model to achieve a better fit to the data.

Similar Posts

Leave a Reply