Contributed by Yannick Kimmel. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his second class project – R Shiny (due on the 4th week of the program).
Introduction
The culture of food and health (like the other aspects of culture) is constantly changing and is diverse (meaning high variance) in the USA. Obesity affects roughly 1 in 3 Americans, while diabetes affects roughly 1 in 10 Americans. I wanted to understand the relationship of food and health demographics. I thought this data would be important for policy makers and civic leaders who would be interested in changing their demographics for the better. The USDA’s Food Environment Atlas is a great resource for county specific data on this specific subject matter. The data can be found here. The Shiny app can be found here and the code here.
USA county map
The first page of the Shiny app is county specific map of the continental USA. I choose to display three indicators of particular interest: obesity, diabetes
Scatter plot
On the second page, I selected 20 variables among the 9 categories in the Food Environment atlas so that the general relationship among them. The scatter plot was created through the Plotly package. Although the y variable can be changed from obesity rates, I fixed obesity to the color variable because that is the most important metric for this project, and allows it to always be considered on the plot. If the mouse pointer hovers over an individual data point, the specific county is displayed. Like was shown on the map,diabetes
Prediction
The end goal of this project is to make a predictive tool for policy makers interested in seeing how factors could affect their obesity rates. So in my third page, I allow users to predict obesity rates. As a preliminary prediction analysis, multiple, stepwise linear regression was used on 17 variables of interest to predict obesity rates. Stepwise regression showed that at least 10 variables are significant. Basic diagnostics indicate model assumptions were not violated, and multiple linear regression is valid. 76% of the data was complete cases, while the rest had at least one NA. Only the complete cases were used in prediction. The initial values in the sliders being the mean of the dataset and the range of the sliders are mean +/- 3*standard deviation, allowing the user to choose reasonable values. The page includes coefficients, their variance inflation factors, residuals plots, QQ plot, and leverage plot.
Data table
The last page is a data table of the counties where a user can search for a specific county in the USA.
Conclusions
The USA has many diverse counties in the USA, and this spread can result in many demographic differences in health and food. The scope of this project is to explore the relationship of food and economic factors in the US’ counties. Data was taken from the USDA’s Food Environment Atlas. The diversity (or variance) of factors across countries allowed for correlations in obesity rates to be developed. Users can map health indicators across the US, explore the relationship among different factors, and predict how changes in demographic factors will affect obesity rates.