Analysis of relationship between US Adults' Obesity Rate, Socioeconomic status and Environment by State

To Yin Yu(Github: tonyx1998)

Introduction

Obesity is a common, serious and costly disease in the US. According to the Centers for Disease Control and Prevention, Obesity-related conditions include heart disease, stroke, type 2 diabetes and certain types of cancer. These are among the leading causes of preventable, premature death. Also, obesity adults are at greater risk during the COVID-19 pandemic, since obesity would worsen the outcomes from COVID-19.

However, not every person has the same probability to be obese, obesity affects some groups more than others. Hence, in this tutorial is aimed to analyze the correlation of socioeconomic status, environment and obesity across states. We are going to see whether some aspects are more linked to one being obese.

You can learn more about obesity here:

-https://www.cdc.gov/obesity/data/adult.html

-https://www.cdc.gov/obesity/data/obesity-and-covid-19.html

Imports

In this part, we are going to import all the libraries and functionalities that are required for the rest of the tutorial.

1. Pandas(https://pandas.pydata.org/) & Numpy(https://numpy.org/): Required for dataframe manipulation
2. Matplotlib(https://matplotlib.org/) & Seaborn(https://seaborn.pydata.org/): Required for data visualization
3. scikit-learn(https://scikit-learn.org/stable/) & statsmodels(https://www.statsmodels.org/stable/index.html): Required for hypothesis testing and machine learning

Data Collection

Before we starting diving into finding interesting patterns, and analyzing different correlation between factors, we will first need some data. The data I am using for the tutorial are annual obesity rate, air pollution rate, high school graduration rate, and public health funding by state(https://www.americashealthrankings.org/explore/annual/measure/Obesity/state/ALL), annual average temperature by state(https://www.ncdc.noaa.gov/cag/statewide/time-series/1/tavg/ann/6/1990-2020?base_prd=true&begbaseyear=1901&endbaseyear=2000), annual Gross domestic product(GDP) in current currency(USD), percentage, and per capita by state(https://apps.bea.gov/), and annual median household income by state(https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-income-households.html). We will have our data remain raw and messy in this part, and clean it up in the next part of the tutorial. We will only be looking at 2015-2019 data.

Data Cleaning

For obesity 2015 to 2019 dataset, only edition, measure name, state name, value are useful for our analysis column wise. For rows, only rows with measure name air pollution, high school graduration, physical inactivity, public health funding and obesity is meaningful for our purpose. Hence, we will be deleting the rest. Also, we will be removing rows with "United States" and "District of Columbia" as the State Name, since they are not states.

*** For 2015 High school graduation data, Idaho's value is missing, but the ranking is not. Hence, I have decided to utilize single imputation. Idaho is at rank 17, rank 16 has the value 85.5 and rank 18 has the value 85.0. I estimated the value for idaho as 85.3, which is the approximate middle value between rank 16 and 18, and used that throughout the tutorial.

After handing the 2015-2019 air pollution, high school graduration, public health funding and obesity dataset, we will be cleaning up the GDP datasets. We will be removing the GeoFips column. Also, we will be melting the dataframe, and make year as one column to keep the dataframe tidy. More about Tidy Data here: https://cmsc320.github.io/files/tidy_data.pdf

After gdp datasets, we will be cleaning annual average temperature by states.

We will be cleaning up the state median income dataset now. We will be removing all the column besides 2015-2019 data, and also the United States row.

After cleaning up all the datasets separately, we will now try to combine all the useful data into one dataframe. Since it would be difficult to keep track of many datasets at the same time.

We will also be adding a new column for normalized year, which might help ease the varying scales issue in the upcoming sessions.

We can see that we do not have the state of Hawaii's temperature data, we will leave Hawaii out of the dataset in this tutorial.

After all the processing and cleaning, we finally got a single dataframe that contains all the data we need.

ap_value: Average exposure of the general public to particulate matter of 2.5 microns or less measured in micrograms per cubic meter (3-year estimate)

ob_value: Percentage of adults with a body mass index of 30.0 or higher based on reported height and weight

hs_value: Percentage of high school students who graduated with a regular high school diploma within four years of starting ninth grade

phy_value: Percentage of adults who reported doing no physical activity or exercise other than their regular job in the past 30 days

hf_value: State dollars dedicated to public health and federal dollars directed to states by the Centers for Disease Control and Prevention and the Health Resources Services Administration per person

gdp: Annual gross domestic product(GDP) in current United States Dollar for all industry total

gdp_pct: Annual gross domestic product(GDP) in percent of United States for all industry total

median_income: Median household income in United States Dollar

temp: Average annual temperature in Fahrenheit

gdp_cap: Annual gross domestic product(GDP) in current United States Dollar for all industry total per capita

norm_year: normalized year(0: 2015, 1: 2016, 2: 2017, 3: 2018, 4:2019)

Exploratory Data Analysis and Visualization

Although we now have a single dataframe for all our data, it is still difficult to find any interesting pattern. Hence, we will be plotting some graphs with our data to visualize any meaningful patterns for our analysis.

In this plot, we are trying see the relationship between year and obesity. We that obesity and year has a mild positive linear correlation. Which indicate that we need to be careful when with look at the relationship between obesity and other variables, since obesity will naturally increase every year.

In this plot, we will be looking at air pollution and obesity. We can see that there might be a slight positive linear relationship, however, it might be too mild for our purpose to be an interesting pattern.

When it comes to high school graduration rate vs obesity rate, we can see a more obvious positive linear relation. Also the margin of errors is smaller.

We can see that physical inactivity value has the strongest positive linear relation. This might hint us that physical inactivity rate might be correlated to obesity rate positively.

We can see that there are almost no linear relationship between public health funding and obesity.

For GDP vs obesity rate plot, we can see a slight negative linear relation.

For GDP per capita vs obesity rate plot, we can see more negative linear relation than the gdp vs obesity rate.

We can see a clear negative linear relationship in this plot. This might show that lower median household income might have correlation to higher obesity rate.

We can see that annual average temperature vs obesity value might also be interesting for us to study, since it has a positive relation.

In this heatmap, we can see all the correlation across the whole dataframe, and all variables. This can help check and verify what we found from the plots above.

Hypothesis Testing and Machine Learning

In this section, we will be using some different machine learning models to predict and analyze the pattern and possible correlations we found in the previous session. Before diving into the algorithm, we would have to split our data into the predictors and verifier for our results. Since in last session, we found that public health funding has basically no correlation with obesity rate, so we are not going to use public health funding as one of the predictor. Also, gdp, gdp percentage and gdp per capita are basically the same measures, hence, we will be only use gdp per capita as our predictor.

First, machine learning model we will be using is going to be linear regression. Since we were visuallizing the linear pattern across variables and obesity in the last session, we will be fitting the linear regression model first and see how it performs. We will be looking at the mean, mean absolute error and the coefficient of determination as a indicator of whether our model results in a good fit or not.

We can see that we have a mean of ~30.32, and a mean absolute error of ~1.79, which is about 6% of the mean. This might indicate that the margin of error of our model is relatively small. However, the coefficient of determination is relatively low, which tells us that there might not be a very strong linear correlation between variables and obesity.

Since we are not getting a very good result from linear regression model, we are going to try a different model now. we will be trying Gradient Boosting.

We can see that we have a slightly smaller mean absolute error of ~1.60, which is about 5% of the mean. The coefficient of determination is also very similar to the linear regression model, ~0.52 for the gradient boosting model. Since these results are very similar to the linear regression model, there are not much improvement from the linear regression model, this might indicate that the linear regression model has explained most of the relation across variables and obesity. However, we are still going to try some of the other machine learning models, and see if there is any unexpected improvement or changes to the result.

The next model we are looking at is decision tree.

We can see the our results have dramatically changed. For mean absolute error, we are getting ~2.3 here, which is about 7.5%. Also, for coefficient of determination, we only have ~0.12 for this model. This tells us our data has a terrible fit for the decision tree model, this might not be a good approach for our dataset.

The last machine learning model we are going to look at is random forest.

After looking at the random forest model, we can be more certain about what we claimed above. Linear regression model explains most of the correlation across variables in our dataset and obesity rate. The mean absolute error and coefficient of determination are extremely similar to the linear regression model, which are ~1.76 and ~0.52. It is disappointing that we are seeing such a low coefficient of determination, however, this is not meaningless, we can still get some insights to the data.

Beside machine learning, we will be also performing a hypothesis testing

Our null hypothesis would be socioeconomic status and enviromental factors are not correlated to obesity in US states. We would now perform a t-test, in order to find out whether we reject our null hypothesis or not.

We can see that our p-values are extremely low, hence, as expected, we are not able to reject h0. Which means that socioeconomic status and enviromental factors are not correlated to obesity in US states, which might be disappointing as a result. However, there are a lot of interesting facts we found throughout the analysis, we will be discuss about that in the following, conclusion and insights session.

Conclusion and Insights

Some of the variables in our dataset, for instance, physical inactivity and public health funding, seem related to obesity to me at the beginning of the tutorial. However, after the tutorial, I learnt that there might not be that simple as I thought. Although we rejected our hypothesis, but does this mean there are absolutely no relationship between socioeconomic status and enviromental factors and obesity? Alsolutely not. There are always more approaches to one single problem, there are many ways to examine different features. This is what makes data science and machine learning challenging and exciting, there are always more to learn from what has already established, and I am looking forward to dive in this topic in another perspective.