Monday 09/11 – pradnyamth522.sites.umassd.edu

As python was my preferred language for analysis, I used my setup for jupyter notebook and loaded the dataset as a pandas dataframe. I used several functions of the pandas library to understand about the basic characteristics of the data given for example: the length, type of data, and missing data.
Upon looking through the three datasets provided I found that the Obesity dataset has only 363, Inactivity has 1370 and Diabetes has 3142 records.

During the discussion with the professor we had come across the question of what can we do with this dataset. What questions can be answered, and as per our discussion combining these datasets to answer questions about how one factor is related to the other would be what we would like to achieve from these datasets. So, my first step was to combine diabetes and inactivity dataset and this gave me a total of 1370 rows. After doing some visualizations from this data like histograms and Quantile Plot I found that the diabetes data is a little skewed higher from normal distribution and that inactivity data is skewed slightly lower from normal distribution.

To find the relation between datasets the first step would be to find the correlation of the data and this gave me a correlation of 44% which is not a strong correlation. With the scatterplot of this I could see that we could fit a linear model on this data to find some prediction. Based on the R squared we can see that 20% of variation in Diabetes data points is attributed to variation in Inactivity. From the residuals and Heteroscedasticity plots, we can conclude that the linear model is not effective for this dataset.

Leave a Reply Cancel reply