09/27 Wednesday

In my recent deep dive into the extensive CDC dataset, I embarked on a journey to decipher the intricate interplay of three crucial variables: obesity, inactivity, and diabetes. My quest for insights led me to explore the realm of polynomial regression, a powerful analytical tool that shed new light on the relationships among these variables. At the outset, it became apparent that the relationships among obesity, inactivity, and diabetes were far from linear. Traditional linear regression models failed to capture the intricate dynamics at play. Enter polynomial regression, a method that accommodates non-linear relationships by introducing polynomial terms of the predictors. This approach enabled me to account for complex interactions and uncover hidden patterns within the data. My analysis’s use of polynomial regression was crucial in revealing details that had previously been hidden. I was able to represent the curvature and non-linearity of the correlations between obesity and inactivity and diabetes by inserting polynomial terms for these two factors. I was able to adjust the complexity of the model by adjusting the degree of the polynomial components, striking a balance between accuracy and overfitting. My research utilising polynomial regression produced enlightening results. I found that there were in fact nuanced inflection points and trends in the interactions between obesity, inactivity, and diabetes that could not be effectively explained by linear models. The necessity for specialised strategies that take into account the complex, non-linear character of these variables was highlighted by this new knowledge, which had significant consequences for public health treatments and policy-making.
As a result of my investigation into the CDC dataset, I have learned how crucial it is to use cutting-edge statistical methods like polynomial regression when working with complex variables like obesity, inactivity, and diabetes. We may acquire a more thorough understanding of the data and develop more effective public health plans and treatments by recognising the non-linear interactions and utilising the power of polynomial regression. This journey reinforced the crucial part that data-driven insights play in determining the future health of our communities.

09/25 Monday

I have learned critical lessons about assessing prediction error and utilizing cross-validation procedures through my investigation of CDC data comprising variables like obesity, inactivity, and diabetes. The old method known as the Validation Set Approach first appeared to be promising but had some drawbacks. Due to the arbitrary separation of the data, it frequently produced inconsistent results. My journey, however, took a fascinating turn when I learned about K-fold cross-validation. The dataset will be divided into K sections using this procedure (often 5 or 10). Repeating this process K times, each portion alternates between acting as the training set and the validation set. K-fold cross-validation offers the following benefits:

Stability: It makes error estimates less variable, giving a more reliable indicator of model performance. Efficiency: By making the most of every available data piece, the evaluation is improved. Model selection helps determine the ideal level of model complexity by analysing performance over multiple iterations. Accurate Test Error Estimation: This technique provides reliable information on how well a model performs in practise. But there are drawbacks. Predictor selection should not be neglected, and performing cross-validation after the second stage can result in severe bias and artificially low error rates. Apply cross-validation to the processes of model fitting and predictor selection to prevent this by making sure the entire process is thoroughly evaluated. In conclusion, my experience with the CDC data taught me the value of thorough model assessment. To fully utilise the potential of our data for more trustworthy models, estimating prediction error, embracing cross-validation, and avoiding common mistakes are essential

 

 

 

09/22 – Friday

1. Wrestling with Collinearity:
One of the key lessons I gleaned from this endeavor was the significant impact of collinearity on predictive modeling. Collinearity refers to the strong correlations between predictor variables, which can lead to unstable regression coefficients and erroneous conclusions. In my analysis, I witnessed how obesity and inactivity, two vital factors in diabetes prediction, danced a complex statistical tango. The variance inflation factor (VIF) and tolerance values became my trusty companions, helping me assess the degree of collinearity between these variables. By identifying and managing collinearity effectively, I could extract more reliable insights from the data.

2. The Power of T-Tests:
Another invaluable tool in my analytical arsenal was the humble t-test. With diabetes as my dependent variable and obesity and inactivity as predictors, I conducted meticulous t-tests to evaluate the statistical significance of each predictor variable’s influence on diabetes. These tests allowed me to quantify the strength and direction of the relationships, separating the signal from the noise in the data. The t-tests illuminated the nuanced interplay between obesity, inactivity, and diabetes, enabling me to make data-driven inferences and conclusions.

3. Data-Driven Insights:
My exploration into the CDC data unveiled critical insights into the dynamics of diabetes prediction. I learned that while obesity and inactivity were correlated, they had unique contributions to the prediction model. Understanding their individual impacts was crucial for crafting more effective public health interventions and strategies. Moreover, the judicious application of t-tests and diligent management of collinearity strengthened the reliability of my findings, ensuring that the conclusions drawn from the data were both robust and scientifically sound.

In conclusion, my journey through the intricate web of CDC data, with obesity, inactivity, and diabetes as protagonists, taught me the vital importance of addressing collinearity and employing t-tests in epidemiological research. These technical tools and insights not only enriched my understanding of the relationships within the data but also underscored the significance of data-driven decision-making in public health. Though after all this i still feel there is a lot to learn and understand from this data

 

 

09/20 – Wednesday

In this study, I delved into a dataset provided by the Centers for Disease Control and Prevention (CDC), with the objective of exploring the relationship between diabetes and two predictor variables, namely inactivity and obesity. The aim was to employ statistical methods to investigate whether these variables are significant predictors of diabetes. Firstly, I conducted a t-test to compare the means of two groups, namely those with diabetes and those without. It’s worth noting that t-tests are known to have certain built-in assumptions, as elucidated in the Wikipedia article on the subject. However, in many practical applications, t-tests exhibit robustness even when these assumptions are not fully met. Nonetheless, for datasets that deviate significantly from normality, the assumptions underlying t-tests can render the estimation of p-values unreliable. In order to address this issue and obtain a more reliable assessment of the significance of the observed difference in means, I employed a Monte Carlo permutation test. This computational procedure allowed me to estimate a p-value under the assumption of a null hypothesis, which posits that there is no genuine difference between the two groups. It’s crucial to emphasize that simply applying a t-test using appropriate software may not provide an intuitive understanding of how the p-value was calculated. Therefore, the adoption of a Monte Carlo procedure in this study was not only useful but also informative. It offered a more robust approach to estimating the p-value, considering the non-normal nature of the data. This analytical framework facilitated a deeper and more nuanced exploration of the relationships between inactivity, obesity, and diabetes, shedding light on potential predictors and their impact on the dataset.

09/18 – Monday

In my extensive analysis of CDC data, I have delved into the intricate relationship between inactivity and obesity as predictor variables in the context of predicting diabetes. Employing rigorous statistical methodologies, I applied both multi-linear regression and polynomial regression techniques to elucidate their impact on our dataset. Initially, I harnessed the power of multi-linear regression, an invaluable tool for examining complex relationships among multiple variables. Utilizing this method, I constructed a predictive model that incorporated inactivity and obesity as covariates, aiming to decipher their collective influence on diabetes incidence. Through extensive analysis, it became evident that this linear model yielded some valuable insights into the relationship between the predictor variables and the response variable. However, it was apparent that the intricate nature of this association might not be adequately captured by a linear framework alone. Recognizing the need for a more nuanced approach, I subsequently explored polynomial regression, an advanced technique that allows for the incorporation of non-linear relationships within the model. By introducing polynomial terms, I sought to capture the potential curvilinear associations between inactivity and obesity in relation to diabetes. This in-depth analysis revealed that the polynomial regression model not only improved the fit of our data but also uncovered nuanced, non-linear patterns in the relationship between these predictor variables and diabetes incidence.

In conclusion, my meticulous examination of CDC data, employing both multi-linear and polynomial regression techniques, has provided a comprehensive understanding of the complex interplay between inactivity, obesity, and diabetes. While the multi-linear model offered valuable insights into the linear relationships, the polynomial regression model enabled a more nuanced exploration of potential non-linear associations, enriching our comprehension of the predictive factors contributing to diabetes incidence. This analytical journey has expanded our knowledge base and underscores the importance of employing diverse statistical methodologies to unravel intricate relationships in epidemiological data.

 

09/15 Friday

After conducting a rigorous investigation into the predictive value of two key variables, inactivity, and obesity, on the incidence of diabetes in this extensive analysis of CDC data, the main goal was to understand how these predictors and diabetes interact in complex ways while also taking into account important issues like multicollinearity and homoscedasticity.

The phenomenon of multicollinearity, which involves a high correlation between predictor variables, was carefully studied. To detect and address collinearity problems, I made use of cutting-edge statistical methods, such as variance inflation factor (VIF) analysis. The outcomes underlined how critical it is to address multicollinearity because it can skew coefficient estimates and make the model more difficult to understand. After successfully reducing multicollinearity by using model refinement techniques like variable selection and regularisation, which increased the model’s predictive power. Furthermore, homoscedasticity—the presumption that the error terms’ variance is constant across predictor values—was carefully examined. I used statistical tests and residual plots to determine whether heteroscedasticity was present. The results demonstrated that our model adhered to the homoscedasticity assumption, demonstrating that the error term variability remained constant across the range of predictor variables. This result was crucial because it guarantees the accuracy of the statistical inferences made using the results of the regression model. This analytical journey yielded a number of insightful discoveries. First, it was discovered that obesity and inactivity were important diabetes predictors, reiterating their significance in public health initiatives aimed at diabetes prevention. Second, by addressing multicollinearity, we saw a significant increase in model stability and interpretability, which confirms the need for reliable preprocessing methods in regression analysis. The confirmation of homoscedasticity, which highlighted the dependability of our model’s predictive abilities, gave rise to confidence in its applicability for determining diabetes risk. The crucial link between inactivity, obesity, and diabetes has been highlighted by this CDC data analysis. This has not only enhanced the quality of our predictive model by addressing multicollinearity and verifying the homoscedasticity assumption but have also gained important knowledge that can guide public health policies and interventions aimed at reducing the diabetes epidemic. The significance of rigorous statistical analysis in using data to tackle urgent healthcare challenges is brought home by this work.

 

 

 

 

 

09/13 – Wednesday

The analysis of CDC data on diabetes, with predictor variables being obesity and inactivity, revealed several key findings. Firstly, when examining the p-value of the data, it was found to be statistically significant, indicating a strong relationship between the predictor variables (obesity and inactivity) and the occurrence of diabetes. This suggests that these factors play a crucial role in the development of diabetes within the population studied. However, it’s important to note that the data exhibited homoscedasticity, which can make predictive modeling inefficient. This suggests that there might be some limitations when trying to accurately predict diabetes using obesity and inactivity as predictor variables. Homoscedasticity implies that the variance of the errors in the model is not consistent across different levels of the predictor variables, which can make predictions less reliable. Additionally, it was observed that the predictor variables, obesity, and inactivity, were highly correlated. This high correlation could lead to multicollinearity issues in predictive models, making it challenging to determine the unique contribution of each variable in explaining diabetes risk. Addressing multicollinearity through techniques like feature selection or dimensionality reduction may be necessary to build a more robust predictive model. In summary, the CDC data analysis revealed a significant relationship between obesity, inactivity, and diabetes, suggesting that these variables are important in understanding diabetes risk. However, the presence of homoscedasticity and high correlation between the predictor variables should be considered when developing predictive models for diabetes, to ensure the model’s efficiency and accuracy.

Monday 09/11

As python was my preferred language for analysis, I used my setup for jupyter notebook and loaded the dataset as a pandas dataframe. I used several functions of the pandas library to understand about the basic characteristics of the data given for example: the length, type of data, and missing data.
Upon looking through the three datasets provided I found that the Obesity dataset has only 363, Inactivity has 1370 and Diabetes has 3142 records.

During the discussion with the professor we had come across the question of what can we do with this dataset. What questions can be answered, and as per our discussion combining these datasets to answer questions about how one factor is related to the other would be what we would like to achieve from these datasets. So, my first step was to combine diabetes and inactivity dataset and this gave me a total of 1370 rows. After doing some visualizations from this data like histograms and Quantile Plot I found that the diabetes data is a little skewed higher from normal distribution and that inactivity data is skewed slightly lower from normal distribution.

To find the relation between datasets the first step would be to find the correlation of the data and this gave me a correlation of 44% which is not a strong correlation. With the scatterplot of this I could see that we could fit a linear model on this data to find some prediction. Based on the R squared we can see that 20% of variation in Diabetes data points is attributed to variation in Inactivity. From the residuals and Heteroscedasticity plots, we can conclude that the linear model is not effective for this dataset.