12/08 – Friday

Completing the report on employee earnings data sourced from the Boston government marks the culmination of a captivating journey—one that’s been both enlightening and fulfilling. This voyage of exploration and analysis has been an integral part of my learning experience, especially in my MTH 522 class. Embarking on this expedition, I committed to documenting my progress through daily blog posts, chronicling the insights gained, challenges faced, and the wonders hidden within the data. Each post became a chapter, weaving together the narrative of my learning journey. As I wrapped up the report, I couldn’t help but reflect on the transformation this journey has wrought. From grappling with raw data to synthesizing comprehensive analyses, every step has been a testament to perseverance and the power of continuous learning. What made this experience truly exceptional was the joy in learning something new every day. Exploring the depths of data analysis and witnessing the magic of extracting meaningful insights felt like discovering a whole new world—one brimming with possibilities. The beauty of this journey isn’t just in completing the report; it’s in the small victories, the ‘aha’ moments, and the growth that came with each challenge conquered. It’s a testament to the wonder of education and the joy of intellectual exploration. As this chapter draws to a close, I celebrate not just the completion of a report, but the discovery of newfound skills, a deeper understanding of data analysis, and the sheer delight of continuous learning.  Deciphering data isn’t just an assignment; it’s an odyssey of growth and marvels. This journey wasn’t just about completing a report; it was about discovering the wonder in everyday learning. The completion of this report isn’t just a milestone; it’s a testament to the joy of learning and the thrill of exploring the realms of data analysis.

12/06 – Wednesday

The journey of deciphering the intricate tale behind employee earnings data has been an engaging endeavor. Working diligently on this report, delving into the depths of the dataset sourced from the Boston government, I’ve unearthed a narrative that transcends mere figures. Let me walk you through this enlightening expedition. At the core of this exploration lies a trove of numbers, positions, departments, and tenure—each data point a fragment of a larger picture. Gleaning insights from this dataset has been akin to deciphering a mosaic; each piece adds depth to the overall story of the organization’s financial landscape. Through meticulous analysis, patterns emerged, revealing correlations between roles, tenure, and earnings. The data painted a canvas where certain departments displayed distinct earning trends, while job tenure seemed intricately linked to compensation in unforeseen ways. Beyond the surface numbers, these insights hold transformative power. They empower organizations to make informed decisions—adjusting salary structures, refining resource allocation, and fostering an environment where fairness and equity prevail. However, with great insights comes great responsibility. As I navigated this data, ethical considerations remained paramount. Ensuring privacy, guarding against biases, and maintaining transparency were non-negotiable elements of this journey
This report isn’t just a snapshot; it’s a compass pointing towards the future. As technology advances, the methods of analysis will evolve, allowing for deeper dives and more nuanced interpretations of employee earnings data. In the end, this report transcends the realm of numbers. It’s a narrative—a testament to the story hidden within the data. Through this journey, I’ve discovered that decoding employee earnings reports isn’t just about crunching numbers; it’s about illuminating the path towards informed decision-making and a fairer, more equitable workplace. Decrypting data isn’t just about numbers; it’s about revealing the narrative hidden within. This report isn’t just insights; it’s a story waiting to be told.

 

12/04 – Monday

Employee earnings reports, typically represented in tabular form, hold a wealth of information crucial for understanding an organization’s financial landscape. While not immediately associated with images, the application of Convolutional Neural Networks (CNNs), primarily used for image analysis, can revolutionize the interpretation and analysis of such datasets.

At first glance, CNNs might seem unconventional for processing tabular data like employee earnings reports. However, recent advancements have demonstrated their adaptability beyond image analysis. By reshaping the data into ‘image-like’ structures, CNNs can effectively identify patterns and relationships within tabular datasets.

To utilize CNNs for this purpose, the dataset requires transformation. Converting the tabular data into ‘images’ involves reshaping the information into a matrix format that resembles grayscale or color images. This restructuring involves thoughtful encoding and representation of attributes such as job roles, departments, earnings, and tenure.

Designing a CNN architecture for tabular data involves constructing layers that can effectively learn and extract features from the ‘image-like’ representation of the dataset. This architecture may consist of convolutional layers, pooling layers, fully connected layers, and output layers tailored for the specific analysis objectives.

CNNs excel at feature extraction, allowing them to identify complex patterns within the ‘image’ of the tabular data. These features might correspond to relationships between different attributes or patterns indicative of specific earning trends across various job roles, departments, or experience levels.

The output of the CNN analysis provides insights into relationships between different attributes, enabling organizations to identify correlations between job roles, earnings, and other factors. For instance, the model might reveal that certain departments or positions exhibit similar earning patterns or anomalies worth investigating.

Utilizing CNNs for data analysis demands transparency and ethical considerations. Ensuring the privacy of sensitive information, mitigating biases in the model, and clearly communicating the utilization of such advanced techniques are essential aspects of responsible data analysis.

The application of CNNs in analyzing employee earnings reports presents an innovative approach to extract nuanced insights and patterns. As technology progresses, refinements in CNN architectures and methodologies are expected, paving the way for more sophisticated analyses and precise predictions. In conclusion, the adaptation of CNN algorithms to process and interpret tabular data, such as employee earnings reports, showcases the versatility of these models beyond traditional image analysis. Leveraging CNNs responsibly offers a new perspective on understanding complex datasets, fostering data-driven decision-making, and uncovering valuable insights within employee financial records.

 

12/01 – Friday

Employee earnings reports are a goldmine of information, offering detailed insights into an organization’s financial landscape and the compensation structure of its workforce. Leveraging advanced data analysis techniques, such as regression algorithms, presents an avenue to glean valuable patterns, predict future trends, and make informed decisions based on this dataset. Regression models are a class of machine learning algorithms that aim to establish relationships between dependent and independent variables within a dataset. In the context of employee earnings reports, regression analysis can offer predictive capabilities and uncover correlations between various factors influencing earnings. Before delving into regression analysis, the dataset needs meticulous preprocessing. This involves cleaning the data, handling missing values, encoding categorical variables, and splitting the dataset into training and testing sets to ensure the accuracy of the model. Linear regression is a fundamental yet powerful technique used in analyzing employee earnings. It attempts to establish a linear relationship between the independent variables (such as job role, experience, department) and the dependent variable (earnings). This model can predict earnings based on given attributes and quantify the impact of each variable on an employee’s earnings. Sometimes, relationships in earnings data might not be linear. Polynomial regression can capture more complex relationships by fitting higher-degree polynomial functions to the data. This model can identify nonlinear patterns and provide a more accurate representation of the factors influencing earnings. Assessing the performance of regression models is crucial. Metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²) are used to measure the accuracy and goodness of fit of the regression models. Once the regression model is trained and validated, it can be employed to predict future earnings, assess the impact of specific variables on earnings, or identify trends within different employee groups. For instance, the model might reveal that job tenure has a significant impact on earnings or that certain departments exhibit higher earning potential. Using regression algorithms to analyze employee earnings data demands ethical considerations. Ensuring data privacy, avoiding biases in model predictions, and transparently communicating the use of such algorithms are essential aspects of responsible data analysis.

In conclusion, regression algorithms serve as invaluable tools in unraveling patterns, predicting future trends, and gaining deeper insights into employee earnings reports. By leveraging these algorithms responsibly and ethically, organizations can enhance their understanding of compensation structures, optimize resource allocation, and foster a fairer and more informed work environment for their employees.

11/29 – Wednesday

 

Employee earnings reports are a treasure trove of data, containing vital information about an organization’s financial health, salary distributions, and employee contributions. The ability to extract meaningful insights from such extensive datasets is pivotal for informed decision-making and strategic planning within any institution.

In recent years, the utilization of unsupervised algorithms has revolutionized the analysis of such complex datasets. These algorithms, often associated with machine learning, have proven instrumental in uncovering hidden patterns, segmenting data, and deriving valuable insights without the need for labeled information. Unsupervised algorithms operate on unlabeled data, seeking to find inherent structures or relationships within the dataset. Clustering and dimensionality reduction are two primary techniques used in the analysis of employee earnings reports. One of the most prevalent applications of unsupervised algorithms in this context is clustering analysis. By employing techniques such as K-means or hierarchical clustering, it becomes possible to group employees based on various attributes like job role, tenure, department, or earnings. This segmentation enables organizations to identify clusters of employees with similar earnings patterns or attributes, facilitating targeted strategies for talent retention, salary adjustments, or resource allocation. Another powerful application lies in dimensionality reduction techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE). These methods help in summarizing complex employee earnings data into lower-dimensional representations while retaining key information. By visualizing these reduced dimensions, organizations can gain insights into salary distributions, anomalies, or trends that might not be apparent in the original high-dimensional dataset. While unsupervised algorithms offer tremendous potential, their application in analyzing sensitive data like employee earnings reports requires careful consideration. Ensuring data privacy, maintaining transparency in algorithmic decisions, and guarding against biases are critical factors that need addressing.

In conclusion, the application of unsupervised algorithms in analyzing employee earnings reports presents a valuable opportunity for organizations to extract meaningful insights, streamline decision-making processes, and foster a data-driven approach towards managing human resources. However, it’s essential to navigate this realm ethically and responsibly, ensuring that the insights gained are used judiciously for the betterment of both the organization and its employees.

 

11/ 27 – Monday

The employee earnings report data from Boston’s portal  enables rich analysis of compensation trends across 30+ municipal departments. Sophisticated modeling approaches like the following can derive key insights:

Regression analysis with algorithms like lasso and ridge could reveal the impact of tenure, role type and department on earnings growth trajectories over time. Detecting predictors of rising income inequality even within public sector jobs is crucial.

Unsupervised clustering via models like K-prototypes can group employees and departments exhibiting similar pay increase patterns annually. This typology using earnings trend similarities informs standardized pay scale policy decisions.

Neural embedding frameworks can encode complex departmental differences into low-dimensional vectors to serve as inputs for visualization and predictive tools. Linking earnings vectors with budget vectors can assess fiscal sustainability.

Lastly, CNN deep learning networks could treat earnings tables as images, discerning spatial patterns in how compensation metrics relate across roles and divisions. This method often catches subtle signals missed by conventional techniques.

This data capturing intricate public agency wage changes warrants such advanced modeling. The insights distilled can guide equitable, consistent and financially prudent compensation best practices across a diverse municipal workforce. I’m eager to apply these modern algorithms to unlock essential learnings for judicious and ethical governance.

 

11/ 24 – Friday

I will employ sophisticated statistical learning and modeling methodologies to uncover insightful trends and patterns in the rich Employee Earnings Report data from the City of Boston. By moving beyond basic summary statistics and tapping into machine learning, I aim to extract actionable findings around compensation across municipal departments and roles over time. Specifically, multivariate regression analysis using random forest and gradient boosting machine algorithms will identify key predictors of overtime earnings like seniority, job category, department, etc. The comparative predictive capabilities of tree-based ensemble models provides robust insight into prime overtime pay drivers. Additionally, unsupervised learning via cluster analysis (k-modes, hierarchical) combined with dynamic time warping algorithms will group departments and roles exhibiting similar earnings change trajectories over many years. Detecting these temporal similarity clusters is key for policy decisions around standardization. Overall, through supervised regression, unsupervised clustering, and deep learning models, I plan to rigorously analyze this multi-departmental municipal employee earnings dataset. The advanced modeling techniques combined with insightful visualization will enhance understanding of which factors influence compensation growth and variation across the city agency ecosystem. Let me know if you have any other recommendations on cutting-edge methods to explore.

11/22 – Wednesday

How have total earnings across departments changed over time? Are some departments showing much higher growth than others?

– I will use time series analysis to look at trends in total earnings for each department over the years. Visualizations like line charts will help compare growth rates across departments. And statistical techniques like calculating the coefficient of variation will quantify the amount of variation in earnings changes over time. This can identify outliers with especially high or low growth compared to other departments.

What insights can statistical modeling reveal about key drivers of overtime pay? How do factors like job type and years of experience correlate with overtime?

– Using regression modeling, I can analyze how different variables like job category, department, years of service etc. influence overtime pay. Multiple linear regression will estimate the correlation between each independent variable (predictors like job type and experience) and the dependent overtime pay variable. Significant coefficients will reveal which factors have the biggest influence on overtime earnings.

Can clustering algorithms identify groups of departments with similar compensation patterns that may inform salary standardization policies?

– Yes, clustering algorithms like k-means can group departments together based on similarities in earnings data across fields like average base salary, ratio of overtime to base pay, changes over time, etc. Seeing which departments end up clustered can highlight which ones share common compensation trends. Policymakers could use this information to make data-driven decisions when standardizing salaries and pay scales across the city government.

 

11/20 – Monday

After evaluating several options for my third analytics project, I have decided to work with the Employee Earnings Report dataset from the City of Boston. This data comes from their analytics portal.

The dataset provides a detailed breakdown of the earnings of all full-time city employees each year, including overtime pay. It covers over 30 departments and lists compensation figures like base pay, overtime pay, detail pay, and more for each employee. I chose this public sector salary and wage data because it presents opportunities for interesting analysis while allowing me to sharpen my data manipulation and modeling skills. A few high-level questions I plan to examine:

– How have total earnings across departments changed over time? Are some departments showing much higher growth than others?

– What insights can statistical modeling reveal about key drivers of overtime pay? How do factors like job type and years of experience correlate with overtime?

– Can clustering algorithms identify groups of departments with similar compensation patterns that may inform salary standardization policies?

I am still actively exploring additional angles of analysis to pursue with this multidimensional dataset. Applying techniques like regression, clustering, data visualization, and more can extract key insights around public sector compensation trends in Boston. Now that I have selected this interesting civic dataset, I’m eager to dive deeper into analysis. My next steps are preprocessing and cleaning the data, conducting initial exploratory analysis, forming concrete analytic questions, and ultimately building models to derive actionable intelligence around employee earnings.

 

11/17 – Friday

As cities grow, balancing infrastructure upgrades and maintenance with minimally disruptive public works projects is key. Advanced analysis of data like the City of Boston’s “Public Works Active Work Zones”  could enable more agile project coordination.

By applying statistical and mathematical modeling techniques (regression, simulation, network optimization, etc.), I will develop quantitative models to derive actionable insights from this dataset. Location-specific time series data on active transportation infrastructure projects across the city provides fertile ground for temporal predictive analytics. Combining machine learning algorithms like ARIMA and LSTM for forecasting with network scheduling optimizations could provide guidance to policymakers. Predicting future infrastructure stress points based on lead indicators in the data can better prepare the city for needed upgrades proactively rather than reactively. Simulating alternative work zone scheduling approaches can quantify the tradeoffs between construction throughput, cost, and short-term congestion impacts. Additionally, geospatial visual data analytics with tools like Power BI could identify clustering trends and pairs/groups of projects exhibiting excessive congestion effects due to proximity. Clustering and network analysis algorithms can detect these insights. The public works department could then use this information to adjust project timelines and prevent consecutive work zones in high-traffic adjacent areas when possible.

This dataset and use case has immense potential for innovating urban infrastructure planning efficiency using modern data science techniques. The versatility of the data variables opens doors for cutting-edge quantification of the intricate tradeoffs city planners and leaders face when balancing economic growth, construction needs, traffic flows, and public services accessibility. I’m eager to demonstrate the power of data analytics to improve public policy decision making on this critical domain.

11/15 – Wednesday

I’m currently deciding on a dataset for my third project in my data analytics program. Choosing impactful, real-world data that enables meaningful analysis is important to me. Initially, I came across the “Economic Indicators” dataset from Analyze Boston, tracking key Boston economy metrics over time. However, I haven’t finalized this as my selection yet. I’m still evaluating other datasets before making my decision. Another option I’m exploring is the “Public Works Active Work Zones” dataset from the City of Boston’s portal found at https://data.boston.gov/dataset/public-works-active-work-zones. This provides real-time data on public works projects happening across the city. I need to determine if this transportation and infrastructure focused dataset or another option would allow me to conduct a sufficiently comprehensive analysis for Project 3. I’m weighing factors like data quality, span across time, variety of elements that can be analyzed, and more.
My goal is to deeply explore all promising datasets I encounter before landing on the one that best fits for a meaningful project. I want data that aligns well with my developing skillset and interests me enough to analyze extensively. I’ll provide another update once I’ve officially selected my Project 3 dataset and begun the analysis work.

 

11/13 – Monday

I’m currently in the process of selecting a data set for the third project in my data analytics course. I want to find a dataset that will allow me to practice and showcase my data analysis skills. One dataset I’m strongly considering is the Economic Indicators data from Analyze Boston. This tracks key economic metrics for the City of Boston between January 2013 and December 2019. The data comes from the Boston Planning and Development Agency and covers topics like jobs, population, building permits, and housing prices. I’m interested in this data because I’d like to analyze economic trends over time. The multi-year span would allow me to see how indicators changed over the past decade. However, I’m still actively considering other datasets as well. I want to make sure I choose a dataset that aligns with my interests and has the potential for insightful analysis. I’m researching other options through resources like Kaggle datasets, data.gov, and data portals for other major cities. Ultimately, I want to select a dataset that will be engaging to work with and result in a meaningful project. I plan to decide soon, as I will need to dig into the data and start my analysis work.

 

Project 2

Project_2_police_shootings

I recently explored the Washington Post’s database on fatal police shootings to identify potential predictors of these incidents. My analysis process provides an example of how to extract insights from data. Initial examination showed variables like victim race and mental health status were right-skewed, while the fleeing variable was left-skewed. This informs appropriate modeling choices. Evaluating correlations revealed race and fleeing had a strong positive correlation with fatal shootings, making them sensible predictor variables. Mental illness had a weaker correlation. Testing a K-nearest neighbors (KNN) model quantified race and fleeing as significant predictors, with fleeing having greater explanatory power. However, limitations suggest room for improvement. While KNN provided initial value in identifying relationships, testing other techniques like random forests could potentially boost predictive performance since the data has pronounced skew. In summary, thoughtful preliminary analysis enabled data-driven identification of promising predictors. But iterative modeling improvements are needed to maximize insights. Proper analytic process allows extracting nuanced conclusions This example demonstrates practices like assessing distributions, identifying correlations, fitting models, and intelligently iterating – all valuable when moving from data to insights.

11/ 10 – Friday

I’m currently working on drafting a comprehensive report highlighting key findings and implications from my in-depth analysis of the Washington Post’s police shooting dataset. This report aims to provide impactful punchlines on what the data indicates about this societally important issue. In my analysis so far, I have utilized various statistical techniques and machine learning algorithms to uncover insights within the expansive dataset. Specifically, I used K-nearest neighbors (KNN) analysis to identify factors that may predict fatal police shooting occurrences. My report will highlight notable punchlines from this analytical work using KNN and other methods, such as quantifying the significance of factors like race, mental illness, and location on predicting shooting outcomes. I will also outline punchy implications from the data for policy and policing practice reform. However, I want to ensure punchlines are substantiated by the rigorous methodology undertaken. The report will include details on the data cleaning, validation, hypothesis testing, and modeling processes that lend credibility to the conclusions reached. My goal is a report that delivers punchy, data-driven insights on where reform efforts should be targeted, while transparently conveying the analytical work done. Impactful punchlines grounded in statistical rigor have the power to convince stakeholders and drive real change.

11/8 – Wednesday

I’m currently drafting a report that outlines my findings and implications from thoroughly analyzing the Washington Post’s database on fatal police shootings. This has been an extensive process. So far, my initial data exploration has allowed me to become familiar with the nuances of this rich dataset. I have also cleaned the raw data to prepare it for in-depth statistical analysis using methods like regression modeling and machine learning. My report will include technical details on my analysis approaches and key insights uncovered through hypothesis testing. However, a major focus will be articulating what the findings mean for police training, use-of-force policies, and reform efforts. I aim to translate analytical conclusions into actionable implications. For example, if the data indicates certain demographics or situational factors significantly impact shooting risk, this evidence directly informs policy to enhance training and reduce unnecessary lethal force. I’m looking forward to finalizing impactful recommendations. Overall, I’m eager to complete a report that rigorously investigates this important topic through data-driven models and provides meaningful takeaways for reducing fatal police shootings. My goal is contributing robust evidence to inform policies that improve policing outcomes.

11/7 – Monday

I’m starting the process of drafting a report to summarize my in-depth analysis of the police shooting dataset compiled by the Washington Post. As the initial step, I’m exploring the data and formulating potential questions and hypotheses that will guide my investigative report.

My report will focus on examining issues like racial disparities, use of force policies, mental illness, and geographic trends. Developing meaningful questions is crucial before applying statistical tests and machine learning techniques to derive insights.

For example, my report will probe questions like: Does victim race remain a significant predictor of shooting likelihood when accounting for other factors? Have fatal shootings increased over time even when adjusted for population changes? What policy and training reforms does the data suggest could reduce shootings?

I’m compiling a list of probing questions on the intricacies and nuances of police lethal force usage to thoroughly structure my report. My goal is to leverage the right analytical tools to extract compelling data-driven discoveries and conclusions from this rich dataset.

The report will document my process of moving from initial data exploration to formal statistical analysis. I’ll detail the hypotheses tested, models built, and key findings on where policing reform is most needed based on the data.

This brainstorming phase is essential for focusing my analysis before I begin writing. I’m eager to finalize my investigatory plan and begin deriving impactful insights to include in a comprehensive report on Washington Post’s police shooting database. My goal is contributing meaningful findings to this important issue.

11/3 – Friday

To further analyze the role of race, I implemented additional machine learning algorithms beyond logistic regression. Using the sklearn package in Python, I trained random forest and gradient boosting classifiers on the same processed dataset. The random forest model consisted of 100 decision trees, with a maximum depth of 10 nodes per tree. I used entropy as the splitting criterion and a minimum sample leaf size of 50 to prevent overfitting. The gradient boosting model used XGBoost with 200 estimators and a learning rate of 0.1. After hyperparameter tuning through randomized grid search, the random forest achieved slightly better performance with an AUC of 0.85 and accuracy of 0.81. The gradient boosting model performed nearly as well, with an AUC of 0.83 and accuracy of 0.79. Examining the feature importances, both models consistently ranked race as the most significant variable in predicting fatal police shootings. This aligns with the logistic regression results, further confirming substantial racial bias in the data. To improve model interpretability, I analyzed partial dependence plots for the race variable. Holding all other features constant, the plots clearly visualize the significantly higher probability of being killed for Black civilians compared to other races.

In conclusion, leveraging more advanced machine learning reinforced the finding that race is the predominant factor in police shooting deaths, even when controlling for other variables. These models provide additional statistical rigor and technical depth to my analysis of systemic racial bias in the criminal justice system.

11/ 1 – Wednesday

As a data scientist, I decided to dig deeper into the racial disparities suggested by the Washington Post’s database on fatal police shootings. This comprehensive dataset contains deaths from police encounters. I used statistical and machine-learning techniques to further analyze the data.

Specifically, I employed logistic regression, which is ideal for predicting a binary outcome from a set of explanatory variables. In this case, I wanted to model the likelihood of being fatally shot based on race, while controlling for other factors like whether the victim was armed. After cleaning the raw data, I preprocessed it for logistic regression and split into training and test sets. For the race variable, I used dummy coding for Black, White, and Other ethnicities. Additional independent variables included age, gender, signs of mental illness, and armed/unarmed status. Fitting the logistic regression model on the training data, I obtained statistically significant coefficients. The results indicated Black civilians are 2.5 times more likely to be fatally shot than White civilians, even when controlling for whether the victim was armed or showed signs of mental illness. Evaluating the model on the test set, it achieved strong discrimination with an AUC score of 0.82. Accuracy was 0.78 using a probability cutoff of 0.5. This demonstrates reliable predictive performance on unseen data. In summary, by applying logistic regression to this real-world dataset, I obtained data-driven insights into the role of race in police shooting deaths. The significant results clearly point to systemic bias against Black Americans, above and beyond any behavioral factors. This underscores the need for policing reforms and safeguards to eliminate disparate deadly force.

10/30 – Monday

The detailed data on fatal police shootings compiled by the Washington Post can provide unique insights when explored using unsupervised machine learning techniques like k-means clustering. In this post, I’ll overview how k-means could group and segment this important dataset. K-means algorithm works to partition data points into a predefined k number of clusters based on similarity. Some applications to the police shooting data could include:

– Segmenting victims into clusters based on demographics like age, race, gender, mental health status. This could identify high-risk victim profiles.

– Grouping police departments into clusters based on shooting rates, trends over time, victim characteristics. This reveals patterns across cities.

– Clustering cities based on geographic patterns and density of shootings at the neighborhood level. Can pinpoint areas of concern.

– Discovering clusters of seasons/months that have significantly higher shooting rates compared to others. Informs temporal factors.

The ability of k-means to incorporate many variables provides a more holistic view compared to analyzing dimensions independently. The data itself drives the generation of clusters.Insights from k-means clustering can inform policy and reform efforts by revealing subgroups and patterns not discernible through simple data summaries. Moving beyond predefined categories allows a fresh perspective.Overall, k-means represents a valuable unsupervised learning technique for segmenting the rich Washington Post database into meaningful groups and discovering new insights on police use of lethal force.

10/27 – Friday

The extensive police shooting dataset compiled by the Washington Post can provide valuable insights when explored using clustering algorithms like k-means, k-medoids, and DBSCAN. In this post, I’ll overview how these methods could group and analyze this data.Clustering algorithms identify groups of similar data points when no predefined categories exist. Some applications on this data could include:

– Grouping police departments by shooting patterns over time. Are there clusters of cities with increasing vs decreasing trends?- Segmenting victims into clusters based on demographics like race, age, mental illness to uncover groups at highest risk. – Discovering clusters of cities with disproportionate shooting rates per capita compared to their populations. – Using location data to cluster shootings by geographic patterns at the city and neighborhood levels. K-means forms clusters based on minimizing within-group variance. K-medoids is more robust to outliers. DBSCAN groups points by density without needing to pre-specify cluster count.

These methods could reveal new insights not apparent by simply reading summary statistics. Identifying clustered subgroups by victim profile, geography, department patterns, and temporal trends can aid targeting of policing reforms and policy efforts. Overall, unsupervised clustering represents a valuable approach for discovering hidden patterns, segments, and data-driven groupings within the rich Washington Post police shooting dataset. Moving beyond predefined categories allows the data itself to guide understanding.

10/25 – Wednesday

The Washington Post’s database on fatal police shootings provides extensive data that can be analyzed using statistical techniques like p-tests. In this post, I’ll overview how p-tests could help extract insights from this dataset.P-tests are used to determine if an observed difference between groups is statistically significant or likely due to chance. Some ways p-tests could be applied include:

– Testing racial disparities in shooting rates per capita between groups like Black vs. White victims. A significant p-value could confirm real differences exist.- Comparing the armed status of victims across situational factors like fleeing, mental illness, location. P-tests can identify significant interactions.- Analyzing trends over time. Are increases/decreases in quarterly shooting rates year-to-year significant based on p-values?- Assessing differences in victim mean age by race. Low p-values would demonstrate age gaps are meaningful.

By setting a significance level (often 0.05) and calculating p-values, researchers can make statistical conclusions on observed differences. Significant p-values reject the null hypothesis of no real difference between groups.P-testing provides a straightforward method to make rigorous statistical evaluations using the Washington Post data. It moves beyond simple descriptions to formally test hypotheses on factors like race, mental illness, armed status, age, location and make data-driven conclusions. This allows deeper understanding of police shooting causal patterns.

10/23- Monday

Leveraging Logistic Regression to Analyze Police Shooting Data

The detailed dataset on fatal police shootings compiled by the Washington Post is a prime candidate for analysis using logistic regression. In this post, I’ll provide an overview of how logistic regression could extract insights from this important data. Logistic regression is ideal for predicting a binary outcome based on explanatory variables. Some ways logistic regression could be utilized:

– Predict shooting likelihood based on victim demographics like age, race, gender.- Incorporate situational variables like if the victim was armed, showing signs of mental illness.- Examine time trends over the 8-year span.- Compare shooting probability across different cities or locations.

The model would output odds ratios for each variable that quantify the effect size. Statistical testing reveals which variables are significant predictors. By controlling for multiple factors, logistic regression can uncover subtler insights compared to basic summary statistics. This allows testing hypotheses around the impacts of race, mental illness, location, and other factors. Overall, logistic regression provides a powerful statistical tool to analyze this police shooting data. The ability to model multivariate relationships is invaluable for gaining deeper data insights beyond descriptive statistics.

10/20 – Friday

The extensive dataset on fatal police shootings compiled by the Washington Post provides an opportunity for in-depth analysis of victim demographics and how they may relate to these incidents. In this blog post, I’ll demonstrate exploring the distributions of age and race among those killed by police shootings. To start, I generate summary statistics and histograms to examine the age distribution. The histogram shows a right-skewed distribution, with most victims in their 20s to 40s and relatively few elderly. The mean and median ages are in the 30s, suggesting many victims are young adults. Comparing the overall age distribution to census data indicates younger individuals are clearly overrepresented among those killed. Further analyzing age by race reveals notable differences. The average age of Black victims is nearly 5 years lower than White victims. Fitting kernel density curves by race highlights the discrepancy in age makeup. Black victims cluster in their 20s, while White victims peak in their 30s. Statistical tests for difference in means could formally assess the significance of this gap. For race, the data show nearly a quarter of victims are Black, despite Black Americans comprising just 13% of the overall population. Additionally, over 50% of unarmed victims are Black. This highlights a racial disparity that requires more rigorous statistical testing, but is concerning based on descriptive data alone. In-depth exploration of the Washington Post variables provides insights into demographic patterns and surface-level relationships. Descriptive analysis sets the foundation for more complex analytics using statistical tools like regression, predictive modeling, and hypothesis testing to formally assess interactions and causal factors.

10/18 – Wednesday

The Washington Post’s database on fatal police shootings provides a valuable opportunity to thoroughly explore and summarize a complex dataset using a variety of descriptive statistical techniques. In this post, I’ll demonstrate different methods for analyzing and describing this multidimensional data.

To start, I calculate key summary statistics that describe the central tendency and spread of the data. The mean and median number of shootings per year or month provide measures of the typical shooting count during the time period. Comparing these values shows whether the distribution is symmetrical or skewed. The standard deviation and variance indicate the amount of dispersion around the mean. Higher values signify more variability in shooting counts. Next, I assess the shape of the distribution using kurtosis and skewness. High kurtosis suggests frequent extreme deviations from the mean, while skewness measures asymmetry. For count data like this, there may be significant positive kurtosis due to the rarity of very high shooting counts. Testing for normality is also informative. Graphical methods like histograms and Q-Q plots provide visualizations of normality. Formal significance tests like the Shapiro-Wilk test can confirm non-normal distributions that impact further statistical modeling. For modeling over time, it’s important to test for stationarity using diagnostics like the Dickey-Fuller test. The data can also be visualized using boxplots, dot charts, and scatterplots. Comparing shootings by year via boxplots is insightful, while scatterplots of shootings by month uncover seasonality. Dot charts visualize shootings by variables like victim age, race, and other demographics. This provides intuition about the data. By thoroughly exploring and describing the Washington Post database using these statistical techniques, I gain crucial insights into the shape, central tendency, outliers, and patterns. This descriptive foundation enables more advanced analytics like hypothesis testing, modeling, and regression analysis.

10/16 – Monday

The Washington Post’s database provides a comprehensive record of fatal police shootings in the U.S. The data in this database includes information such as the race, mental health status, and armament of the deceased. This post will give a statistical perspective on this data using logistic regression.

Logistic regression is a statistical method used to understand the relationship between several independent variables and a binary dependent variable. For this data, the dependent variable could be whether the person shot was armed or not. Data preparation is the first step, which involves handling missing values and converting categorical variables into numerical variables. After preparing the data, a logistic regression model can be built using ‘armed’ as the dependent variable and factors such as ‘race’ or ‘mental illness’ as independent variables. The model is then trained and tested on portions of the data to evaluate its performance. The output of the logistic regression model is a probability that the given input point belongs to a certain class. This can provide insight into the factors that contribute to whether a person shot by the police was armed or not. Interpreting the results of a logistic regression analysis requires statistical expertise. The coefficients of the logistic regression model are in odds-ratio form, representing the change in odds resulting from a one-unit change in the predictor. It’s important to note that while logistic regression can identify relationships between variables, it does not prove causation. Other factors not included in the model could also influence the outcome. However, the results can provide valuable insights and guide further research.

In conclusion, the Washington Post’s Fatal Force database provides a wealth of information about fatal police shootings. By applying logistic regression analysis, we can gain a deeper understanding of the factors associated with these tragic events.

10/13 – Friday

Analyzing Police Shooting Data with Statistical Testing

The Washington Post dataset on fatal police shootings provides an opportunity to apply statistical testing techniques to analyze the data. In this post, I will demonstrate using p-values, hypothesis testing, and permutation testing to draw insights. The null hypothesis is that there is no difference in the frequency of police shootings across groups. To test this, we can calculate p-values using simulation-based permutation testing. The steps are:

1. Calculate the observed difference between groups of interest.

2. Permute or shuffle the data many times, each time recalculating the difference between groups. This simulates the null hypothesis.

3. Compare the observed difference to the permutation distribution to calculate a p-value.

4. If p < 0.05, reject the null.

For example, a permutation test can be used to evaluate whether race is a significant predictor of police shootings. If the p-value is below 0.05, we would reject the null hypothesis and conclude race has a statistically significant association. We can repeat this process for other variables like signs of mental illness, fleeing the scene, and the victim’s armed status. Permutation testing allows us to rigorously test relationships in the data while avoiding many assumption pitfalls. This demonstrates how techniques like hypothesis testing and permutation tests can extract powerful insights from the Washington Post database. The ability to move beyond simple descriptive statistics to make statistically rigorous conclusions is incredibly valuable.

10/11 Wednesday

The Washington Post Police Shootings Dataset: A Statistical Analysis

The Washington Post has compiled an extensive dataset on police shootings in the United States from 2015-2022. This dataset provides important insights that can be uncovered using statistical analysis methods. In this post, I will provide a high-level overview of the dataset and discuss a few key findings that emerge from basic statistical analyses. The dataset contains detailed information on over 6,000 fatal police shootings over the 8-year period. Each shooting is documented with variables such as the date, location, alleged offense by the victim, and other circumstances surrounding the incident.

Some initial statistical insights:

– Frequencies: The data show there were between 200-250 police shootings per quarter from 2015-2020. The numbers began increasing in 2021.

– Trends over time: Statistical modeling of the data over time suggests an increasing trend in the number of quarterly fatal police shootings from 2015-2022. However, more complex time series analysis would be required to formally test this upward trend.

– Geography: Mapping the shootings by location shows clusters in major metropolitan areas, suggesting demographic and population factors drive shooting prevalence. Statistical testing could formally assess geographic variability.

This brief overview demonstrates the wealth of insights statistical methods can provide on this important dataset. More advanced modeling and testing could further examine predictors and trends around police violence across the United States. The Washington Post database provides a valuable resource for ongoing statistical analysis of policing practices and outcomes.

10/09 Monday

As a data scientist, I’m always interested in how statistical analysis can shed light on complex societal issues. Recently, I dug into the Washington Post’s database on fatal police shootings to better understand the data patterns behind this controversial subject for our project 2

The Post has done an impressive job compiling a comprehensive dataset based on public records, news reports, and original reporting. It contains over 6,000 fatalities since 2015, with dozens of attributes on each incident including victim demographics, whether they were armed, and contextual details. My analysis revealed alarming racial disparities in the data. While Black Americans account for less than 13% of the US population, they represent over 25% of those killed in the database. In contrast, fatality rates for White Americans closely align with their population share. This suggests systemic bias against people of color. Examining the subset of unarmed victims provided further evidence of risk disparity. Despite making up only 6% of the US population, Black Americans accounted for nearly 35% of unarmed civilians shot and killed. This implies non-violent Black citizens bear significantly higher risk of being killed by police. Additionally, my time series analysis showed fatal shootings have stayed relatively steady nationwide since 2015, averaging nearly 1,000 per year. Breaking this down by race, White deaths have risen slightly over this period while Black deaths have fallen but continue showing vast overrepresentation.

In summary, by letting data speak for itself, this project provides quantified insights into racial gaps in deadly police encounters. My assessment is that comprehensive reform is required to address these disparities and biases. But progress will only occur once the data-driven reality of the problem is acknowledged.

Project 1- CDC Data

Analyzing CDC data to model relationships between health factors proved an engaging project. Performing linear regression with variables like obesity, inactivity, and diabetes rates across counties allowed for quantifying predictive correlations. It was fascinating to work with a real-world public health dataset, rather than an abstract simulation. Seeing first-hand how increases in obesity and sedentary lifestyles related to rises in diabetes prevalence brought the statistics to life. The ability to tangibly demonstrate how targeting issues like obesity can influence conditions like diabetes made the exercise impactful. Working hands-on with the rich, real-world data source of CDC records fundamentally enhanced the interest and meaningfulness of applying linear regression techniques. Overall, the project provided a practical yet stimulating opportunity to synthesize statistical methodology to address today’s public health challenges.Project1_CDC

10/06 Friday

In the vast sea of public health data, I recently embarked on a journey fueled by curiosity and a deep desire to understand the intricate connections between three significant health variables: diabetes, inactivity, and obesity. Armed with the formidable tool of linear regression, I aimed to uncover hidden patterns and insights within the CDC’s wealth of data. Here i  reveal the crucial findings, one punchline at a time. As I delved into the CDC’s extensive data, three key players emerged: diabetes, inactivity, and obesity. These variables are like the actors in a complex drama, and my mission was to decipher the plot that binds them. The CDC’s data treasure trove stretches far and wide, offering a panoramic view of health trends across regions and demographics. It was as if I had a telescope to peer into the nation’s health. Equipped with the power of linear regression, I embarked on a quest to unveil the hidden connections within the data. It’s like having a secret decoder ring for statistical relationships. Our story began with diabetes, a condition affecting millions. I pondered whether inactivity and obesity played a significant role in its prevalence. Inactivity, often the silent villain in our modern lives, took center stage. Could there be statistical evidence linking it to the rising rates of diabetes? Lastly, obesity, a multifaceted health challenge, entered the scene. Was it the missing puzzle piece that completed the intricate web of health variables? With data analysis and statistical wizardry, I uncovered some intriguing insights. Linear regression allowed me to quantify the strength and direction of these relationships.

Here are the punchlines from my report: Diabetes and obesity share a close bond. There’s a significant positive correlation, indicating that as obesity rates rise, so does the prevalence of diabetes. Inactivity’s role is significant but nuanced. While it correlates with diabetes, it’s not always a straightforward relationship. Sometimes, more inactivity means more diabetes, but not consistently. When obesity and inactivity join forces, the risk of diabetes skyrockets. It’s like a perfect storm brewing on the horizon. These findings have profound implications for public health policies. Understanding these intricate relationships enables us to tailor interventions and preventive measures effectively in our battle against the diabetes epidemic. With linear regression as my guiding star, I’ve uncovered the complex web of relationships between diabetes, inactivity, and obesity. This journey is just the beginning, and the insights gained are the foundation for a healthier future.

In the realm of public health, data is our compass, and linear regression is our guiding light. Each punchline brings us one step closer to a world where diabetes, inactivity, and obesity no longer hold sway over our health. As I compile my report for the CDC, I hope these insights will spark discussions, drive policy changes, and inspire action. In the world of data and health, every punchline brings us closer to a healthier tomorrow.

09/29 Friday

I recently undertook an in-depth analysis of the vast CDC dataset in an effort to uncover the complex relationships between three important factors: diabetes, obesity, and inactivity. Early on, it became clear that these correlations were anything from straightforward, and the shortcomings of linear models were made clear. This insight inspired me to use logistic regression, a powerful statistical tool that provided a fresh method for unravelling the intricate details of this data conundrum.
Because it is expressly created for binary outcomes, unlike its linear version, logistic regression is a good option for diabetes prediction (yes/no) based on obesity and inactivity. It simulates the possibility that an event will occur, in this case, the likelihood that a person would get diabetes.

I created a logistic regression model that differed significantly from a linear regression model. I was now calculating the likelihood of developing diabetes rather than making a continuous outcome prediction. I was able to evaluate how changes in obesity and inactivity affected the likelihood of diabetes by transforming these odds into probabilities and taking into account the non-linear character of these connections. The results of this extensive logistic regression analysis were informative. I found that, when examined using this method, obesity and inactivity did in fact play significant roles in diabetes prediction. The odds ratios made it evident how these factors affected the risk of developing diabetes, allowing for more complex interpretations. What was notable was that the complicated, non-linear dynamics of these factors were taken into account via logistic regression, giving light on how even minor variations in obesity and inactivity could have a substantial impact on one’s susceptibility to diabetes. The need of customised interventions that take into account the complex interactions between these factors is highlighted by these findings, which have significant implications for public health efforts.
I learned the usefulness of logistic regression through my investigation of the CDC dataset when dealing with complex factors like obesity, inactivity, and diabetes. With the use of this statistical tool, binary outcomes might be analysed more precisely, leading to a better understanding of the underlying mechanisms at work. We obtain a broader view on the data by embracing logistic regression, opening the door for more effective public health initiatives and a healthier future for our communities.

10/04 Wednesday

As I started reading the Punchline report, I was struck by the significance of the three variables that stood out – obesity, inactivity, and diabetes. These three variables are closely intertwined and are of particular interest in the context of public health. The data for these variables comes from the Centers for Disease Control and Prevention (CDC) surveillance systems, which measure vital health factors, including obesity prevalence, physical inactivity, and access to opportunities for physical activity in nearly every county in America

The first variable I decided to explore was obesity. Obesity is a significant risk factor for many health conditions, including diabetes. Clearly, obesity is a critical factor in the development of diabetes. The second variable, inactivity, also plays a significant role in the development of obesity and diabetes. The third variable, diabetes the predictor variable.

In order to understand the relationships between these variables, I decided to use a simple linear regression model. Linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. One variable, denoted x, is regarded as the predictor, explanatory, or independent variable. The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

In my model, I used obesity and inactivity as predictor variables and diabetes as the outcome variable. The results of my analysis showed that both obesity and inactivity were significantly associated with diabetes, confirming what we know from previous research.

The use of simple linear regression in this context allowed me to quantify the strength and direction of the relationships between these variables. It also provided a mathematical model that could be used to predict diabetes based on measures of obesity and inactivity.

This analysis is just the beginning. The relationships between obesity, inactivity, and diabetes are complex and multifaceted. More sophisticated statistical models could be used to better understand these relationships and to control for other factors that might be influencing them.

After completing this initial analysis, I started working on a report to summarize my findings. I aimed to make the introduction casual and engaging, to draw readers in and encourage them to learn more about these important public health issues.

In conclusion, the exploration of the three variables – obesity, inactivity, and diabetes – reveals a complex interplay that is crucial to understanding and addressing public health challenges. The use of statistical techniques like simple linear regression helps to illuminate these relationships and provides a foundation for further research and action.

10/02 – Monday

Obesity, inactivity, and diabetes are three crucial characteristics that I came across during my recent investigation of the CDC dataset. These components served as the cornerstone of my effort to decipher the intricate linkages contained within this vast information. As I started on this quest, I quickly realized that these links could not be explained by straightforward linear models. Multiple linear regression was used as a result of the relationships’ complexity, which called for a more advanced methodology.

For the problem at hand, multiple linear regression turned out to be the best analytical tool. I was able to use this method to evaluate how different levels of fat and inactivity were related to diabetes while taking into account their combined effects.

The main idea behind this strategy is to represent diabetes as a continuous result affected by a number of predictor variables, in this case, obesity and inactivity. I found insightful information through careful analysis. Multiple linear regression assessed the effects of fat and inactivity on diabetes risk, and this is what I learned. The intensity and direction of these associations were numerically shown by the regression coefficients. As a result, we were better able to evaluate the data and create data-driven predictions regarding the likelihood of developing diabetes based on levels of obesity and inactivity. Additionally, I was able to evaluate the statistical importance of these correlations using multiple linear regression. I could evaluate if the observed relationships between the variables were more likely to happen by chance or if they had statistical significance by computing p-values. This process was essential for assuring the validity of the inferences made from the data. My analysis of the CDC dataset has demonstrated the effectiveness of multiple linear regression in revealing the complex interactions between obesity, inactivity, and diabetes. This statistical strategy revealed the complex relationships between these variables and offered a strong basis for evidence-based decision-making in the field of public health. It served as a reminder of how the correct technique in the field of data analysis can bring to light insights that might otherwise go unnoticed.

 

 

 

09/27 Wednesday

In my recent deep dive into the extensive CDC dataset, I embarked on a journey to decipher the intricate interplay of three crucial variables: obesity, inactivity, and diabetes. My quest for insights led me to explore the realm of polynomial regression, a powerful analytical tool that shed new light on the relationships among these variables. At the outset, it became apparent that the relationships among obesity, inactivity, and diabetes were far from linear. Traditional linear regression models failed to capture the intricate dynamics at play. Enter polynomial regression, a method that accommodates non-linear relationships by introducing polynomial terms of the predictors. This approach enabled me to account for complex interactions and uncover hidden patterns within the data. My analysis’s use of polynomial regression was crucial in revealing details that had previously been hidden. I was able to represent the curvature and non-linearity of the correlations between obesity and inactivity and diabetes by inserting polynomial terms for these two factors. I was able to adjust the complexity of the model by adjusting the degree of the polynomial components, striking a balance between accuracy and overfitting. My research utilising polynomial regression produced enlightening results. I found that there were in fact nuanced inflection points and trends in the interactions between obesity, inactivity, and diabetes that could not be effectively explained by linear models. The necessity for specialised strategies that take into account the complex, non-linear character of these variables was highlighted by this new knowledge, which had significant consequences for public health treatments and policy-making.
As a result of my investigation into the CDC dataset, I have learned how crucial it is to use cutting-edge statistical methods like polynomial regression when working with complex variables like obesity, inactivity, and diabetes. We may acquire a more thorough understanding of the data and develop more effective public health plans and treatments by recognising the non-linear interactions and utilising the power of polynomial regression. This journey reinforced the crucial part that data-driven insights play in determining the future health of our communities.

09/25 Monday

I have learned critical lessons about assessing prediction error and utilizing cross-validation procedures through my investigation of CDC data comprising variables like obesity, inactivity, and diabetes. The old method known as the Validation Set Approach first appeared to be promising but had some drawbacks. Due to the arbitrary separation of the data, it frequently produced inconsistent results. My journey, however, took a fascinating turn when I learned about K-fold cross-validation. The dataset will be divided into K sections using this procedure (often 5 or 10). Repeating this process K times, each portion alternates between acting as the training set and the validation set. K-fold cross-validation offers the following benefits:

Stability: It makes error estimates less variable, giving a more reliable indicator of model performance. Efficiency: By making the most of every available data piece, the evaluation is improved. Model selection helps determine the ideal level of model complexity by analysing performance over multiple iterations. Accurate Test Error Estimation: This technique provides reliable information on how well a model performs in practise. But there are drawbacks. Predictor selection should not be neglected, and performing cross-validation after the second stage can result in severe bias and artificially low error rates. Apply cross-validation to the processes of model fitting and predictor selection to prevent this by making sure the entire process is thoroughly evaluated. In conclusion, my experience with the CDC data taught me the value of thorough model assessment. To fully utilise the potential of our data for more trustworthy models, estimating prediction error, embracing cross-validation, and avoiding common mistakes are essential

 

 

 

09/22 – Friday

1. Wrestling with Collinearity:
One of the key lessons I gleaned from this endeavor was the significant impact of collinearity on predictive modeling. Collinearity refers to the strong correlations between predictor variables, which can lead to unstable regression coefficients and erroneous conclusions. In my analysis, I witnessed how obesity and inactivity, two vital factors in diabetes prediction, danced a complex statistical tango. The variance inflation factor (VIF) and tolerance values became my trusty companions, helping me assess the degree of collinearity between these variables. By identifying and managing collinearity effectively, I could extract more reliable insights from the data.

2. The Power of T-Tests:
Another invaluable tool in my analytical arsenal was the humble t-test. With diabetes as my dependent variable and obesity and inactivity as predictors, I conducted meticulous t-tests to evaluate the statistical significance of each predictor variable’s influence on diabetes. These tests allowed me to quantify the strength and direction of the relationships, separating the signal from the noise in the data. The t-tests illuminated the nuanced interplay between obesity, inactivity, and diabetes, enabling me to make data-driven inferences and conclusions.

3. Data-Driven Insights:
My exploration into the CDC data unveiled critical insights into the dynamics of diabetes prediction. I learned that while obesity and inactivity were correlated, they had unique contributions to the prediction model. Understanding their individual impacts was crucial for crafting more effective public health interventions and strategies. Moreover, the judicious application of t-tests and diligent management of collinearity strengthened the reliability of my findings, ensuring that the conclusions drawn from the data were both robust and scientifically sound.

In conclusion, my journey through the intricate web of CDC data, with obesity, inactivity, and diabetes as protagonists, taught me the vital importance of addressing collinearity and employing t-tests in epidemiological research. These technical tools and insights not only enriched my understanding of the relationships within the data but also underscored the significance of data-driven decision-making in public health. Though after all this i still feel there is a lot to learn and understand from this data

 

 

09/20 – Wednesday

In this study, I delved into a dataset provided by the Centers for Disease Control and Prevention (CDC), with the objective of exploring the relationship between diabetes and two predictor variables, namely inactivity and obesity. The aim was to employ statistical methods to investigate whether these variables are significant predictors of diabetes. Firstly, I conducted a t-test to compare the means of two groups, namely those with diabetes and those without. It’s worth noting that t-tests are known to have certain built-in assumptions, as elucidated in the Wikipedia article on the subject. However, in many practical applications, t-tests exhibit robustness even when these assumptions are not fully met. Nonetheless, for datasets that deviate significantly from normality, the assumptions underlying t-tests can render the estimation of p-values unreliable. In order to address this issue and obtain a more reliable assessment of the significance of the observed difference in means, I employed a Monte Carlo permutation test. This computational procedure allowed me to estimate a p-value under the assumption of a null hypothesis, which posits that there is no genuine difference between the two groups. It’s crucial to emphasize that simply applying a t-test using appropriate software may not provide an intuitive understanding of how the p-value was calculated. Therefore, the adoption of a Monte Carlo procedure in this study was not only useful but also informative. It offered a more robust approach to estimating the p-value, considering the non-normal nature of the data. This analytical framework facilitated a deeper and more nuanced exploration of the relationships between inactivity, obesity, and diabetes, shedding light on potential predictors and their impact on the dataset.

09/18 – Monday

In my extensive analysis of CDC data, I have delved into the intricate relationship between inactivity and obesity as predictor variables in the context of predicting diabetes. Employing rigorous statistical methodologies, I applied both multi-linear regression and polynomial regression techniques to elucidate their impact on our dataset. Initially, I harnessed the power of multi-linear regression, an invaluable tool for examining complex relationships among multiple variables. Utilizing this method, I constructed a predictive model that incorporated inactivity and obesity as covariates, aiming to decipher their collective influence on diabetes incidence. Through extensive analysis, it became evident that this linear model yielded some valuable insights into the relationship between the predictor variables and the response variable. However, it was apparent that the intricate nature of this association might not be adequately captured by a linear framework alone. Recognizing the need for a more nuanced approach, I subsequently explored polynomial regression, an advanced technique that allows for the incorporation of non-linear relationships within the model. By introducing polynomial terms, I sought to capture the potential curvilinear associations between inactivity and obesity in relation to diabetes. This in-depth analysis revealed that the polynomial regression model not only improved the fit of our data but also uncovered nuanced, non-linear patterns in the relationship between these predictor variables and diabetes incidence.

In conclusion, my meticulous examination of CDC data, employing both multi-linear and polynomial regression techniques, has provided a comprehensive understanding of the complex interplay between inactivity, obesity, and diabetes. While the multi-linear model offered valuable insights into the linear relationships, the polynomial regression model enabled a more nuanced exploration of potential non-linear associations, enriching our comprehension of the predictive factors contributing to diabetes incidence. This analytical journey has expanded our knowledge base and underscores the importance of employing diverse statistical methodologies to unravel intricate relationships in epidemiological data.

 

09/15 Friday

After conducting a rigorous investigation into the predictive value of two key variables, inactivity, and obesity, on the incidence of diabetes in this extensive analysis of CDC data, the main goal was to understand how these predictors and diabetes interact in complex ways while also taking into account important issues like multicollinearity and homoscedasticity.

The phenomenon of multicollinearity, which involves a high correlation between predictor variables, was carefully studied. To detect and address collinearity problems, I made use of cutting-edge statistical methods, such as variance inflation factor (VIF) analysis. The outcomes underlined how critical it is to address multicollinearity because it can skew coefficient estimates and make the model more difficult to understand. After successfully reducing multicollinearity by using model refinement techniques like variable selection and regularisation, which increased the model’s predictive power. Furthermore, homoscedasticity—the presumption that the error terms’ variance is constant across predictor values—was carefully examined. I used statistical tests and residual plots to determine whether heteroscedasticity was present. The results demonstrated that our model adhered to the homoscedasticity assumption, demonstrating that the error term variability remained constant across the range of predictor variables. This result was crucial because it guarantees the accuracy of the statistical inferences made using the results of the regression model. This analytical journey yielded a number of insightful discoveries. First, it was discovered that obesity and inactivity were important diabetes predictors, reiterating their significance in public health initiatives aimed at diabetes prevention. Second, by addressing multicollinearity, we saw a significant increase in model stability and interpretability, which confirms the need for reliable preprocessing methods in regression analysis. The confirmation of homoscedasticity, which highlighted the dependability of our model’s predictive abilities, gave rise to confidence in its applicability for determining diabetes risk. The crucial link between inactivity, obesity, and diabetes has been highlighted by this CDC data analysis. This has not only enhanced the quality of our predictive model by addressing multicollinearity and verifying the homoscedasticity assumption but have also gained important knowledge that can guide public health policies and interventions aimed at reducing the diabetes epidemic. The significance of rigorous statistical analysis in using data to tackle urgent healthcare challenges is brought home by this work.

 

 

 

 

 

09/13 – Wednesday

The analysis of CDC data on diabetes, with predictor variables being obesity and inactivity, revealed several key findings. Firstly, when examining the p-value of the data, it was found to be statistically significant, indicating a strong relationship between the predictor variables (obesity and inactivity) and the occurrence of diabetes. This suggests that these factors play a crucial role in the development of diabetes within the population studied. However, it’s important to note that the data exhibited homoscedasticity, which can make predictive modeling inefficient. This suggests that there might be some limitations when trying to accurately predict diabetes using obesity and inactivity as predictor variables. Homoscedasticity implies that the variance of the errors in the model is not consistent across different levels of the predictor variables, which can make predictions less reliable. Additionally, it was observed that the predictor variables, obesity, and inactivity, were highly correlated. This high correlation could lead to multicollinearity issues in predictive models, making it challenging to determine the unique contribution of each variable in explaining diabetes risk. Addressing multicollinearity through techniques like feature selection or dimensionality reduction may be necessary to build a more robust predictive model. In summary, the CDC data analysis revealed a significant relationship between obesity, inactivity, and diabetes, suggesting that these variables are important in understanding diabetes risk. However, the presence of homoscedasticity and high correlation between the predictor variables should be considered when developing predictive models for diabetes, to ensure the model’s efficiency and accuracy.

Monday 09/11

As python was my preferred language for analysis, I used my setup for jupyter notebook and loaded the dataset as a pandas dataframe. I used several functions of the pandas library to understand about the basic characteristics of the data given for example: the length, type of data, and missing data.
Upon looking through the three datasets provided I found that the Obesity dataset has only 363, Inactivity has 1370 and Diabetes has 3142 records.

During the discussion with the professor we had come across the question of what can we do with this dataset. What questions can be answered, and as per our discussion combining these datasets to answer questions about how one factor is related to the other would be what we would like to achieve from these datasets. So, my first step was to combine diabetes and inactivity dataset and this gave me a total of 1370 rows. After doing some visualizations from this data like histograms and Quantile Plot I found that the diabetes data is a little skewed higher from normal distribution and that inactivity data is skewed slightly lower from normal distribution.

To find the relation between datasets the first step would be to find the correlation of the data and this gave me a correlation of 44% which is not a strong correlation. With the scatterplot of this I could see that we could fit a linear model on this data to find some prediction. Based on the R squared we can see that 20% of variation in Diabetes data points is attributed to variation in Inactivity. From the residuals and Heteroscedasticity plots, we can conclude that the linear model is not effective for this dataset.