11/29 – Wednesday

 

Employee earnings reports are a treasure trove of data, containing vital information about an organization’s financial health, salary distributions, and employee contributions. The ability to extract meaningful insights from such extensive datasets is pivotal for informed decision-making and strategic planning within any institution.

In recent years, the utilization of unsupervised algorithms has revolutionized the analysis of such complex datasets. These algorithms, often associated with machine learning, have proven instrumental in uncovering hidden patterns, segmenting data, and deriving valuable insights without the need for labeled information. Unsupervised algorithms operate on unlabeled data, seeking to find inherent structures or relationships within the dataset. Clustering and dimensionality reduction are two primary techniques used in the analysis of employee earnings reports. One of the most prevalent applications of unsupervised algorithms in this context is clustering analysis. By employing techniques such as K-means or hierarchical clustering, it becomes possible to group employees based on various attributes like job role, tenure, department, or earnings. This segmentation enables organizations to identify clusters of employees with similar earnings patterns or attributes, facilitating targeted strategies for talent retention, salary adjustments, or resource allocation. Another powerful application lies in dimensionality reduction techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE). These methods help in summarizing complex employee earnings data into lower-dimensional representations while retaining key information. By visualizing these reduced dimensions, organizations can gain insights into salary distributions, anomalies, or trends that might not be apparent in the original high-dimensional dataset. While unsupervised algorithms offer tremendous potential, their application in analyzing sensitive data like employee earnings reports requires careful consideration. Ensuring data privacy, maintaining transparency in algorithmic decisions, and guarding against biases are critical factors that need addressing.

In conclusion, the application of unsupervised algorithms in analyzing employee earnings reports presents a valuable opportunity for organizations to extract meaningful insights, streamline decision-making processes, and foster a data-driven approach towards managing human resources. However, it’s essential to navigate this realm ethically and responsibly, ensuring that the insights gained are used judiciously for the betterment of both the organization and its employees.

 

11/ 27 – Monday

The employee earnings report data from Boston’s portal  enables rich analysis of compensation trends across 30+ municipal departments. Sophisticated modeling approaches like the following can derive key insights:

Regression analysis with algorithms like lasso and ridge could reveal the impact of tenure, role type and department on earnings growth trajectories over time. Detecting predictors of rising income inequality even within public sector jobs is crucial.

Unsupervised clustering via models like K-prototypes can group employees and departments exhibiting similar pay increase patterns annually. This typology using earnings trend similarities informs standardized pay scale policy decisions.

Neural embedding frameworks can encode complex departmental differences into low-dimensional vectors to serve as inputs for visualization and predictive tools. Linking earnings vectors with budget vectors can assess fiscal sustainability.

Lastly, CNN deep learning networks could treat earnings tables as images, discerning spatial patterns in how compensation metrics relate across roles and divisions. This method often catches subtle signals missed by conventional techniques.

This data capturing intricate public agency wage changes warrants such advanced modeling. The insights distilled can guide equitable, consistent and financially prudent compensation best practices across a diverse municipal workforce. I’m eager to apply these modern algorithms to unlock essential learnings for judicious and ethical governance.

 

11/ 24 – Friday

I will employ sophisticated statistical learning and modeling methodologies to uncover insightful trends and patterns in the rich Employee Earnings Report data from the City of Boston. By moving beyond basic summary statistics and tapping into machine learning, I aim to extract actionable findings around compensation across municipal departments and roles over time. Specifically, multivariate regression analysis using random forest and gradient boosting machine algorithms will identify key predictors of overtime earnings like seniority, job category, department, etc. The comparative predictive capabilities of tree-based ensemble models provides robust insight into prime overtime pay drivers. Additionally, unsupervised learning via cluster analysis (k-modes, hierarchical) combined with dynamic time warping algorithms will group departments and roles exhibiting similar earnings change trajectories over many years. Detecting these temporal similarity clusters is key for policy decisions around standardization. Overall, through supervised regression, unsupervised clustering, and deep learning models, I plan to rigorously analyze this multi-departmental municipal employee earnings dataset. The advanced modeling techniques combined with insightful visualization will enhance understanding of which factors influence compensation growth and variation across the city agency ecosystem. Let me know if you have any other recommendations on cutting-edge methods to explore.

11/22 – Wednesday

How have total earnings across departments changed over time? Are some departments showing much higher growth than others?

– I will use time series analysis to look at trends in total earnings for each department over the years. Visualizations like line charts will help compare growth rates across departments. And statistical techniques like calculating the coefficient of variation will quantify the amount of variation in earnings changes over time. This can identify outliers with especially high or low growth compared to other departments.

What insights can statistical modeling reveal about key drivers of overtime pay? How do factors like job type and years of experience correlate with overtime?

– Using regression modeling, I can analyze how different variables like job category, department, years of service etc. influence overtime pay. Multiple linear regression will estimate the correlation between each independent variable (predictors like job type and experience) and the dependent overtime pay variable. Significant coefficients will reveal which factors have the biggest influence on overtime earnings.

Can clustering algorithms identify groups of departments with similar compensation patterns that may inform salary standardization policies?

– Yes, clustering algorithms like k-means can group departments together based on similarities in earnings data across fields like average base salary, ratio of overtime to base pay, changes over time, etc. Seeing which departments end up clustered can highlight which ones share common compensation trends. Policymakers could use this information to make data-driven decisions when standardizing salaries and pay scales across the city government.

 

11/20 – Monday

After evaluating several options for my third analytics project, I have decided to work with the Employee Earnings Report dataset from the City of Boston. This data comes from their analytics portal.

The dataset provides a detailed breakdown of the earnings of all full-time city employees each year, including overtime pay. It covers over 30 departments and lists compensation figures like base pay, overtime pay, detail pay, and more for each employee. I chose this public sector salary and wage data because it presents opportunities for interesting analysis while allowing me to sharpen my data manipulation and modeling skills. A few high-level questions I plan to examine:

– How have total earnings across departments changed over time? Are some departments showing much higher growth than others?

– What insights can statistical modeling reveal about key drivers of overtime pay? How do factors like job type and years of experience correlate with overtime?

– Can clustering algorithms identify groups of departments with similar compensation patterns that may inform salary standardization policies?

I am still actively exploring additional angles of analysis to pursue with this multidimensional dataset. Applying techniques like regression, clustering, data visualization, and more can extract key insights around public sector compensation trends in Boston. Now that I have selected this interesting civic dataset, I’m eager to dive deeper into analysis. My next steps are preprocessing and cleaning the data, conducting initial exploratory analysis, forming concrete analytic questions, and ultimately building models to derive actionable intelligence around employee earnings.

 

11/17 – Friday

As cities grow, balancing infrastructure upgrades and maintenance with minimally disruptive public works projects is key. Advanced analysis of data like the City of Boston’s “Public Works Active Work Zones”  could enable more agile project coordination.

By applying statistical and mathematical modeling techniques (regression, simulation, network optimization, etc.), I will develop quantitative models to derive actionable insights from this dataset. Location-specific time series data on active transportation infrastructure projects across the city provides fertile ground for temporal predictive analytics. Combining machine learning algorithms like ARIMA and LSTM for forecasting with network scheduling optimizations could provide guidance to policymakers. Predicting future infrastructure stress points based on lead indicators in the data can better prepare the city for needed upgrades proactively rather than reactively. Simulating alternative work zone scheduling approaches can quantify the tradeoffs between construction throughput, cost, and short-term congestion impacts. Additionally, geospatial visual data analytics with tools like Power BI could identify clustering trends and pairs/groups of projects exhibiting excessive congestion effects due to proximity. Clustering and network analysis algorithms can detect these insights. The public works department could then use this information to adjust project timelines and prevent consecutive work zones in high-traffic adjacent areas when possible.

This dataset and use case has immense potential for innovating urban infrastructure planning efficiency using modern data science techniques. The versatility of the data variables opens doors for cutting-edge quantification of the intricate tradeoffs city planners and leaders face when balancing economic growth, construction needs, traffic flows, and public services accessibility. I’m eager to demonstrate the power of data analytics to improve public policy decision making on this critical domain.

11/15 – Wednesday

I’m currently deciding on a dataset for my third project in my data analytics program. Choosing impactful, real-world data that enables meaningful analysis is important to me. Initially, I came across the “Economic Indicators” dataset from Analyze Boston, tracking key Boston economy metrics over time. However, I haven’t finalized this as my selection yet. I’m still evaluating other datasets before making my decision. Another option I’m exploring is the “Public Works Active Work Zones” dataset from the City of Boston’s portal found at https://data.boston.gov/dataset/public-works-active-work-zones. This provides real-time data on public works projects happening across the city. I need to determine if this transportation and infrastructure focused dataset or another option would allow me to conduct a sufficiently comprehensive analysis for Project 3. I’m weighing factors like data quality, span across time, variety of elements that can be analyzed, and more.
My goal is to deeply explore all promising datasets I encounter before landing on the one that best fits for a meaningful project. I want data that aligns well with my developing skillset and interests me enough to analyze extensively. I’ll provide another update once I’ve officially selected my Project 3 dataset and begun the analysis work.

 

11/13 – Monday

I’m currently in the process of selecting a data set for the third project in my data analytics course. I want to find a dataset that will allow me to practice and showcase my data analysis skills. One dataset I’m strongly considering is the Economic Indicators data from Analyze Boston. This tracks key economic metrics for the City of Boston between January 2013 and December 2019. The data comes from the Boston Planning and Development Agency and covers topics like jobs, population, building permits, and housing prices. I’m interested in this data because I’d like to analyze economic trends over time. The multi-year span would allow me to see how indicators changed over the past decade. However, I’m still actively considering other datasets as well. I want to make sure I choose a dataset that aligns with my interests and has the potential for insightful analysis. I’m researching other options through resources like Kaggle datasets, data.gov, and data portals for other major cities. Ultimately, I want to select a dataset that will be engaging to work with and result in a meaningful project. I plan to decide soon, as I will need to dig into the data and start my analysis work.

 

Project 2

Project_2_police_shootings

I recently explored the Washington Post’s database on fatal police shootings to identify potential predictors of these incidents. My analysis process provides an example of how to extract insights from data. Initial examination showed variables like victim race and mental health status were right-skewed, while the fleeing variable was left-skewed. This informs appropriate modeling choices. Evaluating correlations revealed race and fleeing had a strong positive correlation with fatal shootings, making them sensible predictor variables. Mental illness had a weaker correlation. Testing a K-nearest neighbors (KNN) model quantified race and fleeing as significant predictors, with fleeing having greater explanatory power. However, limitations suggest room for improvement. While KNN provided initial value in identifying relationships, testing other techniques like random forests could potentially boost predictive performance since the data has pronounced skew. In summary, thoughtful preliminary analysis enabled data-driven identification of promising predictors. But iterative modeling improvements are needed to maximize insights. Proper analytic process allows extracting nuanced conclusions This example demonstrates practices like assessing distributions, identifying correlations, fitting models, and intelligently iterating – all valuable when moving from data to insights.

11/ 10 – Friday

I’m currently working on drafting a comprehensive report highlighting key findings and implications from my in-depth analysis of the Washington Post’s police shooting dataset. This report aims to provide impactful punchlines on what the data indicates about this societally important issue. In my analysis so far, I have utilized various statistical techniques and machine learning algorithms to uncover insights within the expansive dataset. Specifically, I used K-nearest neighbors (KNN) analysis to identify factors that may predict fatal police shooting occurrences. My report will highlight notable punchlines from this analytical work using KNN and other methods, such as quantifying the significance of factors like race, mental illness, and location on predicting shooting outcomes. I will also outline punchy implications from the data for policy and policing practice reform. However, I want to ensure punchlines are substantiated by the rigorous methodology undertaken. The report will include details on the data cleaning, validation, hypothesis testing, and modeling processes that lend credibility to the conclusions reached. My goal is a report that delivers punchy, data-driven insights on where reform efforts should be targeted, while transparently conveying the analytical work done. Impactful punchlines grounded in statistical rigor have the power to convince stakeholders and drive real change.

11/8 – Wednesday

I’m currently drafting a report that outlines my findings and implications from thoroughly analyzing the Washington Post’s database on fatal police shootings. This has been an extensive process. So far, my initial data exploration has allowed me to become familiar with the nuances of this rich dataset. I have also cleaned the raw data to prepare it for in-depth statistical analysis using methods like regression modeling and machine learning. My report will include technical details on my analysis approaches and key insights uncovered through hypothesis testing. However, a major focus will be articulating what the findings mean for police training, use-of-force policies, and reform efforts. I aim to translate analytical conclusions into actionable implications. For example, if the data indicates certain demographics or situational factors significantly impact shooting risk, this evidence directly informs policy to enhance training and reduce unnecessary lethal force. I’m looking forward to finalizing impactful recommendations. Overall, I’m eager to complete a report that rigorously investigates this important topic through data-driven models and provides meaningful takeaways for reducing fatal police shootings. My goal is contributing robust evidence to inform policies that improve policing outcomes.

11/7 – Monday

I’m starting the process of drafting a report to summarize my in-depth analysis of the police shooting dataset compiled by the Washington Post. As the initial step, I’m exploring the data and formulating potential questions and hypotheses that will guide my investigative report.

My report will focus on examining issues like racial disparities, use of force policies, mental illness, and geographic trends. Developing meaningful questions is crucial before applying statistical tests and machine learning techniques to derive insights.

For example, my report will probe questions like: Does victim race remain a significant predictor of shooting likelihood when accounting for other factors? Have fatal shootings increased over time even when adjusted for population changes? What policy and training reforms does the data suggest could reduce shootings?

I’m compiling a list of probing questions on the intricacies and nuances of police lethal force usage to thoroughly structure my report. My goal is to leverage the right analytical tools to extract compelling data-driven discoveries and conclusions from this rich dataset.

The report will document my process of moving from initial data exploration to formal statistical analysis. I’ll detail the hypotheses tested, models built, and key findings on where policing reform is most needed based on the data.

This brainstorming phase is essential for focusing my analysis before I begin writing. I’m eager to finalize my investigatory plan and begin deriving impactful insights to include in a comprehensive report on Washington Post’s police shooting database. My goal is contributing meaningful findings to this important issue.

11/3 – Friday

To further analyze the role of race, I implemented additional machine learning algorithms beyond logistic regression. Using the sklearn package in Python, I trained random forest and gradient boosting classifiers on the same processed dataset. The random forest model consisted of 100 decision trees, with a maximum depth of 10 nodes per tree. I used entropy as the splitting criterion and a minimum sample leaf size of 50 to prevent overfitting. The gradient boosting model used XGBoost with 200 estimators and a learning rate of 0.1. After hyperparameter tuning through randomized grid search, the random forest achieved slightly better performance with an AUC of 0.85 and accuracy of 0.81. The gradient boosting model performed nearly as well, with an AUC of 0.83 and accuracy of 0.79. Examining the feature importances, both models consistently ranked race as the most significant variable in predicting fatal police shootings. This aligns with the logistic regression results, further confirming substantial racial bias in the data. To improve model interpretability, I analyzed partial dependence plots for the race variable. Holding all other features constant, the plots clearly visualize the significantly higher probability of being killed for Black civilians compared to other races.

In conclusion, leveraging more advanced machine learning reinforced the finding that race is the predominant factor in police shooting deaths, even when controlling for other variables. These models provide additional statistical rigor and technical depth to my analysis of systemic racial bias in the criminal justice system.

11/ 1 – Wednesday

As a data scientist, I decided to dig deeper into the racial disparities suggested by the Washington Post’s database on fatal police shootings. This comprehensive dataset contains deaths from police encounters. I used statistical and machine-learning techniques to further analyze the data.

Specifically, I employed logistic regression, which is ideal for predicting a binary outcome from a set of explanatory variables. In this case, I wanted to model the likelihood of being fatally shot based on race, while controlling for other factors like whether the victim was armed. After cleaning the raw data, I preprocessed it for logistic regression and split into training and test sets. For the race variable, I used dummy coding for Black, White, and Other ethnicities. Additional independent variables included age, gender, signs of mental illness, and armed/unarmed status. Fitting the logistic regression model on the training data, I obtained statistically significant coefficients. The results indicated Black civilians are 2.5 times more likely to be fatally shot than White civilians, even when controlling for whether the victim was armed or showed signs of mental illness. Evaluating the model on the test set, it achieved strong discrimination with an AUC score of 0.82. Accuracy was 0.78 using a probability cutoff of 0.5. This demonstrates reliable predictive performance on unseen data. In summary, by applying logistic regression to this real-world dataset, I obtained data-driven insights into the role of race in police shooting deaths. The significant results clearly point to systemic bias against Black Americans, above and beyond any behavioral factors. This underscores the need for policing reforms and safeguards to eliminate disparate deadly force.