HR Attrition

"Information can be extracted from data just as energy can be extracted from oil." - Adeola Adesina

Take me to the Dashboard!

The purpose of this analysis is to answer the underlying question; what factors are worsening the rate of attrition at this organization? The dataset used for this report is titled ‘HR Attrition Data.csv’ and is sourced from the HR department with information regarding employees at the organization. Many analyst begin their process with Exploratory Data Analysis (EDA) to identify anomalies and gain a better understanding of the data before applying complex models (Madhugiri, 2023). At this point, making adjustments like removing outliers, null values, or duplicate values will support the model's performance. I will use Python with libraries; Matplotlib, Numpy, Pandas, and Sklearn to accomplish the complex analysis in this report.

Pre-Processing

The preprocessing process is a preliminary stage that corrects erroneous data, resolves missing values, and more. This project involves multiple datasets meaning the data will need to be aggregated. After gathering the data by concatenation, the next step is feature engineering. Feature engineering is a form of data manipulation by combining or mutating columns to better suit the analysis (Azevedo, 2023).

With the HR Data, I will be creating columns such as; Attrition Count, Attrition Rate, and Employee Count. Below I have pasted the command involved when working with Power BI and Tableau.

NEW COLUMN: Attrition Count =

IF(‘HR_Data’[Attrition] = “Yes”, 1, 0)

NEW MEASURE: Attrition Rate =

DIVIDE(SUM(‘HR_Data’[Attrition

Count]),SUM(‘HR_Data’[Employee Count]),””)

NEW MEASURE: Active Employee’s =

SUM(‘HR_Data’[EmployeeCount])-

SUM(HR_Data[Attrition Count])

Often times an analyst will be handling a large dataset that has many absent data points. Having a lot of missing variables can skew the distribution of the data and impact the performance of the model. After determining the nature of the missing value (MCAR, MAR, or MNAR) there are two main methods to resolve this issue; imputation or data removal. When dealing with data that is missing completely at random (MCAR) it may be best to use the deletion method: pairwise deletion. Luckily the HR_Data does not contain too many missing values that may obstruct the analysis. the merged dataset has 64 missing values from the ‘Attrition’ column. The dataset is also missing 1000 values from the ‘TrainingTimesLastYear’.

Baseline

Using the ‘isnull().sum()’ function, the program reveals there are no missing values of our concern. The total number of employees at the company is 1470, the number of active employees is 1233, and the attrition count is 237. This reveals the attrition rate is 16.12%. The biggest benefit of creating a baseline of your data is that it serves as a point of reference when assessing the strength of a model. The baseline accuracy without building a model for the attrition data is 83.88% on whether or not an employee will leave/stay (Vikash-Analytics, 2016).

Screen Shot 2024-01-30 at 7.17.37 PM.png

Feature Engineering

It is possible to use the Random Forest method to select important features concerning the target variable. The model calls on the function ‘feature_importances_’ and creates a separate data frame to display results. The top five features are ‘PerformanceRating’, ‘WorkLifeBalance’, ‘YearsWithCurrManager’, ‘JobSatisfaction’, and ‘JobRole’, which lines up with the results from the correlation analysis.

Categorical data is used for grouping and aggregating data where information is encoded to numerical values. This data type can be nominal (ex. gender) or ordinal (ex. ranking). There are three main methods to accomplish encoding; python’s category encoder library, pandas’ get dummies, or scikit-learn (Garg, 2022). Next, I encoded the target variable ‘Attrition’ through mapping as shown below. It is also necessary to encode the categorical variables within the dataset for the future use during the predictive models. The function ‘LabelEncoder()’ will encode the variables automatically to a numerical value.

Screen Shot 2024-01-30 at 7.20.34 PM.png

Screen Shot 2024-01-30 at 7.21.15 PM.png

Correlation Analysis

Feature selection refers to the process of determining the most relevant variables to the target variable that should be included in a model. To begin, I will perform correlation analysis to measure the strength between variables and the target. The features with the lowest correlation to the target will be dropped in hopes of reducing noise in the upcoming models. The four variables with the highest correlation are PerformanceRating (-0.9189), WorkLifeBalance (-0.7149), JobSatisfaction (-0.4490), and YearsWithCurrManager (-0.4368).

Screen Shot 2024-01-30 at 7.26.39 PM.png

Screen Shot 2024-01-30 at 7.27.18 PM.png

Visualizations

Below I have provided some visualizations for most of the variables most correlated with attrition to further support the baseline. We can tell all of the variables move in opposite directions with the target variable, meaning there is a negative correlation. For example, as the performance rating for an employee increases they are less likely to leave their employment.

Screen Shot 2024-01-30 at 7.30.54 PM.png

Screen Shot 2024-01-30 at 7.31.34 PM.png

Screen Shot 2024-01-30 at 7.31.59 PM.png

Screen Shot 2024-01-30 at 7.32.14 PM.png

Model 1: Logistic Regression

The first model I will be implementing is a Logistic Regression to predict employee attrition. The first steps require splitting the dataset into training and testing sets and building the model to fit the data. I created two logistic models; the first performed with a score of 0.9966 and the second performed perfectly at 1.0000 in predicting outcomes. The variables included in both models are 'Age', 'BusinessTravel', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', and 'YearsWithCurrManager'

Screen Shot 2024-01-30 at 7.41.16 PM.png

Screen Shot 2024-01-30 at 7.39.19 PM.png

Screen Shot 2024-01-30 at 7.38.48 PM.png

Model 2: Random Forest

The second model I will be implementing is Random Forest (RF) regression. Random Forest Classifiers are used to predict discrete values and the algorithms have a high prediction accuracy because it tolerates outliers and noise in data. The process of an RF classification begins when the algorithm performs bootstrapping to extract subsamples and creates a decision tree for every node. To implement this, I have selected the training set from the original dataset and separated the testing set. After this step, I used the ‘RandomForestClassifier’ function to build the RF model and then fit the model to the training sets. The model performs perfectly with a score of 1.0 in predicting outcomes of attrition.

Screen Shot 2024-01-30 at 7.44.26 PM.png

It is important to evaluate the performance of the models explored to reach a valid conclusion about which model to use and follow. By calling on the ‘classification_report’ function in the sklearn library the program outputs the values associated with precision, recall, and the f1 measure. Precision is used to reveal the ability a model has to make correct positive predictions (TP/TP + FP) (Kanstrén, 2023). Recall, also referred to as sensitivity, provides how many positive predictions were made correctly out of all positive predictions (TP/TP + FN). Finally, the F1 score measures the model's accuracy by combining precision and recall to reach the mean (2TP/2TP + FP + FN). The screenshot below provides the classification report for the first logistic regression model.

Screen Shot 2024-01-30 at 7.45.19 PM.png

The screenshot below reveals the classification report for the RF classification model.

Screen Shot 2024-01-30 at 7.47.37 PM.png

Screen Shot 2024-01-30 at 7.47.26 PM.png

By including information about an employee as a variable the model allows analysts to predict whether attrition is likely or not. Through this analysis, we have found that Performance Rating, Work-Life Balance, Job Satisfaction, and Years with Current Managers were negatively correlated with attrition. It is best to attend to the possibility of attrition when performance rating, job satisfaction, and work-life balance are rated a score of 2 and below. Attrition is also most likely to occur within an employee's first year with a Manager.

Dashboard

As data is updated through Tableau and Power BI the insights may change to reflect the new reality. This is the benefit of hosting KPI dashboards through an interactive system. The information can be easily changed and swapped to stay up to date with a fast-paced organization or for real-time results. Creating visualizations means transforming raw data into real insights that empower and support business decisions. Stakeholders depend on graphical information in order to make informed choices that will help the organization's performance. They are able to recognize patterns in the vast amount of data that might have been otherwise hidden. It also allows them to quickly understand what is happening without going through the time-consuming process of data analysis. It is important for analysts to highlight the benchmark so stakeholders can identify areas for improvement. Understanding how to present the data is a huge pillar of translating the information for others to grasp. Effective visualization means selecting the most appropriate graph for the type of variables, clearly labeling data points, and providing relevant context. The way data is framed can change the perspective thus altering the truth.