Machine Learning Meets HR: Predicting Employee Attrition with PyCaret

Machine Learning for Business

Bolstering an organization’s productivity is probably the first application of Machine Learning (ML) in business that comes to mind. Any company that switches to a more data-driven approach, would first and foremost utilize Machine Learning to help themselves produce more or, more accurately, produce efficiently. Evaluating and quantifying the end result is pretty straightforward. However, there has been an increasing interest in the applications of ML in the human resources department. Gathering employee (or potential employee) data can be a really powerful tool in the hands of HR practitioners to help them accomplish their tasks quicker and in some cases (like hiring) in a less biased manner.

Applications of ML in HR

Integrating ML in HR processes is not a distant future theory. It has been embraced by many enterprises already, and the trends show that it will soon become the norm. Most of the use cases fall into three major groups:

Hiring

Hiring the best available talent for a job is a two-way process. Certain Natural Language Processing tools can be a useful aid to compose a job description that will sound really attractive and interesting to the candidate. On the other hand, ML can make the HR practitioner’s life easier when going through hundreds (or even thousands) of resumes, and filter out those that are not a great match for the role.

Having an algorithm to do this can greatly reduce the unavoidable human bias as well as the amount of time required. An interesting case study is Unilever. They switched to a more data-driven hiring model, with the help of HireVue and this led to £1M+ annual cost savings, 96% candidate completion rates, 90% reduction in time to hire, and 16% increase in new-hire diversity. You can read the case study here.

Engagement

An ML-based system can process data collected via various intra-company sources and provide insights on how engaged the employees are in their work. These findings are easily missed by the human eye, especially for very big companies.

Personal Development and Learning

There are studies available that demonstrate how important development opportunities are for employee happiness. In some age groups (such as the millennials), a robust personalized framework for growth is shown to be more important than the salary. The same applies to the employer. It is for their benefit to keep their staff up-to-date with the latest trends and technologies. Blindly spending money on training their employees is inefficient. Two steps are necessary here:

a) Identify the skill gap on an individual level and

b) Create a learning path for them.

“Individual” is a key word. The development plan must be as personalized as possible. There is no one size fits all solution. This is where ML can help, by analyzing the employee’s data and highlighting the missing or weak skills followed by generating a plan of action and monitoring the progress made. As an example, you can read the case study of Amadeus, a travel technology company, that designed and built the Valamis-based Amadeus Learning Universe (ALU) global learning platform in order to train Amadeus employees, partners, and clients worldwide.

Attrition

You invested a lot of time and money in finding and attracting a great candidate and they are performing as expected. You want to make sure that they are happy and committed to the company. Losing great talent can not only be frustrating but they can also be really expensive to replace, especially for senior or managerial positions. There is a series of costs involved in that, ranging from finding new people to onboarding and training them.

ML algorithms can help with providing a better vision of which are the most important factors that led to employees leaving in the past, as well as calculating the likelihood of a specific individual leaving in the future, based on past data. In this story, I will focus on job attrition and will demonstrate how we can use ML to help analyze the data and make predictions. The PyCaret library will be used to speed up the process.

PyCaret

Gartner defined the citizen data scientist as a person:

“Who creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics.”

PyCaret is a library that is inspired by the emerging role of citizen data scientists. It is the Python version of the Caret package for R, essentially a wrapper around some of the most popular machine learning libraries, such as scikit-learn, XGBoost, and many others. It greatly simplifies the lifecycle of an ML experiment from model training to deployment. Behind the scenes, PyCaret takes care of a great amount of the necessary coding, exposing easy-to-use functions to the user. This leads to a great reduction of the lines of code that actually need to be written.

From the documentation page:

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.

The code

Setup and objective

I will be using the IBM HR Analytics Employee Attrition & Performance dataset available from Kaggle, which contains a mixture of numeric and categorical variables. The objective is to predict if an employee will leave the company (attrition). I will use Google Colab to train the model and perform the analysis. For this specific example, the dataset is assumed to be saved at this location:

/content/drive/MyDrive/Data/HR-IBM_ATTRITION/WA_Fn-UseC_-HR-Employee-Attrition.csv This is a typical binary classification task, where the target variable (Attrition) takes 0/1 values.

Install / import packages and read the data

!pip install pycaret shap

from pycaret.classification import *
import pandas as pd
from pycaret.utils import enable_colab
enable_colab()

df_ibm = pd.read_csv('/content/drive/MyDrive/Data/HR-IBM_ATTRITION/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df_ibm.info()

Exploratory data analysis findings

After performing EDA, the major insights were:

There are no missing values
Some of the features have a constant value. These can be dropped (‘Over18', 'EmployeeCount', 'StandardHours', 'EmployeeNumber')
There are a lot of numeric features that do not follow a normal distribution
Some numeric features have only a few unique values
There is some multicollinearity present for some of the features
The variable to be predicted is highly imbalanced

So we will start by dropping the columns that don’t offer anything to the model

df_ibm.drop(['Over18','EmployeeCount','StandardHours','EmployeeNumber'],axis=1,inplace=True)

Setup PyCaret environment

Before running any model training, the environment must be set up. This is super easy, all we need is to call the setup() function, but first, we need to take care of how the variables will be interpreted. PyCaret will infer the various variable types (numeric, categorical, etc). We can override this behavior by providing our own association of variables with data types. Let’s create these lists and dictionaries first and have them ready to be passed as arguments to the setup function.

cat_vars = ['Department','EducationField','Gender','JobRole','MaritalStatus','OverTime',]
ordinal_vars = {
    'BusinessTravel' : ['Non-Travel','Travel_Rarely','Travel_Frequently'],
    'Education' : [str(x) for x in sorted(df_ibm['Education'].unique())],
    'EnvironmentSatisfaction' : [str(x) for x in sorted(df_ibm['EnvironmentSatisfaction'].unique())],
    'JobInvolvement' : [str(x) for x in sorted(df_ibm['JobInvolvement'].unique())],
    'JobLevel' : [str(x) for x in sorted(df_ibm['JobLevel'].unique())],
    'JobSatisfaction' : [str(x) for x in sorted(df_ibm['JobSatisfaction'].unique())],
    'PerformanceRating' : [str(x) for x in sorted(df_ibm['PerformanceRating'].unique())],
    'RelationshipSatisfaction' : [str(x) for x in sorted(df_ibm['RelationshipSatisfaction'].unique())],
    'StockOptionLevel' : [str(x) for x in sorted(df_ibm['StockOptionLevel'].unique())],
    'TrainingTimesLastYear' : [str(x) for x in sorted(df_ibm['TrainingTimesLastYear'].unique())],
    'WorkLifeBalance' : [str(x) for x in sorted(df_ibm['WorkLifeBalance'].unique())]

}
numeric_features = ['DailyRate','DistanceFromHome','Age','HourlyRate','MonthlyIncome','MonthlyRate','NumCompaniesWorked','PercentSalaryHike','TotalWorkingYears','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']

We are now ready to set up the environment.

experiment = setup(df_ibm,target=‘Attrition',
                   categorical_features=cat_vars,
                   train_size=0.8,
                   ordinal_features=ordinal_vars,
                   remove_multicollinearity=True,
                   multicollinearity_threshold = 0.9,
                   transformation=True,
                   normalize = True,
                   numeric_features=numeric_features,
                   session_id = 42)

Let me explain some of the parameters of this setup() method:

target: Name of the target column to be passed in as a string. The target variable can be either binary or multiclass.
train_size: Proportion of the dataset to be used for training and validation. By default, PyCaret will split the dataset to train/test splits when running setup on a 70/30 ratio. We have overridden this value.
categorical_features: If the inferred data types are not correct or the silent param is set to True, categorical_features param can be used to overwrite or define the data types. It takes a list of strings with column names that are categorical
ordinal_features: Encode categorical features as ordinal. For example, a categorical feature with ‘low’, ‘medium’, ‘high’ values where low < medium < high can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }. numeric_features: If the inferred data types are not correct or the silent param is set to True, numeric_features param can be used to overwrite or define the data types. It takes a list of strings with column names that are numeric.
normalize: When set to True, it transforms the numeric features by scaling them to a given range. The type of scaling is defined by the normalize_method parameter (default is score).
transformation: When set to True, it applies the power transform to make data more Gaussian-like. The type of transformation is defined by the transformation_method parameter (default is yeo-johnson).
session_id: Controls the randomness of the experiment. It is equivalent to ‘random_state’ in scikit-learn. When None, a pseudo-random number is generated. This can be used for later reproducibility of the entire experiment.
remove_multicollinearity: When set to True, features with the inter-correlations higher than the defined threshold are removed. When two features are highly correlated with each other, the feature that is less correlated with the target variable is removed. Only considers numeric features.
multicollinearity_threshold: Threshold for correlated features. Ignored when remove_multicollinearity is not True.

Model creation and training

To get a list of all the available classification models, we can simply run :

models() which returns:

We can conveniently see how all the above models perform on the train dataset:

top3 = compare_models(n_select=3,sort='AUC')

This function trains and evaluates the performance of all estimators available in the model library using cross-validation (10 fold is the default). The output of this function is a score grid with average cross-validated scores. In this example, I chose AUC to be the metric that models are sorted by. The top3 variable is now a list that holds the 3 best performing models, and they can be referenced like normal Python list elements. Can you imagine? With a single line of code, we trained all the models and generated a metrics grid. Nice.

It is worth mentioning here that PyCaret allows individual model creation very easily, as well as various ensembling techniques (boosting, bagging, blending, stacking). Check the documentation on how to perform these steps. Just to give you an idea, we could blend the top3 models like this :

blend = blend_models(top3)

But for now, we will use the top model, Logistic Regression. The next step would be to try and fine-tune the model. PyCaret has us covered.

tuned_lr = tune_model(top3[0],n_iter=70,optimize=‘AUC')

This function tunes the hyperparameters of a given estimator. The output of this function is a score grid with CV scores by fold of the best selected model based on optimize parameter

The n_iter parameter is the number of iterations in the grid search. Increasing n_iter may improve model performance but also increases the training time.

By default, tune_model performs scikit-learn random grid search.

Model performance

We are now ready to evaluate the model’s performance on the holdout set (remember, PyCaret created one when we ran the setup). This is again very easy:

pred_holdout = predict_model(tuned_lr)

PyCaret takes the burden off us to create the plots that would visually depict the model’s performance. Let’s see some examples:

plot_model(tuned_lr,’confusion_matrix')

plot_model(tuned_lr,'auc',save=True)

plot_model(tuned_lr,'learning',save=True)

Before saving the model, one last useful step is to run the finalize() function. Once the predictions are generated on the hold-out set using predict_model and we have chosen to deploy the specific model, we want to train our model for one final time on the entire dataset including hold-out.

final_tuned_lr = finalize_model(tuned_lr)

To save the final model, we need to run:

save_model(final_tuned_lr, 'final_tuned_lr')

Which saves the entire transformation pipeline and trained model object as a transferable binary pickle file for later use.

As you can imagine, loading the saved model is as easy as:

final_tuned_lr_loaded = load_model('final_tuned_lr')

Model interpretation - SHAP values

There is another feature I would like to show you, which is model interpretation. The implementation of the interpretations is based on SHAP (SHapley Additive exPlanations). SHAP values are a widely used approach from cooperative game theory that come with desirable properties. They help us demystify how the model works and how each feature affects the final prediction. However, only tree-based models for binary classification are supported (Extra Trees Classifier, Decision Tree Classifier, Random Forest Classifier, and Light Gradient Boosting Machine). Let’s quickly demonstrate how straightforward it is for PyCaret to interpret a model. We will train and tune a Random Forest Classifier first. Instead of using the compare_models() function, we can create individual models like so:

rf = create_model(‘rf')

and then tune it:

tuned_rf = tune_model(rf,optimize=‘AUC')

Next, we will generate the SHAP values for each feature:

interpret_model(tuned_rf,save=True)

Now, how do we read this plot? The logic is that:

For each feature observation (dots), the red color represents high values, while the blue color represents low values.
If the dots are more spread to the right, this means that the feature has a high effect on the positive class (Attrition = 1). Similarly, if the dots are spread to the left, the feature has a negative effect on attrition.

Looking at the graph, we can conclude that:

Features that have a great effect on attrition

Low monthly income
Low total working years
Low age
High overtime hours

Features that have a medium effect on attrition

Low years at the company
A high number of companies worked before
High distance from home
High years since last promotion