Learn how to build an attrition risk model using logistic regression. Step-by-step guide covering feature selection, odds ratios, and HR interventions.
You have probably lived this moment: a top performer hands in their resignation, and you are blindsided. The exit interview reveals warning signs that were hiding in plain sight — a stagnant salary, a disengaged manager, a missed promotion cycle. The data was there. You just did not have a model to surface it.
Attrition is not random. It follows patterns, and those patterns can be captured mathematically. Logistic regression is one of the most practical, interpretable tools for predicting which employees are at risk of leaving — and more importantly, why. This guide walks you through building an attrition risk model from scratch, interpreting the results in plain language, and turning predictions into retention strategies that actually work.
Logistic regression is a statistical method that predicts the probability of a binary outcome — in this case, whether an employee will leave (1) or stay (0). Unlike a black-box machine learning model, logistic regression gives you something invaluable: interpretable coefficients called odds ratios. These tell you exactly how much each factor increases or decreases the likelihood of attrition.
For HR leaders, this transparency matters. When you walk into a leadership meeting and say "employees with more than two years without a promotion are 2.4 times more likely to leave," that is a story executives understand and act on.
Linear regression predicts a continuous value (like salary). Logistic regression predicts a probability between 0 and 1. It uses a sigmoid function to squeeze predictions into that range, which makes it naturally suited for yes/no outcomes like attrition, promotion decisions, or engagement risk flags.
Before any modeling begins, you need a clean, structured dataset. The quality of your predictions depends entirely on the quality of your inputs.
Start with these core variables from your HRIS, payroll, and engagement platforms:
Handle missing values deliberately. For numeric fields like satisfaction scores, median imputation works well. For categorical fields, create an "Unknown" category rather than dropping rows. Remove duplicates and verify that your target variable (attrition: yes/no) is coded consistently.
Encode categorical variables using one-hot encoding. For example, "Department" becomes separate binary columns: Department_Sales, Department_Engineering, Department_HR. This lets the model measure the independent effect of each category.
Attrition datasets are almost always imbalanced — typically 80-85% stayed, 15-20% left. If you train on this raw distribution, the model will learn to predict "stay" for everyone and achieve high accuracy while being useless for your actual goal.
Use SMOTE (Synthetic Minority Over-sampling Technique) or adjust class weights in your logistic regression parameters to give the minority class (leavers) proportionally more influence during training.
Not every variable in your dataset belongs in the model. Feature selection improves accuracy and keeps the model interpretable.
Start by examining correlations between your independent variables. If two features are highly correlated (above 0.7), keep the one with a stronger theoretical connection to attrition. For example, "Years at Company" and "Years in Current Role" often correlate highly — keep both only if you believe they capture distinct risk factors.
Calculate VIF scores for each predictor. A VIF above 5 signals problematic multicollinearity, meaning the variable's effect is entangled with other predictors. Remove or combine the offending variables.
After fitting an initial model with all candidate features, examine p-values. Features with p-values above 0.05 are not contributing meaningful predictive power. Remove them iteratively using backward elimination: drop the least significant variable, refit, and repeat until all remaining features are statistically significant.
After rigorous selection, a strong attrition model typically retains 8-12 features. A common high-performing set includes: overtime frequency, years since last promotion, job satisfaction, monthly income, age, number of companies worked, work-life balance rating, and distance from home.
Split your data into training (70-80%) and testing (20-30%) sets. Use stratified splitting to maintain the same attrition ratio in both sets.
In Python, this is straightforward with scikit-learn:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)
The class_weight='balanced' parameter automatically adjusts for imbalanced classes, giving you better sensitivity to actual attrition cases.
Accuracy alone is misleading for imbalanced data. Focus on these metrics:
Generate a confusion matrix to visualize true positives, true negatives, false positives, and false negatives. This grounds your metrics in actual employee counts.
This is where logistic regression earns its place in HR analytics. Convert the model coefficients to odds ratios by exponentiating them:
import numpy as np
import pandas as pd
odds_ratios = np.exp(model.coef_[0])
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Odds Ratio': odds_ratios
}).sort_values('Odds Ratio', ascending=False)
Do not present raw odds ratios to leadership. Translate them into statements like:
These statements move conversations from "we should probably do something about retention" to "here is exactly where to invest."
A model that sits in a Jupyter notebook changes nothing. The value comes from operationalizing predictions into targeted retention programs.
Score every active employee and segment them into risk tiers:
Map your top odds-ratio drivers to specific programs:
Attrition drivers shift over time. Retrain your model quarterly with fresh data to keep predictions calibrated. Track whether interventions actually reduce predicted risk scores in subsequent model runs — this closes the feedback loop between prediction and action.
Building a logistic regression model is the analytical foundation. Scaling it across your organization — keeping data fresh, automating risk scores, triggering workflows — is where most HR teams hit a wall.
PeoplePilot Analytics connects directly to your HRIS, payroll, and engagement data to run attrition risk models continuously. Instead of a one-time analysis, you get a living dashboard that updates risk scores as employee data changes. Pair it with PeoplePilot Surveys to feed real-time engagement signals into your model, and PeoplePilot Learning to automatically recommend development programs for medium-risk employees before they tip into the danger zone.
The goal is not just to predict who will leave. It is to give you the time and insight to change the outcome.
Aim for at least two to three years of data with a minimum of 200-300 attrition events. Logistic regression needs sufficient examples of both outcomes (stayed and left) to learn meaningful patterns. If your dataset is small, consider combining data across business units or using regularization techniques to prevent overfitting.
Standard logistic regression assumes a linear relationship between features and the log-odds of the outcome. If you suspect non-linear effects — for example, satisfaction having a stronger impact at very low levels — you can create polynomial features or bin continuous variables into categories (e.g., low, medium, high satisfaction) before fitting the model.
Focus on odds ratios translated into plain-language statements. Avoid discussing coefficients, p-values, or sigmoid functions. Instead, present a ranked list of attrition drivers with statements like "Working overtime increases the chance of leaving by 130%." Use a simple visual showing the top five risk factors and their relative impact. Decision-makers care about what to do, not how the math works.
Treat it as an early warning, not a certainty. Schedule a confidential stay interview to understand their current experience. Review their compensation against market benchmarks, assess their growth trajectory, and check whether their manager relationship is healthy. The model gives you lead time — use it to address concerns before the employee starts interviewing elsewhere.