Learn how to use the chi-square test to measure training impact on employee performance. Step-by-step HR analytics guide with a worked example.
You invested six figures in a leadership development program last quarter. Completion rates look healthy. Managers are saying good things. But when the CFO asks whether the training actually moved the needle on performance, you need more than anecdotes. You need a statistically defensible answer.
The chi-square test of independence is one of the most accessible yet powerful tools for answering that question. It tells you whether two categorical variables --- like training completion status and performance rating --- are genuinely related or just appear that way by chance. No regression models, no coding bootcamp required. If you can build a pivot table, you can run a chi-square test.
This guide walks you through exactly how to do it, with a real-world HR example you can adapt to your own data today.
Most L&D teams report on activity metrics: enrollment numbers, completion rates, satisfaction scores. These metrics confirm that training happened. They do not confirm that training worked.
The gap between "completed" and "effective" is where organizations waste budget. A Brandon Hall Group study found that only 8% of organizations can demonstrate a clear business impact from their learning programs. The rest rely on Kirkpatrick Level 1 (reaction) data --- essentially, whether people enjoyed the workshop.
The chi-square test bridges that gap by testing whether employees who completed training are statistically more likely to achieve higher performance outcomes. It converts a hopeful correlation into a defensible conclusion.
The chi-square test of independence examines whether two categorical variables are associated. In HR terms, it answers questions like:
The test compares what you actually observe in your data against what you would expect to see if the two variables were completely unrelated. If the gap between observed and expected values is large enough, you can conclude --- with statistical confidence --- that a real relationship exists.
Before running the test, confirm these conditions hold:
Suppose you rolled out a new data literacy training program across your organization. Six months later, you want to know: are employees who completed the training more likely to receive higher performance ratings?
You pull data from your HRIS and learning management system, cross-referencing training completion status with the most recent performance review cycle. Here is what you find across 300 employees:
| | Exceeds Expectations | Meets Expectations | Below Expectations | Row Total | |---|---|---|---|---| | Training Completed | 60 | 90 | 30 | 180 | | Training Not Completed | 20 | 55 | 45 | 120 | | Column Total | 80 | 145 | 75 | 300 |
At first glance, 33% of trained employees exceed expectations compared to only 17% of untrained employees. But is this difference statistically significant, or could it be random variation?
For each cell, the expected frequency equals (Row Total x Column Total) / Grand Total. This represents what you would see if training and performance were completely independent.
| | Exceeds Expectations | Meets Expectations | Below Expectations | |---|---|---|---| | Training Completed | (180 x 80) / 300 = 48.0 | (180 x 145) / 300 = 87.0 | (180 x 75) / 300 = 45.0 | | Training Not Completed | (120 x 80) / 300 = 32.0 | (120 x 145) / 300 = 58.0 | (120 x 75) / 300 = 30.0 |
All expected frequencies exceed 5, so the chi-square assumptions are satisfied.
The formula sums the squared differences between observed (O) and expected (E) values, divided by the expected value, across all cells:
X2 = Sum of [(O - E)2 / E]
Calculating each cell:
X2 = 3.000 + 0.103 + 5.000 + 4.500 + 0.155 + 7.500 = 20.258
The degrees of freedom equal (rows - 1) x (columns - 1) = (2 - 1) x (3 - 1) = 2.
At a significance level of 0.05, the critical chi-square value for 2 degrees of freedom is 5.991.
Since 20.258 is far greater than 5.991, you reject the null hypothesis. There is a statistically significant relationship between training completion and performance rating. The p-value here is less than 0.001, meaning there is less than a 0.1% probability this result occurred by chance.
The numbers tell a clear story. Employees who completed the data literacy training were significantly more likely to exceed performance expectations and significantly less likely to fall below expectations. Specifically:
This does not prove causation on its own. Employees who voluntarily complete training may be more motivated to begin with. But combined with other evidence --- pre/post assessments, manager observations, controlled rollout designs --- the chi-square result provides a strong quantitative foundation for continued investment in the program.
Start with clean data. The most common failure point is not the statistics --- it is messy data. Ensure training completion records are accurate and performance ratings are standardized across departments. If your learning management system and performance data live in separate tools, invest the time to match records properly.
Choose meaningful categories. Collapsing a 5-point performance scale into 3 categories (Exceeds / Meets / Below) often produces cleaner results and satisfies the minimum expected frequency requirement. Avoid categories with very few observations.
Report effect size, not just significance. A statistically significant result with a tiny effect size may not justify the training investment. Cramer's V is the standard effect size measure for chi-square tests. For this example, Cramer's V = sqrt(20.258 / (300 x 1)) = 0.26, indicating a moderate association.
Automate recurring analyses. If you run this test every review cycle, build it into your analytics workflow so results refresh automatically. This transforms a one-time study into an ongoing measurement system.
The chi-square test is ideal for categorical data, but some training impact questions require different approaches. If your outcome variable is continuous (like a numeric assessment score rather than a rating category), consider an independent samples t-test or ANOVA. If you need to control for confounding variables like tenure or department, logistic regression may be more appropriate.
The chi-square test is your starting point --- the first credible answer you can bring to a stakeholder meeting. As your analytics maturity grows, you can layer on more sophisticated methods.
There is no single minimum sample size, but a practical guideline is that at least 80% of expected cell frequencies should be 5 or greater, and no expected frequency should be less than 1. For a 2x3 contingency table like the example above, this typically means you need at least 60 to 100 observations. If your sample is smaller, use Fisher's exact test instead.
No. The chi-square test identifies a statistically significant association between two variables, but it does not establish causation. Employees who complete training may differ from those who do not in motivation, tenure, or role type. To strengthen causal claims, combine chi-square results with controlled study designs, pre/post assessments, and multivariate analysis that accounts for confounding variables.
Comparing raw percentages (e.g., "33% of trained employees exceeded expectations vs. 17% of untrained employees") is descriptive, not inferential. It tells you what happened in your sample but not whether the difference is large enough to be meaningful beyond random variation. The chi-square test adds statistical rigor by calculating the probability that the observed difference could have occurred by chance alone. This is the difference between an observation and an evidence-based conclusion.
You can run a chi-square test in Excel (using the CHISQ.TEST function), Google Sheets, Python (scipy.stats.chi2_contingency), R (chisq.test), or any modern analytics platform. The key requirement is a clean contingency table with accurate training and performance data. Platforms like PeoplePilot Analytics can automate the data preparation by connecting your LMS and performance review data in one place, eliminating the manual data matching that often introduces errors.