FairWork: A Generic Framework For Evaluating Fairness In LLM-Based Job Recommender System

Dual-perspective, dual-granularity fairness evaluation for LLM-based job recommendation: we assess bias from both the user and the recruiter sides, at individual and group levels.

Why FairWork #

Large Language Models (LLMs) enable highly personalized recommendations, but their inherent biases can harm fairness in job recommendation, affecting users and platforms alike. While prior work often covers limited dimensions, FairWork provides a comprehensive view.

Key Problem: Existing fairness evaluation methods for job recommendation systems have limited scope and don’t account for the complex bias patterns that emerge when LLMs are used for recommendation tasks.

🤗 Demo 📁 Code

What We Built #

Our framework addresses fairness evaluation through multiple dimensions:

Two perspectives: fairness for users and recruiters
Two granularities: individual-level and group-level evaluation
Realistic inputs: stakeholders upload profiles and job descriptions; we align candidate qualifications with job requirements
Actionable metrics & analysis: sensitivity to custom attributes and their intersections; quantitative fairness metrics (SPD, EO, PPV) and background analysis

Method & System #

1. Framework Overview #

Framework workflow — Figure 1. Individual- and Group-level fairness evaluation workflow.

Counterfactual perturbation. We inject alternative values for sensitive attributes to build a perturbed profile $x’_u=\text{Perturb}(x_u, C_u)$, then query the same LLM prompt for original and perturbed profiles to obtain prediction scores and compare outcomes across settings.

2. Individual-level Pipeline #

Individual Analysis

Inputs: one user profile + job description
Candidate Matching: Adopt TF-IDF to retrieve resumes similar to the user’s profile; if insufficient, synthetic profiles are generated by the LLM to ensure adequate comparison candidates
Ranking Comparison: Analyze rank changes of the original candidate among perturbed profiles to quantify fairness deviations

Individual-level input — Individual Input Interface

Individual-level output — Individual Output Results

3. Group-level Pipeline #

Group Analysis

Inputs: dataset + specified sensitive attributes
Industry-specific Sampling: Aggregate jobs by industry, then sample representative job postings to match with user profiles
Evaluation: Evaluate system-wide fairness with standard metrics (SPD, EO, PPV diff), sensitivity scores (Pred diff), and detailed background analysis (educational & professional distribution)
We also employ an adaptive injection strategy to keep diversity while controlling compute when intersections grow combinatorially

Group-level output 1 — Group-level Fairness Metrics

Group-level output 2 — Detailed Background Analysis

Data & Settings #

Dataset: We include the public CareerBuilder job recommendation dataset (≈321,235 users, 365,668 jobs), preprocessed and clustered by industry.

Demo case study: sample 150 users and 5 jobs/user; evaluate Age and Gender on Deepseek-v3 and Llama3-8b; deterministic decoding with temperature=0.

Representative Findings #

Our comprehensive evaluation revealed several important patterns:

Intersectional Effects #

Age & Gender combined yields larger bias than either attribute alone
Intersectional bias patterns vary significantly across different industries

Model Differences #

Deepseek-v3 shows larger group-level gaps:
- SPD: 0.0187
- EO: 0.0299
Llama3-8b demonstrates more controlled bias:
- SPD: 0.0053
- EO: 0.0074

Industry-Specific Patterns #

Key Finding: Bias patterns are highly industry-dependent and vary between different LLM models.

Deepseek-v3: Bias peaks in Sales (Age/Gender) and Education (Age&Gender)
Llama3-8b:
- Age bias notable in Education/Healthcare
- Gender bias in Technology
- Age&Gender bias in Administrative/Healthcare

How to Use #

Individual-level Evaluation #

Upload one user profile and one job description
Optionally set your sensitive-attribute values
The system retrieves similar candidates (TF-IDF; synthesizes profiles if needed)
Perturbs attributes and compares rankings

Group-level Evaluation #

Upload a dataset (profiles + JDs + optional interaction labels)
Choose sensitive attributes
The system performs industry-aware sampling
Runs metrics for user/recruiter perspectives

Citation #

Reference (SIGIR ‘25, Padua, Italy):

Yuhan Hu, Ziyu Lyu, Lu Bai, and Lixin Cui. 2025. FairWork: A Generic Framework For Evaluating Fairness In LLM-Based Job Recommender System. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘25), July 13–18, 2025, Padua, Italy. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3726302.3730145

BibTeX

@inproceedings{Hu2025FairWork,
  title     = {FairWork: A Generic Framework For Evaluating Fairness In LLM-Based Job Recommender System},
  author    = {Yuhan Hu and Ziyu Lyu and Lu Bai and Lixin Cui},
  booktitle = {Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25)},
  year      = {2025},
  address   = {Padua, Italy},
  doi       = {10.1145/3726302.3730145}
}