AI Bias Testing Methods for Enterprise: Synthetic vs Real Data Approaches
EU AI Act Article 10 requires high-risk AI systems to be tested for bias before deployment. NYC Local Law 144 requires annual bias audits for hiring AI. Colorado SB 24-205 requires impact assessments covering bias. Here is a practical guide to what bias testing actually involves.
What regulations require bias testing
EU AI Act Article 10 (data governance for high-risk AI), NYC Local Law 144 (annual bias audit for hiring AEDTs), Colorado SB 24-205 (impact assessment covering demographic bias), ECOA/FCRA (disparate impact in credit AI), and EU GDPR Article 22 (automated decision safeguards) all require some form of bias assessment. The methods described here satisfy requirements across all of these frameworks.
What Is AI Bias — and Which Types Matter?
AI bias refers to systematic errors in AI outputs that unfairly disadvantage particular groups. Not all bias is equal from a regulatory perspective. The types most likely to create legal exposure:
Disparate impact bias
ECOA, FCRA, EU AI Act, Colorado AI ActThe AI produces systematically different outcomes for different demographic groups — even without using protected characteristics as direct inputs. A credit model that approves loans at 30% for one race and 60% for another has disparate impact regardless of whether race is an input.
Disparate treatment bias
Employment law, ECOA, EU AI Act, GDPRThe AI treats individuals differently based on protected characteristics (race, gender, age, disability). This is direct discrimination and is easier to detect but rarer in well-designed systems.
Representation bias
EU AI Act Article 10, NYC LL144, sector lawsThe training data underrepresents certain groups, causing the AI to perform worse for those groups. A facial recognition system trained on lighter skin tones that performs poorly on darker skin tones is a classic example.
Measurement bias
EU AI Act Article 10, ECOA for credit AIThe labels used to train the AI are themselves biased. A hiring AI trained on historical promotion data will learn from a biased historical promotion process.
Aggregation bias
EU AI Act, FDA (for medical AI)The AI uses a single model across groups where different models would be more appropriate. A medical AI trained primarily on male patients may perform poorly on female patients.
The Core Bias Testing Methods
1. Disparate Impact Analysis (DIA)
The most widely required method. DIA measures whether an AI system produces different outcomes for different demographic groups. The key metric is the 80% rule (4/5ths rule)from US employment law — if the selection rate for any protected group is less than 80% of the highest-rated group, there is prima facie adverse impact.
How to run it:
- Collect AI outputs and ground-truth outcomes for a representative test set
- Segment results by each protected characteristic (sex, race, age, disability where available)
- Calculate selection rates (or positive outcome rates) for each subgroup
- Apply the 4/5ths rule: flag any group with <80% of the highest-rate group's outcome
- Document results and statistical significance (p-value, confidence intervals)
Limitation: DIA requires actual demographic data about test subjects — which may not be available for privacy reasons. Synthetic data approaches (see below) solve this problem.
2. Counterfactual Fairness Testing
Counterfactual testing asks: if everything about an individual were the same except their protected characteristic, would the AI make the same decision?
Method: Create counterfactual versions of test cases by changing only the protected characteristic (or proxies for it), rerun through the AI, and measure how often the output changes. A high counterfactual change rate indicates the AI is implicitly using the protected characteristic.
Example: Send identical job applications to a hiring AI where only the applicant name varies (English names vs non-English names). Different selection rates reveal demographic bias even when no demographic data is an input.
3. Subgroup Performance Testing
Beyond outcome rates, test whether the AI's accuracy differs across groups. An AI that is 90% accurate for majority groups but 65% accurate for minority groups is biased by performance, even if it produces similar selection rates.
Metrics to compare across subgroups:
- Accuracy, precision, recall, F1 by demographic group
- False positive rate and false negative rate by group (these should be equalised where possible)
- Calibration — are confidence scores equally reliable across groups?
4. Slice-Based Evaluation
A systematic approach to subgroup testing. Define all relevant population slices (e.g., female + over 50 + urban; male + under 30 + rural), measure performance on each slice, and identify which combinations of characteristics produce the worst outcomes.
Tools like SliceFinder (Google) and Errudite automate slice discovery to find performance gaps you would not have thought to look for.
Synthetic Data vs Real Data for Bias Testing
| Approach | How It Works | Best For | Limitations |
|---|---|---|---|
| Real production data with labels | Use actual historical decisions with demographic labels; measure outcome rates by group | Credit, hiring AI with historical outcomes | Privacy constraints; demographic data may be absent; historical labels may themselves be biased |
| Synthetic test cases (LLM-generated) | Generate thousands of realistic test cases varying only protected characteristics; run through AI | Counterfactual testing; NLP/CV AI; when real data lacks demographic labels | Synthetic cases may not fully represent real distribution; bias in synthetic data generator possible |
| Augmented real data | Take real cases and flip protected characteristics (or proxies); measure output change | Structured data AI (credit, hiring); quantifying direct discrimination | Difficult for complex unstructured inputs; proxy identification requires domain knowledge |
| Red-teaming with adversarial inputs | Human testers create edge cases designed to surface bias | Identifying unknown failure modes; regulatory-facing bias audit | Labour intensive; limited systematic coverage; depends on tester creativity |
| Independent bias audit | Third-party auditors test the AI independently using their own methodology | NYC LL144 compliance (requires independent auditor); regulatory credibility | Cost $10,000–$100,000+; annual requirement for in-scope NYC hiring AI |
Satisfying EU AI Act Article 10
EU AI Act Article 10 requires that training, validation, and test data for high-risk AI systems must:
- Be subject to appropriate data governance practices
- Be free of errors and complete, to the extent reasonably possible
- Have appropriate statistical properties — including in respect of the persons or groups of persons on which the system is intended to be used
- Take into account possible biases that could affect health, safety, or fundamental rights
To satisfy Article 10, your bias testing documentation should include:
Bias Mitigation Strategies
Detecting bias is only step one. When bias is found, the response depends on root cause:
| Root Cause | Mitigation Approach | Stage |
|---|---|---|
| Imbalanced training data | Resampling, oversampling minority groups, or generating synthetic training examples | Pre-training |
| Biased labels | Relabelling with diverse annotators; measuring inter-annotator agreement by group | Pre-training |
| Model learning proxies | Adversarial debiasing, fairness constraints during training, feature selection | During training |
| Disparate output rates | Calibration post-processing, threshold adjustments by group, Platt scaling | Post-training |
| Structural use case bias | Redesign the task — if the fundamental use of AI is discriminatory, no model change will fix it | System design |
Track Your Bias Testing Documentation
ComplianceIQ helps you maintain bias testing records for each AI system, track when retesting is due, and generate the documentation required for EU AI Act Article 10 compliance.
Start Your AI Compliance Assessment