AI Bias Testing Methods for Enterprise: Synthetic vs Real Data Approaches

What regulations require bias testing

EU AI Act Article 10 (data governance for high-risk AI), NYC Local Law 144 (annual bias audit for hiring AEDTs), Colorado SB 24-205 (impact assessment covering demographic bias), ECOA/FCRA (disparate impact in credit AI), and EU GDPR Article 22 (automated decision safeguards) all require some form of bias assessment. The methods described here satisfy requirements across all of these frameworks.

What Is AI Bias — and Which Types Matter?

AI bias refers to systematic errors in AI outputs that unfairly disadvantage particular groups. Not all bias is equal from a regulatory perspective. The types most likely to create legal exposure:

Disparate impact bias

ECOA, FCRA, EU AI Act, Colorado AI Act

The AI produces systematically different outcomes for different demographic groups — even without using protected characteristics as direct inputs. A credit model that approves loans at 30% for one race and 60% for another has disparate impact regardless of whether race is an input.

Disparate treatment bias

Employment law, ECOA, EU AI Act, GDPR

The AI treats individuals differently based on protected characteristics (race, gender, age, disability). This is direct discrimination and is easier to detect but rarer in well-designed systems.

Representation bias

EU AI Act Article 10, NYC LL144, sector laws

The training data underrepresents certain groups, causing the AI to perform worse for those groups. A facial recognition system trained on lighter skin tones that performs poorly on darker skin tones is a classic example.

Measurement bias

EU AI Act Article 10, ECOA for credit AI

The labels used to train the AI are themselves biased. A hiring AI trained on historical promotion data will learn from a biased historical promotion process.

Aggregation bias

EU AI Act, FDA (for medical AI)

The AI uses a single model across groups where different models would be more appropriate. A medical AI trained primarily on male patients may perform poorly on female patients.

The Core Bias Testing Methods

1. Disparate Impact Analysis (DIA)

The most widely required method. DIA measures whether an AI system produces different outcomes for different demographic groups. The key metric is the 80% rule (4/5ths rule)from US employment law — if the selection rate for any protected group is less than 80% of the highest-rated group, there is prima facie adverse impact.

How to run it:

Collect AI outputs and ground-truth outcomes for a representative test set
Segment results by each protected characteristic (sex, race, age, disability where available)
Calculate selection rates (or positive outcome rates) for each subgroup
Apply the 4/5ths rule: flag any group with <80% of the highest-rate group's outcome
Document results and statistical significance (p-value, confidence intervals)

Limitation: DIA requires actual demographic data about test subjects — which may not be available for privacy reasons. Synthetic data approaches (see below) solve this problem.

2. Counterfactual Fairness Testing

Counterfactual testing asks: if everything about an individual were the same except their protected characteristic, would the AI make the same decision?

Method: Create counterfactual versions of test cases by changing only the protected characteristic (or proxies for it), rerun through the AI, and measure how often the output changes. A high counterfactual change rate indicates the AI is implicitly using the protected characteristic.

Example: Send identical job applications to a hiring AI where only the applicant name varies (English names vs non-English names). Different selection rates reveal demographic bias even when no demographic data is an input.

3. Subgroup Performance Testing

Beyond outcome rates, test whether the AI's accuracy differs across groups. An AI that is 90% accurate for majority groups but 65% accurate for minority groups is biased by performance, even if it produces similar selection rates.

Metrics to compare across subgroups:

Accuracy, precision, recall, F1 by demographic group
False positive rate and false negative rate by group (these should be equalised where possible)
Calibration — are confidence scores equally reliable across groups?

4. Slice-Based Evaluation

A systematic approach to subgroup testing. Define all relevant population slices (e.g., female + over 50 + urban; male + under 30 + rural), measure performance on each slice, and identify which combinations of characteristics produce the worst outcomes.

Tools like SliceFinder (Google) and Errudite automate slice discovery to find performance gaps you would not have thought to look for.

Synthetic Data vs Real Data for Bias Testing

Approach	How It Works	Best For	Limitations
Real production data with labels	Use actual historical decisions with demographic labels; measure outcome rates by group	Credit, hiring AI with historical outcomes	Privacy constraints; demographic data may be absent; historical labels may themselves be biased
Synthetic test cases (LLM-generated)	Generate thousands of realistic test cases varying only protected characteristics; run through AI	Counterfactual testing; NLP/CV AI; when real data lacks demographic labels	Synthetic cases may not fully represent real distribution; bias in synthetic data generator possible
Augmented real data	Take real cases and flip protected characteristics (or proxies); measure output change	Structured data AI (credit, hiring); quantifying direct discrimination	Difficult for complex unstructured inputs; proxy identification requires domain knowledge
Red-teaming with adversarial inputs	Human testers create edge cases designed to surface bias	Identifying unknown failure modes; regulatory-facing bias audit	Labour intensive; limited systematic coverage; depends on tester creativity
Independent bias audit	Third-party auditors test the AI independently using their own methodology	NYC LL144 compliance (requires independent auditor); regulatory credibility	Cost $10,000–$100,000+; annual requirement for in-scope NYC hiring AI

Satisfying EU AI Act Article 10

EU AI Act Article 10 requires that training, validation, and test data for high-risk AI systems must:

Be subject to appropriate data governance practices
Be free of errors and complete, to the extent reasonably possible
Have appropriate statistical properties — including in respect of the persons or groups of persons on which the system is intended to be used
Take into account possible biases that could affect health, safety, or fundamental rights

To satisfy Article 10, your bias testing documentation should include:

Description of the demographic groups tested (which characteristics, which subgroups)

Testing methodology used (disparate impact analysis, counterfactual testing, subgroup performance)

Quantitative results: selection rates and accuracy metrics by subgroup

Statistical significance of any observed disparities

Bias mitigation steps taken if disparities were found

Post-mitigation re-testing results

Ongoing monitoring plan for bias drift post-deployment

Bias Mitigation Strategies

Detecting bias is only step one. When bias is found, the response depends on root cause:

Root Cause	Mitigation Approach	Stage
Imbalanced training data	Resampling, oversampling minority groups, or generating synthetic training examples	Pre-training
Biased labels	Relabelling with diverse annotators; measuring inter-annotator agreement by group	Pre-training
Model learning proxies	Adversarial debiasing, fairness constraints during training, feature selection	During training
Disparate output rates	Calibration post-processing, threshold adjustments by group, Platt scaling	Post-training
Structural use case bias	Redesign the task — if the fundamental use of AI is discriminatory, no model change will fix it	System design

Track Your Bias Testing Documentation

ComplianceIQ helps you maintain bias testing records for each AI system, track when retesting is due, and generate the documentation required for EU AI Act Article 10 compliance.

Start Your AI Compliance Assessment