← Blog
Bias Testing April 17, 2026 · 14 min read

AI Bias Testing Methods for Enterprise: Synthetic vs Real Data Approaches

EU AI Act Article 10 requires high-risk AI systems to be tested for bias before deployment. NYC Local Law 144 requires annual bias audits for hiring AI. Colorado SB 24-205 requires impact assessments covering bias. Here is a practical guide to what bias testing actually involves.

What regulations require bias testing

EU AI Act Article 10 (data governance for high-risk AI), NYC Local Law 144 (annual bias audit for hiring AEDTs), Colorado SB 24-205 (impact assessment covering demographic bias), ECOA/FCRA (disparate impact in credit AI), and EU GDPR Article 22 (automated decision safeguards) all require some form of bias assessment. The methods described here satisfy requirements across all of these frameworks.

What Is AI Bias — and Which Types Matter?

AI bias refers to systematic errors in AI outputs that unfairly disadvantage particular groups. Not all bias is equal from a regulatory perspective. The types most likely to create legal exposure:

Disparate impact bias

ECOA, FCRA, EU AI Act, Colorado AI Act

The AI produces systematically different outcomes for different demographic groups — even without using protected characteristics as direct inputs. A credit model that approves loans at 30% for one race and 60% for another has disparate impact regardless of whether race is an input.

Disparate treatment bias

Employment law, ECOA, EU AI Act, GDPR

The AI treats individuals differently based on protected characteristics (race, gender, age, disability). This is direct discrimination and is easier to detect but rarer in well-designed systems.

Representation bias

EU AI Act Article 10, NYC LL144, sector laws

The training data underrepresents certain groups, causing the AI to perform worse for those groups. A facial recognition system trained on lighter skin tones that performs poorly on darker skin tones is a classic example.

Measurement bias

EU AI Act Article 10, ECOA for credit AI

The labels used to train the AI are themselves biased. A hiring AI trained on historical promotion data will learn from a biased historical promotion process.

Aggregation bias

EU AI Act, FDA (for medical AI)

The AI uses a single model across groups where different models would be more appropriate. A medical AI trained primarily on male patients may perform poorly on female patients.

The Core Bias Testing Methods

1. Disparate Impact Analysis (DIA)

The most widely required method. DIA measures whether an AI system produces different outcomes for different demographic groups. The key metric is the 80% rule (4/5ths rule)from US employment law — if the selection rate for any protected group is less than 80% of the highest-rated group, there is prima facie adverse impact.

How to run it:

  1. Collect AI outputs and ground-truth outcomes for a representative test set
  2. Segment results by each protected characteristic (sex, race, age, disability where available)
  3. Calculate selection rates (or positive outcome rates) for each subgroup
  4. Apply the 4/5ths rule: flag any group with <80% of the highest-rate group's outcome
  5. Document results and statistical significance (p-value, confidence intervals)

Limitation: DIA requires actual demographic data about test subjects — which may not be available for privacy reasons. Synthetic data approaches (see below) solve this problem.

2. Counterfactual Fairness Testing

Counterfactual testing asks: if everything about an individual were the same except their protected characteristic, would the AI make the same decision?

Method: Create counterfactual versions of test cases by changing only the protected characteristic (or proxies for it), rerun through the AI, and measure how often the output changes. A high counterfactual change rate indicates the AI is implicitly using the protected characteristic.

Example: Send identical job applications to a hiring AI where only the applicant name varies (English names vs non-English names). Different selection rates reveal demographic bias even when no demographic data is an input.

3. Subgroup Performance Testing

Beyond outcome rates, test whether the AI's accuracy differs across groups. An AI that is 90% accurate for majority groups but 65% accurate for minority groups is biased by performance, even if it produces similar selection rates.

Metrics to compare across subgroups:

4. Slice-Based Evaluation

A systematic approach to subgroup testing. Define all relevant population slices (e.g., female + over 50 + urban; male + under 30 + rural), measure performance on each slice, and identify which combinations of characteristics produce the worst outcomes.

Tools like SliceFinder (Google) and Errudite automate slice discovery to find performance gaps you would not have thought to look for.

Synthetic Data vs Real Data for Bias Testing

ApproachHow It WorksBest ForLimitations
Real production data with labelsUse actual historical decisions with demographic labels; measure outcome rates by groupCredit, hiring AI with historical outcomesPrivacy constraints; demographic data may be absent; historical labels may themselves be biased
Synthetic test cases (LLM-generated)Generate thousands of realistic test cases varying only protected characteristics; run through AICounterfactual testing; NLP/CV AI; when real data lacks demographic labelsSynthetic cases may not fully represent real distribution; bias in synthetic data generator possible
Augmented real dataTake real cases and flip protected characteristics (or proxies); measure output changeStructured data AI (credit, hiring); quantifying direct discriminationDifficult for complex unstructured inputs; proxy identification requires domain knowledge
Red-teaming with adversarial inputsHuman testers create edge cases designed to surface biasIdentifying unknown failure modes; regulatory-facing bias auditLabour intensive; limited systematic coverage; depends on tester creativity
Independent bias auditThird-party auditors test the AI independently using their own methodologyNYC LL144 compliance (requires independent auditor); regulatory credibilityCost $10,000–$100,000+; annual requirement for in-scope NYC hiring AI

Satisfying EU AI Act Article 10

EU AI Act Article 10 requires that training, validation, and test data for high-risk AI systems must:

To satisfy Article 10, your bias testing documentation should include:

Description of the demographic groups tested (which characteristics, which subgroups)
Testing methodology used (disparate impact analysis, counterfactual testing, subgroup performance)
Quantitative results: selection rates and accuracy metrics by subgroup
Statistical significance of any observed disparities
Bias mitigation steps taken if disparities were found
Post-mitigation re-testing results
Ongoing monitoring plan for bias drift post-deployment

Bias Mitigation Strategies

Detecting bias is only step one. When bias is found, the response depends on root cause:

Root CauseMitigation ApproachStage
Imbalanced training dataResampling, oversampling minority groups, or generating synthetic training examplesPre-training
Biased labelsRelabelling with diverse annotators; measuring inter-annotator agreement by groupPre-training
Model learning proxiesAdversarial debiasing, fairness constraints during training, feature selectionDuring training
Disparate output ratesCalibration post-processing, threshold adjustments by group, Platt scalingPost-training
Structural use case biasRedesign the task — if the fundamental use of AI is discriminatory, no model change will fix itSystem design

Track Your Bias Testing Documentation

ComplianceIQ helps you maintain bias testing records for each AI system, track when retesting is due, and generate the documentation required for EU AI Act Article 10 compliance.

Start Your AI Compliance Assessment