Autonomous validation · Run 4

71,939 synthetic profiles. Every result published.

We built an autonomous research system, inspired by Karpathy's autoresearch, that generates adversarial developmental profiles, runs them through the engine, labels the expected output, and measures where the two disagree. Four runs. 14 engine fixes shipped. 325 threshold experiments that changed nothing. Here's what we found.

None of this is clinical validation. Ground-truth labels came from an AI clinical evaluator, not licensed clinicians reviewing real children. We publish it anyway because we'd rather show our work than wait until it's perfect.

0.906

Cohen's Kappa

Combined set

91.4%

Sensitivity

Flags real concerns

98.8%

Specificity

Doesn't cry wolf

34,587

Observations

Domain-profile pairs scored

Two-phase validation

First, a hand-labeled verification set to prove the engine's logic works. Then, an adversarial stress test to find where it breaks.

Phase 1 · Hand-verified

Curated set

1,747

Profiles

12,991

Observations

1.000

Kappa

100%

Sensitivity & specificity

These profiles were promoted from the adversarial set after manual review confirmed unambiguous agreement. Kappa = 1.000 is expected by construction. This validates the engine's logic on clear-cut cases, not its edge-case performance.

Phase 2 · Stress test

Adversarial set

70,192

Profiles

21,596

Observations

0.852

Kappa

86.5%

Sensitivity

70,192 profiles designed to break the engine: regression patterns, sparse data, contradictory answers, preterm edge cases, single-domain flags. Specificity held at 98.0%. Of the 781 disagreements, most were under-flags. The engine is cautious. It misses some things, but it doesn't make things up.

What "adversarial" actually means

The system generated 18 categories of edge-case profiles. Not random data. Each category targets a specific way the engine could fail.

10,578

Regression

9,072

Sparse data

8,786

Single domain

6,500

Mutations

5,616

Exhaustive age

5,208

Contradictory

5,073

Clear delay

4,964

Clear typical

4,752

Preterm

2,897

Borderline

2,544

Combinatorial

1,914

Real-world

315 critical findings need human review.

Adversarial profiles where the engine and the AI evaluator disagreed on clinically meaningful classifications. Logged and triaged. Not hidden.

How it compares

On specificity, the engine beats the published numbers for comparable tools. Sensitivity is in the same range. But our numbers come from synthetic data, so this comparison is directional at best.

ASQ-3, M-CHAT-R/F, and PEDS are clinically validated with real patient data. Ours are synthetic. You can't put them on equal footing.

Tool	Sensitivity	Specificity	Validation	Sample
ASQ-3	82–97%	83–93%	Clinical	~18,000
M-CHAT-R/F	85–95%	93–99%	Clinical	—
PEDS	74–79%	70–80%	Clinical	~1,500
MyChild (adversarial)	86.5%	98.0%	Synthetic	21,596 obs
MyChild (combined)	91.4%	98.8%	Synthetic	34,587 obs

14 improvements across 4 runs

Each run found failure modes. Each run shipped fixes. By Run 4, the system couldn't find anything left to change. 325 threshold experiments, all came back "keep current parameters."

Run 1

Isolated single-flag downgrade

Cross-domain sparse correction

Regression bypasses evidence gate

Cross-domain regression protection

Run 2

Soft regression detection

Multi-regression escalation

Caregiver uncertainty detection

Developmental sequence anomaly

Asymmetric screening thresholds

+ 5 more fixes

RF scoring, label normalization, evidence gate tuning, low_concern handling, sparse precaution

Run 3

8 refinement commits

Polishing edge cases found in Run 2's improvements

Run 4

Zero changes

325 threshold experiments. All returned "keep." Current parameters are locally optimal.

We tried 325 ways to make it better. None worked.

The autoresearch system didn't just test the engine. It tried to improve it. It swept every tunable parameter: delta-not-yet interactions, delta-unsure values, infant and toddler grace periods, the yellow-orange-red thresholds. 325 experiments total.

Every single one came back "keep current parameters." The defaults we built from first principles against CDC 2022 guidance are locally optimal across the curated validation set. If we want the numbers to move from here, we need to change the algorithms, not the knobs.

T_yellow = 2

T_orange = 4

T_red = 6

Grace: 3w / 12w

RBSK alignment

Mapped against India's national screening program

RBSK (Rashtriya Bal Swasthya Karyakram) screens 270 million children across India. Independently validated at 97% sensitivity and 96.4% specificity against ASQ-3. We mapped our question bank domain-by-domain against their screening items.

That doesn't make us RBSK-validated. The scoring logic hasn't been tested with Indian populations. But the developmental constructs we measure are the same ones a national public health program considers worth screening for.

What MyChild adds

Evidence-weighted escalation
Probe-based false positive reduction
Automated regression detection
Corrected age for prematurity

What RBSK covers that we don't

Ages 3 to 6 years
India health infrastructure integration
ABDM / Poshan Tracker connectivity

The original verification

Before the autoresearch system existed, we ran 108 hand-labeled developmental scenarios through the engine. This is where it all started.

98.6%

Agreement

0.971

Cohen's Kappa

294

Labeled observations

Threshold cases

Abstentions

Per-domain agreement

Four domains had a single borderline over-flag. The other four matched exactly. When the engine gets it wrong, it errs toward concern. For a screening aid, that's the right kind of wrong.

Domain	N	Agreement	Kappa	Notes
Gross Motor	72	100%	1.000	No disagreements
Cognitive / Play	22	100%	1.000	No disagreements
Self-Help / Adaptive	14	100%	1.000	No disagreements
Vision / Hearing	23	100%	1.000	No disagreements
Expressive Language	67	98.5%	—	1 borderline over-flag near threshold
Social-Emotional	45	97.8%	—	1 borderline over-flag on sparse data
Receptive Language	34	97.1%	—	1 borderline over-flag with too little evidence
Fine Motor	17	94.1%	—	1 borderline over-flag near threshold

The honest version

What this shows

70,192 adversarial profiles tried to break it. Kappa held at 0.852.
98.0% specificity on adversarial data. Low false alarm rate.
14 algorithmic fixes shipped directly from autoresearch findings
325 threshold experiments all said "keep current parameters"
Question bank maps to CDC 2022 milestones and RBSK screening items
Open source. Every rule inspectable, every result reproducible.

What this does NOT prove

Clinical validation with real patient data. That needs IRB oversight and clinical partners we don't have yet.
Real-world sensitivity or specificity. Synthetic profiles are not real children.
Whether it works across languages, cultures, or care settings
Whether the thresholds are right for clinical use
315 critical adversarial findings still need a clinician to look at them

All thresholds are hypothesis-level, aligned to CDC 2022 guidance. A prospective clinical study is next. We're looking for clinical partners.

Don't trust us. Run it yourself.

Every number on this page is reproducible. Install the engine, run the validation, check the output against what we've published. If something doesn't match, open an issue.

# Install

npm install mychild-engine

# Run the full validation suite

npx mychild-engine validate --profiles data/synthetic-profiles.json --format markdown

# Run the autoresearch system

npx mychild-engine autoresearch --runs 4 --adversarial 70000

Full research on GitHub npm package

Where the questions come from

Every question in the engine traces back to peer-reviewed developmental science. The synthetic validation approach draws on published work about when and how synthetic data can stand in for clinical data.

Primary source: CDC 2022 Revised Developmental Milestones

Zubler JM, Wiggins LD, Macias MM, et al. "Evidence-Informed Milestones for Developmental Surveillance Tools." Pediatrics. 2022;149(3):e2021052138.

The 2022 revision moved milestones from the 50th to the 75th percentile. Each milestone is a skill 75% of children can demonstrate by the listed age.

doi:10.1542/peds.2021-052138

AAP Developmental Surveillance and Screening Policy

Lipkin PH, Macias MM; Council on Children with Disabilities. "Promoting Optimal Development: Identifying Infants and Young Children With Developmental Disorders Through Developmental Surveillance and Screening." Pediatrics. 2020;145(1):e20193449.

doi:10.1542/peds.2019-3449

CDC "Learn the Signs. Act Early." Program

Public domain milestone checklists that formed the basis for the question bank's caregiver-facing wording. Monitoring tools, not screening instruments.

cdc.gov/act-early/milestones

RBSK — India's National Child Health Screening Program

Rashtriya Bal Swasthya Karyakram. Independently validated at 97% sensitivity, 96.4% specificity against ASQ-3. Our questions are mapped domain-by-domain against RBSK screening items.

Our questions map to RBSK items, but the scoring logic hasn't been tested with Indian populations.

rbsk.mohfw.gov.in

Autoresearch methodology

Adapted from Karpathy's autoresearch pattern. The system autonomously generates adversarial profiles, labels expected outputs after running them (run-then-label), and iterates on the engine when it finds failures.

Synthetic Data Validation Methodology

"The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures." BMC Medical Informatics and Decision Making. 2019.

doi:10.1186/s12911-019-0793-0

Synthetic Data in Clinical Outcomes

"Synthetic data can aid the analysis of clinical outcomes: How much can it be trusted?" PNAS. 2024.

doi:10.1073/pnas.2414310121

Synthetic Validation of Pediatric Trust Instruments

"Synthetic Validation of Pediatric Trust Instruments Using LLMs." MedRxiv preprint. 2025.

medrxiv.org/10.1101/2025.11.25.25340922v1