Validation study

Why I'm bringing MyChild back.

I built the first version of this engine in 2014, when I was 16. Launched the app in 2016. It screens for developmental delays in children between 1 and 24 months, based on developmental milestones from the Centers for Disease Control, World Health Organization guidelines, and Red Flags indicators. I ran the startup for a couple years, then wound it down.

Almost nobody is still working on this. The few who are aren't open-sourcing it. With what AI can do now, I decided to bring it back.

I'm not raising money or restarting the company. I'm open-sourcing the engine so any platform that reaches parents can use it, improve it, and help push it toward clinical validation. The current study uses synthetic data, and the self-learning model is already generating adversarial cases on its own.

The real goal is to get this validated and give it away. If you're building something for parents, that's exactly what it's for.

— Harsh Songra

The numbers so far

0.906

Cohen's Kappa

Measures how much the engine and the evaluator agree, beyond what you'd expect from random chance. 1.0 is perfect agreement. Anything above 0.8 is considered "almost perfect" in clinical research. Ours is 0.906 across the combined set.

91.4%

Sensitivity

When there's actually a concern, how often does the engine catch it? 91.4% of the time. The remaining 8.6% are misses where the engine said "looks fine" but the evaluator disagreed. For a screening tool, you want this number high.

98.8%

Specificity

When a child is developing normally, how often does the engine correctly say so? 98.8% of the time. Only 1.2% false alarms. This matters because false flags cause unnecessary worry for parents.

71,939

Profiles tested

Synthetic developmental profiles generated by the autoresearch system. Not real children. Each profile simulates a child's milestone responses across multiple domains and age points.

How we tested it

We borrowed the idea from Karpathy's autoresearch. Hand-labeled profiles first to check basic logic. Then adversarial cases to see what breaks.

Phase 1

Hand-verified set

1,747

Profiles

1.000

Kappa

Curated profiles where the expected output is unambiguous. Perfect agreement by construction. This validates the engine's logic on clear-cut cases, not edge-case performance.

Phase 2

Adversarial stress test

70,192

Profiles

0.852

Kappa

70,192 profiles designed to break the engine: regression patterns, sparse data, contradictory answers, preterm edge cases. Specificity held at 98.0%. Of 781 disagreements, most were under-flags. The engine is conservative. That feels like the right failure mode for a screening tool, but it's still a miss.

Ground-truth labels came from an AI clinical evaluator, not licensed clinicians. This is not clinical validation. We publish it because we'd rather show our work than wait until it's perfect.

How it compares

There are a handful of established screening tools that pediatricians and public health programs use around the world. We compared our numbers against theirs. On specificity, the engine is ahead. Sensitivity is in the same range. But our numbers come from synthetic data, so this comparison is directional at best.

Every tool below has been validated with real children in clinical settings. Ours hasn't. You can't put them on equal footing. We include this comparison so you can see where the engine sits relative to the field, not to claim equivalence.

Ages and Stages Questionnaire (ASQ-3)

The most widely used parent-completed screening tool in the world. Parents answer questions about their child's communication, motor skills, problem-solving, and personal-social development. Used in pediatric offices and early intervention programs across the US and internationally. Clinically validated with ~18,000 children.

82-97%

Sensitivity

83-93%

Specificity

Clinical

Modified Checklist for Autism in Toddlers (M-CHAT-R/F)

A two-stage screening tool specifically for autism spectrum disorder in toddlers aged 16-30 months. The first stage is a 20-question parent questionnaire. Children who screen positive get a structured follow-up interview to reduce false positives. Recommended by the American Academy of Pediatrics for routine screening at 18 and 24 months.

85-95%

Sensitivity

93-99%

Specificity

Clinical

Parents' Evaluation of Developmental Status (PEDS)

A 10-question tool that asks parents about their concerns across development areas. Designed for children from birth to 8 years. Instead of testing the child directly, it uses the parent's own observations and worries as the signal. Validated with ~1,500 children. Commonly used in primary care and community health settings.

74-79%

Sensitivity

70-80%

Specificity

Clinical

MyChild Engine (combined)

Our engine. 129 evidence-weighted questions across 8 developmental domains, for children 1-24 months. Unlike the tools above, it hasn't been validated with real children yet. These numbers come from 71,939 synthetic profiles. The comparison is here so you can see where it sits, not to claim it's equivalent.

91.4%

Sensitivity

98.8%

Specificity

Synthetic

What this proves and what it doesn't

What it shows

70,192 adversarial profiles tried to break it. Kappa held at 0.852.
98.0% specificity on adversarial data. Low false alarm rate.
The autoresearch system shipped 14 fixes and tried 325 threshold changes that all came back "keep current parameters."
Questions based on CDC (Centers for Disease Control) 2022 developmental milestones and mapped to India's RBSK screening program.
Open source. Every rule inspectable, every result reproducible.

What it does not prove

Clinical validation with real children. That needs ethics board oversight and clinical partners we don't have yet.
Real-world sensitivity or specificity. Synthetic profiles are not real children.
Whether it works across languages, cultures, or care settings.
Whether the thresholds are right for clinical use. They're hypothesis-level, aligned to the 2022 developmental milestone guidelines.

A prospective clinical study is next. We're looking for clinical partners.

Go deeper

This page is the summary. The full study has the adversarial breakdown, per-domain results, threshold optimization details, India screening program alignment, and every source we used. All the numbers are reproducible.

# Install

npm install mychild-engine

# Run the validation yourself

npx mychild-engine validate --profiles data/synthetic-profiles.json --format markdown

Read the full study Get the raw data Research on GitHub npm package