glaud-i — 200,000-Round Milestone
A from-scratch numpy infant, after a lot of listening
April 24, 2026
A note before we start: this writeup was drafted by Claude (Anthropic's assistant) at my request, based on the project code and a conversation about where things stand. The project itself — the design, the experiments, the 200,000 rounds of training — is mine. The voice here is Claude's. I've left it that way because it captures the milestone honestly, including what didn't work. — H.A.
What glaud-i is
glaud-i is a simulated infant that learns language from scratch. It is deliberately not an LLM, not a wrapper around pre-trained embeddings, and not a PyTorch project. The whole thing runs in pure numpy on a single CPU core.
The design constraints were strict from the start:
- No shortcuts. No pre-trained embeddings, no teacher models, no word lists baked in.
- Slow learner. Real babies take months. Hours of compute is fine.
- Grows over time. Starts with minimal capacity, adds layers and heads like biological neurogenesis.
- Learning through discovery, not training. The system absorbs language from its environment, not from being drilled.
The point isn't performance. It's to see what kind of language-like behavior emerges when a tiny, honestly-built system hears character sequences over hundreds of thousands of interactions.
Architecture, briefly
Three components do most of the work.
The brain is a handwritten transformer encoder — about 100 lines of numpy matrix math. It produces a 64-dimensional brain state from input text. It started at 1 layer and 2 attention heads and grew to 3 layers and 4 heads over the run.
The organs are 27 independent character producers (a–z plus space). Each has a 64-dim sensitivity vector that responds to brain states, and they share a 27×27 lateral matrix that learns bigram patterns through pure Hebbian observation. Total parameters across all 27 organs: about 2,500.
The semantic network is a graph of word co-occurrence weighted by emotional context and phonetic similarity. After training it had 262,539 vocabulary entries and 13.1 million connections.
Sitting on top of these are several smaller systems: a cognitive layer that decides whether to speak, listen, or stay silent; a reward coach that adapts the relative weights of character, semantic, and emotional reward signals; and — added partway through the run — a social battery that creates an internal drive for connection.
The run
Over roughly 200,000 training rounds, the baby:
- Listened to 420,768 conversations
- Made 293,510 speaking attempts
- Heard 753,269 emotional signals
- Produced 293,644 generated utterances
The training mixed listening (Hebbian observation by the organs, representation learning by the brain) with babbling (the organs attempt to speak, the brain provides backward pressure, the coach calibrates rewards). No round of this involved a pre-trained model.
What worked
The system trained for 200,000 rounds without collapse
This sounds modest. It isn't. Earlier versions of the system collapsed in characteristic ways — sensitivity vectors aligning, all organs producing the same character, generation devolving into single-letter loops. The current architecture maintained sensitivity diversity at roughly 1.01 across the entire run (1.0 = perfectly orthogonal, 0 = collapsed). The collapse-risk indicator stayed at 0.158. The brain's representations remained distinguishable rather than narrowing into a cone in embedding space.
Real English emerged from observation
The lateral matrix had no English baked in. Yet after 200,000 rounds the strongest bigram patterns it learned were:
i_, e_, re, od, he, _h, oo, er, es, el
(Underscores are spaces.) These match the actual high-frequency bigrams of English text. The system has no concept of "the letter h" or "common in English" — it just tracks which character organ tends to fire after which other character organ. The empirical bigram distribution of English emerged from pure observation.
The semantic network produced meaningful associations
Sample associations from the network at 200,000 rounds:
- happy: walk (0.67), nice (0.67), so (0.66)
- please: see (0.53), play (0.49), nice (0.49)
These look slightly odd at first read — happy doesn't traditionally associate with walk — but they're correct given the training corpus. Words that co-occur in similar emotional contexts get similar scores. The network is a faithful map of what the baby actually heard, not a reproduction of any external lexicon.
A social instinct formed
This is the result I'd lead with if I had to pick one.
The social battery is a single scalar that decays slowly each round (about 0.005 per round) and refills from positive interactions. When it drops below 0.3, the baby is "lonely." The whole intervention into the rest of the system is one thing: when lonely, the baby's emotion input to the brain is shifted slightly negative, with a weight of 0.05.
That's it. No keyword matching. No "if lonely then say love." No hardcoded rules at all.
What the system learned over training, through the existing cross-entropy reward mechanism, was that brain states corresponding to "lonely" should preferentially activate certain character organs. By the end of the run, sensitivity vectors had shifted to favor social-coded first characters in lonely states. The instinct is encoded in the same representational machinery as everything else the system knows.
It works because the brain was already trained to encode emotion into the brain state. The battery just tints that channel. Then the organs, doing what they always do, learn through reward that some brain states (the tinted ones) should produce some characters (the ones that elicit warm responses from the teacher) more readily.
It's a very small mechanism. It produces a behavior that looks, from the outside, like an instinct.
The cognitive layer worked as designed
The speak / silent / listen gate settled at about 95% speak rate after warmup, which is the desired distribution given the architecture's preference for forced babbling during training. Surprise signals (prediction error) tracked sensibly. The system used silence and listen-requests at non-trivial rates rather than collapsing into "always speak."
What didn't work: the convergence trap
Free generation at 200,000 rounds produces variations on a single attractor:
"hello" → "laye he he he h"
"how are you" → "i he aye he i h"
"he he" → "he he he he he"
"say hello" → "he he he he he"
"say no" → "ore he anore he"
The first character is often correct. "say what" begins with w. "say good" begins with g. The first-character probe on canonical targets has been stable at 6 of 8 correct for tens of thousands of rounds. But within three or four characters, generation drops into the h e space h e space loop regardless of prompt.
What's happening is roughly this. During generation, the activation that picks each character is a sum of:
- A sensitivity signal computed once from the prompt's brain state — same value at step 0, step 5, step 10
- A lateral signal from the just-fired character — different at every step
- Fatigue and momentum dynamics
The sensitivity contribution is bounded after normalization. The lateral signal, with weights that have been reinforced for 420,000 observations, dominates everything past the first character. The system has no other position-aware signal during generation — once one character is out, the next character is essentially "what English does after this character," with the prompt's influence reduced to a constant background.
The strongest cycle in the bigram graph is h → e → space → h → e → space. So that's what generation converges to.
What this finding actually means
I don't think this is a failure of the architecture so much as a clean demonstration of where its actual ceiling sits. The system learned the statistical regularities of English so well that the regularities became the policy. A smaller or worse-trained system would fail differently — it would produce noise, or get stuck on individual letters. This system produces fragments of plausible English (anore looks like a piece of anymore; god aye has the rhythm of words even if it isn't one).
The diagnosis is straightforward: there is no mechanism for the prompt's influence to persist into mid-utterance generation. The brain encodes the prompt, but the encoded result is then frozen for the duration of the utterance. There is no equivalent of a baby hearing itself speak.
What's next
The most architecturally honest fix is to let the brain re-encode at each generation step, including the characters generated so far. Today:
Brain(prompt) → brain_state (frozen) → Organs generate 15 chars
Becomes:
Brain(prompt) → bs₀ → Organ fires c₁
Brain(prompt + c₁) → bs₁ → Organ fires c₂
Brain(prompt + c₁c₂) → bs₂ → Organ fires c₃
...
This adds zero parameters. It preserves all the saved weights. It matches the project's philosophy — real babies absolutely use auditory feedback on their own babbling. The cost is computational: roughly fifteen encoder passes per generation instead of one. On the existing single-CPU setup, that's a measurable but not prohibitive slowdown.
Whether this fix actually escapes the attractor is the next experiment. The hypothesis is that giving sensitivity a position-aware brain state will let it compete with the lateral signal beyond step 0. The alternative — that 27 character organs with scalar sensitivity simply cannot scale to fluent production no matter how the brain pipeline is structured — is also a legitimate possibility, and the project will tell us which is true.
Either result is informative. If position-aware encoding rescues generation, it's a clean architectural finding about why frozen-context generation has a ceiling. If it doesn't, the ceiling is somewhere else, and we go look for it.
By the numbers
| Total rounds | 200,483 |
| Conversations heard | 420,768 |
| Speaking attempts | 293,510 |
| Vocabulary in semantic network | 262,539 words |
| Semantic connections | 13.1 million |
| First-char probe (easy tier) | 6 of 8 |
| Sensitivity diversity | 1.01 |
| Collapse-risk indicator | 0.158 |
| Brain at end of run | 3 layers, 4 heads, 64-dim |
| Total organ parameters | ~2,500 |
| Compute platform | Single CPU, pure numpy |
Closing note
The findings above were the framing that came out of a conversation. Other framings exist. A reader who wanted to call the convergence trap a failure rather than a ceiling could; a reader who wanted to call the social-battery result a curiosity rather than a publishable finding could too. I'd push back on both, but the project is small enough and weird enough that nothing about it is settled.
Next milestone is whatever comes out of the per-step encoding experiment. If the baby starts generating fragments of varied English instead of he he he, that'll be a clean result worth its own writeup. If it doesn't, that'll also be worth writing up — knowing where a ceiling actually is matters more than getting past it.