Self-Improving AI: Small Model Trains on Mistakes, Beats GPT-3.5 in 2026

In a startling demonstration of self-directed learning, a solo researcher has shown that a small AI model can dramatically improve its coding and math abilities by training exclusively on its own mistakes. This self-improving AI project, detailed in a series of experiments posted on Reddit and backed by open-source code on GitHub, saw a 7-billion-parameter model reach 80% on the HumanEval benchmark—and beat OpenAI's GPT-3.5 on math—without a single line of human-written training data.

The Self-Mining Approach

The core method, which the researcher calls 'self-mining,' is deceptively simple. A base model is asked to invent a coding problem and write a few small tests for it. The same model then attempts to solve its own problem multiple times. When it fails, the pair of (broken attempt, working attempt) is saved. The model is then fine-tuned on these self-generated corrections. No human intervention. No curated datasets. Only the Python interpreter passes judgment.

How Self-Mining Works

According to the researcher's GitHub repository, the technique was tested across multiple model families. Starting with Qwen 2.5 7B base, which scored 25 out of 164 on HumanEval, the model jumped to 112 after training on its own mined pairs—an improvement of 87 problems. Scaling up to Qwen 2.5 14B base, a 95-minute H100 run costing just $3.50 in cloud credits lifted the model to within 4 points of the company's own RLHF-tuned version.

Control Experiment Validates the Method

To verify the signal wasn't noise, the researcher ran a control experiment: training on fake pairs of random garbage code that didn't pass any tests. The score stayed flat at 25 out of 164. 'The model wasn't getting smarter from generic training,' the researcher wrote. 'It was getting smarter specifically from training on its own mistakes and corrections.'

Benchmark Results: Surpassing GPT-3.5

The recipe proved robust across different model families. When applied to Meta's Llama 3.2 3B, HumanEval scores rose from 39 to 43. Qwen 2.5 Coder 7B, already a code-specialized model, saw a small lift from 83 to 87. Even Qwen 3 4B, a newer generation, jumped from 79 to 106 on HumanEval and from 135 to 148 on MBPP.

Math Performance with Adaptive Difficulty

The researcher then adapted the method for math, using SymPy as the judge instead of Python. The initial attempt failed because the model generated trivial arithmetic problems. A twist was added: when the model solved a problem on every try, the next problem had to be harder; when it kept failing, the next had to be easier. This adaptive difficulty gradually pushed the model toward problems at the edge of its ability.

The result, as the researcher notes, is striking: 'A 3B model, trained on 13 math problems it wrote for itself, beats the version of ChatGPT that broke the internet in 2022.' On GSM8K, a benchmark of grade-school math word problems, the self-trained model outperformed GPT-3.5.

Limitations and Threshold Effects

However, the approach has clear limitations. The recipe does not work on already-strong models like Qwen 3 8B or Qwen 2.5 72B, which have too few wrong attempts to mine from. It also fails on too-weak models like OLMo 2 7B, which cannot produce enough correct answers to generate useful training pairs. Additionally, training on code does not transfer to math, and HumanEval-style problems do not generalize to real-world Python libraries like pandas.

Implications for Open-Source AI

Perhaps the most surprising finding concerns the interaction between fine-tuning and test-time sampling. The researcher expected that training would make the model better, and that sampling—asking the model multiple times and keeping the answer that passes the tests—would compound the gains. But at 36 mined pairs, training and sampling fought each other. The fine-tuning narrowed the model's output diversity so much that sampling lost the variety that made it useful.

When Fine-Tuning Hurts

'There's a threshold,' the researcher wrote. 'If you have a small dataset, you might be better off not fine-tuning and just sampling from the base. The standard advice—'always fine-tune when you can'—is wrong below the threshold.' This finding, which the researcher describes as the one they most want other researchers to test and try to break, challenges conventional wisdom in the field.

Critical Failure in Math Training

Another critical failure: training on pairs of (wrong answer, corrected answer) for math destroyed the model. Qwen 3 4B dropped from 60% to 14% on MATH-500. The model learned to always doubt itself, even when it was right. The fix was to mix in examples where a correct answer stays correct.

The researcher has released all code and reproduction guides on GitHub under the repository tinyforge-zero, along with adapter weights on Hugging Face. A paper is pending on arXiv once moderation clears. The work builds on existing tools like the HumanEval-infilling benchmark from OpenAI and the ai-toolkit by ostris, though the researcher noted that switching between training and sampling modes in that toolkit can cause errors—a known issue documented in GitHub issue #409.

Researchers at institutions like the University of Texas at Austin have previously examined the mismatch between HumanEval performance and real-world coding tasks, as detailed in a study on NaturalCodeBench published on arXiv. The solo researcher's findings echo that concern: HumanEval-style problems do not transfer to real-world Python that uses libraries like pandas.

Despite the limitations, the core discovery stands: a small model can train on its own mistakes to beat GPT-3.5 on math, achieving an 80% score on HumanEval without human-annotated data. For a field increasingly dominated by billion-dollar training runs, this low-cost recipe opens a door for independent researchers and small teams to push the boundaries of self-improving AI.

AI-Powered Content

Sources: github.com • arxiv.org • github.com

Self-Improving AI: Small Model Trains on Mistakes, Beats GPT-3.5 in 2026