Makes sense. This also tracks with the research on human-AI collaboration. A single model converges to the mean of its training distribution, but adversarial multi-model setups break that pattern because each model's blind spots are different.
I wrote about why single-model AI has a structural quality ceiling and why ensemble/hybrid approaches consistently outperform: https://philippdubach.com/posts/the-impossible-backhand/
I did the exact same thing! Uncanny.
I agree with models being better at different tasks: gemini-cli is superficial, codex is stubborn as a mule and dependable, claude-cli just wants to get something working and done. qwen-cli, Qwen, in general, has a tendency to pendulate too much.
I also reduced the team to two, codex and claude, for me.
Agree with this. I have Codex do analysis and feedback for Claude code. For whatever reason, Claude code seems to produce successful code more frequently, but it tends to have blind spots in performing analysis that Codex does a good job of picking up. The two together feel like a step up in state of the art.
I need a tool to put them in a loop together to get this done more efficiently…I guess I’ll plug this in as a prompt and go from there!
[dead]