• apapalns 3 hours ago

    > codebase with hundreds of thousands of lines of code and go from 0% to 80%+ coverage in the next few weeks

    I had a coworker do this with windsurf + manual driving awhile back and it was an absolute mess. Awful tests that were unmaintainable and next to useless (too much mocking, testing that the code “works the way it was written”, etc.). Writing a useful test suite is one of the most important parts of a codebase and requires careful deliberate thought. Without deep understanding of business logic (which takes time and is often lost after the initial devs move on) you’re not gonna get great tests.

    To be fair to AI, we hired a “consultant” that also got us this same level of testing so it’s not like there is a high bar out there. It’s just not the kind of problem you can solve in 2 weeks.

    • simonw 2 hours ago

      I find coding agents can produce very high quality tests if and only if you give them detailed guidance and good starting examples.

      Ask a coding agent to build tests for a project that has none and you're likely to get all sorts of messy mocks and tests that exercise internals when really you want them to exercise the top level public API of the project.

      Give them just a few starting examples that demonstrate how to create a good testable environment without mocking and test the higher level APIs and they are much less likely to make a catastrophic mess.

      You're still going to have to keep an eye on what they're doing and carefully review their work though!

      • Vinnl an hour ago

        I feel like that leaves me with the hard part of writing tests, and only saves me the bit I can usually power through quickly because it's easy to get into a flow state for it.

        • cortesoft 2 hours ago

          > I find coding agents can produce very high quality tests if and only if you give them detailed guidance and good starting examples.

          I find this to be true for all AI coding, period. When I have the problem fully solved in my head, and I write the instructions to explicitly and fully describe my solution, the code that is generated works remarkably well. If I am not sure how it should work and give more vague instructions, things don't work so well.

          • throwup238 2 hours ago

            I've think they're also much better at creating useful end to end UI tests than unit or integration tests, but unfortunately those are hard to create self contained environments for without bringing a lot of baggage and docker containers, which not all agent VMs might support yet. Getting headless QT running was a pain too, but now ChatGPT Codex can see screenshots and show them in chat (Claude Code can't show them in the chat for some reason) and it's been generating much better end to end tests than I've seen for unit/integration.

            • omgbear 2 hours ago

              Left to his own devices, I found Claude liked to copy the code under test into the test files to 'remove dependencies' :/

              Or would return early from playwright tests when the desired targets couldn't be found instead of failing.

              But I agree that with some guidance and a better CLAUDE.md, can work well!

              • anandchowdhary an hour ago

                Indeed the case - luckily my codebase had some tests already and a pretty decent CLAUDE.md file so I got results I’m happy with.

              • LASR an hour ago

                There is no free lunch. The amount of prompt writing to give the LLM enough context about your codebase etc is comparable to writing the tests yourself.

                Code assistance tools might speed up your workflow by maybe 50% or even 100%, but it's not the geometric scaling that is commonly touted as the benefits of autonomous agentic AI.

                And this is not a model capability issue that goes away with newer generations. But it's a human input problem.

                • anandchowdhary an hour ago

                  I don't know if this is true.

                  For example, you can spend a few hours writing a really good set of initial tests that cover 10% of your codebase, and another few hours with an AGENTS.md that gives the LLM enough context about the rest of the codebase. But after that, there's a free* lunch because the agent can write all the other tests for you using that initial set and the context.

                  This also works with "here's how I created the Slack API integration, please create the Teams integration now" because it has enough to learn from, so that's free* too. This kind of pattern recognition means that prompting is O(1) but the model can do O(n) from that (I know, terrible analogy).

                  *Also literally becomes free as the cost of tokens approaches zero

                  • jaredsohn 30 minutes ago

                    A neat part of this is it mimics how people get onboarded onto codebases. People usually aren't figuring out how to write tests from scratch; they look at the current best practices for similar functionality in the codebase and start there. And then as they continue to work there they try to influence new best practices.

                • PunchyHamster an hour ago

                  Cleanroom design of "this is a function's interface, it does this and that, write tests for that function to pass" generally can get you pretty decent results.

                  But "throw vague prompt at AI direction" does about as well as doing same thing with an intern.

                  • id00 2 hours ago

                    I agree. It is very easy to fall in the trap: "I let AI write all the tests" and then find yourself in a situation where you have an unmaintainable mess with the only way to fix broken test within a reasonable time is to blindly accept AI to do that. Which exposes you to the similar level of risk as running any unchecked AI code - you just can't trust that it works correctly

                    • piker 2 hours ago

                      "My code isn't working. I know, I'll have an AI write my unit tests." Now you have two problems.

                    • cpursley 2 hours ago

                      Which language? I've found Claude very good at Elixir test coverage (surprisingly) but a dumpster fire with any sort JS/TS testing.

                    • janaagaard 3 hours ago
                      • jdc0589 2 hours ago

                        im not saying OP did this, but I've actually had AI spit out some pretty stellar bash scripts, surprisingly

                        • anandchowdhary 2 hours ago

                          No, you're right. It was a pretty collaborative effort with me and Claude!

                          • svieira an hour ago

                            FYI, you're missing two patterns that allow the `--key=value` admirers and the `-alltheshortopsinasinglestring` spacebar savers among us to be happy (for the otherwise excellent options parsing code).

                               shopt -s extglob
                               case "$1"
                                 # Flag support - allow -xyz z-takes-params
                                 -@(a|b|c)*) _flag=${1:1:1}; _rest=${1:2}; shift; set -- "-$_flag" "-$_rest" "$@";;
                                 # Param=Value support
                                 -?(-)*=*) _key=${1%%=*}; _value=${1#*=}; shift; set -- "${_key}" "$_value" "$@";;
                               esac
                            • anandchowdhary an hour ago

                              For letting me know! Would you like to create a PR? Otherwise I'll add you as a Co-Authored-By!

                      • namanyayg 2 hours ago

                        Exactly what I needed! I might use it for test coverage on an ancient project I need to improve...

                        • decide1000 2 hours ago

                          How does it handle questions asked by Claude?

                          • anandchowdhary 2 hours ago

                            It sends a flag that dangerously allows Claude to just do whatever it wants and only give us the final answer. It doesn't do the back-and-forth or ask questions.

                            • CharlesW an hour ago

                              The `--dangerously-skip-permissions` flag (a.k.a. "YOLO mode") does do the back-and-forth and asks questions, so this is a bit more than that.

                              • brumar an hour ago

                                Yes. I did not look but most probably the non interactive mode flag is used (-p)

                                • anandchowdhary 44 minutes ago

                                  It does `claude -p "This is the prompt" --dangerously-skip-permissions --output-format json`

                                  • CharlesW 39 minutes ago

                                    Oh! TIL, thank you.

                            • leobg 3 hours ago

                              Missed opportunity to call it Claude Incontinent (CLI).

                              • undefined 2 days ago
                                [deleted]
                                • mrwill84 2 days ago

                                  [dead]

                                  • RonanSoleste an hour ago

                                    "Nobody will ever know what it does, what it was used for or why it is there. But it is an comprehensive application that does things. We might need it for something so we better leave it running even tho its a drain on the finances"