One trend I've noticed, framed as a logical deduction:
1. Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.
2. Coding agents do massively better when they have a test-driven reward signal.
3. If a problem can be framed in a way that a coding agent can solve, that speeds up development at least 10x from the base case of human + assistant.
4. From (1)-(3), if you can get all the necessary context into 50k tokens and measure progress via tests, you can speed up development by 10x.
5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.
Sure enough, I see HN projects evolving in that direction.
> 3. If a problem can be framed in a way that a coding agent can solve...
This reminds me of the South Park underwear gnomes. You picked a tool and set an expectation, then just kind of hand wave over the hard part in the middle, as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem.
Does it sometimes take 50x effort to understand a problem and the agent well enough to get that done? Are there classes of problems where it can't be done? Are either of those concerns something you can recognize before they impact you? At commercial quality, is it an accessible skill for inexperienced people or do you need a mastery of coding, the problem domain, or the coding agent to be able to rely on it? Can teams recruit people who can reliable achieve any of this? How expensive is that talent? etc
>as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem.
It's not, but if you can A) make it cheap to try out different types of framings - not all of them have to work and B) automate everything else then the labor intensity of programming decreases drastically.
>At commercial quality, is it an accessible skill for inexperienced people
I'd expect the opposite, it would be an extremely inaccessible skill requiring high skill and high pay. But, if 2 people can deliver as much as 15 people at a higher quality and they're paid triple, it's still way cheaper overall.
I would still expect somebody following this development pattern to routinely discover a problem the LLM can't deal with and have to dive under the hood to fix it - digging down below multiple levels of abstraction. This would be Hard with a capital H.
We've had failed projects since long before LLMs. I think there is a tendency for people to gloss over this (3.) regardless, but working with an LLM it tends to become obvious much more quickly, without investing tens/hundreds of person-hours. I know it's not perfect, but I find a lot of the things people complain about would've been a problem either way - especially when people think they are going to go from 'hello world' to SaaS-billionaire in an hour.
I think mastery of the problem domain is still important, and until we have effectively infinite context windows (that work perfectly), you will need to understand how and when to refactor to maximize quality and relevance of data in context.
well according to xianshou's profile they work in finance so it makes sense to me that they would gloss over the hard part of programming when describing how AI is going to improve it
Working in one domain does not preclude knowledge of others. I work in cybersec but spent my first working decade in construction estimation for institutional builds. I can talk confidently about firewalls or the hospital you want to build.
No need to make assumptions based on a one-line hacker news profile.
> 5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.
Not necessarily. You can get the same benefits you described in (1)-(3) by using clearly defined modules in your codebase, they don't need to be separate microservices.
I wonder if we'll see a return of the kind of interface file present in C++, Ocaml, and Ada. These files, well commented, are naturally the context window to use for reference for a module.
Even if languages don't grow them back as a first class feature, some format that is auto generated from the code and doesn't include the function bodies is really what is needed here.
Python (which I mention because it is the preferred language of LLM output) has grown stub files that would work for this:
https://peps.python.org/pep-0484/#stub-files
I guess that this usecase would be an argument to include docstrings in your Python stub files, which I hadn’t considered before.
Agreed. If the microservice does not provide any value from being isolated, it is just a function call with extra steps.
I think the argument is that the extra value provided is a small enough context window for working with an LLM. Although I'd suggest making it a library if one can manage, that gives you the desired context reduction bounded by interfaces without taking on the complexities of adding an additional microservice.
I imagine throwing a test at an LLM and saying:
> hold the component under test constant (as well as the test itself), and walk the versions of the library until you can tell me where they're compatible and where they break.
If you tried to do that with a git bisect and everything in the same codebase, you'd end up varying all three (test, component, library) which is worse science than holding two constant and varying the third would be.
> I think the argument is that the extra value provided is a small enough context window for working with an LLM.
I'm not sure moving something that could work as function to a microservice would save much context. If anything, I think you are adding more context, since you would need to talk about the endpoint and having it route to the function that does what you need. When it is all over, you need to describe what the input and output is.
Oh certainly. I was arguing that if you need more isolation than a function gives you, don't jump to the conclusion that you need a service. Consider a library as a middle ground.
Yeah, I think monorepos will be better for LLMs. Easier to refactor module boundaries as context grows or requirements change.
But practices like stronger module boundaries, module docs, acceptance tests on internal dev-facing module APIs, etc are all things that will be much more valuable for LLM consumption. (And might make things more pleasant for humans too!)
So having clear requirements, a focused purpose for software, and a clear boundary of software responsibility makes for a software development task that can be accomplished?
If only people had figured out at some point that the same thing applies when communicating to human software engineers.
If human software engineers refused to work unless those conditions were met, what a wonderful world it would be.
They do implicitly: you can only be accidentally productive without those preconditions.
> you can speed up development by 10x.
If you know what you are doing, then yes. If you are a domain expert and can articulate your thoughts clearly in a prompt, you will most likely see a boost—perhaps two to three times—but ten times is unlikely. And if you don't fully understand the problem, you may experience a negative effect.
I think it also depends on how much yak-shaving is involved in the domain, regardless of expertise. Whether that’s something simple like remembering the right bash incantation or something more complex like learning enough Terraform and providers to be able to spin up cloud infrastructure.
Some projects just have a lot of stuff to do around the edges and LLMs excel at that.
You don't need microservices for that, just factor your code into libraries that can fit into the context window. Also write functions that have clear inputs and outputs and don't need to know the full state of the software.
This has always been good practice anyway.
> Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.
I had a very similar impression (wrote more in https://hua.substack.com/p/are-longer-context-windows-all-yo...).
One framing is that effective context window (i.e. the length that the model is able to effectively reason over) determines how useful the model is. A human new grad programmer might effectively reason over 100s or 1000s of tokens but not millions - which is why we carefully scope the work and explain where to look for relevant context only. But a principal engineer might reason over many many millions of context - code yes, but also organizational and business context.
Trying to carefully select those 50k tokens is extremely difficult for LLMs/RAG today. I expect models to get much longer effective context windows but there are hardware / cost constraints which make this more difficult.
> microservices written from scratch and interacting via cleanly defined APIs.
Introducing network calls because why? How about just factoring a monolith appropriately?
50K context is an interesting number because I think there's a lot to explore with software within an order of magnitude that size. With apologies to Richard Feynman, I call it, "There's plenty of room in the middle." My idea there is the rapid expansion of computing power during the reign of Moore's law left the design space of "medium sized" programs under-explored. These would be programs in the range of 100's of kilobytes to low megabytes.
It doesn't have to be microservices. You can use modular architecture. You can use polylith. You can have boundaries in your code and mock around them.
This is a helpful breakdown of a trend, thank you
Might be a boon for test-driven development. Could turn out that AI coding is the killer app for TDD. I had a similar thought about a year ago but had forgotten, appreciate the reminder
Hey I reached out on twitter to chat :)
> 5. Therefore all new development should be ~~microservices~~ modules written from scratch and interacting via cleanly defined APIs.
We figured this out for humans almost 20 years ago. Some really good empirical research. It's the only approach to large scale software development that works.
But it requires leadership that gives a shit about the quality of their product and value long-term outcomes over short-term rewards.
By large scale do you mean large software or large amounts of developers? Because there's some absolutely massive software in terms of feature set, usefulness and even LoC (not that that is a useful measurement) etc out there made by very small teams.
I'm not sure that you've got the causal relationship the right way around here re: architecture:team size.
What does team size have to do with this? Small teams can (and should) absolutely build modularized software ...
You simply cannot build a [working/maintainable] large piece of software if everything is connected to everything and any one change may cause issues in conceptually unrelated pieces of code. As soon as your codebase is bigger than what you can fully memorize, you need modules, separation of concerns, etc.
Sure I agree with that, but microservices are just one of many ways to modularize software/achieve separation of concerns.
I assumed you were talking about team size specifically because that is the thing that a microservice architecture uniquely enables in my experience.
I think you might be missing that Swizec edited the quote, crossing out microservices and correcting it to modular systems. It seems to me you're both in violent agreement.
Ahh, the strike through doesn't render on mobile. Yes, I think we are just agreeing with each other.
On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?
The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.
Absolutely. LLMs are a "need to verify" the results almost always. LLMs (for me) shine by pointing me in the right direction, getting a "first draft", or for things like code where I can test it.
It is really the only safe way to use it IMHO.
Even in most simple forms of automation, humans suffer from Automation Bias and Complacency and one of the better ways to avoid those issues is to instill a fundamental mistrust of those systems.
IMHO it is important to look at other fields and the human factors studies to understand this.
As an example ABS was originally sold as a technology that would help you 'stop faster'. Which it may do in some situations, and it is obviously mandatory in the US. But they had to shift how they 'sell' it now, to ensure that people didn't rely on it.
https://www.fmcsa.dot.gov/sites/fmcsa.dot.gov/files/docs/200...
2.18 – Antilock Braking Systems (ABS)
ABS is a computerized system that keeps your wheels from locking up during hard brake applications.
ABS is an addition to your normal brakes. It does not decrease or increase your normal braking capability. ABS only activates when wheels are about to lock up.
ABS does not necessarily shorten your stopping distance, but it does help you keep the vehicle under control during hard braking.
Transformers will always produce code that doesn't work, it doesn't matter if that is due to what they call hallucinations, Rice's theory, etc...Maintaining that mistrust is the mark of someone who understands and can leverage the technology. It is just yet another context specific tradeoff analysis that we will need to assess.
I think forcing people into the quasi-TDD thinking model, where they focus on what needs to be done first vs jumping into the implementation details will probably be a positive thing for the industry, no matter where on the spectrum LLM coding assistants arrive.
That is one of the hardest things to teach when trying to introduce TDD, focusing on what is far closer to an ADT than implementation specific unit tests to begin with is very different but very useful.
I am hopeful that required tacit experience will help get past the issues with formal frameworks that run into many barriers that block teaching that one skill.
As LLM's failure mode is Always Confident, Often Competent, and Inevitably Wrong, it is super critical to always realize the third option is likely and that you are the expert.
Agree. My biggest pain point with LLM code review tools is that they sometimes add 40 comments for a PR changing 100 lines of code. Gets noisy and hard to decipher what really matters.
Along the lines of verifiability, my take is that running a comprehensive suite of tests in CI/CD is going to be table stakes soon given that LLMs are only going to be contributing more and more code.
> On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?
I'm working on a LLM chat app that is built around mistrust. The basic idea is that it is unlikely a supermajority of quality LLMs can get it wrong.
This isn't foolproof though, but it does provide some level of confidence in the answer.
Here is a quick example in which I analyze results from multiple LLMs that answered, "When did Homer Simpson go to Mars?"
https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...
If you look at the yes and no table, all except GPT-4o and GPT-4o mini said no. After asking GPT-4o who was correct, it provided "evidence" on an episode so I asked for more information on that episode. Based on what it said, it looks like the mission to Mars was a hoax and when I challenged GPT-4o on this, it agreed and said Homer never went to Mars, like others have said.
I then asked Sonnet 3.5 about the episode and it said GPT-4o misinterpreted the plot.
https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...
At this point, I am confident (but not 100% sure) Homer never went to Mars and if I really needed to know, I'll need to search the web.
It's the backwards reasoning that really frustrates me when using LLMs. You ask a question, it says sure do these things, they don't work out and you ask the LLM why not, and it replies yes that thing I told you to do wouldn't work because of these clear reasons.
It would be nice to start at the end of that chain of reasoning instead of the other side.
Another regular example is when it "invents" functions or classes that don't exist, when pressed about them, it will reply of course that won't work, that function doesn't exist.
Okay great, so don't tell me it does with such certainty, is what I would tell a human feeding me imagination as facts all the time. But of course an LLM is not reasoning in the same sense, so this reverse chain of thought is the outcome.
I am finding LLMs far more useful for soft skill topics than engineering type work, simply because of how often it leads me down a path that is eventually a dead end, because of some small detail that was wrong at the very beginning.
> I am finding LLMs far more useful for soft skill topics than engineering type work, simply because of how often it leads me down a path that is eventually a dead end, because of some small detail that was wrong at the very beginning.
Yeah I felt the same way in the beginning which is why I ended up writing my own chat app. What I've found while developing my spelling and grammar checker is that it is very unlikely for multiple LLMs to mess up at the same time. I know they will mess up, but I'm also pretty sure they won't at the same time.
So far, I've been able to successfully create working features that actually saved me time by pitting LLMs against their own responses and others. My process right now is, I'll ask 6+ models to implement something and then I will ask models to evaluate everyone's responses. More often than not, a model will find fault or make a suggestion that can be used to improve the prompt or code. And depending on my confidence level, I might repeat this a couple of times.
The issue right now is tracking this "chain of questioning" which is why I am writing my own chat app. I need an easy way to backtrack and fork from different points in the "chain of questioning". I think once we get a better understanding of what LLMs can and can't do as a group, we should be able to produce working solutions easier.
I believe that this is what chain of thought models attempt to address.
Isn't this essentially making the point of the post above you?
For comparison - if I just do a web search for "Did homer simpson go to mars" I get immediately linked to the wikipedia page for that exact episode (https://en.wikipedia.org/wiki/The_Marge-ian_Chronicles), and the plot summary is less to read than your LLM output - It clearly summarizes that Marge & Lisa (note - NOT homer) almost went to mars, but did not go. Further - the summary correctly includes the outro which does show Marge and Lisa on mars in the year 2051.
Basically - for factual content, the LLM output was a garbage game of telephone.
> Isn't this essentially making the point of the post above you?
Yes. This is why I wrote the chat app, because I mistrust LLMs, but I do find them extremely useful when you approach them with the right mindset. If answering "Did Homer Simpson go to Mars?" correctly is critical, then you can choose to require a 100% consensus, otherwise you will need a fallback plan.
When I asked all the LLMs about the Wikipedia article, they all correctly answered "No" and talked about Marge and Lisa in the future without Homer.
Relatedly, asking LLMs what happens in a TV episode, or a series in general, I usually get very low quality and mostly flat out wrong answers. That baffles me, as I thought there are multiple well structured synopses for any TV series in the training data.
Yes, it is good for suumarizing existing text, explaining something or coding; in short any generative/transformative tasks. Not good for information retrieval. Having said that even tiny Qwen 3b/7b coding llms turned out to be very useful in my use experience.
You're going to fall behind eventually, if you continue to treat LLMs with this level of skepticism, as others won't, and the output is accurate enough that it can be useful to improve the efficiency of work in a great many situations.
Rarely are day-to-day written documents (e.g. an email asking for clarification on an issue or to schedule an appointment) of such importance that the occasional error is unforgivable. In situations where a mistake is fatal, yes I would not trust GenAI. But how many of us really work in that kind of a field?
Besides, AI shines when used for creative purposes. Coming up with new ideas or rewording a paragraph for clarity isn't something one does blindly. GenAI is a coworker, not an authority. It'll generate a draft, I may edit that draft or rewrite it significantly, but to preclude it because it could error will eventually slow you down in your field.
You’re narrowly addressing LLM use cases & omitting the most problematic one - LLMs as search engine replacements.
That's the opposite of problematic, that's where an LLM shines. And before you say hallucination, when was the last time you didn't click the link in a Google search result? It's user error if you don't follow up with additional validation, exactly as you would with Google. With GenAI it's simply easier to craft specific queries.
We need a hallucination benchmark.
My experience is, o1 is very good at avoiding hallucinations and I trust it more, but o1-mini and 4o are awful.
Well given the price $15.00 / 1M input tokens and $60.00 / 1M output* tokens, I would hope so. Given the price, I think it is fair to say it is doing a lot of checks in the background.
It is expensive. But if I'm correct about o1, it means user mistrust of LLMs is going to be a short-lived thing as costs come down and more people use o1 (or better) models as their daily driver.
> mistrust of LLMs is going to be a short-lived thing as costs come down and more people use o1
I think the biggest question is, is o1 scalable. I think o1 does well because it is going back and forth hundreds if not thousands of times. Somebody mentioned in a thread that I was participating in that they let o1 crunch things for 10 minutes. It sounded like it saved them a lot work, so it was well worth it.
Whether or not o1 is practical for the general public is something we will have to wait and see.
I'm going to wager "yes" because o3-mini (High) gets equal benchmark scores to o1 despite using 1/3rd as much compute, and because the consistent trend has been towards rapid order-of-magnitude decreases in price for a fixed level of intelligence (trend has many components dovetailing, both hardware and software related). Can't forecast the future, but this would be my bet on a time horizon of < 3 years.
Here's the Go app described in the post: https://github.com/yfzhou0904/tdd-with-llm-go
Example usage from that README (and the blog post):
% go run main.go \
--spec 'develop a function to take in a large text, recognize and parse any and all ipv4 and ipv6 addresses and CIDRs contained within it (these may be surrounded by random words or symbols like commas), then return them as a list' \
--sig 'func ParseCidrs(input string) ([]*net.IPNet, error)'
The all important prompts it uses are in https://github.com/yfzhou0904/tdd-with-llm-go/blob/main/prom...I have yet to see an LLM + TDD essay where the author demonstrates any mastery of Test Driven Development.
Is the label "TDD" being hijacked for something new? Did that already happen? Are LLMs now responsible for defining TDD?
In Rust, there's a controversial practice around putting unit tests in the same file as the actual code. I was put off by it at first, but I'm finding LLM autocomplete is able to be much more effective just being able to see the tests.
No clunky loop needed.
It's gotten me back into TDD.
The benefit of this approach is that you can directly test any function in the same scope without altering its visibility: it implicitly encourages you to test all functions (and design functions in a way they can be tested, as you are writing tests as you write code), not just those part of the public api contract.
Plus you can update tests, code, and comments in one go, with visibility into them at all times.
If the LLM can't complete a task, you add a test the shows it how to do it. This is multishot incontext learning and programming by example.
As for real TDD, you start with the tests and code until they pass. I haven't used an LLM to do this in Rust yet, but in Python due its dynamic nature, it is much simpler.
You can write the tests, then have the LLM sketch the code out enough so that they pass or at least exist enough to pass a linter. Dev tools are going to feel like magic 18 months from now.
I've sometimes done the same in python. I do quite like the ergonomics.
Writing a whole load of tests up front and then coding until all the tests pass is not TDD.
We implemented something similar for our Java backend project based on my rant here: https://testdriven.com/testdriven-2-0-8354e8ad73d7 Works great! I only look at generated code if it passes the tests. Now, can we use LLMs to generate tests from requirements? Maybe, but tests are mostly declarative and are easier to write than production code most of the time. This approach also allows us to use cheaper models, because the tool will automatically tell the model about compile error and failed tests. Usually, we give it up to five attempts to fix the code.
Super interesting approach! We've been working on the opposite - always getting your Unit tests written with every PR. The idea is that you don't have to bother running or writing them, you just get them delivered in your Github repo. You can check it out here https://www.codebeaver.ai
First, I'm a fan of LLMs reducing friction in tests, but I would be concerned with the false sense of confidence here. The demo gif shows "hey I wrote your tests and they pass, go ahead and merge them"
OP makes a valid point
> Now we contend with the “who guards the guard” problem. Because LLMs are unreliable agents, it might so happen that Claude just scammed us by spitting out useless (or otherwise low-effort) test cases. [...] So it’s time to introduce some human input in the form of additional test cases, which is made extra convenient since the model already provided the overall structure of our test. If those cases pass, we can be reasonably confident in integrating this function into our codebase.
In our repos, I would love to have an LLM tool/product that helps out with test writing, but the workflow certainly needs to have some human in the loop for the time being. More like "Here I got you started with test coverage, add a few more of your own" or "Give me a few bullet points of cases that should pass or fail" and review the test code, not "go ahead and merge these tests I wrote for you"
Test driven development is sequenced the way it is for a reason. Getting a failing test first builds confidence that the test is, you know, actually testing something. And the process of writing the tests is often where the largest amount of reasoning about design choices takes place.
Having an LLM generate the tests after you've already written the code for them is super counterproductive. Who knows whether those tests actually test anything?
I know this gets into "I wanted AI to do my laundry, not my art" territory, but a far more rational division of labor is for the humans to write the tests (maybe with the assistance of an autocomplete model) and give those as context for the AI. Humans are way better at thinking of edge cases and design constraints than the models are at this point in the game.
> For best results, our project structure needs to be set up with LLM workflows in mind. Specifically, we should carefully manage and keep the cognitive load required to understand and contribute code to a project at a minimum.
What's the main barrier to doing this all the time? Sounds like a good practice in general.
> What's the main barrier to doing this all the time? Sounds like a good practice in general.
Misunderstanding of what "cognitive load" is. It's not measured in ability for a junior picked off a street at random to understand code they never saw before.
There are at least two components to cognitive load: knowledge and working memory. Human working memory is limited, meaning we can only keep track of so much code in our head at the same time, which sets an upper bound on code complexity we're able to handle. If the problem's inherent complexity is greater than that upper bound, you won't be able to effectively solve it at all.
The point of learning about the domain, learning design patterns, advanced programming techniques, the point of developing languages with powerful features, is to allow people to spend cognitive work up front, learning these techniques and tools, and then forever be able to handle more complexity within their limited working memory. The current zeitgeist of writing dumbest possible code anyone can understand, gets it exactly backwards: the dumber, more junior-friendly the code, the more working memory it takes up, for a fixed amount of problem complexity being addressed. In other words, the dumber code you insist on, the fewer and simpler problems you can solve with it before you hit the hard limits of human brains.
My own belief is that we're already witnessing it all the time - software increasingly sucks and is bug-ridden because industry is trying to save money by making the least experienced people available do most of the actual coding work. Basically, software quality and productivity, as well as the complexity of problems it can address, is being limited by working memory of junior developers.
(The industry doesn't actually stop people from learning - it just forces people who reached a minimum amount of competence and want to earn more money to switch to faux-management, which "senior" and above roles increasingly are.)
It isn't good practice. You don't want people contributing to a project who don't understand the code they submit or the project they're contributing to, because you'll just need to make that up with more effort debugging garbage code. The cognitive load required to actually learn how things work is a necessary filter for minimum effort and quality.
This is not a good idea.
If you want better tests with more cases exercising your code: write property based tests.
Tests form an executable, informal specification of what your software is supposed to do. It should absolutely be written by hand, by a human, for other humans to use and understand. Natural language is not precise enough for even informal specifications of software modules, let alone software systems.
If using LLM's to help you write the code is your jam, I can't stop you, but at least write the tests. They're more important.
As an aside, I understand how this antipathy towards TDD develops. People write unit tests, after writing the implementation, because they see it as boilerplate code that mirrors what the code they're testing already does. They're missing the point of what makes a good test useful and sufficient. I would not expect generating more tests of this nature is going to improve software much.
Edit added some wording for clarity
The confusion in this article about what TDD is demonstrates how far everything has drifted. It's interesting in terms of what it achieves, but I don't think it's useful as a comment on TDD (or, for that matter, testing).
I got massive productivity gains from having an LLM fill out my test suite.
It is like autocomplete and macros... "Based on these two unit tests, fill out the suite considering b, c, and d. Add any critical corner case tests I have missed or suggest them if they don't fit well."
It is on the human to look at the generated test to ensure a) they are comprehensive and b) useful and c) communicate clearly
Can you extend that - what was the domain, how did you start? I would like to give this a try but am not quite sure I get it?
Backend coding for web services.
In the past I would hand write 8 or 9 unit tests. Now I write the first one or two and then brain dump anything else into the LLM prompt. It then outputs mine plus 6 or more.
I delete any that seem low value or ridiculous or have a follow up prompt to ask for refinements. Then just copy/pasta back into the codebase out of the chat.
Can confirm this approach works well for us too.
That simple ? I’ll try it
> 5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.
That's what the software industry has been trying and failing at for more than a decade.
Hey, yeah, this is a fun idea. I built a little toy llm-tdd loop as a Saturday morning side project a little while back: https://github.com/zephraph/llm-tdd.
This doesn't actually work out that well in practice though because the implementations the llm tended to generate were highly specific to pass the tests. There were several times it would cheat and just return hard coded strings that matched the expects of the tests. I'm sure better prompt engineering could help, but it was a fairly funny outcome.
Something I've found more valuable is generating the tests themselves. Obviously you don't wholesale rely on what's generated. Tests can have a certain activation energy just to figure out how to set up correctly (especially if you're in a new project). Having an LLM take a first pass at it and then ensuring it's well structured and testing important codepaths instead of implementation details makes it a lot faster to write tests.
> just return hard coded strings that matched the expects of the tests
I have done literally this in test ping-pong. It's fine. It just means it's on the other half of the loop to make the tests more in-depth.
Did you show the test cases to it? Maybe blinding it would solve the tailoring problem.
I did something similar for autogenerating RSpec tests in a Rails project.
https://gist.github.com/czhu12/b3fe42454f9fdf626baeaf9c83ab3...
It basically starts from some model or controller, and then parses the Ruby code into an AST, and load all the references, and then parses that code into an AST, up to X number of files, and ships them all off to GPT4-o1 for writing a spec.
I found sometimes, without further prompting, the LLM would write specs that were so heavily mocked that it became almost useless like:
``` mock(add_two_numbers).and_return(3) ... expect(add_two_numbers(1, 2)).to_return(3) ``` (Not that bad, just an illustrating example)
But the tests it generates is quite good overall, and sometimes shockingly good.
> recognize and parse any and all ipv4 and ipv6 addresses and CIDRs contained within it (these may be surrounded by random words or symbols like commas), then return them as a list'
Did I miss the generated code and test cases? I would like to see how complete it was.
For example, for IPv4 does it only handle quad-dotted IP addresses, or does it also handle decimal and hex formats?
For that matter, should it handle those, and if so, where there clarification of what exactly 'all ipv4 ... addresses' means?
I can think of a lot of tricky cases (like 1.2.3.4.5 and 3::2::1 as invalid cases, or http://[2001:db8:4006:812::200e] to test for "symbols like commas"), and would like to see if the result handles them.
very few times we are encountered with developing from scratch
I’m not going to claim I’ve solved this and figured out “the way” to use LLMs for tests, but I’ve found that copy-and-pasting code + tests and then providing a short essay about my own reasoning of edge cases followed with something along the lines of “your job is to find out what edge cases my reasoning isn’t accounting for, cases that would expose latent properties of the implementation not exposed via its contract, cases tested for by other similar code, domain exceptions I’m not accounting for, cases that test unexplored code paths, cases that align exactly with chunking boundaries or that break chunking assumptions, or any other edge cases I’m neglecting to mention that would be useful both to catch mistakes in the current code and to handle foreseeable mistakes that could arise from refactoring in the future. Try to understand how the existing test cases are defined to catch possibly problematic inputs and extend accordingly. Take into account both the api contract and the underlying implementation and approach this matter from an adversarial perspective where the goal of the tests is to challenge the author’s assumptions and break their code” has been useful.