Comments Page - Thoughts on a Month with Devin

« Back Thoughts on a Month with Devinanswer.aiSubmitted by swyx 13 hours ago

rbren 7 hours ago
I'm one of the creators of OpenHands (fka OpenDevin). I agree with most of what's been said here, wrt to software agents in general.
We are not even close to the point where AI can "replace" a software engineer. Their code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp. I've talked to companies who went all in on AI engineers, only to realize two months later that their codebase was rotting because no one was reviewing the changes.
But once you develop some intuition for how to use them, software agents can be a _massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself. I especially love asking it to do simple, tedious things like fixing merge conflicts or failing linters. It's great at getting an existing PR over the line.
It's also important to keep in mind that these agents are literally improving on a _weekly_ basis. A few weeks ago we were at the top of the SWE-bench leaderboard; now there are half a dozen agents that have pulled ahead of us. And we're one launch away from leapfrogging back to the top. Exciting times!
https://github.com/All-Hands-AI/OpenHands
- jebarker 6 hours ago
  > code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp
  > ..._massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself.
  I'm having trouble reconciling these statements. Where does the productivity boost come from since that reviewing burden seems much greater than you'd have if you knew commits were coming from a competent human?
  lars512 6 hours ago
  There's often a lot of small fixes that not time efficient to do, but a solution is not much code and is quick to verify.
  If the cost is small to setting a coding agent (e.g. aider) on a task, seeing if it reaches a quick solution, and just aborting if it spins out, you can solve a subset of these types of issues very quickly, instead of leaving them in issue tracking to grow stale. That lets you up the polish on your work.
  That's still quite a different story to having it do the core, most important part of your work. That feels a little further away. One of the challenges is the scout rule, the refactoring alongside change that makes the codebase nicer. I feel like today it's easier to get a correct change that slightly degrades codebase quality, than one that maintains it.
  jebarker 5 hours ago
  Thanks - this all makes sense - I still don't feel like this would constitute a massive productivity boost in most cases, since it's not fixing time consuming major issues. But I can see how it's nice to have.
  rbren 5 hours ago
  The bigger win comes not from saving keystrokes, but from saving you from a context switch.
  Merge conflicts are probably the biggest one for me. I put up a PR and move onto a new task. Someone approves, but now there are conflicts. I could switch off my task, spend 5-10 min remembering the intent of this PR and fixing the issues. Or I could just say "@openhands fix the merge conflicts" and move back to my new task.
  lolinder 5 hours ago
  I haven't started doing this with agents, but with autocomplete models I know exactly what OP is talking about: you stop trying to use models for things that models are bad at. A lot of people complain that Copilot is more harm than good, but after a couple of months of using it I figured out when to bother and when not to bother and it's been a huge help since then.
  I imagine the same thing applies to agents. You can waste a lot of time by giving them tasks that are beyond them and then having to review complicated work that is more likely to be wrong than right. But once you develop an intuition for what they can and cannot do you can act appropriately.
  drewbug01 6 hours ago
  I suspect that many engineers do not expend significant energy on reviewing code; especially if the change is lengthy.
  linsomniac 6 hours ago
  >burden seems much greater than...
  Because the burden is much lower than if you were authoring the same commit yourself without any automation?
  jebarker 6 hours ago
  Is that true? I'd like to think my commits are less burdensome to review than a fresh out of boot camp junior dev especially if all that's being done is fixing linter issues. Perhaps there's a small benefit, but doesn't seem like a major productivity boost.
  ErikBjare 6 hours ago
  A junior dev is not a good approximation of the strengths and weaknesses of these models.
  rbren 5 hours ago
  Agreed! The comparison is great for estimating the scope of the tasks they're capable of--they do very well with bite-sized tasks that can be individually verified. But their world knowledge is that of a principal engineer!
  I think this is why people struggle so much with agents--they see the agent perform magic, then assume it can be trusted with a larger task, where it completely falls down.
  jebarker 5 hours ago
  The post I originally commented on literally made that comparison when describing the models as a massive productivity boost.
- bufferoverflow 3 hours ago
  We've seen exponential improvements in LLM's coding abilities. Went from almost useless to somewhat useful in like two years.
  Claude 3.5 is not bad really. I wanted to do a side project that has been on my mind for a few years, and Claude coded it in like 30 seconds.
  So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.
  Zanfa 2 hours ago
  > So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.
  These sorts of things can’t be extrapolated. It could be 6-months, it could be a local maxima / dead end that’ll take another breakthrough in 10 years like transformers were. See self-driving cars.
- veggieroll 3 hours ago
  What does the cost look like for running OpenHands yourself? From your docs, it looks like you recommend Sonnet @ $3 / million tokens. But I could imagine this can add up quickly if you are sending large portions of the repository at a time as context.
CGamesPlay 8 hours ago
As someone who uses AI coding tools daily and has done a fair amount of experimentation with different approaches (though not Devin), I feel like this tracks pretty well. The problem is that Devin and other "agentic" approaches take on more than they can handle. The best AI coders are positioned as tools for developers, rather than replacements for them.
Github Copilot is "a better tab complete". Sure, it's a neat demo that it can produce a fast inverse square root, but the real utility is that it completes repetitive code. It's like having a dynamic snippet library always available that I never have to configure.
Aider is the next step up the abstraction ladder. It can edit in more locations than just the current cursor position, so it can perform some more high-level edit operations. And although it also uses a smarter model than Copilot, it still isn't very "smart" at the end of the day, and will hallucinate functions and make pointless changes when you give it a problem to solve.
- frereubu 7 hours ago
  When I tried Copilot the "better tab complete" felt quite annoying, in that the constantly changing suggested completion kept dragging my focus away from what I was writing. That clearly doesn't happen for you. Was that something you got used to over time, or did that just not happen for you? There were elements of it I found useful, but I just couldn't get over the flickering of my attention from what I was doing to the suggested completions.
  Edit: I also really want something that takes the existing codebase in the form of a VSCode project / GitHub repo and uses that as a basis for suggestions. Does Copilot do that now?
  macNchz 6 hours ago
  I tried to get used to the tab completion tools a few times but always found it distracting like you describe. often I’d have a complete thought, start writing the code, get a suggested completion, start reading it, realize it was wrong, but then I’d have lost my initial thought, or at least have to pause and bring myself back to it.
  I have, however, fully adopted chat-to-patch style workflows like Aider, I find it much less intrusive and distracting than the tab completions, since I can give it my entire thought rather than some code to try to complete.
  I do think there’s promise in more autonomous tools, but they still very much fall into the compounding-error traps that agents often do at the present.
  CGamesPlay 4 hours ago
  I have the automatic suggestions turned off. I use a keybind to activate it when I want it.
  > existing codebase in the form of a VSCode project / GitHub repo and uses that as a basis for suggestions
  What are you actually looking for? Copilot uses "all of github" via training, and your current project in the context.
  frereubu 4 hours ago
  > I have the automatic suggestions turned off. I use a keybind to activate it when I want it.
  I didn't realise you could do that. Might give it another go.
  > Copilot uses "all of github" via training, and your current project in the context.
  The current project context is the bit I didn't think it had. Thanks!
  wrsh07 5 hours ago
  For cursor you can chat and ask @codebase and it will do rag (or equivalent) to answer your question
  goosejuice 7 hours ago
  Copilot is also very slow. I'm surprised people use it to be honest. Just use Cursor.
  pindab0ter 6 hours ago
  Cursor requires you to use their specific IDE though, doesn't it? With Copilot I don't have to switch contexts as it lives in my Jetbrains IDE.
  goosejuice 6 hours ago
  It's just vscode. I greatly prefer vim but the difference between vim + ai tools and cursor is just a no brainer in terms of productivity. Cursor isn't without problems but it's leagues ahead of the competition in my opinion.
  verdverm 6 hours ago
  I've been tempted to try Cursor because of vocal fans like yourself. Then I went to their website and forums yesterday. I am no longer tempted.
  goosejuice 2 hours ago
  Because of the complaints? If so, yeah I get it. I'm there amongst them. It's kind of like Tesla FSD. There are often setbacks in releases and they definitely need to work on their communication with the community. That said, for the current price it's still worth any misgivings.
  verdverm an hour ago
  The price is one of the issues I have with this space more generally.
  I do not want to pay $20/m for a capped experience
  I want to pay $10/m to support development, and pay for my AI usage on my own, per request, by choosing my own model and provider
  If I was going to shell out money, it would be for Copilot, not Cursor. I prefer my AI to be a side dish, not the main course or core experience
  rob137 4 hours ago
  Can you say more?
  verdverm an hour ago
  pricing model, downtime, model support, pricing model, trying to take over the experience rather than assist within my experience. This last one is big, because Cursor wants to "reimagine" how developers work. The problem is the AIs are so far from being competent, they need to be kept on the sidelines and sub'd in occasionally, not be the quarterback. Oh, did I mention pricing model?
  mattnewton 5 hours ago
  I would try cursor. It’s pretty good at copy pasting the relevant parts of the codebase in and out of the chat window. I have the tab autocomplete disabled.
  Aeolun 7 hours ago
  Cursor tab does that. Or at least, it takes other open tabs into account when making suggestions.
  sincerely 6 hours ago
  i’ve been very impressed with the gemini autocomplete suggestions in google colab, and it doesn’t feel more/less distracting than any IDEs built in tab suggestions
  verdverm 6 hours ago
  I think a lot of people who are enabling copilot in vs code (like I did a few days ago), are experiencing "suggested autocomplete as I type" for the first time where before there was no grey text below what I am writing personally.
  It is a huge distraction, especially if it changes as I write more. I turned it off almost immediately.
  I deeply regret turning on copilot in vscode. It (M$) immediately weaseled into so many places and settings. I'm still trying to scaled it back. Super annoying and distracting. I'd prefer a much more opt in for each feature than what they did.
- the_af 5 hours ago
  > The best AI coders are positioned as tools for developers, rather than replacements for them.
  I agree with this. However, we must not delude ourselves and understand that corporate is pushing for replacement. So there will be a big push to improve on tools like Devin. This is not a conspiracy theory, in many companies (my wife's, for example) they are openly stating this: we are going to reduce (aka "lay off") the engineering staff and use as much AI solutions as possible.
  I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. Not everyone can be a top of the cream specialist. And it'll be used to drive down salaries, too.
  lolinder 5 hours ago
  I remember when I was first getting started in the industry the big fear of the time was that offshoring was going to take all of our jobs and drive down the salaries of those that remained. In fact the opposite happened: it was in the next 10 years that salaries ballooned and tech had a hiring bubble.
  Companies always want to reduce staff and bad companies always try to do so before the solution has really proven itself. That's what we're seeing now. But having deep experience with these tools over many years, I'm very confident that this will backfire on companies in the medium term and create even more work for human developers who will need to come in and clean up what was left behind.
  (Incidentally, this also happened with offshoring— many companies ended up with large convoluted code bases that they didn't understand and that almost did what they wanted but were wrong in important ways. These companies needed local engineers to untangle the mess and get things back on track.)
  senordevnyc 2 hours ago
  But having deep experience with these tools over many years, I'm very confident...
  No one has had deep experience with these tools for any amount of time, let alone many years. They're literally just now hitting the market and are rapidly expanding their capabilities. We're at a fundamentally different place than we were just twelve months ago, and there's no reason to think 2025 will be any different.
  lolinder an hour ago
  I was building things with GPT-2 in 2019. I have as much experience engineering with them as anyone who wasn't an AI researcher before then.
  And no, we're not at a fundamentally different place than we were just 12 months ago. The last 12 months had much slower growth than the 12 months before that, which had slower growth than the 12 months before that. And in the end these tools have the same weaknesses that I saw in GPT-2, just to a lesser degree.
  The only aspect in which we are in a fundamentally different place is that the hype has gone through the roof. The tools themselves are better, but not fundamentally different.
  the_af 3 hours ago
  I think it's qualitatively different this time.
  Unlike with offshoring, this is a technological solution, which understandably is received more enthusiastically on HN. I get it. It's interesting as tech! And it's achieved remarkable things. But unlike with offshoring (which is a people thing) or magical NOCODE/CASE/etc "solutions", it seems the consensus is that AI coding assistants will eventually get there. At least a portion of even HN seems to think so. And some are cheering!
  The coping mechanism seems to be "it won't happen to me" or "my knowledge is too specialized" but I think this will become increasingly false. And even if your knoweldge is too specialized to be replaced by AI, most engineers aren't like that. "Well, become more specialized" is unrealistic advice, and in any case, the employment pool will shrink.
  PS: I am offhsoring (in a way). I'm not based in the US but I work remotely for a US company.
  lolinder 3 hours ago
  > But unlike with offshoring (which is a people thing) or magical NOCODE/CASE/etc "solutions", it seems the consensus is that AI coding assistants will eventually get there.
  There's no consensus to that point. There are a few loud hype artists, most of whom are employed in AI and have so have conflicts of interest and also are pre-filtered to the true believers. Their logic is basically "See this trend? Trends continue, so this is inevitable!"
  That's bad logic. Trends do not always continue, they often slow or reverse, and this one is showing all signs of doing so already. OpenAI has come straight out and said that they don't expect to see another jump like GPT-3 to 4, and have resorted to throwing more tokens at the problems, which works with diminishing returns. I do not expect to see a return to the rapid growth we had for a year or two there.
  > PS: I am offhsoring (in a way). I'm not based in the US but I work remotely for a US company.
  Yes, and this is a good example: there's a place for offshoring, but it didn't replace US devs. The same thing will happen here.
  senordevnyc 2 hours ago
  Trends do not always continue, they often slow or reverse, and this one is showing all signs of doing so already. OpenAI has come straight out and said that they don't expect to see another jump like GPT-3 to 4, and have resorted to throwing more tokens at the problems, which works with diminishing returns. I do not expect to see a return to the rapid growth we had for a year or two there.
  This feels like the declaration of someone who has spent almost no time playing with these models or keeping up with AI over the last two years. Go look at the benchmarks and leaderboards for the last 18 months and tell me we're not progressing far beyond GPT4. Meanwhile models are also getting faster, cheaper, getting multi-modal capabilities, cheaper to train for a given capability, etc.
  And of course there are diminishing returns, the latest public models are in the 90s on many of their benchmarks!
  nyarlathotep_ 5 hours ago
  > I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. And it'll be used to drive down salaries, too.
  Yeah it's maddening.
  The cope is bizarre too: "writing code is the least important part of the job"
  Ok then why does nearly every company make people write code for interviews or do take home programming projects?
  Why do people list programming languages on their resumes if it's "least important"?
  Also bizarre to see people cheering on their replacements as they use all this stuff.
  s1mplicissimus 5 hours ago
  > Ok then why does nearly every company make people write code for interviews or do take home programming projects?
  For the same reason they put leetcode problems to "test" an applicants skill. Or have them write mergesort on a chalkboard by hand. It gives them a warm fuzzy feeling in the tummy because now they can say "we did something to check they are competent". Why, you ask? Well it's mostly impossible to come up with a test to verify a competency you don't have yourself. Imagine you can't distinguish red and green, are not aware of it, but want to hire people who can. That's their situation, but they cannot admit it - because it would be clear evidence that they are no good fit for their current role. Use this information responsibly ;)
  > Why do people list programming languages on their resumes if it's "least important"?
  You put the programming languages in there alongside the HR-soothing stuff because you hope that an actual software person gets to see your resume and gives you an extra vote for being a good match. Notice that most guides recommend a relatively small amount of technical content vs. lots of "using my awesomeness i managed to blafoo the dingleberries in a more efficient manner to earn the company a higher bottom line"
  If you don't want to be a software developer that's fine. But your questions point me towards the conclusion that you don't know a lot of things about software development in the first place which doesn't speak for your ability to estimate how easy it will be to automate it using LLMs.
  the_af 3 hours ago
  Arguing about programming is not the point, in my opinion.
  When AI becomes able to do most non-programming tasks too, say design or solving open-ended problems (yeah, except in trivial cases it cannot -- for now) we can have this conversation again...
  I think saying "well, programming is not important, what matters is $THING" is a coping mechanism. Eventually AI will do $THING acceptably enough for the bean counters to push for more layoffs.
- qup 5 hours ago
  It's weird to talk about aider hallucinating.
  That's whatever model you chose to use with it. Aider can use any.l model you like.
xmprt 12 hours ago
I think one of the big problems with Devin (and AI agents in general) is that they're only ever as good as they are. Sometimes their intelligence feels magical and they accomplish things within minutes that even mid level or senior software engineers would take a few hours to do. Other times, they make simple mistakes and no matter how much help you give, they run around in circles.
A big quality that I value in junior engineers is coachability. If an AI agent can't be coached (and it doesn't look like it right now), then there's no way I'll ever enjoy using one.
- ipnon 11 hours ago
  My first job I spent so much time reading Python docs, and the ancient art of Stack Overflow spelunking. But I could intuitively explain a solution in seconds because of my CS background. I used to encounter a certain kind of programmer often, who did not understand algorithms well but had many years of experience with a language like Ruby, and thus was faster in completing tasks because they didn't need to do the reference work that I had to do. Now I think these kinds of programmers will slowly disappear and only the ones with the fast CS intuition will remain.
  joduplessis 9 hours ago
  I've found the opposite true as well.
  halfmatthalfcat 10 hours ago
  I disagree. If anything, CS degrees have proven time and time again they aren't translatable into software development (which is why there's an entire degree field called Software Engineering emerging).
  If anything, my gut says that the CS concepts are very easy for LLMs to recall and will be the first things replaced (if ever) by AI. Software engineer{ing,s} (project construction, integrations, scaling, organizational/external factors, etc) will stick around for along time.
  There's also the meme in the industry that self-taught, non-CS degree engineers are potentially of the most capable group. Though this is anecdotal.
  ben_w 9 hours ago
  > If anything, CS degrees have proven time and time again they aren't translatable into software development (which is why there's an entire degree field called Software Engineering emerging).
  Emerging? I graduated in 2006 with a BEng in Software Engineering.
  The difference between it and the BSc CompSci degree I started in, was that optional modules became mandatory — including an industrial placement year (paid internship).
  > Software engineer{ing,s} (project construction, integrations, scaling, organizational/external factors, etc) will stick around for along time.
  My gut disagrees, because LLMs are at about the same level in those things as they are in low level coding: not yet replacing humans in project level tasks any more than they do in coding tasks, but also being OK assistants for both coding and project domains. I have no reason to think either has less or more opportunity for self-training, so I expect progress to track for the foreseeable future.
  (That said, the foreseeable future in this case is 1-2 years).
  viraptor 8 hours ago
  > the CS concepts are very easy for LLMs to recall
  They're easy to recall, but you have to know what to recall in the first place. Or even know enough of the territory to realise there's something to recall. Without enough background, you'll get a whole set of amazing tools that you have no idea what to do with.
  For example, you may be able to write a long description of your problem with some ideas how to steer the AI to give you possible solutions. And the AI may figure out what the problem is and that the hyperloglog is something that could be useful to you. And you may have the awesome programming skills to implement that. But that's a lot of maybes. It would be much faster/easier if you knew about hyperloglog ahead of time and just asked for the implementation or library recommendation.
  Or even if you don't know about the actual solution, you'd have enough of CS vocabulary to ask: "how do I get a fast, approximate distinct count from a multiset". It would take a long imprecise description to get the same thing for a coder with no theory background.
  macNchz 6 hours ago
  To this point, I use AI programming assistants pretty heavily and find very frequently that they will write extremely inefficient or oddly baroque implementations of what I’m asking for in their first pass, that appear as if they don’t have the “knowledge” or ability to do it better, but then they can be prodded to re-do it very easily. Frequently I look at some generated code and write back the most cursory feedback like “looks o(n^2) can you make more efficient” or “use pointers instead of nested loops” or “how about using X approach” and it will often produce something dramatically better than the initial effort. For now at least I think these tools are still most powerful in the hands of experts. (I am a self-taught programmer but have a fair bit of experience)
  cmiles74 4 hours ago
  I'm not convinced an LLM is really "recalling" any CS concepts when they try to solve a problem. IMHO, we're lucky if it matches the pattern of the request against the pattern of a solution and the two are actually related. I'm no expert but I don't think there's any reason to think that an LLM is taking a CS concept and applying it to something novel in order to get a solution. If they were, I believe their success rate would be much higher.
  In many places where someone might reach for something they remember from their CS coursework, there's often an open-source library or tool doing much the same thing. Understanding how these libraries and tools function is certainly valuable but, much of the time, people can get by with only a vague hunch; indeed, this is why they exist! IMHO, I would be happier with the LLM assistant if it picked reliable library code rather than writing up a sketchy solution of its own.
  I'm also familiar with this idea people who have managed to be successful in the field without a CS degree are more capable. In my opinion, this is hogwash. I think if we take a step back, we'll see that people graduating from established, top-tier CS programs are looking for higher pay than those who have come from a less expensive and (very often) business focused program. To be fair, people from each of these backgrounds has their strengths; in many organizations a developer who has done two semesters of accounting is a real benefit, in others the ability to turn a research paper into the prototype of a new product is going to be important.
  Years of experience often washes out much of these differences. People who have started from business oriented education programs may end up taking a handful of CS courses as they navigate their career, likewise many people with a CS background end up accruing business centered skills.
  In my opinion, people start out their education at a place that they can afford, a place that is familiar to them, often a place that they feel comfortable. Someone's economic background (and of their family) plays a big role on what kind of educational program they choose when they are on the cusp of adulthood. Smart and talented people can always learn what they need to learn, regardless of their starting point.
  jpc0 6 hours ago
  I think honestly the meme that non-CS degree engineers are most capable is selection bias.
  If they had taken a CS degree they would likely be just as, of not more capable.
  To self-learn the topics you need to make good software takes an immense amount of effort and although the data and material is out there, takes a lot of work to figure out.
  I'm only recently starting to pick up on "magic" patterns that are actually extremely simple to understand given the right base knowledge... I can gain tons of insights from talks givem in the early 2010s but if I watched them without the correct practical experience and foundational knowledge it is the same as the title to a HN post this week[1], gibberish.
  With the correct time playing with the foundational patterns and learning some of the backing knowledge it unlocks amazing patterns in my mind and makes the magic seem simple. A great example, CSP[2]. I've known about and used the actor model before, which I first discovered when I found Erlang, but now with CSP I could ask the question "Why should actors be heavy?", you can put an actor into a light-weight task and spawn tons of them and build a tree of connections. Stuff like oneTBB flow graph[3]now makes sense and looks like a beautiful pattern with some really interesting ideas that can be implemented in more general computing than the high performance computing it was designed for. It seems niche but golang is built on those foundations, and the true power of concurrency in golang comes from embracing that. It fundamentally changes the way I want to structure and layout code and I feel like a good CS course can get you there quicker...
  Unfortunately a good CS course probably wouldn't accelerate the average CS grads understanding of that but can get someone dedicated and hungry there much much quicker. Someone fresh out of a JS bootcamp is maybe a decade away from that if they ever even want to search for that knowledge.
  1. https://news.ycombinator.com/item?id=42711751
  2. https://en.m.wikipedia.org/wiki/Communicating_sequential_pro...
  3. https://oneapi-spec.uxlfoundation.org/specifications/oneapi/...
- marcyb5st 11 hours ago
  I completely agree with you. More precisely, I feel they are useful when you have specific tasks with limited scope.
  For instance, just yesterday I was battling with a complex SQL query and I got halfway there. I gave our bot the query and an half assed description of what I wanted/what was missing and it got it right on the first try.
  datadrivenangel 4 hours ago
  Are you sure that your SQL query is correct?
  QuadmasterXLII 4 hours ago
  he’s certainly sure, but lord knows if it is
- undefined 7 hours ago
  [deleted]
- kkaatii 9 hours ago
  And when working with people it's fairly easy to intervene and improve when needed. I think the current working model with LLMs is definitely suboptimal when we cannot confine their solution space AND where they should apply a solution precisely, and timely.
  llamaimperative 8 hours ago
  It’s also often possible to know what a human will be bad at before they start. This allows you to delegate tasks better or vary the level of pre-work you do before getting started. This is pretty unpredictable with LLMs still.
  undefined 5 hours ago
  [deleted]
rco8786 7 hours ago
I'm sure a lot of folks in these comments predicted these sorts of results with surprising accuracy.
Stuff like this is why I scoff when I hear about CEOs freezing engineering hiring or saying they just don't need mid-level engineers anymore because they have AI.
I'll start believing that when I see it happening, and see actual engineers saying that AI can replace a human.
I am long AI, but I think the winning formula is small, repetitive tasks with a little too much variation to make it worth it (or possible) to automate procedurally. Pulling data from Notion into Google sheets, like these folks did initially, is probably fine. Having it manage your infrastructure and app deployments, likely not.
davedx 12 hours ago
One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?
One of the more important features of agents is supposedly that they can stop and ask for human input when necessary. It seems it does do this for "hard stops" - like when it needed a human to setup API keys in their cloud console - but for "soft stops" it wouldn't.
By contrast, a human dev would probably throw in the towel after a couple of hours and ask a senior dev for guidance. The chat interface definitely supports that with this system but apparently the agent will churn away in a sort of "infinite thinking loop". (This matches my limited experience with other agentic systems too.)
- coffeebeqn 12 hours ago
  LLMs can create infinite worlds in the error message it’s receiving. It probably needs some outside signal to stop and re-assess. I don’t think LLMs have any ability to reason if they’re lost in their own world on their own. They’ll just keep creating new less and less coherent context for themselves
  someothherguyy 11 hours ago
  If you correct an LLM based agent coder, you are always right. Often, if you give it advice, it pretends like it understands you, then goes on to do something different from what it said it was going to do. Likewise, it will outright lie to you telling you it did things it didn't do. (In my experience)
  rsynnott 9 hours ago
  So when people say these things are like junior developers, they really mean that they’re like the worst _stereotype_ of junior developers, then?
  undefined 9 hours ago
  [deleted]
  davedx 11 hours ago
  For sure - but if I'm paying for a tool like Devin then I'd expect the infrastructure around it to do things like stop it if it looks like that has happened.
  What you often see with agentic systems is that there's an agent whose role is to "orchestrate", and that's the kind of thing the orchestrator would do: every 10 minutes or so, check the output and elapsed time and decide if the "developer" agent needs a reality check.
  mousetree 6 hours ago
  How would it decide if it needs a reality check? Would the thing checking have the same limitations?
  tobyhinloopen 10 hours ago
  You can maybe have a supervisor AI agent trigger a retry / new approach
  nejsjsjsbsb 8 hours ago
  They need impatience!
- verdverm 6 hours ago
  I think training it to do that would be the hard part.
  - stopping is probably the easy part
  - I assume this happens during RLFH phase
  - Does the model simply stop or does it ask a question?
  - You need a good response or interaction, depending on the query? So probably sets or decision trees of them, or agentic even? (chicken-egg problem?)
  - This happens 10s of thousands of times, having humans do it, especially with coding, is probably not realistic
  - Incumbents like M$ with Copilot may have an advantage in crafting a dataset
- rfoo 7 hours ago
  Devin does ask for help when it can't do something. I think I have it asked me how to use a testing suite it had trouble running.
  The problem is it really really hate asking for help if it had a skill issue, it would prefer running in circles than admitting it just can't do something.
  Aeolun 7 hours ago
  So they perfectly nailed the junior engineer. It’s just that that isn’t what people are looking for.
  rfoo 6 hours ago
  Maybe. It's pretty weird and I'm still thinking about it.
  You can't throw junior engineers working on an issue under the bus when they clearly can't do that. Or at least it takes some effort. In return you may coach them and hope they eventually improves.
  Devin does look like junior engineers, but I've learned to just click "Terminate Session" immediately after I spotted that it was doing something hopeless. I've managed to get some real work done out of it, without much effort on my side (just check what it's doing every 10~15 minutes and type a few lines or restart session).
- mkagenius 6 hours ago
  If they had built that from the beginning people would have said "every other tasks it asks me for help, how is it a developer then if I have to assist it all the time?"
  But now since you are okay with that, I think it's the right time to add that feature.
- csomar 11 hours ago
  > One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?
  You are over-estimating the sophistication of their platform and infrastructure. Everyone was talking about Cursor (or maybe was it astroturfing?) but once I checked it out, it was not far from avante on neovim.
  a1j9o94 7 hours ago
  Cursor isn't designed to do long running tasks. As someone mentioned in another comment it's closer to a function call than a process like Devin.
  It will only do one task at a time that it's asked to do.
- bot403 6 hours ago
  You can set a "max work time" before it pauses so it wont go for days endlessly spending your credits. By default its set to 10 credits.
  So I'm not sure how the author got it to go for days.
- ImHereToVote 11 hours ago
  There should be an energy coefficient to problems. You only get a set amount of energy to solve per issue. When the energy runs out. A human must help.
npilk 6 hours ago
This feels a bit like AI image generation in 2022. The fact that it works at all is pretty mindblowing, and sometimes it produces something really good, but most of the time there are obvious mistakes, errors, etc. Of course, it only took a couple more years to get photorealistic image outputs.
A lot of commenters here seem very quick to write off Devin / similar ideas permanently. But I'd guess in a few years the progress will be remarkable.
One stubborn problem – when I prompt Midjourney, what I get back is often very high-quality, but somehow different than what I expected. In other words, I wouldn't have been able to describe what I wanted, but once I see the output I know it's not quite right. I suspect tools like this will run into similar issues. Maybe there will be features that can help users 'iterate' quickly.
- muglug 5 hours ago
  > Of course, it only took a couple more years to get photorealistic image outputs.
  "Photorealistic" is a pretty subjective judgement, whereas "does this code produce the correct outputs" is an objective judgement. A blurry background character with three arms might not impact one's view of a "photorealistic" image, but a minor utility function returning the wrong thing will break a whole program.
noodletheworld 11 hours ago
Those “how I feel about Devin after using it” comments at the bottom are damning, when you compare them to the user testimonials of people using cursor.
Seems to me that agents just aren’t the answer people want them to be, just a hype wave obscuring real progress in other areas (eg. MCST) because they’re easy to implement.
…but really, if things are easy to implement, at this point, you have to ask why they haven’t been done yet.
Probably, it seems, because it’s harder to implement in a way that’s useful than it superficially appears…
Ie. If the smart folk working on Devin can only do something of this level, anyone working on agentic systems should be worried, because it’s unlikely you can do better, without better underlying models.
- Melomomololo 9 hours ago
  Agents are really new and would solve plenty of annoying things.
  When I code with Claude, I have to copy paste files around.
  But everything we do in AI is new and outdated a few weeks ago.
  Claude is really good but blocks you in 1-3h for a bit due to context length.
  That type of issues will be solved.
  And local coding models are super fast on a 4090 already. Imagine a small project digits on your desktop were you allow these models also more thinking. But the thinking style models again are super new.
  Things probably are not done yet because we humans are the bottleneck right now. Getting enough chips, energy, standards, training time, doing experiments with tech a while tech b starts to emerge from another corner of ai.
  5090 just was announced and depending on benchmarks it might be 1.x-3 times faster. if it's faster above 1.5 that would again be huge.
  llamaimperative 8 hours ago
  Have you used Cursor, which GP actually refers to?
- undefined 11 hours ago
  [deleted]
- freddref 9 hours ago
  How is Devin different from cursor?
  I recently used cursor and it has felt very capable in implementing tasks across files. I get that cursor is an IDE but it's ai functionality feels very agentic.. where do you draw the line?
  Xmd5a 8 hours ago
  I had to look up MCST: it means Model-Centric Software Tools, as opposed to autonomous agents.
  Devin is closer to a long-running process that you can interact with as it is processing tasks, whereas Cursor is closer to a function call: once you've made the call, the only think you can do is wait for the result.
  noodletheworld 8 hours ago
  It stands for Monte Carlo search tree.
  Ie. Better outputs from models, not external tooling and prompt engineering.
  https://github.com/zz1358m/MCTS-AHD-master
  Xmd5a 7 hours ago
  Thanks for the correction, I guess I was lured by yet another LLM confabulation
ianbutler 12 hours ago
Disclosure: Working on a company in the space and have recently been compared to Devin in at least one public talk.
Devin has tried to do too much. There is value in producing a solid code artifact that can be handed off for review to other developers in limited capacities like P2s and minor bugs which pile up in business backlogs.
Focusing on specific elements of the development loop such as fix bugs, add small feature, run tests, produce pull request is enough.
Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.
- yoavm 11 hours ago
  Not to take away from your opinion, but I guess time will tell? As models get better, it's possible that wide tools like Devin will work better and swallow tools that do one thing. I think companies much rather have a AI solution that works like what they already know (developers), than one that works in the IDE, another that watches to Github issues, another that reviews PRs, and one that hangs on Slack and makes small fixes.
  > Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.
  Interest isn't what tools like Devin are lacking, (un)fortunately.
  To be clear, I do share a lot of scepticism regarding all the businesses working around AI code generation. However, that isn't because I think they'll never be able to figure it out, but because I think they are all likely to figure it out at the end, at the same time, when better models come out. And none of them will have a real advantage over the other.
  ianbutler 11 hours ago
  I've recently had several enterprise level conversations with different companies and what we're being asked for is specifically the simpler approach. I think that is the level of risk they're willing to tolerate and it will still ameliorate a real issue for them.
  The key here is my product is no worse positioned to do more things if and when the time comes, but building a solid foundation and trust, and not having the quiet part be (which I heard as early as several months ago) that your product doesn't work means we'll hopefully still have the customer base to roll that out to.
  I've talked to Devin's CEO once at Swyx's conference last June, they're very thoughtful and very kind so this must be very rough but between when they showed their demo then and what I'm hearing now the product has not evolved in a way where they are providing value commensurate with their marketing or hype.
  I'm a fan of Guillermo Rauch's (Vercel CEO) take on these things. You earn the right to take on bigger challenges and no one in this space has earned the right yet including us.
  Devin's investment was fueled by hyperspeculation early on when no one knew what the shape of the game was. In many ways we still don't, but if you burn your reputation before we get there you may not be able to capitalize on it.
  To be completely fair to them, taking the long view and the bank account to go with it they may still be entirely fine.
  likium 10 hours ago
  > You earn the right to take on bigger challenges and no one in this space has earned the right yet including us.
  Not entirely. We're in interesting times where products with better models can suddenly leapfrog and displace even current upstarts. Cursor won over Copilot from leveraging Claude Sonnet 3.5. They didn't "earn the right".
  Improvements with models will help those with the existing infrastructure that can benefit from it. I'm not saying Devin will win when that time comes, but a similar product might find their space quickly.
  kgilpin 6 hours ago
  I just want to note that Copilot is multi model now and can also run Sonnet.
- morgante 9 hours ago
  You can get a much higher hit rate with more constrained agents, but unfortunately if it's too constrained it just doesn't excite people as much.
  Ex. the Grit agent (my company) is designed to handle larger maintenance tasks. It has a much higher success rate, with <5% rejected tasks and 96% merged PRs (including some pretty huge repos).
  It's also way less exciting. People want the flashy tool that can solve "everything."
tlarkworthy 12 hours ago
Also trialed Devin, it's quite impressive when it understands the code formatting and local test setup, producing well formatted and test case passing code, but it seems to always add extraneous changes beyond the task that can break other things. And it can't seem to undo those changes if you ask. So everything requires more cleanup. Devin opened my eyes to the power of agentic workflows with closed loop feedback, and the coolness of a slack interface, but I am gonna recommend cancelling it because it's not actually saving time and it's quite expensive.
huijzer 10 hours ago
I’ve used Cursor a lot and the conclusion doesn’t surprise me. I feel like I’m the one *forcing* the system in a certain direction and sometimes an LLM gives a small snippet of useful code. Sometimes it goes in the wrong direction and I have to abort the suggestion and force it into another direction. For me, the main benefit is having a typing assistant which can save me from typing one line here and there. Especially refactorings is where Cursor shines. Things like moving argument order around or adding/removing a parameter at function callsites is great. Saved me a ton of typing and time already. I’m way more comfortable just quickly doing a refactoring when I see one.
- kromem 10 hours ago
  Weird. I have such a different experience with Cursor.
  Most changes occur with a quick back and forth about top level choices in chat.
  Followed with me grabbing appropriate interfaces and files for context so Sonnet doesn't hallucinate API, and then code that I'll glance over and around half the time suggest one or more further changes.
  It's been successful enough I'm currently thinking of how to adjust best practices to make things even smoother for that workflow, like better aggregating package interfaces into a single file for context, as well as some notes around encouraging more verbose commenting in a file I can provide as context as well on each generation.
  Human-centric best practices aren't always the best fit, and it's finally good enough to start rethinking those for myself.
  cootsnuck 4 hours ago
  This! I've been using Cursor regularly since late 2023. It's all about building up effective resources to tactfully inject into prompts as needed. I'll even give it sample API responses in addition to API docs. Sometimes I'll have it first distill API docs down into a more tangible implementation guide and then save that as a file in the codebase.
  I think I'm just a naturally verbose person by default, and I'm starting to think that has been very helpful in me getting a lot out of my use of LLMs and various LLM tools over the past 2+ years.
  I treat them like the improv actors they are and always do the up front work to create (with their assistance) the necessary broader context and grounding necessary for them to do their "improv" as accurately as possible.
  I honestly don't use them with the immediate assumption I'll save time (although that happens almost all the time), I use them because they help me tame my thoughts and focus my efforts. And that in and of itself saves me time.
  huijzer 9 hours ago
  Interesting. What project are you working on? For me it's writing a library in Rust.
  kgilpin 6 hours ago
  This is what’s needed to get the most out of these tools. You understand deeply how the tool works and so you’re able to optimize its inputs in order to get good results.
  This puts you in the top echelon of developers using AI assisted coding. Most developers don’t have this deep of an understanding and so they don’t get results as good as yours.
  So there’s a big question here for AI tool vendors. Is AI assisted coding a power tool for experts, or is it a tool for the “Everyman” developer that’s easy to use?
  Usage data shows that the most adopted AI coding tool is still ChatGPT, followed by Copilot (even if you’d think it’s Cursor from reading HN :-))
- epolanski 8 hours ago
  I'll add few things at which Cursor with Claude is better than us (at least in time/effort):
  - explaining code. Enter some legacy part of your code nobody understands, LLMs aren't limited to keeping few things in memory like us. Even if the code is very obfuscated and poorly written it can understand what it does and the purpose and suggest refactors to make it understandable
  - explaining and fixing bugs. Just the other day Antirez posted a bug of him debugging a Redis segfault on some C code providing context and stack trace. This might be a hit or miss at times, but more often than not it saves you hours
  - writing tests. It often comes up with many more examples and edge cases than I thought of. If it doesn't, you can always ask it to.
  In any case I want to stress that LLMs are only as good as your data and prompts. They lack the nuance of understanding lots of context, yet I see people talking to them like humans that understand the business, best practices and others.
  moffkalast 8 hours ago
  That first one has always felt super crazy to me, I've figured out what lots of "arcane magic, don't touch" type of functions genuinely do since LLMs have become a thing.
  Even if it's slightly wrong it's usually at least in the right ballpark so it gives you a very good starting point to work from. Almost everything is explainable now.
  sherburt3 30 minutes ago
  Agreed, AI has been a godsend for trying to understand snippets of perl code in our codebase that were basically unreadable before unless you were an expert.
  epolanski 7 hours ago
  I can relate, I have been genuinely amazed more than once by how it could "understand" some very complex code nobody dared to touch like you mention.
  moffkalast an hour ago
  Kinda reminds me of that Glados quote, haha:
  "These next tests require cooperation. Consequently, they have never been solved by a human. That's where you come in. You don't know pride, you don't know fear, you don't know anything. You'll be perfect."
  It takes someone with no ego, no preconceptions, and infinite patience to delve in and come back alive.
- timrichard 9 hours ago
  I think the .cursorrules and .cursorignore files might be useful here.
  Especially the .cursorrules file, as you can include a brief overview of the project and ground rules for suggestions, which are applied to your chat sessions and Cmd/Ctrl K inline edits.
falcor84 10 hours ago
So for anyone who doubted SWE-BENCH's relevance's to typical tasks, it seems that its stated 13.86% almost exactly matches this 3 successes out of 20 pilot outcome.
We're not quite there yet, but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction. I would now expect pretty much the textbook disruptive innovation process over the next decade or so, until the typical human dev role is pushed to something more akin to the responsibilities of current day architects and product managers. QA engineering though will likely see a big resurgence.
- toyetic 5 hours ago
  >> but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction.
  Can you explain why you think this. From what I gather from other comments it seems like if we continue on current trajectory at best you'd still need a dev who understands the projects context to work in tandem w/ the agent so the code doesn't devolve into slop.
  falcor84 an hour ago
  > so the code doesn't devolve into slop
  As I see it, this is pretty much a given across all codebases, with a natural tendency of all projects to become balls of mud if the developer(s) don't actively strive to address technical debt and continuously refactor to address the new needs. But having said that, my experience is that for a given task in an unfamiliar codebase, an AI agent is already better at maintaining consistency than a typical junior developer, or even a mid-level developer who recently joined the team. And when explicitly given the task of refactoring the codebase while keeping the tests passing, the AI agents are already very capable.
  The biggest issue, which is what you may be alluding to, is that AI agents are currently very bad at recognizing the limits of their capabilities and continue trying an approach when a human dev would have long since given up and went to their lead to ask for help or for the task specification to be redefined. That's definitely an issue, but I don't see any fundamental technological limitation here, but rather something addressable via an engineering effort.
  In general, I've seen so many benchmarks fall to AI in the recent decade (including SWE-BENCH), that now I'm quite confident that if a task being performed by humans can be defined with clear numerical goals, then it's achievable by AI.
  And another way I'm looking at it is that for any specific knowledge work competency, it seems to already be much easier and time effective to train an AI to do well on it than to create a curriculum for humans to learn it and then to have every single human to go through it.
DaveMcMartin 8 hours ago
This only reinforces my bias against AI agents. At this point, they are mostly just hype. I believe that for AI to replace a junior, we would need to achieve at least near-AGI, and we are far from that.
- eitland 7 hours ago
  If by hype you mean that there isn't extreme real world value right here and right now, then I very much disagree.
  Closing in on 20 years since I left school and for me AI is absolutely useful, right here and right now. It is really a bicycle for the mind:
  It allows me to get much faster to where I want. (And like bicycles you will get a few crashes early on and possibly later as well, depending on how fast you move and how careful you are.)
  I might be in some sweet spot where I am both old enough to know what is going on without using an AI but also young enough to pick up the use of AI relatively effortlessly.
  If however by hype you mean that people still have overhyped expactations about the near future, then yes, I agree more and more.
  marginalia_nu 6 hours ago
  I feel AI can also do simple monotonous coding tasks, but I don't think programming is something it's currently very good at. Samples, yes, trivial programs, sure, but anything non-trivial and it's rarely useful.
  Where it really shines today is getting humans up to speed with new technologies, things that are well understood in general but maybe not well understood by you.
  Want to say build a window manager in X11, despite never having worked with X11 before? Sure, Claude will point you in the right direction and give you a simple template to work with in 30 seconds. Enormous time saver compared to figuring out how to do that from scratch.
  Never touched node in your life but want to build a simple electron app? Sure, here's how you get started. Few hours and several follow up questions later, you're comfortable and productive in the environment.
  Getting off the ground with new technologies is so much easier with AI it's kind of ridiculous. The revolutionary part of AI coding is how it makes it much easier to be a true generalist, capable of working in any environment with any technology, whatever is appropriate.
- thegeomaster 8 hours ago
  Exactly. LLMs are gullible. They will believe anything you tell them, including incorrect things they have told themselves. This amplifies errors greatly, because they don't have the capacity to step back and try a different approach, or introspect why they failed. They need actual guidance from somebody with much common sense; if let loose in the world, they mostly just spin around in circles because they don't have this executive intelligence.
  falcor84 5 hours ago
  A regular single-pass LLM indeed cannot step back, but newer ones like o1/o3/Marco-o1/QwQ can, and a larger agentic system composed of multiple LLMs definitely can. There is no "fundamental" limitation here. And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models), the sky's the limit. I'd be very bullish about Deepmind, once they fully enter this race.
weatherlite 6 hours ago
The whole idea of Devin is pointless and doomed to fail in my humble opinion, big tech will be quite capable on delivering A.I agents / assistants - very soon. I don't think wrappers over other people's LLMs like Devin make a lot of sense. Can someone help me understand what's the value proposition / moat of this company?
- dysoco 6 hours ago
  I'm confused here, aren't agents/assistants basically wrappers over LLMs or tools that interact with them as well? Devin seems to be in this category.
  fullstackchris 6 hours ago
  I recommend you look at tools like Aider or Codebuff... sure they need to call some LLM at some point (could be your own, could be external), but the key thing that they are doing complex modifications of source code using things like treesitter -> i.e. you don't rely directly on the LLM modifying code, but the LLM using trees to modify the code. See in Aider's sourcecode: https://github.com/Aider-AI/aider/tree/main/aider/queries
  Simple copy-pasting of "here's my prompt, give me code" was always doomed from the start to be perfect every time, and DEFINITELY won't work for an agent. We need to start thinking about how to use these LLMs in smarter ways (like the above mentioned tools)
  verdverm 6 hours ago
  Can Aider sit inside VS Code, understand what files I have open, and use them as context? Their docs lead me to say no, that they are an inline chat/completion experience
  ParetoOptimal 4 hours ago
  Their is a /chat command and an /add command so I'd assume a plugin like that is possible.
aurareturn 12 hours ago
What model does Devin use? How would it change if it used o1 or even o3 for times when it gets stuck?
IE. Generate the initial code using GPT4o/Claude 3.5, then start testing the code, when it gets stuck, use o1/o3 to help.
- monkeydust 9 hours ago
  Yea this is what I was wondering as well. I have o1 not o1 pro but I am gathering from reddit/youtube o1 pro if used correctly is superior for coding tasks.
bodge5000 6 hours ago
The thing with AI agents I tend to find is they reveal how much heavy lifting the dev is actually doing.
A personal example, my best use out of AI so far has been cases where documentation was poor to nonexistent, and Claude was able to give me a solution. But the thing is, it wasn't a working solution, nowhere close, but it was enough for me to extrapolate and do my own research based on the structure, classes and functions it used. Basically, it gave me somewhere to start from. Whether that's worth the social, economic and environmental problems is another story.
coffeebeqn 12 hours ago
Sounds exactly like my experience with the “agents” about a year ago. Autogpt or whatever it was called. Works great 1% of the time and the rest it gets stuck in the wrong places completely unable to back out.
I’m now using o1 or Claude Sonnet 3.5 and usually one of them gets it right.
- ipnon 11 hours ago
  The current frontier models are all neocortex. They have no midbrain or crocodile brain to reconcile any physical, legal or moral feedback. The current state of the art is to preprocess all LLM responses with a physical/legal/moral classifier and respond with a generic "I'm sorry Dave, I'm afraid I can't do that."
  We are fooled into thinking these golems have a shred of humanity, but their method of processing information is completely backward. Humans begin with a fight/flight classifier, then a social consensus regression, and only after this do we start generating tokens ... and we do this every moment of every day of our lives, uncountably often, the only prerequisite being the calories in an occasional slice of bread and butter.
buremba 5 hours ago
The assumption with low-code tooling was that AI is so good at writing actual code in a way that it will make low-code tools redundant. Spending time with Windsurf, Cursor, and a bunch of VSCode extensions, while it was so impressive to see new projects being created autonomously, asking for new requirements or fixing bugs after >10 iterations was more complex.
I had to audit the code and give specific directions on how to restructure the code to avoid getting stuck when the project gets more complex. That makes me think autonomous agents will do much better on low-code tools, as their restrictions ensure the agent is on track. The problem with low-code tools is that they also get more complicated to scale after maybe like >200 iterations. (for a medium-sized project, on average 6 months)
pplonski86 9 hours ago
I'm working on AI assistant in Python notebook. It aims to help with data science tasks. I'm not using it to do a full analysis. It will fail. What I ask is to create a code snippet for my next step in the analysis. Many times I need to manually change the code, but it is fine because LLM speed-up my coding a lot. And it is really fantastic in writing matplotlib code for visualization. I don't remember all matplotlib syntax to change axis labels, add annotations or change style, and LLM really can handle it good, in impressive speed.
jboggan 7 hours ago
"Even more telling was that we couldn’t discern any pattern to predict which tasks would work."
I think this cuts to the core of the problem for having a human in the loop. If we cannot learn how to best use the tool from repeated use and discern some kind of patterns of best and worst practices then it isn't really a tool.
bsenftner 8 hours ago
At some point people are going to realize that using these LLM AIs is a communications problem, and by that I mean the reason various attempts to use them fail is because they are not being effectively told what to do, vague and implied requests are not enough for a inhuman statistical construct to grasp what you're asking without clearer more details and more specific instructions.
exo-cortex 11 hours ago
I remain sceptical about the "Planet Tracker"-task. The task was to debunk claims about historical positions of Jupiter and Saturn. If the task was to find those planets were NOT in a certain (claimed) position an erroneous program would still appear to "debunk" the claims. Did they check if Devin's code's calculated positions were actually correct? Did they check in some NASA-database? If Devin gave arbitrary positions for the planets it's much more likely that they're different than any claim and appear to debunk it.
- Yenrabbit 2 hours ago
  I was able to read the code it wrote, and check that (as hoped) it was using a good existing library to do the heavy lifting. And I had it make plots that I could visually use to check that the values were 'reasonable'. The value in that case was simply that I didn't have to leave the couch and write the code myself (although if the result was actually needed for anything more important than a smug 'i thought so' confirmation I would still have taken over and validated it kore carefully).
jamalaramala 5 hours ago
> Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible. (...)
> Devin spent over a day attempting various approaches and hallucinating features that didn’t exist.
One of the big problems of GenAI is its inability to know what they don't know.
Because of that, they don't ask clarifying questions.
Humans, in the same situation, would spend a lot of time learning before they could be truly productive.
- iLoveOncall 5 hours ago
  Your statement is factually wrong, Claude 3.5v2 asks clarifying questions when needed "natively", and you can add similar instructions in your prompt for any model.
  sitkack 5 hours ago
  The default system prompts are tuned for the naive case. LLMs being all purpose text handling tools, can be reprogrammed for any behavior you wish. This is the crux of skilled use of LLMs.
  The better the LLMs get, the worse the average prompt quality.
  baobabKoodaa 4 hours ago
  Yep. It's fairly trivial to prompt an LLM to say "I don't know" when it doesn't know something.
paradite 10 hours ago
I also wrote my first impressions on Devin, more focused on the user experience and analysis of its capabilities (with lots of screenshots):
https://thegroundtruth.substack.com/p/devin-first-impression...
- timabdulla 8 hours ago
  Your take seems much more positive than theirs. What do you think the key differences are between your experience and the one here?
  paradite 8 hours ago
  One possible reason is that I'm using popular tech stacks (Next.js, HTML/JS for demo website and SDK). No niche frameworks or tools like nbdev (I've never heard of that).
  Also I've been prompting ChatGPT and Claude for over a year, that might help with communicating with Devin.
neves 5 hours ago
Do you have good references about using AI coding assistants?
Techniques of prompt engineering help a lot, but I really think there will be created a body of knowledge about how to use, what's the good contexts of use, and good heuristics. They are a valuable tool, but I feel it's possible to extract more value.
brown_martin 6 hours ago
I've been experimenting with code gen on and off for the last 18 months, and find this exactly in line with my experience.
pcwelder 6 hours ago
Most the problems you mentioned will likely be solved with the next iterations of Devin or similar product.
I can say that because I work daily with Claude as an agent over mcp, and the problems you mentioned feel very familiar.
Based on the type of the issues you mentioned, Devin isn't likely using o1 yet. A workflow like o1 for planning, Claude for Coding, o1 for review, etc., would work better.
The problems you mentioned: ssh-key issue unrelated to script, code not following existing patterns or themes, instructions not being followed, extra abstractions, etc., fall into that category.
Some of the issues are likely due to context length problem. For example, LLM doesn't work well with jupyter notebook because of extra junk in ipynb, which will likely remain a problem.
- elicksaur 5 hours ago
  We’ll see! We’re just one year away from AGI. Just like we were last year!
bpicolo 6 hours ago
An engineer that thinks it knows everything (but doesn't) and can't self-correct is about the worst combo I can think of.
- falcor84 6 hours ago
  Well, having read too much sci-fi, I am more afraid of an AI engineer that really does know everything.
Over2Chars 7 hours ago
No matter what happens with Devin specifically, I think this is a really important topic and I enjoy reading updates on this kind of review every time.
Please keep them coming.
puuush 10 hours ago
If you're going to compare other tooling, I'm curious to know what you think of our long term goals: https://github.com/charperbonaroo/robtherobot/issues/2
suneater921 10 hours ago
I can’t believe they named it after Devin Franco - guess it can take a lot of load!
motbus3 8 hours ago
I saw few people around testing and it is quite disappointing. Sometimes a task might take forever and deliver a bad result or fail completely.
It seems it is targeting few specific problems and whatever else is just too hard. I also think that, thought it is expensive, it is cheap for the technology behind it and it won't be able to keep that price for long
gtsop 11 hours ago
Honestly, i have been bitten so many times by LLM hallucinations when I work in parallel with the LLM, I wouldn't trust it autonomously running anything at all. If you have tried to use imaginary APIs, imaginary configuration and imaginary cli arguments, you know what I mean
- energy123 11 hours ago
  > If you have tried to use imaginary APIs, imaginary configuration and imaginary cli arguments, you know what I mean
  I see this comment a lot but I can't help but feel it's 4 weeks out of date. The version of o1 released on 2024-12-17 so rarely hallucinates when asked code questions of basic to medium difficulty and provided with good context and a well written prompt, in my experience. If the context window is sub-10k tokens, I have very high confidence that the output will be correct. GPT-4o and o1-mini, on the other hand, hallucinates a lot and I have learned to put low trust in the output.
- meiraleal 10 hours ago
  I have been feeling LLM burnout and favoring code it all my self after a year of LLM assistance. When it gets things wrong it is too annoying. Like, I would get mad and start to curse it, shouting loud and in the chat.
  nejsjsjsbsb 8 hours ago
  I mainly use it as a typing assist. If it suggests ahead what I was thinking it saves time.
ipnon 12 hours ago
Now is the time for us to hold seemingly contradictory propositions: A child born today will live to see 99% of all computer code written by artificial intelligence, but the current AI boom is massively overcapitalized.
- zeroonetwothree 10 hours ago
  I don't put much stock in predictions about 100 years into the future.
  mnky9800n 9 hours ago
  would you like to buy a flying car?
  nejsjsjsbsb 8 hours ago
  I have a Tesla in space to sell ya
  mnky9800n 2 hours ago
  i accept.
  throwup238 8 hours ago
  No, I want 140 characters.
- JTyQZSnP3cQGa8B 11 hours ago
  How is it contradictory?
- torginus 8 hours ago
  I'd argue that software is being written (either by humans or AI) in an order that it progressively adds less marginal value (if we define value in the capitalistic sense).
  Most of the value that software will ever create has already been created.
  The only truly valuable missing things are stuff whose value is not easy to translate to capitalists, or need some visionary work.
- meiraleal 10 hours ago
  That's already the case if you call compilers/interpreters "AI". Just a new higher level abstraction for code.
oldpersonintx 10 hours ago
[dead]
undefined 4 hours ago
[deleted]