Comments Page - ProofOfThought: LLM-based reasoning using Z3 theorem proving

« Back ProofOfThought: LLM-based reasoning using Z3 theorem provinggithub.comSubmitted by barthelomew a day ago

chrchr a day ago
I had a surprising interaction with Gemini 2.5 Pro that this project reminds me of. I was asking the LLM for help using an online CAS system to solve a system of equations, and the CAS system wasn't working as I expected. After a couple back and forths with Gemini about the CAS system, Gemini just gave me the solution. I was surprised because it's the kind of thing I don't expect LLMs to be good at. It said it used Python's sympy symbolic computation package to arrive at the solution. So, yes, the marriage of fuzzy LLMs with more rigorous tools can have powerful effects.
- TrainedMonkey a day ago
  Just like humans... we are not so good at hard number crunching, but we can invent computers that are amazing at it. And with a lot of effort we can make a program that uses a whole lot of number crunching to be ok at predicting text but kind of bad at crunching hard numbers. And then that program can predict how to create and use programs which are good at number crunching.
  emporas a day ago
  Small steps of nondeterministic computation, checked thoroughly with deterministic computation every so often, and the sky is the limit.
  That's when A.I. starts advancing itself and needs humans in the loop no more.
  visarga 20 hours ago
  > That's when A.I. starts advancing itself and needs humans in the loop no more.
  You got to put the environment back in the loop though, it needs a source of discovery and validity feedback for ideas. For math and code is easy, for self driving cars doable but not easy, for business ideas - how would we test them without wasting money? It varies field by field, some allow automated testing, others are slow, expensive and rate limited to test.
  jaggederest 20 hours ago
  Simulation is the answer. You just need a model that's decent at economics to independently judge the outcome, unless the model itself is smart enough. Then it becomes a self-reinforcing training environment.
  Now, depending on how good your simulation is, it may or may not be useful, but still, that's how you do it. Something like https://en.wikipedia.org/wiki/MuZero
  drbojingle 14 hours ago
  Electric dreams. Simulation of what?
  enknee1 11 hours ago
  Simulated environment suggests the possibility of alignment during training but real time, real world, data streams are better.
  But the larger point stands: you don't need an environment to explore the abstraction landscape prescribed by systems thinking. You only need the environment at the human interface.
  staplers 19 hours ago
  doable but not easy, for business ideas
  That requires a lot of human psychology and advanced hard economic theory (not the fluffy academic kind). With human controlled monetary supply and most high-level business requiring illegal and immoral exploitation of law and humans in general, it's not a path machines can realistically go down or even want machines treading down.
  Think scams and pure resource extraction. They won't consider many impacts outside of bottom line.
  galaxyLogic 20 hours ago
  The question is where should AI advance itself? Which direction? There are an infinite number of theorems that can be derived from a set of axioms. Infinite. AI can't prove them all. Somebody needs to tell it what it needs to do, and that is us.
  eru a day ago
  Your checks don't have to be deterministic either.
  Eg randomised quicksort works really well.
  emporas 21 hours ago
  Couldn't disagree more.
  Sorting a finite number of elements in a sequence, is a very narrow application of AI, akin to playing chess. Usually very simple approaches like RL work totally fine for problems like these, but auto-regression/diffusion models have to take steps that are not well defined at all, and the next step towards solving the problem is not obvious.
  As an example, imagine a robot trying to grab a tomato from a table. It's arm extends across 1 meter maximum, and the tomato is placed 0.98 meters away. Is it able to grab the tomato from the point it stands, or it needs to move closer, and only then try to grab the tomato?
  That computation should better be calculated deterministically. Deterministic computation is faster, cheaper and more secure. It has to prove that: $tomato_distance + $tomato_size < $arm_length. If this constraint is not satisfied, then: move_closer(); Calculate again:$tomato_distance + $tomato_size < $arm_length.
  From the paper:
  > Our system employs a custom interpreter that parses "LLM-Thoughts" (represented as DSL code snippets) to generate First Order Logic programs, which are then verified by a Z3 theorem prover.
  eru 20 hours ago
  > Sorting a finite number of elements in a sequence, is a very narrow application of AI, [...]
  Sorry, I did not suggest you should use AI to sort numbers. I was solely replying to this:
  > Small steps of nondeterministic computation, checked thoroughly with deterministic computation every so often, and the sky is the limit.
  You don't necessarily need your checks to be deterministic.
  In fact, it's often better for them to be not deterministic.
  See also https://fsharpforfunandprofit.com/series/property-based-test...
  I don't understand your claim about 'Deterministic computation is faster, cheaper and more secure.' That's not true at all.
  In fact, for many problems the fastest and simplest known solutions are non-deterministic. And in eg cryptography you _need_ non-determinism to get any security at all.
  jonplackett a day ago
  Maybe the number crunching program the text generation program creates will, with enough effort become good at generating text, an will in turn make another number crunching computer and then…
  psadri a day ago
  Watch the movie “The Thirteenth Floor”
  Barbing a day ago
  This is somewhat unusual: 28% on the Tomatometer, but 7 out of 10 on IMDb.
  Beyond its relevancy to the parent comment, would you consider it a good movie yourself? (for a random/average HN commenter to watch)
  c-hendricks a day ago
  It didn't do well critically, but audience scores on many platforms are 60-70%. It came hot on the heels of The Matrix, has similar themes, but nowhere near as ... everything compared to Matrix. I'd bet the only reason it did so poorly critically is due to the timing of the release.
  It's a fine movie though.
  bonoboTP 12 hours ago
  If you like Matrix, Memento, Truman Show, Black Mirror (San Junipero, Bandersnatch), Inception, Interstellar, 12 Monkeys etc. you may also like it. These are not necessarily thematically aligned but based on vibes they cluster near it for me.
  I definitely enjoyed it many years ago as a younger person.
  self 10 hours ago
  Three movies with overlapping themes came out in mid-1999: The Matrix, The Thirteenth Floor, and eXistenZ (probably in that order of box office revenue).
  patcon a day ago
  I love this kind of thought. Thanks.
  idiotsecant a day ago
  Parent post is talking about symbolic manipulation, not rote number crunching, which is exactly what we're supposed to be good at and machines are supposed to be bad at.
  29athrowaway a day ago
  We do plenty of number crunching all the time, just not consciously.
  Like the inverse kinematics required for your arm and fingers to move.
  pstoll a day ago
  I’d argue we aren’t solving those inverse kinematics / kinetics via “number crunching” - but rather that our neuromuscular systems are analog. Which I don’t usually call that “number crunching” in the sense current computers … compute.
  tomcloyd a day ago
  As a psychologist, I completely agree. It absolutely is NOT number crunching. Analog computation is primary and dominant in animals. It has to be, for so many reasons. I continue to be amazed at how much IT people do NOT grasp human and animal IT. And that, I would argue, is why so many IT folks keep talking about our supposedly approaching human intelligence in technology. If they really understood human intelligence the absurdity of that statement would keep them quiet. An elegant, artful puppet is still a puppet, and without the personal history context and consciousness we possess, not to mention a vast complex of analogue computation functionality we rely upon, that puppet will only ever be a clever number-cruncher. We are so much more.
  galaxyLogic 20 hours ago
  Are our brains "analog"? Or are they in fact "digital"? I would think actually more digital than analog. A synapse triggers or it does not trigger. It either triggers or not, not something in between. In this sense it is 0 or 1.
  Similarly transistor-based logic is based on such thresholds, when current or voltage reaches a certain level then a state-transition happens.
  fwip 6 hours ago
  Well, no, synapses aren't binary in response.
  Flere-Imsaho 4 hours ago
  Would you say it's more memorization than actual calculations? It certainly feels like it, when say riding a bike.
  HumanOstrich a day ago
  Neurons aren't crunching numbers for inverse kinematics.
- anotherpaulg a day ago
  I really like LLM+sympy for math. I have the LLM write me a sympy program, so I can trust that the symbolic manipulation is done correctly.
  The code is also a useful artifact that can be iteratively edited and improved by both the human and LLM, with git history, etc. Running and passing tests/assertions helps to build and maintain confidence that the math remains correct.
  I use helper functions to easily render from the sympy code to latex, etc.
  A lot of the math behind this quantum eraser experiment was done this way.
  https://github.com/paul-gauthier/entangled-pair-quantum-eras...
- DrewADesign a day ago
  I get having it walk you through figuring out a problem with a tool: seems like a good idea and it clearly worked even better than expected. But deliberately coaxing an LLM into doing math correctly instead of a CAS because you’ve got one handy seems like moving apartments with dozens of bus trips rather than taking the bus to a truck rental place, just because you’ve already got a bus pass.
  afiori 15 hours ago
  I feel like a better analogy is trying to rent a truck to move to a new apartment and after repeated failures of trucks not working they just hire a moving company for you to get you to leave
  DrewADesign 10 hours ago
  All of those tools are purpose-built for moving people. LLMs are not at all built for doing math.
- jansan 17 hours ago
  How die that work? Did Gemini call sympy on your maschine, or is access to sympy built-in and available through normal chat?
  7734128 15 hours ago
  https://cloud.google.com/vertex-ai/generative-ai/docs/multim...
tannhaeuser a day ago
LLMs are statistical language models (d'uh) not reasoners after all. I found generating logic programs, and Prolog source specifically, to work unreasonably well, though [1], maybe because Prolog was introduced for symbolic natural language processing and there's a wealth of translation examples in the training set. Might be worth checking out Z3's alternative Datalog syntax [2] instead of its Lisp-ish SMTLib syntax.
[1]: https://quantumprolog.sgml.net/llm-demo/part1.html
[2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax
- barthelomew a day ago
  Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in our grammars paper because it allowed the most interoperability with solvers, but our technique also works with PROLOG; as tested our at the behest of reviewers at NeurIPS. I would assume that this should also work with datalog [2].
  [1] https://arxiv.org/abs/2505.20047 [2] https://github.com/antlr/grammars-v4/blob/master/datalog/dat...
- larodi 18 hours ago
  Neuralsymbolic systems are very likely the future as so many times mentioned here already.
  a3w 8 hours ago
  I cannot use wolframalpha most of the time since the syntax is not that natural. WolframAlpha is good AI, it never lies.
  Calculators are good AI, they rarely lie (due to floating arithmetics rounding). And yes, Wikipedia says calculators are AI tech, since a Computer was once a person, and not it is a tool that shows the intelligent trait of doing math with numbers or even functions/variables/equations.
  Querying a calculator or wolfram alpha like symbolic AI system with LLMs seems like the only use for LLMs except for text refactoring that should be feasible.
  Thinking LLMs know anything on their own is a huge fallacy.
LASR a day ago
This is an interesting approach.
My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.
Then we run the solver to verify consistency.
When a wiki page is changed, the process is run again and it's essentially a linter for process.
Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.
But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.
- barthelomew a day ago
  The autoformalization gap is pretty difficult to bridge indeed. We explored uncertainty quantification of autoformalization on well-defined grammars in our NeurIPS 2025 paper : https://arxiv.org/abs/2505.20047 .
  If you ever feel like chatting and discussing more details, happy to chat!
- viraptor a day ago
  Could you share an example of such policy? I'm struggling to think of something defined well enough in the real world to apply in Lean.
- chandureddyvari 20 hours ago
  For anyone curious about what LEAN is, like me, here’s the explanation: Lean Theorem Prover is a Microsoft project. You can find it here: https://www.microsoft.com/en-us/research/project/lean/
- pbronez a day ago
  That’s pretty cool. It would be super useful to identify contradictory guidance systematically.
nextos a day ago
This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.
I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.
I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934
- barthelomew a day ago
  Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.
  Although that work is not public, you can play with the generally available product here!
  [1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
- CuriouslyC a day ago
  Agent/LLM as a judge is biased and only good for bootstrapping. As capabilities get better LLM as a judge will artificially cap your performance, you need to graduate to either expert human judges or deterministic oracles.
  fnordpiglet 19 hours ago
  LLMs display a form of abductive reasoning which is not the same as judgement. The only thing in the universe we know that can display judgement is a human. However many tasks we presume to require human judgement do not and abductive reasoning will perform as well as a human. This in theory acts as a filter if used right reducing the tasks of human judgement to those that can’t be automated with similar or better precision and recall. The trick then is using LLMs and other techniques to reduce the problem space for the human to the kernel of quandary that requires human judgement and to isolate the salient information to reduce the cognitive load as much as possible. Many many mundane tasks can be automated in this way, and many complex tasks can be facilitated to greatly magnify the effectiveness of the human in the middle’s time.
  jebarker a day ago
  Why does this have to be true? For example, if you have a different LLM that is judging than the one being judged then their biases could at least be different. Also, as their reasoning abilities improve wouldn't LLM judges approach the abilities of human judges?
  CuriouslyC a day ago
  LLMs have positional, response length and hedge word biases (and that's just what's rigorously demonstrated in papers) that wash out differences between high performing answers as you approach the limit of your objective. Imagine if you were trying to optimize a function and the measurement function emitted random biased noise, at some point you wouldn't be able to accurately identify the impact of your changes.
  sdenton4 a day ago
  Indeed - human judges suck on average. And you can prompt an llm judge to look for particular kinds of problems, then throw the ensemble of judges at an output to nitpick. (Essentially, bake in a diversity of biases through a collection of prompts.)
0xWTF a day ago
Am I reading this right? Statistical LLM outputs pushed through a formal logic model? Wouldn't that be a case of "crap in, crap out"?
- avmich a day ago
  Formal logic serves as a useful filter. In other words, "crap in, filtered crap out" - remember, evolution works with absolutely random, "crap" mutations, which then are "filtered" by the environment.
- baq 19 hours ago
  You assume it’s all crap when it clearly isn’t often enough to be useful.
- varispeed a day ago
  That's subjective. One could argue all the things we invented in the past few thousands years were crap. Life would have been much easier in the caves, albeit shorter.
sytse a day ago
So the core idea is to use an LLM to draft reasoning as a structured, JSON domain-specific language (DSL), then deterministically translate that into first-order logic and verify it with a theorem prover (Z3).
Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.
ivanbakel a day ago
The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.
Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.
The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?
The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.
- barthelomew a day ago
  Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.
  E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.
  Illustration of this can be seen in Fig 14, 15 on Page 29.
  In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.
  This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.
  https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
  bhk 21 hours ago
  Re: "99% of the time" ... this is an ambiguous sample space. Soundness of results clearly depends on the questions being asked. For what set of questions does the 99% guarantee hold?
  cerved a day ago
  Who makes the rules?
derekcheng08 18 hours ago
Interesting. I wonder if you could implement tool calling with this approach so the LLM calls the tool with the formal specification and gets back the result. Just like a coding agent can run a compiler, get back errors and then self-correct.
renshijian a day ago
This is fascinating! An AI that doesn't just think out loud, but keeps a verifiable diary. It's like a philosopher with a cryptographic notary public living in its brain. Amazing work!
tonerow a day ago
Cool research! I went to the repo to see what the DSL looked like but it was hard to find a clear example. It would be cool if you added a snippet to the README.
- barthelomew a day ago
  Hey! Thank you for the interest! I shall do that. Meanwhile, check out Page 11 onwards. We describe a lot of situations! (https://arxiv.org/pdf/2409.17270)
  pstoll a day ago
  Upvoting the comment that the gitrepo would be way more self stand-alone if it had an intro of the DSL.
nakamoto_damacy a day ago
LLMs lack logical constraints in the generative process; they only learn probabilistic constraints. If you apply logic verification post-hoc, you're not "ensuring the correctness of your LLMs reasoning" (I went down this path a year ago); you're classifying whether the LLM's statistically driven pattern generation happens to correspond to correct logic or not, where the LLMs output may be wrong 100% of the time, and your theorem prover simply acts as a classifier, ensuring nothing at all.
- barthelomew a day ago
  Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)
  Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)
  You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.
  YeGoblynQueenne 15 hours ago
  Well, if you understand that this is a "genuine problem" then what have you done to solve it? A quick look at the abstract of your follow up paper does not reveal an answer.
  And let me be clear that this is a major limitation that fundamentally breaks whatever you are trying to achieve. You start with some LLM-generated text that is, by construction, unrelated to any notion of truth or factuality, and you push it through a verifier. Now you are verifying hot air.
  It's like research into the efficacy of homeopathic medicine and there's a lot of that indeed, very carefully performed and with great attention to detail. Except all of that research is trying to prove whether doing nothing at all (i.e. homeopathy) has some kind of measurable effect or not. Obviously the answer is not. So what can change that? Only making homeopathy do something instead of nothing. But that's impossible, because homeopathy is, by construction, doing nothing.
  It's the same thing with LLMs. Unless you find a way to make an LLM that can generate text that is conditioned on some measure of factuality, then you can verify the output all you like, the whole thing will remain meaningless.
- avmich a day ago
  Probabilistic constraints are all around us. You learn that the sine function is the ratio of the length of the side of the right triangle opposite to the angle to the length of the side opposite to the right angle, so obviously the sine is always positive. Yet your thinking should be flexible enough to allow changing the definition to the ordinate of the point on the unit circle where the line corresponding to the given angle and drawn from zero intersects that circle. So your knowledge - the symbolic one - can also be probabilistic.
  nakamoto_damacy a day ago
  You're thinking along the right track but without formalization it goes nowhere fast. By layering of differential geometry on top of probability and then maybe category theoretic logic on top of that, each layer constraining the one below it, and all layers cohering, you get somewhere... There is work that's been done in this area, and I was recently interviewed by a journalist who published a high level article on it on Forbes (Why LLMs are failing) and it links to the actual technical work (at first to my high level presentation then Prof. L. Thorne McCarty's work): https://www.forbes.com/sites/hessiejones/2025/09/30/llms-are...
- nakamoto_damacy a day ago
  Why is this being down voted? I believe the author acknowledged and responded. Anything wrong?
Yoric a day ago
That is exactly the kind of things that I hope LLM will help us achieve before the next AI winter.
sigmoid10 a day ago
I always find it amazing how many people seem to fail to use current LLMs to the fullest, even though they apparently work with them in research settings. This benchmark pipeline simply calls the OpenAI API and then painstakingly tries to parse the raw text output into a structured json format, when in reality the OpenAI API has supported structured outputs for ages now. That already ensures your model generates schema compliant output without hallucinating keys at the inference level. Today all the major providers support this feature either directly or at least indirectly via function calling. And if you run open models, you can literally write arbitrary schema (i.e. not limited to json behind the scenes) adhering inference engines yourself with rather manageable effort. I'm constantly using this in my daily work and I'm always baffled when people tell me about their hallucination problems, because so many of them can be fixed trivially these days.
- barthelomew a day ago
  Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.
  When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.
  sigmoid10 a day ago
  What are you talking about? OpenAI has supported structured json output in the API since 2023. Only the current structured output API was introduced by OpenAI in summer 2024, but it was primarily a usability improvement that still runs json behind the scenes.
  barthelomew a day ago
  You're right about the 2023 JSON mode, but our project required enforcing a much more complex DSL grammar (look in Appendix for details), not just ensuring a *valid JSON object*. The newer structured output APIs are a significant improvement, but the earlier tools weren't a fit for the specific constraints we were working under at the time.
  dang a day ago
  > What are you talking about?
  Please edit out swipes like this from your HN comments—this is in the site guidelines: https://news.ycombinator.com/newsguidelines.html. It comes across as aggressive, and we want curious conversation here.
  Your comment would be fine without that bit.
  sigmoid10 a day ago
  This is not meant as snide, I'm literally confused if I might have misunderstood the problem here. Because the solution would be so obvious.
  dang a day ago
  I believe you! but when an internet reply leads with "what are you talking about?", it's likely to pattern-match this way for many readers. If that's not your intent, it's best to use an alternate wording.
  vasco 18 hours ago
  Not to be rude, but they clarified it's not a snide, why are you trying to control speech to this degree? If we don't like his tone we can downvote him as well anyway and self regulate.
  dang 9 hours ago
  They clarified that their intention was good, but intent doesn't communicate itself—it needs to be disambiguated [1]. What matters in terms of moderation is not intent, but effects, i.e. effects on the system in the general case [2].
  Arguably your question reduces to: why does HN have moderators at all? The answer to that is that unfortunately, the system of community + software doesn't function well on its own over time—it falls into failure modes and humans (i.e. mods) are needed to jig it out of those [3]. I say "unfortunately" because, of course, it would be so much better if this weren't needed.
  You can't assess this at the level of an individual interaction, though, because it's scoped at the whole-system level. That is, we can (and do) make bad individual calls, but what's important is how the overall system functions. If you see the mods making a mistake, you're welcome to point it out (and HN users are not shy about doing so!), and we're happy to correct it. But it doesn't follow that you don't need moderators for the system to work, or even survive.
  [1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
  [2] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
  [3] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
- atrus a day ago
  I wouldn't find it amazing, there are so many new models, features, ways to use models that the minute you pause to take a deep dive into something specific, 43 other things have already passed by you.
  sigmoid10 a day ago
  I would agree if you are a normal dev who doesn't work in the field. But even then reading the documentation once a year would have brought you insane benefits regarding this particular issue. And for ML researchers there is no excuse for stuff like that at this point.
- jssmith a day ago
  I see JSON parse errors on occasion when using OpeanAI structured outputs that resolve upon retry. It seems it’s giving instructions to the LLM but validation is still up to the caller. Wondering if others see this too.
  barthelomew a day ago
  Hey, yes! This is because the DSL (Domain Specific Language) is pretty complex, and the LLM finds it hard. We prototype a much more effective version using SMT in our NeurIPS 2025 paper (https://arxiv.org/abs/2505.20047). We shall soon open source that code!
  sigmoid10 a day ago
  Depends on how strictly you define your types. Are you using pydantic to pass the information to the API? There are a few pitfalls with this, because not everything is fully supported and it gets turned into json behind the scenes. But in principle, the autoregressive engine will simply not allow tokens that break the supplied schema.
  striking a day ago
  Not sure if I've been using it wrong but I've tried using the Zod-to-structured-output helper with GPT-5 and often gotten weird stuff like trailing commas that break a parse or seeing multiple JSON responses in the same response.
  Ultimately there are still going to be bugs. For this reason and several others you'll still need it wrapped in a retry.
  sigmoid10 a day ago
  Yeah that sounds 100% like a user or middleware issue. Don't bother with these wrappers, they are always outdated anyways. Learn how to use the API directly, it will save you a ton of headaches. And it's really not that hard.
  striking a day ago
  No, we're using the OpenAI vendored version of zod-to-json-schema via https://github.com/transitive-bullshit/openai-zod-to-json-sc..., and applying it directly to the `json_schema` field of the OpenAI API. Maybe we have a subtle bug somewhere but I'd expect a 400 response if we were genuinely sending a malformed request.
  eric-burel a day ago
  Yep from time to time.
- IanCal a day ago
  I’d also be surprised if the models are better at writing code in some custom schema (assuming that’s not z3s native structure) than writing code in something else. Decent models can write pretty good code and for a lot of mistakes can fix them, plus you get testing/etc setups for free.
- eric-burel a day ago
  It's a relatively new feature, also people need actual professional training to become true LLM developers using them to their fullest and not just developers that happen to call an LLM API here and there. Takes a lot of time and effort.
- retinaros a day ago
  yes this can also improve the said reasoning.
  sigmoid10 a day ago
  The secret the big companies don't want to tell you is that you can turn all their models into reasoning models that way. You even have full control over the reasoning process and can make it adhere to a specific format, e.g. the ones used in legal settings. I've built stuff like that using plain old gpt-4o and it was even better than the o series.
nakamoto_damacy a day ago
I posted about my year long development effort of this very method on reddit 25 days ago. My comment elsewhere in this thread provides a cautionary tale, and the authors response to the basic issue I raised is incomplete in that it leaves out that certain problems simply cannot be solved with LLMs (requires logical constraints in the generative process but LLMs lack that layer) So I've pivoted to something else since (also mentioned in my comment elsewhere in this thread)
https://www.reddit.com/r/healthIT/comments/1n81e8g/comment/n...
dehsge a day ago
LLMs and its output are bounded by Rices theorem. This is not going to ensure correctness it’s just going to validate that the model can produce an undecidable result.
- ogogmad 17 hours ago
  Errr, checking correctness of proofs is decidable.
Western0 a day ago
I need this same with Mizar https://wiki.mizar.org/
everdrive a day ago
I'm honestly confused why we can't determine how LLMs come to their decisions in the general sense. Is it not possible to log every step as the neural network / vector db / magic happens? Is it merely impractical, or is it actually something that's genuinely difficult to do?
- konmok a day ago
  My understanding is that it's neither impractical nor genuinely difficult, it's just that the "logging every step" approach provides explanations of their "reasoning" that are completely meaningless to us, as humans. It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain. Theoretically, yes, that should be sufficient to explain their color preferences, in that it fully models their brain. But practically, the explanation would be phrased in terms of atomic configurations in a way that makes much less sense to us than "oh, this person likes red because they like roses".
  everdrive 8 hours ago
  >It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain.
  But this is an incredibly interesting problem!
  qcnguy 6 hours ago
  Anthropic have done some great work on neural interpretability that gets at the core of this problem.
- chpatrick a day ago
  Everything happens in an opaque super-high-dimensional numerical space that was "organically grown" not engineered, so we don't really understand what's going on.
- nickpsecurity 7 hours ago
  There's people doing both types. Look up survey of mechanistic interpretebility of language models and survey of explainable AI for neural networks. Those will give you many techniques for illustrating what's happening.
  You'll also see why their applications are limited compared to what you probably hoped for.
- moffkalast a day ago
  It would be like logging a bunch of random noise from anyone's perspective except the LLM's.
  everdrive 8 hours ago
  I guess I'm also just confused. I get that this is _difficult_ to do, but I would think that computer scientists would be utterly dissatisfied that AI was "non-deterministic" and would poke at the problem until it could be understood.
- NotGMan a day ago
  Chat GPT-4 has alegedly 1.8 trillion parameters.
  Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.
  Good luck figuring out what number represents what.
  everdrive a day ago
  Wouldn't that mean it's totally impractical for day-to-day usage, but a researcher or team of researchers could solve this?
  typpilol a day ago
  Anthropic has a tool that lets them do this but apparently doing it for even one prompt can take an entire day of work.
  tymscar a day ago
  That’s so much faster than I expected
westurner a day ago
ScholarlyArticle: "Proof of thought: Neurosymbolic program synthesis allows robust and interpretable reasoning" (2024) https://arxiv.org/abs/2409.17270 .. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C43&q=%22...
hamonrye 19 hours ago
RHEL knife-edge rolling kernel distribition for the proof of concept.
maiuki 18 hours ago
What industrial problems would this solve?
zwnow a day ago
Reasoning? LLMs can not reason, why is it always assumed they reason? They mimic reasoning.
- elcomet a day ago
  How can you know?
  measurablefunc 17 hours ago
  By thinking about what a computer is actually doing & realizing that attributing thought to an arthmetic gadget leads to all sorts of nonsensical consequences like an arrangement of dominoes & their cascade being a thought. The metaphysics of thinking computers is incoherent & if you study computability theory you'll reach the same conclusion.
  elcomet 17 hours ago
  I'd say that thoughts and reasoning are two different things, you're moving the goalpost.
  But what makes the computer hardware fundamentally incompatible with thinking? Compared to a brain
  measurablefunc 17 hours ago
  I've already explained it in several places. The burden of proof is on those drawing the equivalence to provide actual evidence for why they believe carbon & silicon are interchangeable & why substrate independence is a valid assumption. I have studied this problem for much longer than many people commenting on this issue & I am telling you that your position is metaphysically incoherent.
- moffkalast a day ago
  It's so funny to me that people are still adamant about this like two years after it's become a completely moot point.
  emp17344 a day ago
  Moot point? As far as I know, it’s still intensely debated, and there are some excellent papers out there providing evidence that LLMs truly are just statistical prediction machines. It’s far from an unreasonable position.
  Terr_ a day ago
  The normative importance of a fact may increase when more number of people start willfully ignoring it for shorter-term profit.
  Imagine somebody in 2007: "It's so funny to me that people are still adamant about mortgage default risk after it's become a completely moot point because nobody cares in this housing market."
  MASNeo 14 hours ago
  That’s nailing it really well: “willfully ignoring” is precisely what’s happening all around me. Me talking about small focused AI models, there you have everyone raving about AGI. Energy use and privacy issues of cloud vs local inference discussions end on how awesome the power of GPUs are and the jobs too. GPU backed finance with depreciation schedules past useful life seems OK for anyone chasing some short term gain. Even the job market is troubled, you can hardly tell a relevant candidate from an irrelevant one because everyone is an AI expert these days - hallucinations seem to make lying more casual.
  It’s pretty clear to me there is a collective desire to ignore the problems to sell more GPU, close the next round, get that high paying AI job.
  Part of me wishes humans would show the same dedication to fight climate change…
  avmich a day ago
  Didn't we have economists' consensus then about what's going to happen?
  Terr_ a day ago
  My point is a fact's popularity is not equal to its importance. That was a scenario to highlight how they can even have an inverse relationship.
  Diving into how well/badly anybody predicted a certain economic future is a whole different can of worms.
  That said: "The market can stay irrational longer than I can stay solvent." :p
  zwnow a day ago
  Experts are adamant about this. Just take a look at https://youtu.be/iRqpsCHqLUI
  moffkalast a day ago
  No such thing as an expert consensus on anything about LLMs these days, just different forms of grift.
  My point is, the question if an LLM reasons the same way a human does is about as useful as "does a submarine swim" or "can a telephone talk". The results speak for themselves.
  Terr_ 19 hours ago
  > just different forms of grift
  That sounds like a false "both sides"-ing.
  It's not symmetrical, there's a lot more money (and potential to grift) hyping things up as miracle machines.
  In contrast, most of the pessimists don't have a discernible profit motive.
  moffkalast 18 hours ago
  Well yes the corporate accelerationists are certainly pushing for it the most, shoehorning the tech into things it doesn't belong in to see if they can somehow come up on top, which that in turn makes lots of people resentful towards it in a reactionary way.
  You have artists who've lost work due to diffusion models, teachers who can't assign homework essays anymore, people who hate Microsoft Copilot, just anyone not wanting to be replaced by a bot or being forced to use the tech to avoid being outcompeted, people set in their ways who don't want change or imagine it being destructive, etc. It's a large crowd that one can appeal to for personal gain, politics 101. Anyone with half believable credentials can go on a talk show and say the things people want to hear, maybe sell a book or two afterwards.
  Are today's models on the brink of some exponential self perpetuating shot towards superintelligence? Obviously not. Are they overhyped glorified lookup tables? Also no. Are there problems? Definitely. But I don't think it's entirely fair to dismiss a tech based on someone misappropriating it in monopolistic endeavours instead of directing dismissal towards those people themselves.
  Like, similar to how Elon's douchebaggery has tainted EVs for lots of people for no practical reason, the same has Altman's done for LLMs.
  zwnow a day ago
  LLMs do not reason. Not hard to understand.
  moffkalast a day ago
  Idk, before this people from your camp were saying LLMs can't even understand anything. Always moving the goalposts. Then it'll be they can't feel or can't something else just to be pointlessly contrarian. Anyway, wrong idea.
  There have been enough cases of models providing novel results that it's clear that whatever human trait they supposedly lack they don't really need. A car does not need legs, it does things differently. Having legs would even be a major detriment and would hold it back from achieving its top performance.
  That's what those brain simulating projects are conceptually btw: cars with legs or planes with flapping wings. That's why they all fail, the approach makes no sense.
  horacemorace 21 hours ago
  This will be the exact same argument in 20 years when we’ve got examples of robots that some fraction of people claim are conscious.
  zwnow 17 hours ago
  If LLMs could reason, they would flourish in barely understood topics, they dont. They repeat after what humans already said over and over again all across the training data. They are a parrot, its really not that hard to understand.
  NitpickLawyer 14 hours ago
  > They are a parrot
  Those are some mighty parrots there, if they managed to get gold at IMO, IoI, and so on...
  zwnow 7 hours ago
  Well understood topics... what's so hard to understand?
  kgeist 14 hours ago
  >They repeat after what humans already said
  >They are a parrot
  Is it really much different from most people? The average Joe doesn't produce novel theories every day - he just rehashes what he's heard. Now the new goalpost seems to be that we can only say an LLM can "reason" if it matches Fields Medalists.
  encyclopedism 10 hours ago
  > Is it really much different from most people? The average Joe doesn't produce novel theories every day"
  You've presented a false choice.
  However the average Joe does indeed produce unique and novel thoughts every day. If it were not the case he would be brain dead. Each decision - wearing blue or red today - every tiny thought, action, feeling, indecision, crisis, or change of heart these are just as important.
  The jury maybe out on how to judge what 'thought' actually is. However what it is not is perhaps easier to perceive. My digital thermometer does not think when it tells me the temperature.
  My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) also does not think.
  I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.
  For a group inclined to quickly calling out 'God of the gaps' they have quite quickly invented their very own 'emergence'.
  raspasov a day ago
  What is "novel results"? A random UUID generator also gives "novel result", every time.
  Even if we're to humor the "novel" part, have they actually come up with anything truly novel? New physics? New proofs of hard math problems that didn't exist before?
  moffkalast 19 hours ago
  Yes, exactly. There are other papers, but Google proved it most definitively imo [0], an LLM now holds the state of the art for the lowest bound on a very specific graph problem.
  [0] https://research.google/blog/ai-as-a-research-partner-advanc...
  measurablefunc 17 hours ago
  That's not an LLM. AlphaEvolve is a variant of genetic search for program synthesis. Very different from the chat bot that answers questions about ingrown toenails.
measurablefunc a day ago
This is proof of verifiable logic. Computers can not think so calling it proof of thought misrepresents what's actually happening.
- aSanchezStern a day ago
  I agree that "proof of thought" is a misleading name, but this whole "computers can't think" thing is making LLM skepticism seem very unscientific. There is no universally agreed upon objective definition of what it means to be able to "think" or how you would measure such a thing. The definition that these types of positions seem to rely upon is "a thing that only humans can do", which is obviously a circular one that isn't useful.
  measurablefunc a day ago
  If you believe computers can think then you must be able to explain why a chain of dominoes is also thinking when I convert an LLM from transistor relay switches into the domino equivalent. If you don't fall for the marketing hype & study both the philosophical & mathematical literature on computation then it is obvious that computers (or any mechanical gadget for that matter) can not qualify for any reasonable definition of "thinking" unless you agree that all functionally equivalent manifestations of arithmetic must be considered "thinking", including cascading dominoes that correspond to the arithmetic operations in an LLM.
  supern0va a day ago
  >If you believe computers can think then you must be able to explain why a chain of dominoes is also thinking when I convert an LLM from transistor relay switches into the domino equivalent.
  Sure, but if you assume that physical reality can be simulated by a Turing machine, then (computational practicality aside) one could do the same thing with a human brain.
  Unless you buy into some notion of magical thinking as pertains to human consciousness.
  measurablefunc a day ago
  No magic is necessary to understand that carbon & silicon are not equivalent. The burden of proof is on those who think silicon can be a substitute for carbon & all that it entails. I don't buy into magical thinking like Turing machines being physically realizable b/c I have studied enough math & computer science to not be confused by abstractions & their physical realizations.
  avmich a day ago
  The proof immediately follows from the ability of silicon systems in principle to model carbon ones with arbitrary precision.
  measurablefunc a day ago
  I recently wrote a simulation of water molecules & got really confused when the keyboard started getting water condensation on it. I concluded that simulating water was equivalent to manifesting it in reality & immediately stopped the simulation b/c I didn't want to short-circuit the CPU.
  bobxmax a day ago
  And your definition of thinking is?
  measurablefunc a day ago
  Not arithmetic or boolean algebra. What's your definition?
  bobxmax 16 minutes ago
  I'm not the one making obtuse claims and desperately trying to trigger reactions.
  Nevermark 16 hours ago
  > Not arithmetic or boolean algebra.
  That isn’t a definition or even a coherent attempt.
  For starters, what kind of cognition or computation can’t be implemented with either logic or arithmetic?
  What is or is not “cognition” is going to be a higher level property than what basic universally capable substrate is used. Given such substrates can easily simulate each other, be substituted for each other.
  Even digital and analog systems can be used to implement each other to arbitrary accuracy.
  Cognition is a higher level concern.
  encyclopedism 10 hours ago
  The jury maybe out on how to judge what 'thought' actually is. However what it is not is perhaps easier to perceive. My digital thermometer does not think when it tells me the temperature.
  My paper and pen version of the latest LLM (quite a large bit of paper and certainly a lot of ink I might add) also does not think.
  I am surprised so many in the HN community have so quickly taken to assuming as fact that LLM's think or reason. Even anthropomorphising LLM's to this end.
  For a group inclined to quickly calling out 'God of the gaps' they have quite quickly invented their very own 'emergence'.
  Terr_ a day ago
  > this whole "computers can't think" thing is making LLM skepticism seem very unscientific.
  It's just shorthand for "that's an extraordinary claim and nobody has provided any remotely extraordinary evidence to support it."
  measurablefunc 19 hours ago
  Lots of people consider company valuations evidence of a singularity right around the corner but it requires a very specific kind of mindset to buy into that as "proof" of anything other than very compelling hype by people who have turned financial scams into an art form.
- chpatrick a day ago
  Do you understand human thinking well enough to determine what can think and what can't? We have next to no idea how an organic brain works.
  measurablefunc a day ago
  I understand computers, software, & the theory of computation well enough to know that there is no algorithm or even a theoretical algorithmic construction that can be considered thought. Unless you are willing to concede that thinking is nothing more than any number of models equivalent to a Turing machine, e.g. lambda calculus, Post systems, context aware grammars, carefully laid out dominoes, permutations of bit strings, etc. then you must admit that computers are not thinking. If you believe computers are thinking then you must also admit dominoes are thinking when falling in a cascading chain.
  Nevermark 16 hours ago
  You are confusing primitives with things built with those primitives.
  Complex phenomena emerge from interactions of things that don’t exhibit that phenomena all the time.
  Atoms can’t think. In no sense can you find any thinking in an atom.
  They are no different from dominos in that respect.
  You can pile atoms to the moon without seeing any thinking.
  Yet they can still be arranged so they do think.
  chpatrick a day ago
  Sure, sufficiently advanced dominoes.
  https://xkcd.com/505/
  We're already at the point where LLMs can beat the Turing test. If we define thinking as something only humans can do, then we can't decide if anyone is thinking at all just by talking to them through text, because we can't tell if they're human any more.
  measurablefunc a day ago
  Animals can also think. It's not restricted to one specific type of primate physiology. But it seems like you think you're nothing more than falling cascades of dominoes in which case we don't really have much to discuss. Your metaphysical assumptions are fundamentally at odds with what I consider a reasonable stance on computation & reality.
  Nevermark 16 hours ago
  > Animals can also think
  What are you saying?
  Are you saying you have a clear definition for thinking, and you can demonstrate that animals pass that definituon?
  Then share the definition.
  Or are you simply defining thinking as a common property of humans and animals, using animals and human behavior as exemplars?
  A useful definition for focusing inquiry. But it does not clarify or constrain what else might or might not be enabled to think.
  Or are you defining thinking as an inherent property of animals and humans that other things cannot have because they are not animals or humans?
  Fine, but that that’s an exercise in naming. Something we are all free to do however we want. It has no explanatory power.
  baq 19 hours ago
  Hard to argue with religious beliefs.
  measurablefunc 19 hours ago
  The void created by modernity must be filled somehow so it might as well be the great programmer in the great beyond. Just as childish as religions of pre-modernity but very useful if you're a technocrat building data centers & trying to pump the valuations of companies that can benefit from all that buildout w/ promises of forthcoming utopias approximating the palace of the great programmer in the great beyond. Just a few more nuclear power plants & a few more GPU clusters is all that's needed.
  Nevermark 15 hours ago
  Ideally it is filled with curiosity and continued exploration.
  Not manufactured stop gaps or generic cynicism.
  There is no reason more GPUs can’t contribute to further understanding, as one of many tools that have already assisted with relevent questions and problems.
  Opt out of serious inquiry, no excuse needed, if you wish. Reframing others efforts is not necessary to do that.
  measurablefunc 15 hours ago
  I recommend taking your own advice on that one, specifically the part about reframing efforts of strangers.
  Nevermark 4 hours ago
  I take your views to be exactly as you state them.
  measurablefunc 3 hours ago
  Then there is no need to reframe anything so you might as well get to the actual disagreement you have w/ them.
  QuesnayJr 15 hours ago
  I think you are misjudging which side of the religion/non-religion divide you are on.
  measurablefunc 15 hours ago
  The people who think enough nuclear reactors & silicon chips w/ the right incantation of 0s & 1s will deliver them to an abundant utopia don't leave much room in their ideology for any doubt about the eschatological objective of their quest & mission in life. These people are definitely not on some kind of religious side of a religious vs non-religious divide.
  Sure thing buddy, I'm the confused one in this entire millenarian frenzy.