The post says this in other words: in Lean, Rocq, or any other theorem prover, you get a formally-verified proof, but you do NOT get a formally verified theorem statement.
So, even if the proof is correct, you need to determine if the theorem is what you want. Making that determination requires expertise. Since you cannot "run the theorem", you cannot vibe-code your way through it. E.g., there is no equivalent of "web app seems to be working!" You have to actually understand what the theorems are saying in a deep way.
Even then this seems much more promising to me compared to other areas. Writing theorem statements is much much easier than coming up with proofs so it's not a big deal if a human has to do that. And once that's done getting a correct proof out of an LLM/AI model can be done fully automatically (assuming you do get a proof out of it though!)
> Even then this seems much more promising to me compared to other areas. Writing theorem statements is much much easier than coming up with proofs so it's not a big deal if a human has to do that.
Not at all. Theorem crafting is hard. What's easy is standing up and proving a mathematical strawman which may or may not have any relation to the original problem.
I'm not sure this is true. Encoding theorems in dependent types takes a lot of expertise.
Even without the Lean technical details, a lot of math theorems just don't mean anything to most people. For example, I have no idea what the Navier-Stokes theorem is saying. So, I would not be able to tell you if a Lean encoding of the theorem is correct. (Unless of course, it is trivially broken, since as assuming False.)
There is no "Navier-Stokes theorem". There is a famous class of open problems whether Navier-Stokes equations are well-posed (have solutions that don't explode in finite time) for various initial data, but that type of question is completely irrelevant for any practical purposes. I do share the feeling though. As a non-expert, I have no idea what the existing, allegedly practically relevant, formalizations of distributed algorithms actually guarantee.
Thanks. As I said, I have no idea. :)
Yeah, if you have a formal proof of something but you don't know exactly what then you might as well vibe code the system itself. I've been wondering if it would be possible to formally define a property (like deadlock freedom or linearizability) and then vibe-code the spec (and associated implementation).
I don't know if that's possible but it seems like that's where the value would be.
It is but we're not there yet. You can use a refinement style system like Verus in which your spec is a very high level statement of system behavior, like "this system provides the properties that paxos is supposed to guarantee"(1), and then be guaranteed that the things that implement it are preserving that spec.
But this is still double black diamond expert territory - it remains a challenging thing to identify ways to state and refine those properties to enable the proof. But it's getting easier by the year and the LLMs are getting more useful as part of it.
https://github.com/verus-lang/verus
(1) You would have to state the consistency and conditional liveness properties directly, of course, as Verus doesn't understand "what paxos does". That would look like TLA+-style statements, "at some time in the future, a command submitted now will be executed, etc."
In the case of Lean, propositions are encoded as dependent types and the user typically has to encode that themselves then make use of e.g. tactics to derive a term of said type.
Writing a statement you don't understand then later getting a proof from an LLM doesn't seem all that useless to me; in my mind, it could still be useful as exploration. Worst case scenario: you encoded a tautology and the LLM gave you a trivial proof. Best case scenario: the proposition ends up being a lemma for something you want to prove.
I do think there is a kernel of truth in what you've stated: if the user does not actually understand the statement of a proposition, then the proof is not very useful to them, since they don't know what the statement's truthiness implies. As someone who used to do mathematics, I still find immense value in vibe-coding away mundane computations, similar to what Lean's `simp` tactic already does, but much more broad.
The worst case is that you vibe code a theorem that reads:
False => P
Then you vibe code a proof of this theorem. Then you get excited that you’ve proven P.
Some of the X discussion that prompted the OP was quite close to this. There are screenshots on X of Lean code that doesn’t compile, with Lean being blamed.
Right, the risk is encoding a vacuously true statement. I consider "false implies Q" and tautologies to be vacuously true statements, since both are truths that fail to convey any meaningful information.
In any case, the worst case scenario is having a vacuously true statement. I picked tautologies as an example since that and statements about the empty set are the first things that come into my mind.
> Third, you need to decide how far “down the stack” you want to go. That is to say, the software you want to verify operates over some kind of more complex system, for instance, maybe it’s C code which gets compiled down to X86 and runs on a particular chip, or maybe it’s a controller for a nuclear reactor and part of the system is the actual physical dynamics of the reactor. Do you really want your proof to involve specifying the semantics of the C compiler and the chip, or the way that the temperature and other variables fluctuate in the reactor?
I can appreciate what he's getting at, but my utopian vision for the future is that we won't need to reinvent the wheel like this each time we want verified software! E.g. for high-consequence systems, the hard part of compiler correctness is already handled by the efforts of CompCert, and SystemVerilog assertions for the design guarantees of processors is becoming more commonplace.
Yeah, but the problem is that programming languages and compilers change all the time, making it hard to maintain a formal model of them. Exceptions exist (CompCert C and WebAssembly are two good examples) but for example, the semantics of raw pointers in Rust are intentionally under-defined because the compiler writers want to keep changing it.
Rust is still young, so it's taking the opportunity to thoroughly explore its options. As it grows older, it will solidify its decisions, even if it means committing to something that will be broken in the future. I look forward to the day that it becomes legacy (with all love)!
Interesting article, thanks. There is indeed a "semantic gap". However, there is also a practical solution: bidirectional LLM translation. You can verify the formal specification by back-translating it to natural language with another LLM session, allowing human review at the intent level rather than requiring expertise in e.g. Event-B syntax (see https://rochuskeller.substack.com/p/why-rust-solves-a-proble...). This addresses the concern about "mis-defining concepts" without requiring the human to be a formal methods expert. The human can review intent and invariants in natural language, not proof obligations. The AI handles the mathematical tedium while the human focuses on domain correctness, which is exactly where human expertise belongs.
> there is also a practical solution: bidirectional LLM translation
It's a solution only if the translation is proven correct. If not, you're in the same place as you started.
why do we invent these formal languages except to be more semantically precise than natural language? What does one gain besides familiarity by translation back into a more ambiguous language?
Mis-defining concepts can be extremely subtle, if you look at the allsome quantifier https://dwheeler.com/essays/allsome.html you'll see that these problems predate AI, and I struggle to see how natural language is going to help in cases like the "All martians" case where the confusion may be over whether martians exist or not. Something relatively implicit.
We build pretty complex systems only based on "natural language" specifications. I think you are conflating specification ambiguity with verification accessibility.
> What does one gain besides familiarity by translation back into a more ambiguous language?
You gain intent verification. Formal languages are precise about implementation, but they are often opaque about intent. A formal specification can be "precisely wrong". E.g. you can write a perfectly precise Event-B spec that says "When the pedestrian button is pressed, the traffic light turns Green for cars"; the formalism is unambiguous, the logic is sound, the proof holds, but the intent is fatally flawed. Translating this back to natural language ("The system ensures that pressing the button turns the car light green") allows a human to instantly spot the error.
> All Martians are green
Modern LLMs are actually excellent at explicating these edge cases during back-translation if prompted correctly. If the formal spec allows vacuous truth, the back-translation agent can be instructed to explicitly flag existential assumptions. E.g. "For every Martian (assuming at least one exists), the color is Green", or "If there are no Martians, this rule is automatically satisfied". You are not translating back to casual speech; you are translating back to structured, explicit natural language that highlights exactly these kinds of edge cases.
Maybe it can be done, but I struggle to believe adding in that branch for every forall quantifier (which may be plentiful in a proof) is going to help make a proof more understandable. Rather I feel like it'll just balloon the number of words necessary to explain the proof. Feels like it's going to fall on the bad side of verbosity as the sibling comment said.
I think there is a misunderstanding about what is being back-translated.
We don't back-translate the proof steps (the thousands of intermediate logical derivations). That would indeed be verbose and useless.
We back-translate the specification: the Invariants, Guards, and Events.
For a traffic light system, we don't need the LLM to explain the 50 steps of predicate logic that prove inv3 holds. We just need it to translate inv3 itself:
Formal: inv3: light_NS = Green ⇒ light_EW = Red
Back-translation: 'Invariant: If the North-South light is Green, the East-West light MUST be Red.'
This isn't verbose; it's the exact concise summary of the system's safety rules. The 'verbosity' of handling edge cases (like the 'Allsome' example) only applies when the specification itself relies on subtle edge cases, in which case, being verbose is exactly what you want to prevent a hidden bug.Definitions are built up layer upon layer like an onion too, with each step adding it's own invariants reducing the problem space.
I just feel like the street light example is an extremely small free standing example. Most things that I feel are worth the effort of proving end up huge. Forever formal verification languages were denigrated for being overly rigid and too verbose. I feel like translations into natural language can only increase that if they are accurate.
One thing I wish is this whole discussion was less intertwined with AI. The semantic gap has existed before AI, and will be run into again without AI. People have been accidentally proving the wrong thing true or false forever and will never stop with our without AI help.
At the very least we can agree that the problem exists, and while i'm skeptical of natural language as being anything but the problem we ran away from. At least you're trying something and exploring the problem space and that can only be cheered.
My bet is that AI changes the economics of that verbosity, making it cheap to generate and check those 'huge' definitions layer by layer. The next four years will show.
I agree, if AI (or humans) have mistranslated a natural language statement to a formal statement, we should not rely on AI to correctly translate the formal statement back into natural language.
For many statements I expect it's not possible to retain the exact meaning of the formal-language sentence without the natural language becoming at least as complex, and if you don't retain meaning exactly then you're vulnerable to the kind of thing the article warns about.
> if AI (or humans) have mistranslated a natural language statement to a formal statement, we should not rely on AI to correctly translate the formal statement back into natural language.
Perhaps we must not rely on it and find a way to make sure that it cannot fail, but I like to point out that this are two different problems and it seems to me that the current crop of so called AIs are pretty good at distilling excerpts. Perhaps that's the easier problem to solve?
> why do we invent these formal languages except to be more semantically precise than natural language
To be... more precise?
On a more serious note, cannot recommend enough "Exactly: How Precision Engineers Created the Modern World" by Winchester. While the book talks mostly about the precision in mechanical engineering, it made me appreciate _precision_ itself to a greater degree.
Rhetorical sentence? My point is that back-translation into natural langauge is translating into a less precise form. How is that going to help? No number of additional abstraction layers are going to solve human confusion.
Oh well, that flew over my head. You are right.
Some valid points, but I hope the authors had developed them more.
On the semantic gap between the original software and its representation in the ITP, program extraction like in Rocq probably deserves some discussion, where the software is written natively in the ITP and you have to prove the extraction itself sound. For example, Meta Rocq did this for Rocq.
For the how far down the stack problem, there are some efforts from https://deepspec.org/, but it's inherently a difficult problem and often gets less love than the lab environment projects.
Reminds me of Z notation, good times... sure it's nice to have a harness, or a series of test, but they only insure you that what's it done within it, is done within it. If it's poorly done, then the result if still within it but it's not magically correct.
I've had an interesting comparative experience so far using Alloy, which is a 'simpler' formal method approach that you can think of as smarter exhaustive fuzzing for your properties of interest.
We have been building GFQL, an open source graph query language that can run at the compute tier seperate from whatever existing storage system (splunk, SQL, cs , ..), making graphs easy for interactive notebooks/dashboards/analytics/pipelines/agents without needing graph DBs, even at billion scales. Part of the trick is it is a bit unusual inside too because we are making cypher-style queries run with columnar operations (purely vectorized for both CPU + GPU mode via a multi-target dataframe abstraction layer), but that is taking a few innovations in the interpreter algorithm so we worry about subtle semantic bugs. Most of our bugs can surface with small (graph, query) combos, so this sounded perfect for Alloy's small scope hypothesis!
So we are having a natural experiment for which does more bug finding:
- asking Claude code to analyze new code or recently found issues and using that for unit test amplification
- asking Claude code to use the alloy model checker to prove out our space of (query, graph) for key methods
- asking Claude code to port a large cypher conformance test suite
So far, Claude Code as a test amplifier is doing so much better than the others that it's our Dr default preference. The real gem is when we make it do a '5 whys root cause analysis' on why some bug got through and then it does test amplification around several categories of potential error.
We found the conformance suite to be 'class imbalanced' by focusing on a couple features we are adding next, so jury still out on that one. And finally... Alloy hasn't really found anything. We suspect we need to change how we use Alloy to be more reflective of the kinds of bugs we find, but 0 was disappointing.
If anyone is into this kind of thing, happy to discuss/collab!
A formal specification language is a programming language that we don't know how to compile.
If we can use AI to automatically implement a formal spec, then that formal specification language has just become a programming language.
> A formal specification language is a programming language that we don't know how to compile.
Not really, on both counts.
Firstly they're not really programming languages in the usual sense, in that they don't describe the sequence of instructions that the computer must follow. Functional programming languages are considered 'declarative', but they're still explicit about the computational work to be done. A formal spec doesn't do this, it just expresses the intended constraints on the correspondence between input and output (very roughly speaking).
Secondly, regarding the we don't know how to compile it aspect: 'constraint programming' and SMT solvers essentially do this, although they're not a practical way to build most software.
It is more general than that: A programming language is a formal specification language that we know how to compile.
There are plenty of formal specifications that cannot be compiled, even not by an AI. If you use AI, how do you make sure that the AI compiler compiles correctly?
> A programming language is a formal specification language that we know how to compile
Plenty of real programming languages are ambiguous in ways that surely disqualify them as formal specification languages. A trivial example in C: decrementing an unsigned int variable that holds 0. The subtraction is guaranteed to wrap around, but the value you get depends on the platform, per the C standard.
> There are plenty of formal specifications that cannot be compiled, even not by an AI. If you use AI, how do you make sure that the AI compiler compiles correctly?
By proving that the code satisfies the formal spec. Getting from a formal spec to a program (in an imperative programming language, say) can be broken down into several stages of 'refinement'. A snippet from [0] :
> properties that are proved at the abstract level are maintained through refinement, hence are guaranteed to be satisfied also by later refinements.
[0] https://www.southampton.ac.uk/~tsh2n14/publications/chapters...
A formal specification language doesn't have to be deterministic.
And yes, if you can prove that the implementation is correct with respect to the formal spec, you are good, and it doesn't really matter how you got the implementation.
Refinement is one approach, personally, I just do interactive theorem proving.
Except deciding whether an implementation exists or not is itself not a tractable problem in the general case.
Slight nitpick, but isn't Agda based on the MLTT family instead of CoC family of languages?
A fundamental problem is that program verification is intractable in the computational complexity sense. The question of whether a program satisfies a spec or not is called in the literature the model checking problem (not to be confused with model checkers, which are various algorithms for automatically solving some model checking problems). In the worst case, there is no faster way to determine whether a program satisfies a spec or not than explicitly testing each of its reachable states.
The question is, of course, what portion of programs are much easier than the worst case (to the point of making verification tractable). That is not known, but the results are not necessarily encouraging. Programs that have been deductively verified proof assistants were not only very small, and not only written with proofs in mind, but were also restricted to use simple and less efficient algorithms to make the proofs doable. They tend to require between 10x and 1000x lines of proof per line of code.
(an old blog post of mine links to some important results: https://pron.github.io/posts/correctness-and-complexity)
There is a belief that programs that people write and more-or-less work should be tractably provable, as that's what allowed writing them in the first place (i.e. the authors must have had some vague proof of correctness in mind when writing them). I don't think this argument is true (https://pron.github.io/posts/people-dont-write-programs), and we use formal methods precisely when we want to close the gap between working more-or-less and definitely always working.
But even if verifying some program is tractable, it could still take a long time between iterations. Because it's reasonable that it would take an LLM no less than a month to prove a correct program, there's no point in stopping it before. So a "practical" approach could be: write a program, try proving it for a month, and if you haven't finished in a month, try again. That, of course, could mean waiting, say, six months before deciding whether or not it's likely to ultimately succeed. Nevertheless, I expect there will be cases where writing the program and the proof would have taken a team of humans 800 years, and an LLM could do it in 80.
All formal methods came from natural. Somewhere in the meta-language stack, without natural methods, any system is stuck within an inescapable well. Bad formalisms are stuck and wrong. Natural methods are bad, but not stuck. They can arrive at new formalisms.
For the case of Propositional Logic, ChatGPT reflects the current epistemological crisis. When asking for help on a question, it could not properly apply the Law of the Excluded Middle [1].
1. https://chatgpt.com/share/696b7f8a-9760-8006-a1b5-89ffd7c5d2...
I would love for this to turn out to be some internal constraint where the LLM can not ‘reason’ about LEM and will always go to an understanding based in constructive logic. However, I am more ready to accept that LLM aren’t actually ‘reasoning’ about anything and it’s an inherent flaw in how we talk about the algorithms as though they were actually thinking ‘minds’ instead of very fancy syntax completion machines.
The problem is that both constructive logic and "normal" logic are part of the training data. You might be able to say "using constructive logic, prove X". But even that depends on none of the non-constructive training data "leaking" into the part of the model that it uses for answering such a query. I don't think LLMs have hard partitions like that, so you may not get a purely constructive proof even if that's what you ask for. Worse, the non-constructive part may be not obvious.