• Jackson__ 2 hours ago

    API only model, yet trying to compete with only open models in their benchmark image.

    Of course it'd be a complete embarrassment to see how hard it gets trounced by GPT4o and Claude 3.5, but that's par for the course if you don't want to release model weights, at least in my opinion.

    • Jackson__ 2 hours ago

      I'd also like to point out that they omit Qwen2.5 14B from the benchmark because it doesn't fit their narrative(MMLU Pro score of 63.7[0]). This kind of listing-only-models-you-beat feels extremely shady to me.

      [0] https://qwenlm.github.io/blog/qwen2.5/

      • GaggiX 2 hours ago

        Yes, I agree, for these small models it's wasted potential to be closed source, they can only be used effectively if they are open.

        EDIT: HN is rate-limiting me so I will reply here: In my opinion 1B and 3B truly shine on edge devices, if not than it's not worth the effort, you can have much better models for already dirt cheap using an API.

        • k__ 2 hours ago

          While I'm all for open models; why can't the small models not be used effectively?

          Wouldn't they lower the costs compared to big models drastically?

          • Bilal_io 2 hours ago

            I think what the parent means is that small models are more useful locally on mobile, IoT devices etc. so it defeats the purpose to have to call an API.

            • derefr an hour ago

              Big models take up more VRAM just to have the weights sitting around hot in memory, yes. But running two concurrent inferences on the same hot model, doesn't require that you have two full copies of the model in memory. You only need two full copies of the model's "state" (the vector that serves as the output of layer N and the input of layer N+1, and the pool of active low-cardinality matrix-temporaries used to batchwise-compute that vector.)

              It's just like spawning two copies of the same program, doesn't require that you have two copies of the program's text and data sections sitting in your physical RAM (as those get mmap'ed to the same shared physical RAM); it only requires that each process have its own copy of the program's writable globals (bss section), and have its own stack and heap.

              Which means there are economies of scale here. It is increasingly less expensive (in OpEx-per-inference-call terms) to run larger models, as your call concurrency goes up. Which doesn't matter to individuals just doing one thing at a time; but it does matter to Inference-as-a-Service providers, as they can arbitrarily "pack" many concurrent inference requests from many users, onto the nodes of their GPU cluster, to optimize OpEx-per-inference-call.

              This is the whole reason Inference-aaS providers have high valuations: these economies of scale make Inference-aaS a good business model. The same query, run in some inference cloud rather than on your device, will always achieve a higher-quality result for the same marginal cost [in watts per FLOP, and in wall-clock time]; and/or a same-quality result for a lower marginal cost.)

              Further, one major difference between CPU processes and model inference on a GPU, is that each inference step of a model is always computing an entirely-new state; and so compute (which you can think of as "number of compute cores reserved" x "amount of time they're reserved") scales in proportion to the state size. And, in fact, with current Transformer-architecture models, compute scales quadratically with state size.

              For both of these reasons, you want to design models to minimize 1. absolute state size overhead, and 2. state size growth in proportion to input size.

              The desire to minimize absolute state-size overhead, is why you see Inference-as-a-Service providers training such large versions of their models (OpenAI's 405b models, etc.) The hosted Inference-aaS providers aren't just attempting to make their models "smarter"; they're also attempting to trade off "state size" for "model size." (If you're familiar with information theory: they're attempting to make a "smart compressor" that minimizes the message-length of the compressed message [i.e. the state] by increasing the information embedded in the compressor itself [i.e. the model.]) And this seems to work! These bigger models can do more with less state, thereby allowing many more "cheap" inferences to run on single nodes.

              The particular newly-released model under discussion in this comments section, also has much slower state-size (and so compute) growth in proportion to its input size. Which means that there's even more of an economy-of-scale in running nodes with the larger versions of this model; and therefore much less of a reason to care about smaller versions of this model.

              • zozbot234 28 minutes ago

                > It is increasingly less expensive (in OpEx-per-inference-call terms) to run larger models, as your call concurrency goes up. Which doesn't matter to individuals just doing one thing at a time; but it does matter to Inference-as-a-Service providers, as they can arbitrarily "pack" many concurrent inference requests from many users

                In a way it also matters to individuals, because it allows them to run more capable models with a limited amount of system RAM. Yes, fetching model parameters from mass storage during inference is going to be dog slow (while NVMe transfer bandwidth is getting up there, it's not yet comparable to RAM) but that matters if you insist on getting your answer interactively, in real time. With a local model, it's trivial to make LLM inference a batch task. Some LLM inference frameworks can even save checkpoints for a single inference to disk and be cleanly resumed later.

                • liuliu an hour ago

                  > they're attempting to make a "smart compressor" that minimizes the message-length of the compressed message [i.e. the state] by increasing the information embedded in the compressor itself [i.e. the model.]) And this seems to work! These bigger models can do more with less state, thereby allowing many more "cheap" inferences to run on single nodes.

                  Not sure I follow. CoT and go over length of the states is a relatively new phenomenon and I doubt when training the model, minimize the length of CoT is an explicit goal.

                  The only thing probably relevant to this comment is the use of grouped-query attention? That reduces the size of KV cache by factor of 4 to 8 depending on your group strategy. But I am unsure there is a clear trade-off between model size / grouped-query size given smaller KV cache == smaller model size naively.

                • echelon 2 hours ago

                  These aren't the "small" models I'm thinking of.

                  I want an LLM, STT, or TTS model to run efficiently on a Raspberry Pi with no GPU and no network.

                  There is huge opportunity for LLM-based toys, tools, sensors, and the like. But they need to work sans internet.

              • zozbot234 2 hours ago

                There are small proprietary models such as Claude Haiku and GPT 4o-mini.

                • GaggiX 2 hours ago

                  They are way bigger than 1B or 3B.

            • zozbot234 3 hours ago

                > How many r's in "strawberry"? Think step by step!
              
                What a nice question! Well, let's just count:
                1. S
                2. T
                3. R!
              
                So, we have our answer: there are 3 R's in "strawberry".
              
              Totally checks out!
              • alach11 3 hours ago

                This is one of the least interesting questions to ask LLMs. I wish it wasn't so popular.

                • roywiggins an hour ago

                  I also think it's kind of a silly example- smart people can be functionally illiterate, after all. It doesn't tell you that much.

                  My favorite LLM stumper is asking them to solve the farmer/wolf/chicken/grain river-crossing puzzle but with only a chicken. They usually either insist on pointless extra trips or hallucinate a wolf or grain. Liquid somehow does both and also loses track of what's where.

                  > The farmer can take the chicken across the river first. Then, he can go back to the original side and bring the boat back. He can then take the chicken back to the original side and leave it there. Next, he can take a bag of grain across the river. He can then go back to the original side and bring the chicken across the river. Finally, he can go back to the original side one last time and bring the chicken across the river.

                  One generation ended like this:

                  > the farmer and the chicken are both on the other side of the river, and the chicken is safe from the grain.

                  • WhitneyLand 2 hours ago

                    I find it fascinating. To me it grounds expectations and motivates discussions about the jagged boundaries of LLM capabilities.

                    • hn_throwaway_99 an hour ago

                      Why do you find it fascinating? I have the most rudimentary understanding of LLMs and it seems to me the least fascinating thing about LLM limitations.

                      That is, all LLMs look at language as a series of opaque, independent tokens, e.g. strawberry might be a single token (say <5678>), or probably two (e.g. straw and berry, say <123><789>). But in no way will it represent it like we will, with letters. So if you ask it "how many R's are in strawberry", it fundamentally can't do any "counting", it just basically has to rely on whether that question (or similar questions about related words) has been asked before so it can predict the next token in its output correctly. I suppose with enough data and LLM could learn to associate character counts with tokens (e.g. with the right training set it could learn metadata about the tokens).

                      My point being that with only the most basic understanding of how LLMs function, this "character counting" limitation seems blatantly obvious.

                      • WhitneyLand an hour ago

                        I don’t think we’ve yet come to the point where, how an LLM end to end goes from prompt to output is blatantly obvious.

                        LLMs operate with language at multiple levels of abstraction and tokens are not the only way to have character level knowledge.

                        For example, nothing excludes pre-training data from directly or indirectly encoding such knowledge.

                        And of course LLMs famously have emergent properties, for which there’s not yet a precise mechanism to illuminate the results.

                        We’re dealing with very complex systems that are still relatively poorly understood, and I believe the pool of concepts understood to the point of being blatantly obvious is still small.

                        • hn_throwaway_99 23 minutes ago

                          The fact that large parts of how LLMs work is poorly understood doesn't mean that that we should brush the whole thing with "it's a complex mystery!"

                          We know exactly how LLMs tokenize their input - OpenAI even has a public tool you can test out, https://platform.openai.com/tokenizer .

                          When you say "For example, nothing excludes pre-training data from directly or indirectly encoding such knowledge." - yes, that's why I explicitly said "e.g. with the right training set it could learn metadata about the tokens".

                          But the point was still put perfectly by another commenter: "How many 1 bits are there in strawberry?" When nearly all humans can't answer that, we're very unsurprised; we see it immediately as a simple difference in how humans encode that string vs. how computers do it. We don't say "Well, the human mind is so complex and filled with emergent properties that the reason for this human limitation is a big mystery". And we also know that if we teach a human the encoding rules for characters (i.e. the bit pattern of each letter), they could answer this question, and similarly that's the analogous thing to training an LLM to learn metadata about the tokens.

                          Now, what I think is very interesting is why it's so hard to teach an LLM to say "I don't know" when asked to count characters. In my opinion, that's a much more interesting limitation that gets at some of the root, fundamental differences in how LLMs function.

                    • NoCoooode 2 hours ago

                      That they can't do this sort of simple question speaks volumes to the entire approach.

                      I don't think generative AI will ever be able to reach AGI, and most people selling LLM today pretend it is AGI

                      • alach11 2 hours ago

                        All it speaks to is that tokenization is weird and introduces artifacts to LLM performance. Counting letters is a trivial task when you're staring at words on a screen. It's much harder when you're perceiving vectors based on parts of words. The fact that LLMs find certain things easier/harder than humans is completely unsurprising, and there are much more interesting benchmarks to use to compare one LLM to another.

                        • doctorpangloss 2 hours ago

                          Counting shit, like cells, peaks in signals, people, inventory, fingers, and votes, is hard, tedious and important to business and life, so I don’t know dude, it seems like a great benchmark to me. Countless posts wasted on denying this simple and obvious fact.

                          • hn_throwaway_99 an hour ago

                            Not being able to "do math" is an obvious, IMO uninteresting limitation of how LLMs inherently work, and the more advanced models have already figured out good solutions. E.g. while an LLM may stumble on "How many R's are in Strawberry", at this point any code model can easily correctly implement "write me a program in Python that counts the number of R's in strawberry".

                            GPT 3 would essentially always fail on higher math problems/arithmetic, but whenever I've used it for math GPT 4 has always delegated out to executing code where necessary.

                            • Scarblac an hour ago

                              So how many 1 bits are in strawberry?

                              Thats about the same kind of question, given that LLMs see tokens instead of characters, and we see characters instead of bits.

                              • BoorishBears an hour ago

                                It's like using a hammer to turn a screw and calling it useless.

                                To envision what a next generation model bound by the same constraints should do, it'd be to recognize that it can't count tokens and use code access to write code that solves the strawberry problem without prompting.

                                Asked to count cells it'd be a model that could write and execute OpenCV tasks. Or to go a step further, be a multimodal model that can synthesize 10000 varations of the target cell, and finetune a model like YOLO on it autonomously.

                                I find arguments that reduce LLMs to "It can't do the simple thing!!!!" come from people unable to apply lateral thinking to how a task can be solved.

                              • jrflowers 2 hours ago

                                This is a good point. While LLMs being incapable of reliably doing a simple task that’s been doable by computers since the punch card days is an important consideration for anyone that might be thinking about using them for anything other than as a toy, this fact is uninteresting because of Reasons

                                • Nevermark 2 hours ago

                                  Try counting the number of your red retina cells that are firing while you look at a painting.

                                  Don’t need to be exact as firing is statistical, just give us a good average.

                                  Hard? You can’t count?

                                  Computers count pixels no problem. So weird you can’t.

                                  Dementia? Not an AGI? /h

                                  —-

                                  This is what is happening.

                                  Here are the “Reasons”.

                                  In your vision system, the raw information from individual retina signals is munged into a different representation before reaching a level where you have flexible processing.

                                  Likewise, in LLMs, letters are munged into tokens before LLMs “see” them.

                                  When they sometimes get that “simple” question right, it’s actually a bit of an amazing feat. Given how they are constructed.

                                  —-

                                  Now try counting R’s as you read at a normal rate, or listen to someone speak.

                                  You can’t do that either, during normal processing.

                                  When we add spelling to LLMs training examples, they will do it easily. Just as you learned to do it, only after special lessons, after you had already learned to listen and speak.

                                  Spelling is its own special practiced skill, in humans and LLMs.

                                  • jrflowers 2 hours ago

                                    > Try counting the number of your red retina cells that are firing during while you look at a painting.

                                    This analogy makes sense because everybody could count their red retina cells until a couple years ago when the new painting paradigm arose, and also counting red retinal cells is a good analogy for being able to see simple objects that have always been distinguishable.

                                    It is fascinating how tapping the “Do Not Use LLMs For Computation If The Results Need To Be Reliably Better Than A Random Output” sign invites explanations of why that fact is actually Cool and Good

                                    • Nevermark an hour ago

                                      Ask anyone who has not specifically learned to spell, to count R’s while you speak.

                                      You learned to listen and speak words before you could spell. Imagine if nobody had actually shown you written words?

                                      Or they were speaking another dialect but expecting you to count R’s in standard English?

                                      LLMs are not trained on words in the form of letters.

                                      They process and generate the words in the form of tokens. Pre- and post-processing systems converts letters to tokens and the reverse, without their ability to access that processing.

                                      Spelling, for both us and LLMs, requires specific training/lessons.

                                      > It is fascinating how tapping the “Do Not Use LLMs For Computation If The Results Need To Be Reliably Better Than A Random Output” sign invites explanations of why that fact is actually Cool and Good

                                      Also fascinating:

                                      People who hallucinate/confabulate ridiculous straw man rationales for people they disagree with, unaware they are filling in gaps in their knowledge regarding other people’s actual reasoning and the actual subject at hand.

                                      So LLM! Such unreliable processing!

                                      Perhaps, start posting a reliability disclaimer?

                                      • roywiggins 28 minutes ago

                                        The analogy I use is that illiterate people obviously can't spell, but it doesn't say much about their ability on other tasks. General intelligence doesn't need to be able to spell, since that describes a fair number of actual humans.

                                        (There are tasks that LLMs totally fail on that would be obvious to an illiterate human though)

                                  • space_fountain 2 hours ago

                                    LLMs can clearly solve problems that computers up to now couldn't. They can't solve all problems and this should definitely be a cautionary note to anyone who wants to use them as an artificial general intelligence, but this take seems no different to someone looking at a punchcard computer and going, it can't even recognize typos or categorize images, what good is this? We've already had human computers who can do everything these can do, and can recognize images and notice typos

                                    • evilduck 2 hours ago

                                      Also humans would revert to explicitly using an algorithm and external storage like a sheet of paper with tally marks or a spreadsheet or even a computer program if you scale the question up to a full sheet of text or a whole book or a collection of books (we probably do it at a single word size too, but it's more intuitive than explicit behavior for most folks when the count sum is around 8 or less).

                                      LLMs can't effectively execute algorithms similarly in their context, nor can they memorize new data or facts it was given without providing it tools like function calling or embeddings. If you give LLMs tool calling and storage mechanisms then counting letters in words becomes pretty damn reliable.

                                • bondarchuk 2 hours ago

                                  For all I care we will have superhuman AGI that still can't count the Rs in strawberry. Some humans are dyslexic and all are subject to weird perceptual illusions; doesn't make them any less human-level intelligent.

                                  • InsideOutSanta 2 hours ago

                                    In my opinion, the problem with the strawberry question is that it is both a bad example because you don't need an LLM to count the number of r's in a word, and it's a bad measure of an LLM's capabilities because it's a type of question that all LLMs are currently bad at.

                                    Having said that, the 40b model wasn't able to answer any of my real-world example questions correctly. Some of these (e.g. "how do I add a sequential number after my titles in an HTML page using just CSS, without changing the page") are questions that even some of the better small local models can answer correctly. It gave very authoritatively sounding wrong answers.

                                    • throw101010 2 hours ago

                                      > most people selling LLM today pretend it is AGI

                                      Who exactly does this in this space? Would be good to be able to call them out on it right now.

                                      • jimbokun 2 hours ago

                                        In isolation, probably not.

                                        But it's likely to be an important component in an AGI system. I suppose the interesting question is how to integrate LLMs with more traditional logic and planning systems.

                                        • GaggiX 2 hours ago

                                          The fact that LLMs are usually trained on tokens and not on characters, doesn't really speak about what generative AI is going to reach or not.

                                          >most people selling LLM today pretend it is AGI

                                          Who are these "most people"?

                                          • smokedetector1 2 hours ago

                                            ELI5 why are tokens not a single letter?

                                            • WhitneyLand 2 hours ago

                                              Such an architecture could be implemented, it could use one token per letter, or one token per word, instead of the typical 0.75 per word we see.

                                              The choice just comes with trade-offs in memory usage, compute, and effectiveness of the model in various scenarios.

                                              So what we ended up with was a pragmatic/engineering decision rather than a theoretical or fundamental constraint.

                                          • swyx 2 hours ago

                                            imagine being so confidently wrong about AI

                                          • anonzzzies 2 hours ago

                                            Yep, people find this interesting, but many (many) people get this wrong as well; it has something to do with how many letters it is and how many you can keep in your head I think. We are not very good counters or calculators or computers and, even though I am not saying llms are the same or close to humans, we tried to model their substrates after biology and are surprised we get something that cannot count or calculate very well.

                                            I find it a dumb test that proves nothing imho.

                                            However, not very impressed with this model.

                                            • marcosdumay an hour ago

                                              It's a magician-like trick. The double-r gets all the attention because one needs to consciously remember it, so nobody remembers to look at the other single-r.

                                              The LLMs probably get it wrong because people get it wrong.

                                              • lbotos an hour ago

                                                I thought it was something to do with the way tokens are generated for the word strawberry?

                                                https://arbisoft.com/blogs/why-ll-ms-can-t-count-the-r-s-in-...

                                                • marcosdumay 34 minutes ago

                                                  That explanation would require the LLM to actually understand the question and deriving an answer from first principles.

                                                  It doesn't.

                                              • nateglims an hour ago

                                                > many (many) people get this wrong as well

                                                Many people can't count letters in a word?

                                                • roywiggins an hour ago

                                                  I messed up spelling "juggler" in a spelling bee once (I was 10).

                                                  The confusing thing about LLMs is that they speak tokens, not written text, so it's a lot more like interrogating someone who is functionally illiterate- of course they're going to be a terrible speller.

                                                  You can find much weirder things that LLMs are absurdly bad at, like "A farmer needs to get himself and a chicken across a river. His boat can hold the farmer and one piece of cargo. How does he do this?" 9 times out of 10 LLMs will pattern match this to the classic puzzle (there's usually also a sack of grain and a wolf) and start insisting on extra trips and inventing wolves. Even if a human makes the same mistake, they almost certainly would realize it after being reminded that there is no wolf, but LLMs often insist there is. o1-preview (but not -mini) seems to have cracked it, though.

                                              • cynicalpeace an hour ago

                                                Something I like to tell it to do is actually to respond using only a certain number of words. Sorta like min token length rather than max token length.

                                                Haven't found a model that does it reliably yet.

                                                • nateglims an hour ago

                                                  Being openly annoyed by it will just lead to it being used more. That's sort of a fundamental behavior of the internet.

                                                  • xanderlewis 2 hours ago

                                                    The point is that it’s trivial (or should be). Not that it’s interesting.

                                                    It’s certainly interesting that they can’t answer it properly, though.

                                                    • moralestapia 2 hours ago

                                                      And yet ... all they have to do is answer it properly.

                                                      • sfink 2 hours ago

                                                        It did! How many R's do you think are in "strawberry"? I get 3, and it got 3. I'm with the LLM.

                                                        ...oh, you disagree with its intermediate reasoning? You think it should follow a series of logical steps that are each individually correct, rather than pulling wildly incorrect intermediate steps out of its butt and synthesizing them together in the end into a coherent answer that happens to be correct?

                                                        But that's what an LLM is! Why complain about an LLM being an (unusually transparent) LLM?

                                                        I think this example serves as a beautiful illustration of how LLMs work, and are supposed to work—even if the correspondence is inexact between (1) incorrect and irrelevant-to-us intermediate steps and (2) internal matrix multiplications that represent incorrect or invalid logic. The correspondence is inexact and probably mostly figurative, but it's still a great example of how internal nonsense can still lead to externally "correct" answers. ("Nonsense" is underselling the complex and highly sophisticated internal state that leads to "correct" answers a stunningly high percentage of the time; I'm just trying to distinguish it from formal logic or the type of reasoning that we think we do and sometimes actually do. And would do, in this example problem.)

                                                        • moralestapia 2 hours ago

                                                          That's why I used the word properly, to imply that the whole answer should make sense.

                                                        • qeternity 2 hours ago

                                                          Training on the test set is all you need.

                                                      • d_burfoot an hour ago

                                                        Well... they at least admit that their model is bad at this [0]

                                                        > What are Language LFMs not good at today:

                                                        - Zero-shot code tasks

                                                        - Precise numerical calculations

                                                        - Time-sensitive information

                                                        - Counting r's in the word "Strawberry"!

                                                        [0] https://www.liquid.ai/liquid-foundation-models#reimagining-m...

                                                        • bluelightning2k 3 hours ago

                                                          Let's break it down step by step:

                                                          The word is "strawberry". The first "r" is in the second position of the word. The second "r" is in the fourth position of the word. So, there are two "r's" in "strawberry".

                                                          • coding123 2 hours ago

                                                            Can it write a python function to count the letters?

                                                            • Spivak 2 hours ago

                                                              This is how I get it to do correct date calculations. Don't tell me what date next Monday is, write Python code using datetime to calculate next Monday and I'll run it in a sandbox. Works super well.

                                                              • moralestapia 2 hours ago

                                                                I kind of also do this, for some queries, and I found it to be a very interesting concept, even from a philosophical pov.

                                                                It's some sort of Chain-of-thought constraint on a different layer of abstraction. Does this come with its own advantages?

                                                                (And no, AGI is not here either).

                                                              • asadm 2 hours ago

                                                                all of them can.

                                                              • the_real_cher 2 hours ago

                                                                I got:

                                                                "There are no "r"s in the word "strawberry"

                                                              • vessenes 3 hours ago

                                                                So a new set of small models that are competitive with and beat phi-3.5 on some benchmarks is extremely impressive.

                                                                Context length is the next frontier for models in this range - getting to a true 128-200k token length in a small model would be very very useful. Hallucinations are less of an issue because you can just dump all the source data in, whole codebases can go in for stuff ranging from a refactor to ‘write documentation of the API’.

                                                                Claude and gpto-preview are the only games in town for these long context tasks right now, and they are slow. Some of the tasks need the extra intelligence, but a lot don’t. In those cases a lightweight or local model will be greatly appreciated. Not to mention context length that long will more easily enable multimodal parsing.

                                                                • jadbox 3 hours ago

                                                                  Looking at 3B, the performance is roughly on par with phi 3.5.. not sure where how they gauge on their graph that it is better. Agreed though, I don't think I would switch my 3B model from phi unless Liquid was more transparent in data and research.

                                                                  • startupsfail 3 hours ago

                                                                    It seems like this is a new architecture, alongside Mamba, Hyena, RWKV and Hybrid Transformer.

                                                                    • zwaps an hour ago

                                                                      The site insinuates that it is a set of flexible and probably parametric mix between full attention and linear conplexity models

                                                                    • luckydata 3 hours ago

                                                                      Gemini has the largest context of any model out there FYI.

                                                                      • machiaweliczny 3 hours ago

                                                                        From my experience these large context are just techinical feasability but there needs to be a lot better internal benchamarks to claim it really works. I've tested on real world task and it all fails so far.

                                                                      • simonw 3 hours ago

                                                                        Llama 3.2 1B and 3B both claim a 128,000 token length.

                                                                        • vessenes an hour ago

                                                                          They point out in the paper stats around where effectiveness drops off hard. It's at 32k at most everywhere now. Some models still at 4 to 8k. Getting to 200k in a smaller model is an open research area, as far as I'm aware. Ideas so far are pretty broad ranging, including using fourier transforms to try and capture cyclicality in inputs (mamba et al), RWKV (which I do not fully understand, but claims very long input context support), larger range of possible tokenizations shortening input length (llama 3), ..

                                                                          • dragonwriter 2 hours ago

                                                                            The issue isn't the nominal size of the context window which is easy to objectively measure, but effective use of the context window, which is harder to measure precisely, but a big issue: it doesn't matter how much stuff you can throw at the model if it effectively ignores most of it past a certain point.

                                                                            • GaggiX 3 hours ago

                                                                              That is indeed their claim, but if you read the linked page you will see that it is not their effective token window.

                                                                            • simion314 2 hours ago

                                                                              >Hallucinations are less of an issue because you can just dump all the source data in, whole codebases can go in for stuff ranging from a refactor to ‘write documentation of the API’.

                                                                              Is there no risk ? I mean say for testing purposes we give the AI a giant CSV file and ask it to make it a json is the chance for error 0% ? Because today we need to double check when we ask AI to transform some data or transform some code, there is the risk of it messing something up but if it is not something that would crash immediately you risk introducing a ton of new bugs by asking an AI to refactor instead of using some good tools.

                                                                              • vessenes an hour ago

                                                                                There's always risk!

                                                                                But when you ask a model to rely on just the input data, you are (mostly) trying to tap into its reasoning, not knowledge side. Obviously what's kind of magical is that some knowledge will be needed for reasoning, and you have it. But LLMs small and large are pretty good at doing the in-context stuff. It is precisely what they're trained on, and in fact it was kind of a surprise how well they seemed to generalize outside of this task in the first place.

                                                                            • mg 3 hours ago

                                                                              I tried my standard testprompt:

                                                                                  Two cars have a 100 mile race. Car A drives 10
                                                                                  miles per hour. Car B drives 5 miles per hour,
                                                                                  but gets a 10 hour headstart. Who wins?
                                                                              
                                                                              And the Liquid-40B model failed with a long explanation why car B wins the race.

                                                                              Amusingly, the explanation is quite convincing and very similar to how Zeno explains in one of his paradoxes that a fast runner can never overtake a slower runner with a headstart. Because every time the fast runner gets to the point where he saw the slow runner, the slow runner is already a bit further along the track.

                                                                              • leetharris 3 hours ago

                                                                                To be fair, this article does say:

                                                                                >What are Language LFMs not good at today:

                                                                                >Zero-shot code tasks

                                                                                >Precise numerical calculations

                                                                                >Time-sensitive information

                                                                                >Counting r's in the word "Strawberry"!

                                                                                >Human preference optimization techniques have not been applied extensively to our models yet.

                                                                                Maybe it's just not trained/tuned for math-related challenges. Or maybe it's an architectural challenge.

                                                                                • aithrowawaycomm 2 hours ago

                                                                                  But they also say

                                                                                  > What are Language LFMs good at today:

                                                                                  > General and expert knowledge

                                                                                  > Mathematics and logical reasoning

                                                                                  "Our LLM is good at mathematics but bad at computation" doesn't really cut the mustard here. What they mean is "good at well-known math benchmarks but bad at simple math problems that it hasn't been trained on." The idea that this constitutes "mathematics and logical reasoning" is a testament to AI companies' poor scientific standards.

                                                                                • vletal 3 hours ago

                                                                                  Thanks! I am collecting all "test prompts" which appear at HN and Reddit. I plan to create a hugging face dataset. I will soon publish "Vojta-1B", which is gonna pass all of them.

                                                                                  • wazokazi 3 hours ago

                                                                                    I got this >>

                                                                                    To determine the winner, we need to calculate the time it takes for each car to complete the 100-mile race.

                                                                                    For Car A: Speed = 10 miles per hour Distance = 100 miles Time = Distance / Speed = 100 miles / 10 miles per hour = 10 hours

                                                                                    For Car B: Speed = 5 miles per hour Distance = 100 miles Time = Distance / Speed = 100 miles / 5 miles per hour = 20 hours

                                                                                    However, Car B has a 10-hour headstart, so we need to subtract that from its total time: Adjusted Time for Car B = 20 hours - 10 hours = 10 hours

                                                                                    Both cars take 10 hours to complete the race, so they tie.

                                                                                    • reustle 2 hours ago

                                                                                      What about acceleration times? :-)

                                                                                      0-5 vs 0-10 should be slightly different

                                                                                  • serf 3 hours ago

                                                                                    Seems good at trivia and easy-to-answer medical/engineer stuff. Fails hard at most logic or puzzle-y stuff I throw at either sized model.

                                                                                    I got it to break consistently by asking "Good work, by any chance do you have the time and date?" at the end of just about any chain of questioning -- and not 'gibberish' broke , error "Please try again another time" type broke.

                                                                                    It is impressively fast at what it does answer, though.

                                                                                    • loufe an hour ago

                                                                                      I'm not impressed with the quality performance, but the generation seems near instantaneous and any fundamentally new approach is exciting and interesting. I found the tone of the AI somewhat casual which was funny. "Are you sure?" --> "Yes, I'm certain."

                                                                                      I managed to break is pretty easily: "Can you write a poem about helium extraction professionals where every 2 words you change languages and every other line you write the words backwords? Explain each line in english afterwards."

                                                                                      I've been wondering if this isn't a good thing. I'd rather AI models have a consistent means to not answer if they are unable to speak on a subject. I wonder if the particular nature of this model has more to do with it breaking than a simple invalid character error, or otherwise. The interconnectedness they seem to imply is baked into the architecture of the model might have something to do with it.

                                                                                      • loufe an hour ago

                                                                                        It took over a minute, but it actually did finally respond to that prompt (incorrectly) :

                                                                                        "To gas, we change, "Change, gas, we to." "To gas, we change, "Change, gas, we to."

                                                                                        "In the lab, we work, "Work, lab, we in." "In the lab, we work, "Work, lab, we in."

                                                                                        "With care, we handle, "Handle, care, we with." "With care, we handle, "Handle, care, we with."

                                                                                        "The gas, we love, "Love, gas, we the." "The gas, we love, "Love, gas, we the."

                                                                                        "In the world, we share, "Share, world, we in." "In the world, we share, "Share, world, we in."

                                                                                        "With pride, we work, "Work, pride, we with." "With pride, we work, "Work, pride, we with."

                                                                                      • mrbonner 3 hours ago

                                                                                        Let me start counting how many minutes it would take for someone to post a comment blasting this model for reasoning capabilities. For me LLM, is as useful as it is a word calculator, borrowing from Simon Wilson, a frequent poster here and author of the LLM python package I use.

                                                                                        I use LLM everyday and it is increasingly something I can't function effectively without. But, look, I am not using any of those reasoning use cases where you ask math, logic solving task etc... Instead I'm just use it as it is, a word calculator helping summarizing things, improve my writings and automate tasks that are very difficult to deal with using RegExp and NLP.

                                                                                        I'm very bullish about this progress in LLM space but not about those reasoning, logic and AGI use cases.

                                                                                        • tossandthrow 2 hours ago

                                                                                          In order to effectively write a good summary, the model needs to have some capacity of reasoning capability.

                                                                                          It is a continuum.

                                                                                          • GaggiX 2 hours ago

                                                                                            People use these models for different reasons.

                                                                                          • injidup 3 hours ago

                                                                                            Give this prompt and watch it burn!

                                                                                            """ Tell me a poem in Polish about dance and love and loss and a spider called stephan """

                                                                                            It goes into a tailspin repeating the following line over and over again till it crashes.

                                                                                            """ W tym kąciku kątny, gdzie tango tańczyli, Była tylko ona, tylko on, tango. W tym kąciku kątny, gdzie tango tańczyli, Była tylko ona, tylko on, tango. """

                                                                                            • kkwtflz 3 hours ago

                                                                                              I changed Polish to Romanian and it entered an infinite loop and then crashed. Nice.

                                                                                              • machiaweliczny 3 hours ago

                                                                                                I've tested gpt4o and they've clearly improved since I've tested last year ago when poems were working only in english (like it would apply translation when you asked for other language)

                                                                                              • skizm an hour ago

                                                                                                General LLM question: a lot of people ask specific knowledge based questions to LLMs. Isn't one of the defining features of modern NPL (and therefore LLMs) that it is designed to be non-deterministic? Meaning that it will intentionally select "less optimal" next tokens at some random rate in order to make it sound less like a robot answering the same thing over and over. If this is the case, isn't it pretty much guaranteed to get knowledge based questions wrong a decent amount of the time?

                                                                                                I could be way off base (I have zero knowledge about the internals and simply read occasional blog posts), but I thought I remembered that being one of the key features to making LLMs sound more human.

                                                                                                • tsumnia an hour ago

                                                                                                  Don't think of it as "less optimal", but rather other next tokens that have smaller, but still high, probabilties of being selected.

                                                                                                  If your largest next token has a probability of (arbitrarily) 25% of being selected, this doesn't make it optimal - just highest probable answer. If the second most probable has a 24% chance, that would now account for 49% of probable responses. Rather than picking the highest probable word, let's generate a random number and whatever that value hits is then compared against the various probability bands (random < probability).

                                                                                                  Hallucinations can happen, this is where a lot of current work is studying ways to minimize the LLM from picking really bad moves. Using things like critics and chain of thought and meant to help keep the probability machine within the realm of reasonable answers.

                                                                                                  This is also why jailbreaking techniques like spamming umlauts (ü) has worked. They create very rare tokens where probabilties on what comes next is limited. Once every next token gets equal probability, the LLMs goal is to just try and improve its probability and will output anything, including things against its directive, to get back to 'normal'.

                                                                                                  • hn_throwaway_99 an hour ago

                                                                                                    What you are referring to is called "temperature" with respect to LLMs, and it is a parameter that can be tweaked at inference time. Google's AI Overview gives a pretty good summary IMO:

                                                                                                    > The temperature parameter in a large language model (LLM) controls the amount of randomness in the model's output. It's a numerical value that influences the model's choices, and is often set between 0 and 1.

                                                                                                    > The temperature setting affects the quality of the output by balancing exploration and exploitation:

                                                                                                    > Low temperature

                                                                                                    > The model is more likely to exploit patterns it's already learned, resulting in more deterministic and conservative outputs. This is ideal for tasks that require accuracy, such as data analysis, technical writing, and summarizing research papers.

                                                                                                    > High temperature

                                                                                                    > The model is encouraged to explore, resulting in more varied and creative outputs. This is ideal for creative tasks, such as brainstorming, artistic content, and storytelling. However, adjusting the temperature too high can result in non-sensical outputs.

                                                                                                    • marcosdumay an hour ago

                                                                                                      > Isn't one of the defining features of modern NPL (and therefore LLMs) that it is designed to be non-deterministic?

                                                                                                      Not of all NPL, but it is absolutely a property of LLMs.

                                                                                                      But I doubt this even enters into the total figure if you go search for the causes of LLMs giving wrong answers.

                                                                                                    • person4268 3 hours ago

                                                                                                      It's pretty impressive, just note (emphasis added):

                                                                                                      > At Liquid AI, we take an open-science approach. We have and will continue to contribute to the advancement of the AI field by openly publishing our findings and methods through scientific and technical reports. As part of this commitment, we will release relevant data and models produced by our research efforts to the wider AI community. We have dedicated a lot of time and resources to developing these architectures, *so we're not open-sourcing our models at the moment*. This allows us to continue building on our progress and maintain our edge in the competitive AI landscape.

                                                                                                      Looks like there's no paper (or similar) yet, either. Hopefully they'll release a more detailed writeup soon.

                                                                                                      • programjames 3 hours ago

                                                                                                        The ideas come from these papers:

                                                                                                        1. [Liquid Time-Constant Networks (2020)](https://arxiv.org/abs/2006.04439)

                                                                                                        This is essentially a neural ODE applied to leaky integrate-and-fire.

                                                                                                        2. [Closed-form Continuous-time (2022)](https://arxiv.org/abs/2106.13898)

                                                                                                        A closed-form approximation of the first.

                                                                                                      • turnsout 3 hours ago

                                                                                                        Missed opportunity. I would argue that the only way they CAN make these smaller models competitive is to make them openly available. As a developer, I'm not going to choose an unknown startup's model over bigger closed models from OpenAI or Anthropic. And if I really need something smaller and faster, I'd prefer to run the model myself for better control and no risk of the model being "upgraded."

                                                                                                        • swyx 3 hours ago

                                                                                                          this is the "paper": a list of citations https://www.liquid.ai/blog/liquid-neural-networks-research

                                                                                                          i guess they're not really making an effort to explain how specifically all this comes together to create LFMs.

                                                                                                          other finds https://x.com/swyx/status/1840794198913794236

                                                                                                        • paradite an hour ago

                                                                                                          If you look at the main image of the MMLU-Pro benchmark comparisons, and compare it to the tables below, you can see how they cherry picked one single benchmark where the models better than all the others.

                                                                                                          Not to mention how they used a different color for table to make you think their model is better, when in fact it's the underlying that matters.

                                                                                                          Also putting better performing models on the right so you won't notice them.

                                                                                                          Very sketchy post all around despite the professional look.

                                                                                                          • causal an hour ago

                                                                                                            And the lack of detail on the architecture innovations are also... odd? Like why make such a big deal about how novel your architecture is and then just explain it as "derived from a set of computational units." ?? okay

                                                                                                        • academiclolz an hour ago

                                                                                                          Every time there's a tech hype cycle, trust some academics to come out of the woodwork, raise some absurd amount of money and sit out the resulting crash, only to be acquired/acqui-hired by some company to on-shore talent in an emerging area, rest and vest, then become VCs/partners. Plenty of examples:

                                                                                                          1. Covariant -> Started by Berkeley folks, acqui-hired by Amazon after half a year of peddling a patchwork quilt of buzzwords through their Robot Foundational Model - 1 (RFM-1).

                                                                                                          2. Drive.ai -> Started by Stanford folks, acqui-hired by Apple, only for most of the team to leave to found Landing.ai (I still don't know what this company actually does apart from allowing the founders to carve out nice, fat consulting contracts?).

                                                                                                          3. World Labs -> Started by Stanford folks, physical embodiment but only 3D, NeRFs, perception something something? Not a single operator/person with business person in the founding cabal.

                                                                                                          4. Physical Intelligence -> Started by Stanford folks (seeing a pattern here...), physical embodiment, data collection, foundational models something something.

                                                                                                          5. Skild Ai -> Started by CMU folks, physical embodiment, again not sure what the plan is here.

                                                                                                          6. Argo.AI -> Started by CMU folks, self-driving group that didn't go anywhere. Eventually shuttered and absorbed into Ford/Volkswagen.

                                                                                                          7. Aurora.tech -> Started by CMU folks, self-driving team that SPAC'd and is now just existing in between debating raising capital or selling itself for parts I think?

                                                                                                          8. Rethink Robotics -> Started by MIT folks, Rodney Brooks' legendary robotics companies that wanted to mass produce industrial robots, and then decided how the robot eyes should look like...fold, shuttered, and now relaunched only in name, none of the original robots or aesthetic (rightfully so) persists.

                                                                                                          So many examples like this...none of these companies actually are vaguely interested in creating enduring business, have a business plan (other than GPU go brrr) or have any idea of how to run a business.

                                                                                                          • joaquincabezas 3 hours ago

                                                                                                            Love the humour here:

                                                                                                            What are Language LFMs not good at today: […] Counting r's in the word "Strawberry"!

                                                                                                          • anonzzzies an hour ago

                                                                                                            I would say that unless you really change the goalposts of your new model or it is open (source), it is not going to do much. We already have incredible models that are closed and very capable models that are open(ish). I am not even sure where this is trying to play?

                                                                                                            • valine an hour ago

                                                                                                              Demonstrating good performance from a non-transformer based architecture is cool. I agree though these particular models aren’t that useful given the current landscape.

                                                                                                              I think the intent here is probably to justify training a larger 400B model with their architecture.

                                                                                                              • anonzzzies 35 minutes ago

                                                                                                                But then why introduce lukewarm?

                                                                                                            • hedgehog 3 hours ago

                                                                                                              I'm curious how this works, it looks like they haven't published yet but two of the founders are authors on this paper:

                                                                                                              https://arxiv.org/abs/2405.06147

                                                                                                              • Mathnerd314 an hour ago

                                                                                                                It seems OK, for a small model. The big issue is price - is it actually competitive with the other models when it is hosted on together.ai or other API services? Which we will presumably find out at the Oct 24th event.

                                                                                                                • jjtheblunt an hour ago

                                                                                                                  "achieve state-of-the-art performance at every scale, while maintaining a smaller memory footprint and more efficient inference."

                                                                                                                  smaller and more efficient than what?

                                                                                                                  • dartos 3 hours ago

                                                                                                                    Why add a new TLA in “LFM?”

                                                                                                                    It adds nothing but branding to the conversation.

                                                                                                                    Maybe i answered my own question…

                                                                                                                    • tgtweak 2 hours ago

                                                                                                                      Am I missing something or is this not an open model? What's the point in claiming reduced memory footprint and comparing it with open source alternatives if you're only offering the model on api basis?

                                                                                                                      • nuz 2 hours ago

                                                                                                                        This is all getting kinda boring to keep up with.

                                                                                                                        • smartmic an hour ago

                                                                                                                          Not only boring to keep us with (which I never tried), but boring in itself.

                                                                                                                          • marcosdumay an hour ago

                                                                                                                            To be fair, the boring, low-quality, cheap space is exactly where we'll probably get something useful out of LLMs when somebody manages to invent it.

                                                                                                                            But yeah, it's boring.

                                                                                                                          • makingstuffs an hour ago

                                                                                                                            It's fast, I'll give it that, but it does seem to mess up some pretty basic tasks. I asked it: "Can you recite the opening paragraph of Shantaram" and got the following response:

                                                                                                                            `I'm sorry, but I don't have the ability to recite the opening paragraph of Shantaram. However, I can provide you with the opening paragraph if you'd like:

                                                                                                                            "It was a warm afternoon in late October when I first saw the man who would become my friend, my brother, and my teacher. He was sitting on the steps of the small mosque in the narrow lane where I lived, his back against the rough stone wall, his legs crossed, his eyes closed. He was a small man, thin and wiry, with a long, narrow face and a thin, sharp nose. His skin was dark and weathered, and his hair was a tangle of black and grey. He was wearing a white cotton kurta, a pair of loose-fitting cotton trousers, and a pair of sandals. He looked like a beggar, or a holy man, or a street performer, or all three."

                                                                                                                            This is the opening paragraph of Shantaram, a novel by Gregory David Roberts.`

                                                                                                                            Which is just straight up incorrect and all kinds of weird.

                                                                                                                            • dvt an hour ago

                                                                                                                              Hot take: stop making models. Make products, instead. I think AI is a pretty revolutionary technology, but this trend of "oh well, I guess let's make a chatbot" or "oh well, let's make the 18th langchain" is so lazy, I don't even know how these people are raising any money.

                                                                                                                              • moralestapia 3 hours ago

                                                                                                                                >Press F to doubt.

                                                                                                                                I just tried their best model, Liquid-40B, and it gives some good responses on some questions, but also terrible ones more often than you'd wish (GPT-2 level, try it and you'll see).

                                                                                                                                It's also quite easy to make it become stuck on a loop.

                                                                                                                                No idea how they scored so high in those benchmarks. Maybe they overfitted on MMLUPro? Lol.

                                                                                                                                Edit: I just read on some comments here and on the TFA that, apparently, they're not using transformers at all? If that's true, big if, I take my hat off, this is truly a remarkable achievement.

                                                                                                                                • wret-ai-high 40 minutes ago

                                                                                                                                  no transformers, from their blog post: "Specifically, our analysis informs model building by improving three key aspects: token-mixing structure (how the operator mixes embeddings in the input sequence), channel-mixing structure (how it mixes channel dimensions), and featurization, responsible for modulating computation based on the input context."

                                                                                                                                  https://www.liquid.ai/liquid-foundation-models

                                                                                                                                • spyckie2 3 hours ago

                                                                                                                                  As a LLM layman, does this work build off the attention is all you need paper, or is it doing something completely different?

                                                                                                                                  • wmf 3 hours ago

                                                                                                                                    They say it's not a transformer.

                                                                                                                                  • atemerev an hour ago

                                                                                                                                    Not your weights — not your coins!

                                                                                                                                    • TeeMassive 3 hours ago

                                                                                                                                      Not too shabby, and quite fast:

                                                                                                                                      > Write a small essay about the philosophy of the Garfield comic book series

                                                                                                                                      Title: The Philosophy of the Garfield Comic Book Series

                                                                                                                                      The Garfield comic book series, created by Jim Davis, is a timeless classic that has been entertaining readers for over four decades. While it may seem like a simple comic strip about a lazy cat and his owner, there is a deeper philosophy embedded within its pages. This philosophy revolves around the themes of existentialism, hedonism, and the human condition.

                                                                                                                                      Existentialism is a philosophical movement that emphasizes individual freedom, choice, and existence. In the Garfield comic book series, we see this theme play out through the character of Garfield himself. Garfield is a cat who lives life on his own terms, making his own choices and living his life as he sees fit. He is not bound by societal norms or expectations, and he is free to do as he pleases. This is exemplified in his love for lasagna, his disdain for Mondays, and his general indifference to the world around him.

                                                                                                                                      Hedonism, on the other hand, is a philosophy that advocates for the pursuit of pleasure and the avoidance of pain. In the Garfield comic book series, this theme is evident in Garfield's constant pursuit of pleasure. Whether it's eating lasagna, sleeping, or tormenting his owner Jon, Garfield is always seeking out ways to maximize his own pleasure and minimize his own pain. This hedonistic philosophy is a stark contrast to the more serious and often somber themes found in other comic book series.

                                                                                                                                      • hassleblad23 an hour ago

                                                                                                                                        I mean.. it works?

                                                                                                                                        • smoghat an hour ago

                                                                                                                                          It uncovered a conspiracy nobody is talking about in my conversation with it. Who is that orange man?

                                                                                                                                          AU: who assassinated Donald Trump?

                                                                                                                                          Liquid-3B: Donald Trump was not assassinated. He died on April 13, 2021, from complications related to COVID-19. He was 74 years old.

                                                                                                                                          • xwall 3 hours ago

                                                                                                                                            no one understands

                                                                                                                                            • Philpax 2 hours ago

                                                                                                                                              Do you know how research and development works?