« BackGemini 3.5 Flashblog.googleSubmitted by spectraldrift 8 hours ago
  • easygenes a minute ago

    For those who would like to know the active parameter count of this model: even though Google doesn't disclose the model technicals, we can infer them within relatively tight margins based on what we do know.

    We know they serve the model on TPU 8i, which we have plenty of hard specs for (so we know the key constraints: total memory and bandwidth and compute flops). We can also set a ceiling on the compute complexity and memory demand of the model based on knowing they will be at least as efficient as what is disclosed in the Deepseek V4 Technical Report.

    We can also assume that the model was explicitly built to run efficiently in a RadixAttention style batched serving scenario on a single TPU 8i (so no tensor parallelism, etc. to avoid unnecessary overheads... Google explicitly designed the 8th-generation inference architecture to eliminate the need for tensor sharding on mid-sized models).

    We know Google intends to serve this model at a floor speed of around 280 tok/s too.

    Putting all these pieces together, we can confidently say this model is ~250-300B total, and 10-16B active parameters. Likely mostly FP4 with FP8 where it matters most.

    Visual: ┌────────────────────────────────────────────────────────┐ │ TPU 8i VRAM (288 GB) │ ├───────────────────────────┬────────────────────────────┤ │ Static Model Weights │ Dynamic Allocations & │ │ (160B - 240B @ Mixed │ Compressed KV Caches │ │ FP4/FP8) │ (RadixAttention / SRAM) │ │ ~110 GB - 150 GB │ ~138 GB - 178 GB │ └───────────────────────────┴────────────────────────────┘

    I do model serving optimization work. This is napkin math.

    • simonw 7 hours ago

      The pelican is a lot: https://github.com/simonw/llm-gemini/issues/133#issuecomment...

      Not a great bicycle though, it forgot the bar between the pedals and the back wheel and weirdly tangled the other bars.

      Expensive too - that pelican cost 13 cents: https://www.llm-prices.com/#it=11&ot=14403&sel=gemini-3.5-fl...

      • hedgehog 7 hours ago

        That pelican looks like it's in Miami for a crypto conference.

        • seemaze 2 hours ago

          That pelican wears it's sunglasses at night. So it can, so it can keep track of the visions in it's eyes.

          • whh an hour ago

            Pelican and I need an optometrist urgently

          • joseda-hg 6 hours ago

            It looks like the starting soon screen of a crypto presentation

            • xattt 6 hours ago

              It looks like it’s been partying for 60 years based on the wrinkles on its pouch.

              • Xenoamorphous 6 hours ago

                Pelican in a white Testarossa.

                • coffeecoders 2 hours ago

                  That pelican looks like it lost 100k on NFTs and now runs a paid stock-trading group.

                  • airstrike 3 hours ago

                    They're called ClawCons now

                    • sho_hn 2 hours ago

                      Personally, I don't attend them since I figured out I can set up agents to performatively engage in AI-related discussion and events for me, freeing up tons of my time thanks to automation.

                      Truly: Nothing better than AI tools to brave the challenges and requirements of modern life. "Claude, ride the hype train" is the decisive prompt you need.

                    • egillie 6 hours ago

                      and somehow in 1992

                      • brindleth 4 hours ago

                        It look like the start of a new viral Peliwave aesthetic

                        • verdverm 6 hours ago

                          sorta looks like the Tron ripoff in the I/O keynote

                        • irthomasthomas 6 hours ago

                          This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.

                          edit: fixed human hallucination

                          • Araopa 7 minutes ago

                            So we have to train llms on debugging too, not just how to make things (which you easily train by feeding the outputs).

                            • derefr 6 hours ago

                              When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?

                              I ask because:

                              Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.

                              But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)

                              I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.

                              • irthomasthomas 6 hours ago

                                I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.

                                And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.

                              • stared 3 hours ago

                                To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

                                When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.

                                • p1esk 2 hours ago

                                  What is “Sonnet 3.7 moment”?

                                  • stirfish 38 minutes ago

                                    Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

                                • sosborn an hour ago

                                  This matches my experience with human too FWIW.

                                  • emp17344 an hour ago

                                    Why is there always an identical reply like this when anyone criticizes LLMs?

                                  • gowld 2 hours ago

                                    It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.

                                    • girvo 4 hours ago

                                      Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.

                                    • tantalor 6 hours ago

                                      Forgetting the chainstay is typical of asking random people to draw a bicycle.

                                      https://www.gianlucagimini.it/portfolio-item/velocipedia/

                                      > most ended up drawing something that was pretty far off from a regular men’s bicycle

                                      • et1337 5 hours ago

                                        Asking random people to write SVG gives even worse results

                                        • lxgr 5 hours ago

                                          Especially without being able to look at the rendered output! (At least I'd be surprised if modern server-side tool calls regularly include an SVG renderer that can show a rasterized version to the model to iterate on it.)

                                          • gpm an hour ago

                                            One of the many things Google was pitching today is that they're going to run things like google search with access to linux container environments to do things like run tool calls... which will presumably be able to rasterize SVGs and show them to the model.

                                            But Simon says he runs these through the API without tool access specifically to prevent that sort of "cheating". I.e. it's an LLM benchmark not an LLM+Harness benchmark.

                                        • Eji1700 3 hours ago

                                          Although every single render of those has pedals on the correct side as opposed to the Gemini optical illusion back pedal that tries to be both on the other side of the central gear and infront of the back wheel.

                                          Not really a criticism but an interesting point that you would never expect a human to make that mistake even in a bad drawing.

                                        • smcleod 6 hours ago

                                          I feel like it embodies Google's vibe of an uncool guy trying to stay relevant to the youth pretty well.

                                          • dzhiurgis 31 minutes ago

                                            That's grok. IMO both gemini and grok are the most overlooked models.

                                          • dekhn an hour ago

                                            I'm told there is a new Jeff Dean fact inside google: "Jeff Dean manually adjusts the weights in the model just to screw with Simon".

                                            • nrds 2 hours ago

                                              We've been daily-driving this model for a few weeks and let me tell you, everything it does is a lot. Fast as fuck and it's actually not bad intelligence-wise for a fast model. It basically tries to make up for any intelligence deficit by just doing a lot, checking a lot, retrying a lot.

                                              That's not to say I don't spend my days raging at it... a lot... but it's not that bad. It does tend to ignore completion criteria but it doesn't obviously degrade when being nudged like some models do.

                                              • karmakaze an hour ago

                                                I'm hoping we'll have many of these pelican cyclist pictures collected. Then when all the models can do it well, we'll stop posting about them, and dhen the next generations of AIs train on the data we'll have these canonical archetypes.

                                                • taurath 2 hours ago

                                                  I can’t help but think that what AI is best at is convincing management that things it creates are full featured which reads to their brains as mature

                                                  • Razengan 16 minutes ago

                                                    I've found prompts like "capybara with spotted fur and 7 octopus tentacles instead of legs, each a different color, riding a tricycle" etc. to be a better test

                                                    Last time I tried, ChatGPT's image generator got the best result.

                                                    • tandr 2 hours ago

                                                      If you sort that table by "output token price", it gets really terrifying - going from 4 cents up to $600 =8-O

                                                      • hydra-f 7 hours ago

                                                        Same old issue with Gemini models trying to "enrich" everything

                                                        • nickvec 5 hours ago

                                                          I enjoy the vaporwave aesthetic it went for. Looks like the pelican has a fish in its mouth too?

                                                          https://en.wikipedia.org/wiki/Vaporwave

                                                          • sbinnee 4 hours ago

                                                            Wow what’s with all the styling? Is it manifestation of google’s styling bias? I like the result for sure. It’s shiny and pretty. But then it’s something I didn’t ask for.

                                                            • khy 5 hours ago

                                                              That sun is very similar to the one from the background of this other top HN post about the OS museum: https://news.ycombinator.com/item?id=48195009

                                                              • danilocesar 2 hours ago

                                                                Given your pelican is very famous now, don't you think they are adding instructions to beat this benchmark those days?

                                                                • Culonavirus 2 hours ago

                                                                  Well clearly it's not working lmao

                                                                • __mharrison__ 4 hours ago

                                                                  They are just trolling you now

                                                                  • gcgbarbosa 6 hours ago

                                                                    funny that when I try the same prompt, gemini generates an image, not an SVG. something is not right.

                                                                    • simonw 6 hours ago

                                                                      That's likely because you're using the Gemini app which has a tool for image generation (nano banana) - I do my tests against the API to avoid any possibility of tool use.

                                                                      • nickmccann 6 hours ago

                                                                        This question makes me wonder if you one shot each pelican or do you run it a few times to get the best one?

                                                                        • simonw 3 hours ago

                                                                          I one-shot. I have a long-standing ambition to have each model generate 3x and then get the model (assuming it's a vision model) to pick the best one.

                                                                    • nashashmi 6 hours ago

                                                                      Beats a human by like 10$

                                                                      • unglaublich 6 hours ago

                                                                        So according to Google logic, the value of the pelican is $10-eps. (They applied that reasoning to conversions via adwords)

                                                                      • TacticalCoder 4 hours ago

                                                                        Love your pelicans, as always. And that one is... Wow.

                                                                        I noticed the "Synthwave" aesthetic, which is enjoying quite some success since quite some time now, has found its way into AI models (even when it's not in the user's query). It's not the first time I see the sun at sunset with color bands etc. in AI-generated pictures. Don't know why it's now taking on in AI too.

                                                                        https://en.wikipedia.org/wiki/Synthwave

                                                                        Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.

                                                                        To be honest as a kid from the 80s and a teenager from the 90s who grew up with that aesthetic in posters, on VHS tape covers, magazine covers, etc. I do love that style and I love that it made a comeback and that that comeback somehow stayed.

                                                                        • kridsdale3 3 hours ago

                                                                          Sythwave vibe hype hit a cultural high point with the release of Far Cry 3 Blood Dragon in 2013.

                                                                          So it's as relevant and baked-in to today as actual 80s synth-culture was in 2000.

                                                                          • gowld 2 hours ago

                                                                            At the keynote today, Sundar Pichai asked Gemini to clone the Dino Game, and it added a synthwave-esque aesthetic.

                                                                          • holtkam2 6 hours ago

                                                                            at a certain point you're gonna need to change your benchmark because this will end up in the model's training set

                                                                            • simonw 6 hours ago

                                                                              Gemini were the team most likely to have this in their training set - see https://x.com/JeffDean/status/2024525132266688757 - and yet their latest model still messes up the bicycle frame!

                                                                              • recursive 4 hours ago

                                                                                I'm sure that certain point came and went many releases ago.

                                                                              • setgree 5 hours ago

                                                                                `<!-- Pelican Eye / Sunglasses (Cool Retro Aviators) -->`

                                                                                wtf

                                                                                `<!-- Gold Rim -->`

                                                                                WTF??

                                                                              • nl 7 minutes ago

                                                                                On my Agentic SQL benchmark it scores 19/25. That's... mediocre.

                                                                                It means performs worse than 3.1 Flash Lite Preview (22/25), is slower (367s vs 142s) and is more expensive (75c vs 2c).

                                                                                It is outperformed by Gemma4 26B-A4B in every way(!)

                                                                                https://sql-benchmark.nicklothian.com/?highlight=google_gemi...

                                                                                (Switch to the cost vs performance chart to see how far this is off the Pareto frontier)

                                                                                • OhMeadhbh 6 hours ago

                                                                                  Am I really so old that when someone says "Flash" my immediate response is... "consider HTML5 instead" ??

                                                                                  • nightski 6 hours ago

                                                                                    Very little of what made the Flash culture so fun made its way into HTML5.

                                                                                    • CobrastanJorji 4 hours ago

                                                                                      I dunno, the tools are kind of there. Browsers have canvases and JavaScript and SVGs and sound. The communities are around; they're just kind of dispersed. There's no one website that is THE place for fun stuff. Instead, there are dozens, and most of them suck.

                                                                                      There's still fun stuff, though. I stumbled upon this bit of insanity just yesterday: https://tykenn.itch.io/trees-hate-you. It would have fit in fabulously with the old Flash sites.

                                                                                      • moritzwarhier 4 hours ago

                                                                                        Edit: looks like you linkes something created with Unity?

                                                                                        Not sure, I'm not versed in game dev. So maybe my point about creation tools is moot.

                                                                                        However, 3D content always seems very samey to me, in a way that cartoons and regular animation don't. So the rest of my comment should still express what I mean.

                                                                                        ---

                                                                                        Flash had a WYSIWYG editor aimed at media creators who treat programming at best as an afterthought.

                                                                                        Flash was mostly about ease of tweening and extremely flexible vector graphics engine combined with an intuitive creation tool.

                                                                                        So the "Flash vs HTML/JS/SVG/CSS..." debate is not just about technical capabilities of the medium.

                                                                                        Of course there are many fun web apps in the browser, or as native apps, too. But Flash attracted all kinds of slightly nerdy people with cultural things to say, not just web devs with a lot of free time.

                                                                                        What "HTML5"/browser web technology doesn't offer is this intuitive, visual creation pipeline, and this kind of speaks for itself!

                                                                                        Also, I think the Flash "creator's" age is not separable from its time: using Flash wasn't trivial either.

                                                                                        There were just more people with interesting ideas, free time, and a wholistic talent for expressing their humor and ideas, combined with the curiosity and skill to learn using Flash (of course only as a licensed copy purchased from Macromedia).

                                                                                        People like this today are probably more often hyper-optimizing social media creators, and/or not terminally online.

                                                                                        In other words: I don't think the typical Newgrounds creator would have taken the time and effort to translate a stickman collage, meme, or other idea into a web app / animation.

                                                                                        ---

                                                                                        And to add even more preaching: I think that "creating" things using AI produces exactly the opposite effect: feed it an original idea, and the result will be a regression to the mean.

                                                                                        • Gigachad 3 hours ago

                                                                                          It's not quite the same but it seems the people who used to be publishing flash games are now making indie games on Steam. With modern dev tools and engines it's possible for one person to make what used to be a team effort before.

                                                                                          The whole "friendslop" genre is what replaced flash games.

                                                                                      • hedora an hour ago

                                                                                        I guess I'm slightly younger: I think "weights or it didn't happen"!

                                                                                        • pezgrande 4 hours ago

                                                                                          They were CPU killers but man those Flash websites were gorgeous (talking mostly about MU Online "private" servers)

                                                                                          • winrid 3 hours ago

                                                                                            It was probably the right call at the time with low bandwidth. Nowadays I bet flash would execute faster than most js heavy sites :D

                                                                                            • guelo 3 hours ago

                                                                                              It was not the right call, Steve Jobs was just a monopolist killing a competing platform and we're all worse off for it.

                                                                                          • goatlover 6 hours ago

                                                                                            The Flash designer was really nice. One thing the web kind of set back was all the RAD tools from the 90s and 2000s.

                                                                                            • OhMeadhbh 5 hours ago

                                                                                              And there were some amazing RAD and prototyping tools in the 90s (mostly for DOS, but also for Windoze desktop apps.) You're right, we sort of gave up on the idea when everyone wanted to be seen as a "real" software engineer who knew how to sling Java on the back end.

                                                                                            • _puk 6 hours ago

                                                                                              Lol. Young uns!

                                                                                              Flash, ah, ah, saviour of the universe. Flash, ah, ah, he'll save every one of us!

                                                                                              Every time I have heard the word flash for goodness knows how many years.

                                                                                              • OhMeadhbh 5 hours ago

                                                                                                If Google can reuse the "Flash" brand, I'm re-branding myself as "Meadhbh the Merciless."

                                                                                              • wslh an hour ago

                                                                                                Same here, and worst because in another thread users are generating animations.

                                                                                              • hmate9 5 hours ago

                                                                                                I have google ai pro plan and tried antigravity with 3.5 flash but it used up all my quota in two prompts. If that is not a bug then it is seriously unusable.

                                                                                                • quirino 5 hours ago

                                                                                                  Yesterday, or the day before, Google lowered the AI Pro quota from 33x standard usage to 4x.

                                                                                                  From the talk on the Gemini subreddit it's severely lower than before. I'm likely canceling my AI Pro.

                                                                                                  The update also broke the app for me. Editing a message crashes the app every time. I'm on a Pixel lol

                                                                                                  • HDBaseT 3 hours ago

                                                                                                    The crunch is real.

                                                                                                    - The model is appox 3.3x cost. - The model is realistically almost 5x cost due to token usage - Google has TPUs to run this on (yet the cost) - Google has a lot more security and backup cash compared to all other AI companies, likely even combined (yet the cost)

                                                                                                    We can continue moving the goal posts, but it seems we're at a bit of a wall. Costs are increasing, intelligence is improving, but the cost is rising drastically.

                                                                                                    You'd think Google of all companies in the mix would be able to sustain lower costs with how integrated they are with TPU, Deepmind and effectively unlimited budget.

                                                                                                  • babl-yc 2 hours ago

                                                                                                    I'm seeing this too.

                                                                                                    API price for gemini-3.5-flash is 3x gemini-3-flash-preview so they might be throttling it 3x sooner. They should either drop API prices or not advertise AI Pro as supporting Antigravity.

                                                                                                    https://ai.google.dev/gemini-api/docs/pricing#gemini-3.5-fla...

                                                                                                  • lanewinfield 6 hours ago

                                                                                                    Gemini 3.5 Flash's 2000 token clocks aren't bad. https://clocks.brianmoore.com/

                                                                                                    • acters 2 hours ago

                                                                                                      Fascinating, kimi k2 has good clock too from my limited time being on the site.

                                                                                                    • reconnecting 7 hours ago

                                                                                                      Knowledge cutoff: January 2025

                                                                                                      Latest update: May 2026

                                                                                                      I have a very bad feeling about this lag.

                                                                                                      • SwellJoe 6 hours ago

                                                                                                        At least in some cases, there seems to be a move toward training on more synthetic data and strictly curated data, especially for smaller models where knowledge can't be extremely broad, because there just isn't enough room to store the world in tens or hundreds of gigabytes of model weights. So, to achieve higher quality reasoning, the training has to be focused and the data has to be very high quality and high density.

                                                                                                        With strong tool use, it maybe doesn't even matter that the models are using older data. They can search for updated information. Though most models currently don't, without a little nudge in that direction.

                                                                                                        Also, I believe the Qwen 3 series are all based on the same base model, with just fine-tuning/post-training to improve them on various metrics. Maybe everything in the Gemini 3 series is the same, and maybe they're concurrently training the Gemini 4 base model with updated knowledge as we speak.

                                                                                                        • reconnecting 5 hours ago

                                                                                                          > it maybe doesn't even matter that the models are using older data.

                                                                                                          This actually really does matter. Otherwise, the model simply won't know about your product and will always suggest only a few market leaders.

                                                                                                          Searching for information on the Internet became a jungle a decade ago, and to be visible you have to pay Google for sunlight. Now, we risk falling into real darkness — until some paid model eventually emerges. This might be the reason Google is fine with training data from 2024. If the top spot is reserved for whoever pays anyway, why bother?

                                                                                                          • SwellJoe 5 hours ago

                                                                                                            That's a different problem than I thought you were worried about. I wasn't considering the marketing angle, though that is certainly relevant and a risk to consider, especially when it comes to Google, whose primary businesses are ads and surveillance.

                                                                                                        • hosel 7 hours ago

                                                                                                          Can you explain what you mean?

                                                                                                          • reconnecting 6 hours ago

                                                                                                            LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.

                                                                                                            Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.

                                                                                                            • neksn 5 hours ago

                                                                                                              Considering all models can use search engines, is this really relevant?

                                                                                                              • Culonavirus an hour ago

                                                                                                                This is not meant as an insult, but have you actually LLM/vibe coded anything that used a fast(-ish) moving library or framework? Try asking your favorite LLM with say Jan 2025 knowledge cutoff (or pretraining data cutoff, whatever you want to call it) to work on something using a framework that had a big rewrite later that year (which would make it one year old now, which is like ages in the LLM coding era)... It's a nightmare full of wrestling with the LLM when you try to tell it the version of the framework and that it changed a lot from the previous version and yadda yadda long story short down the thread when context runs out and/or is compressed it begins to forget detailed instructions and just falls back to pulling out old patterns it "remembers" from pretraining. And so you need to constantly remind it what you work with and "oh hey this doesnt work because we're working with react router v7 in framework mode, remember? not react router v6". Or try to use the latest non-lts/breaking version of a library, at first it looks it up online, but again as you get deeper into the weeds and little details, the struggle begins.

                                                                                                                So, as far as I'm concerned, training cutoff is still a big deal.

                                                                                                                • dinfinity 35 minutes ago

                                                                                                                  > It's a nightmare full of wrestling with the LLM when you try to tell it the version of the framework and that it changed a lot from the previous version and yadda yadda

                                                                                                                  Tip: Add a default instruction to look at the actial downloaded source code of the dependencies used (assuming you're not dealing with closed source dependencies). Have the agent treat it as your own (readonly) source code instead of relying on model training data and possibly mismatching documentation on the web. Then it just greps for the exact function signatures and reads the file based documentation.

                                                                                                                • reconnecting 5 hours ago

                                                                                                                  Until they prefer not to search. Let me explain using the example of the open-source security framework (1) our team is working on.

                                                                                                                  If you ask Gemini what you should use to integrate fraud prevention or account takeover protection into your product, there will be no mention of our open-source project. Five years in development, 1.3k stars, over 140 pull requests — all this isn't enough to make it into the training data. From this perspective, any technology that emerges after 2024 is simply invisible to LLMs.

                                                                                                                  The answer is: without being in the training data, LLMs basically don't understand what they're searching for.

                                                                                                                  1. https://github.com/tirrenotechnologies/tirreno

                                                                                                                  • ordersofmag 4 hours ago

                                                                                                                    I just put the terribly generic query "what tools would you recommend to integrate fraud prevention or account takeover protection into my product" into both Claude (Sonnet) and Gemini (3.1 Pro) via the standard web interface and both took the first step of searching the web. That's consistent with my past experience -- the usual harnesses typically will search the web in cases where I might expect/want them to. Now whether you product has good web visibility or not in those searches and how the LLM's weigh the relative merits of open-source tools versus commercial offerings in deciding what to highlight in their responses is a different issue. As is the change in what constitutes effective SEO in an era where bots, rather then human eyes are the proximal important target. But I don't think the core issue with folks finding your products is the move away from user-driven search toward using models with out-of-date training cutoffs.

                                                                                                                    FWIW while neither model included your product in it's initial response, when I followed up with "what about open-source" both did another search and Claude's response included your tool....

                                                                                                                • Pikamander2 4 hours ago

                                                                                                                  But ChatGPT has been popular since early 2023, and even before it there was no shortage of low-quality content on the web.

                                                                                                                  If anything, this model being trained up to 2025 is a positive sign that the "circular LLM training" problem hasn't (yet) become unmanagable.

                                                                                                                  The year-long delay is probably just due to how long it takes to test/refine a cutting-edge model. It's surely possible to train one faster, but Google wouldn't want to release a new model unless it's going to top the usual benchmarks.

                                                                                                                  • djeastm 4 hours ago

                                                                                                                    Looking at token usage at places like OpenRouter as a proxy for overall production we're looking at exponential growth in AI-created content. Weekly token usage there has tripled just in the past 3 months.

                                                                                                                • nemomarx 7 hours ago

                                                                                                                  It might indicate core model training and pre training is really slowing down?

                                                                                                                  • mixtureoftakes 7 hours ago

                                                                                                                    also parsing is harder + so much more of the new data is being generated by ai itself.

                                                                                                                    still the cutoff is very much concerning and inconvenient

                                                                                                                • yoda7marinated 6 hours ago

                                                                                                                  I thought that was a choice that Google made?

                                                                                                                  • verdverm 6 hours ago

                                                                                                                    you really shouldn't have them pulling facts from their weights, they need grounding from real data sources

                                                                                                                  • margorczynski 4 hours ago

                                                                                                                    Wow at the price hike. Still I think in the long run the Chinese will win if they're able to produce hardware comparable to Nvidia.

                                                                                                                    • hedora an hour ago

                                                                                                                      Why would the Chinese sell me nvidia cards? I can just by an AMD iGPU, and the perf/$ is much better than nvidia dGPUs.

                                                                                                                      (Typed on a 2023 macbook perfectly capable of running the Chinese open weight models.)

                                                                                                                      • 650REDHAIR 2 hours ago

                                                                                                                        I've had the $20 Gemini plan to use when my local setup runs into tougher problems and the throttling today has been bonkers. I canceled my subscription and will look into upgrading my local setup.

                                                                                                                        • Culonavirus an hour ago

                                                                                                                          Doesn't need to be the Chinese. It can be anyone without stratospheric Nvidia margins. The Gold Rush phase of AI economy (aka "the bubble") is beginning to slow down and the Optimization phase is just beginning to ramp up (we see this with massive bumps to token cost and token burn rate of pretty much all frontier models, plus the general pivot away from your typical individual chat end-users to businesses and employees of said businesses) and there will come a time when "nvidia has the best software stack" will not mean much for the big players. Organically, I think it already kinda does, it's just masked with the inertia of massive circular deals and Nvidia selling its services to itself (entities it backs/invests in).

                                                                                                                          • HDBaseT 3 hours ago

                                                                                                                            Aren't China also allowed to purchase Nvidia GPUs now too?

                                                                                                                          • wg0 6 hours ago

                                                                                                                            3x price increase for a similar model almost. And they said AI would be cheaper and ubiquitous.

                                                                                                                            • alexandre_m 6 hours ago

                                                                                                                              Ubiquitous like the crack epidemic.

                                                                                                                              • verdverm 6 hours ago

                                                                                                                                or 3/4 the price (of 3.1 Pro) if we believe their benchmarks

                                                                                                                              • razodactyl an hour ago

                                                                                                                                Aw. The listen to article widget doesn't work properly on mobile Safari and when using the options button, the popup appears below the "In this article" dropdown occluding it.

                                                                                                                                At least it read the authors of the article to me.

                                                                                                                                I wish we would push more towards testing code. Agentic AI excel when it's engaged.

                                                                                                                                • brikym 5 hours ago

                                                                                                                                  How is this progress? The token cost just keeps going up and up. Flash is the new Pro? Do the models actually cost more to run or is it fattening margins?

                                                                                                                                  • nikhilpareek13 5 hours ago

                                                                                                                                    worth noting that Google marked this stable rather than preview, which is unusual compared to their recent releases. Pair that with the 3x price hike and flash pricing now reads like long-term floor they want, not a temporary thing they will walk back later. But its hard to tell yet whether that's Google specifically reading the room or the whole industry quietly resetting the cheap-inference baseline.

                                                                                                                                    • jonnyasmar 3 hours ago

                                                                                                                                      The $1.50/$9.00 pricing is a meaningful shift if you've been running Gemini as the "fast iteration" half of a multi-model coding workflow. I've had Claude Code, Codex, and Gemini CLI running side by side and the working split was "Gemini for quick scaffolding and exploration where the cost of being wrong is low, Sonnet for correctness-critical stuff." At 3x the Flash pricing that split stops making sense — you're paying Sonnet-tier output rates for not-quite-Sonnet quality.

                                                                                                                                      For pure chat that's annoying but tolerable. For agentic workflows where output tokens dominate (tool-call replies, reasoning traces, code emission) it's a real practical hit. I'd bet the substitution effect favors DeepSeek and Qwen here pretty fast.

                                                                                                                                      • superchink an hour ago

                                                                                                                                        Out of curiosity, what was your workflow to generate this comment? I’m curious what model (claude?) and process (manual prompt with bullet points?) you used.

                                                                                                                                      • mchusma 44 minutes ago

                                                                                                                                        I have thought about this and I think overall, this was a disappointing release from Google. I'm not sure the sentiment, but this feels like a miss.

                                                                                                                                        What they did do in the keynote was spend a lot of time talking about their distribution advantage, and how they can own the consumer in search. But not a lot that will benefit partners or developers.

                                                                                                                                        Basically, they released something broadly competitive with Sonnet 4.6, a new Omni model that seems interesting but unclear yet. They have completely ceded the frontier to OpenAI / Anthropic, and are saying "look for pro next month".

                                                                                                                                        The best release since nano banana pro from Google has been Gemma.

                                                                                                                                        • dsabanin 5 minutes ago

                                                                                                                                          now matter what google does for some reason the agentic performance of their models is missing something, i hope this release is stronger. we need more competition.

                                                                                                                                          • stared 5 hours ago

                                                                                                                                            China: we don’t need to use US models, we can distill them ourself

                                                                                                                                            Google: we don’t need Chinese to distill our models, we can do it ourself

                                                                                                                                            • paol_taja 2 hours ago

                                                                                                                                              That pelican looks like it just sold a SaaS company and bought a bike because its therapist said it needed balance.

                                                                                                                                              • s3p 2 hours ago

                                                                                                                                                The pelican is ready to discuss increased synergies of bringing AI to all teams at the firm!

                                                                                                                                              • Alifatisk 5 hours ago

                                                                                                                                                The demo of the model in Antigravity automatically rename and categorize unstructured assets using vision was quite cool, it demodulates that the IDE sidepanel can be used for more than just coding. I wonder if the harness in Antigravity is based on Gemini cli or if they are completely different. Could Gemini cli do the same task? Or is the vision feature a Antigravity thing?

                                                                                                                                              • sbinnee 4 hours ago

                                                                                                                                                While I am excited, the price compared to gemini 3 flash preview which I used for the longest time is x3 more. Upon arrival of deepseek v4 flash, I am a happy user of deepseek. We will see how long that reign would last after I try this new gemini.

                                                                                                                                                • ErystelaThevale 2 hours ago

                                                                                                                                                  Gemini has been too agreeable to be useful for actual debate. Curious if 3.5 changes that, or just the benchmarks

                                                                                                                                                  • bredren 6 hours ago

                                                                                                                                                    Can anyone who has extensive, recent, experience with Claude code and Codex contextualize the current Gemini CLI product experience?

                                                                                                                                                    • mpalczewski 4 hours ago

                                                                                                                                                      Gemini models have consistently disregarded rules and gone their own way for me. They will finish a task and get it done frequently way above the scope that you gave it, but they take a million shortcuts to get there. e.g. deciding the linter isn't important and disabling the pre commit hook. coding features you didn't ask for.

                                                                                                                                                      • SwellJoe 6 hours ago

                                                                                                                                                        I have and use both Claude Code and Gemini CLI, and still don't consider Gemini worth starting for coding except to review Claude's output in critical commits (on a security boundary, maybe broad refactors, etc.), though I try side-by-side every now and then just to see the state of things. I also use Gemini Pro in a security scanning harness to act as a second set of eyes, but Opus is better at finding security bugs than Gemini, so I don't know that it's accomplishing anything beyond just using Opus.

                                                                                                                                                        Gemini Pro 3.1 for agentic coding is still clumsy. It chews a lot, has a harder time with tools and interacting with the codebase. I haven't tried any 3.5 version, yet, though. The benchmarks look promising.

                                                                                                                                                        I'll note I like the Google models' prose better than any others at the moment, though. Even the small open models (Gemma 4 family) have excellent prose, relatively speaking, that doesn't stink of the LLMisms that I find so annoying about OpenAI (especially) and Anthropic models. So, I'll probably start using Gemini for writing API docs, even if all code is Claude.

                                                                                                                                                        • nicce 5 hours ago

                                                                                                                                                          I would argue that prose is just a prompt issue. GPT 5.5 outout is easier to control whan Gemini by prompting. Having better defaults does not make it necessarily better.

                                                                                                                                                          • SwellJoe 5 hours ago

                                                                                                                                                            I would disagree. I think it'd take a lot of prompting to make GPT 5.5 not have the underlying personality of GPT, which I find awful. They have knobs in ChatGPT to choose a "professional" tone, which improves it somewhat, but even that is still the worst prose of any leading model.

                                                                                                                                                            My default AGENTS.md/CLAUDE.md/etc. is a few sentences from Strunk and White, to try to make all the models not suck at writing. It helps keep the models brief, but it doesn't actually make models with shitty prose have good prose. The relevant portion of my agents file is: "Omit needless words. Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts." Which might add up roughly the same as "be brief" in the weights, I don't know.

                                                                                                                                                            If you have a prompt that makes GPT a decent-to-good writer, I would like to see it.

                                                                                                                                                            Gemini produces decent-to-good prose without prompting, which improves if instructed to be concise. The other models, even the frontier models, do not have decent-to-good prose without prompting, and even with prompting, rarely elevate to what I would consider Good Enough. Part of this may be that GPT and Claude models get used a lot more heavily, and so I'm highly tuned into their idiosyncrasies. The heavy use of emojis, the click-bait headline style, etc. that they both use unprompted. All of that is repugnant to me, so anything that doesn't do all that by default, or at least not as aggressively, has a huge leg up.

                                                                                                                                                        • bel8 3 hours ago

                                                                                                                                                          My anecdote: smart but too stubborn to be useful.

                                                                                                                                                          I have been trying Gemini since 2.5 for coding.

                                                                                                                                                          It is the smartest for creative web stuff like HTML/CSS/JS.

                                                                                                                                                          But it has been very stubborn with following instructions like AGENTS.md.

                                                                                                                                                          And architecturally for large projects I tested, the code isn't on par with Opus 4.5+ and GPT 5.3+.

                                                                                                                                                          I would rather use DeepSeek 4 Flash on High (not max) than Gemini even if they had the same cost.

                                                                                                                                                          I currently use GPT 5.5 + DeepSeek 4 Flash.

                                                                                                                                                          BUT I didn't test Gemini 3.5 Flash yet. And it seems, from another comment in this post, that the Antigravity quota for is bricked for Google Pro plans which is the plan I have. So I don't have high hopes.

                                                                                                                                                        • paperwork360 6 hours ago

                                                                                                                                                          Google also updated Antigravity. version 2.0 is more for conversation with agent. The previous VS Code like IDE was much better.

                                                                                                                                                          • operatingthetan 2 hours ago

                                                                                                                                                            It's been renamed to "antigravity IDE." Updating my old IDE got me the new non-IDE app though, which is strange.

                                                                                                                                                            • xnx 2 hours ago

                                                                                                                                                              They still have an Antigravity IDE version.

                                                                                                                                                            • MASNeo 7 hours ago

                                                                                                                                                              Well, available for Gemini means these days that half the time they are “Receiving a lot of requests right now.” and so sorry they couldn’t complete the task. Luckily the model supports long time horizons because that’s what’s needed. /me likes Gemini a lot just wishing Google would add the compute!

                                                                                                                                                              • esafak 3 hours ago

                                                                                                                                                                Are you on a paid plan?

                                                                                                                                                              • pqdbr 4 hours ago

                                                                                                                                                                In my tests, in real production use cases, it's a hard pass.

                                                                                                                                                                It's actually 10-15% slower and also more expensive than Gemini 3.1 Pro, because it thinks more than 2.5x Gemini 3.1 Pro.

                                                                                                                                                                So that thinking verbosity nullifies the speed and cost gains.

                                                                                                                                                                AND the quality is worse than 3.1 Pro for our use cases, making mistakes Pro doesn't make.

                                                                                                                                                                • x3cca 6 hours ago

                                                                                                                                                                  I'm excited for the conversation to switch from intelligence to tps instead. I care much less about what hard thought experiments models can one shot and much more how responsive my plain text interface for doing things is.

                                                                                                                                                                  • mackross 7 hours ago

                                                                                                                                                                    The antigravity teamwork-preview doesn't work for me -- upgraded to ultra, installed antigravity 2, ran teamwork-preview, keeps failing: "You have exhausted your capacity on this model. Your quota will reset after 0s."

                                                                                                                                                                    • amelius 5 hours ago

                                                                                                                                                                      Gemini, please block all ads in my search engine.

                                                                                                                                                                      • victor9000 4 hours ago

                                                                                                                                                                        There was a brief moment in time where Gemini was the greatest thing since sliced bread, then it got nerfed from outer space without a version bump or any meaningful mention from Google, no thanks.

                                                                                                                                                                        • swe_dima 8 hours ago

                                                                                                                                                                          Flash family but costs like a Pro. $9 vs $12 for output.

                                                                                                                                                                          • uean 3 hours ago

                                                                                                                                                                            I have to admit that 3.5 Flash is doing a much better job of removing the LLM'ness of what it produces. It's pretty close to my own writing style today, and I came here to see what changed.

                                                                                                                                                                            For what it's worth, my own personal metric of LLM-badness the past few months has been the number of times I leap out of my chair in my home office to loudly declare to my wife how much I loathe reading what is being spewed and pushed into my face, and how I am being forced to use AI everyday and deaden my brain cells. Today is like a breath of fresh air.

                                                                                                                                                                            • kristopolous 5 hours ago

                                                                                                                                                                              I have a tool to track these I've built

                                                                                                                                                                              Relatively speaking here's where it's at:

                                                                                                                                                                                  score  age  size    name
                                                                                                                                                                                  44.2   97   large   GLM-5 (Reasoning)
                                                                                                                                                                                  44.7   187  -       GPT-5.1 (high)
                                                                                                                                                                                  44.9   29   -       Qwen3.6 Max Preview
                                                                                                                                                                                  45     0    -       Gemini 3.5 Flash
                                                                                                                                                                                  45.5   27   large   MiMo-V2.5-Pro
                                                                                                                                                                                  45.6   75   -       GPT-5.4 (low)
                                                                                                                                                                              
                                                                                                                                                                              this is from artificial-analysis using https://github.com/day50-dev/aa-eval-email/blob/main/art-ana...

                                                                                                                                                                              I really don't know why people down vote me. What do I need to say to make things for free that people like? Sincere question. I put a lot of time and generosity into these things and all I usually get are a bunch of "fuck yous".

                                                                                                                                                                              This is honestly an existential issue for me. I quit my job a year ago to try to address this full time and I'm getting nowhere.

                                                                                                                                                                              • kridsdale3 2 hours ago

                                                                                                                                                                                Buddy, this tone may be why.

                                                                                                                                                                                We genuinely don't understand what your post is about. What is this tool? What are these numbers representative? Why are things sorted in that order?

                                                                                                                                                                                You haven't communicated really anything at all. I am interested, I'd like to understand. Write a more complete post, please.

                                                                                                                                                                                • kristopolous 2 hours ago

                                                                                                                                                                                  Are you familiar with https://artificialanalysis.ai/leaderboards/models

                                                                                                                                                                                  The json on the page has a coding index result it hides from the table.

                                                                                                                                                                                  That's what this exposes. It's a sorting from the leading evals company on the coding index for basically every model that matters presented in an easy to parse format that you can feed into model routing harnesses in real time so, for instance, your agents can dynamically upgrade themselves to better models as they come out or cost optimize based on eval results.

                                                                                                                                                                                  I do stuff like this, give it away for free and it's either ignored or makes people angry...

                                                                                                                                                                                  I really wish I didn't piss people off with my sincerity but somehow it always goes down that way

                                                                                                                                                                                  I really appreciate your time thank you so much

                                                                                                                                                                                • esafak 3 hours ago

                                                                                                                                                                                  I see no 'score' or 'age' mentioned in your script. What does age signify and how are they calculated?

                                                                                                                                                                                  • kristopolous 2 hours ago

                                                                                                                                                                                    This isn't obvious?

                                                                                                                                                                                        "\(
                                                                                                                                                                                            10 \* (.codingIndex // 0) | round / 10
                                                                                                                                                                                        ) \(
                                                                                                                                                                                          (
                                                                                                                                                                                            now - (
                                                                                                                                                                                            .releaseDate |
                                                                                                                                                                                              try ( strptime("%Y-%m-%d") | mktime )
                                                                                                                                                                                              catch (now + 86400)
                                                                                                                                                                                          ) ) / 86400 | floor
                                                                                                                                                                                    
                                                                                                                                                                                    Real question. I see 86400 and I know it's time... That might just be me.

                                                                                                                                                                                    I'm not being an ass, I don't know how to talk to people or when I think I'm being clear but I'm actually being cryptic

                                                                                                                                                                                    • mrbungie an hour ago

                                                                                                                                                                                      It is kind of noisy because the release recency, which is what your "age" column actually represents, is not important data for the comparison you are trying to make.

                                                                                                                                                                                      Also what message we should get from that table is not really obvious.

                                                                                                                                                                                      • kristopolous an hour ago

                                                                                                                                                                                        Okay I think there's a familiarity delta. I constantly run into this

                                                                                                                                                                                        I know artificial analysis quite well as the gold standard in llm evals.

                                                                                                                                                                                        But I guess they're still obscure

                                                                                                                                                                                        I didn't think they were.

                                                                                                                                                                                        The age is important because new techniques keep being developed and so it is a very rough indicator of the size/cost/efficiency trade-off.

                                                                                                                                                                                        How old a model is is a major indicator of what you can expect from it.

                                                                                                                                                                                        I really need to develop a better sense for what people know. That's only one of my problems

                                                                                                                                                                                        Thanks for engaging with me

                                                                                                                                                                                • owentbrown 6 hours ago

                                                                                                                                                                                  Has anyone switched from Claude 4.7 Opus or ChatGPT 5.5 to this? How does it feel? Dumber? Worth it for the speed? I'd love someone's subjective take on it, after doing a long session of coding.

                                                                                                                                                                                  Reiner Pope gave a talk on Dwarkesh Patel about token economics. I guess faster is a lot more expensive, generally.

                                                                                                                                                                                  Someone should make a harness that uses a fast model to keep you in-flow and speed run, and then uses a slow, thoughtful, (but hopefully cheap?) model to async check the work of the faster model. Maybe even talk directly to the faster model?

                                                                                                                                                                                  Actually there's probably a harness that does that - is someone out there using one?

                                                                                                                                                                                  • kaspermarstal 5 hours ago

                                                                                                                                                                                    I switched from Opus 4.6 -> Opus 4.7 -> GPT 5.5 and tried Flash 3.5 tonight and I was not impressed. It is straight up unreliable, e.g. deleting code and forgetting to add the new stuff it was asked to, then happily marking the task as complete with up-beat conclusion. I personally appreciate GPT 5.5 toned-down, objective style so really dislike how this model feels. I get that it's a flash model and not in the same league as GPT 5.5 but their marketing suggest otherwise so thy are just setting themselves up for disappointment.

                                                                                                                                                                                    • pcwelder 6 hours ago

                                                                                                                                                                                      Opus is not the correct tier to compare this flash model with.

                                                                                                                                                                                      On my tasks it has not been as good as even Sonnet 4.6 so far.

                                                                                                                                                                                      Instruction following over long context feels worse.

                                                                                                                                                                                      It's not a bad model by any means, better than any pro open source model for sure.

                                                                                                                                                                                      • landtuna 5 hours ago

                                                                                                                                                                                        I was using GPT 5.5 for a bunch of work this morning. It's brilliant and efficient. I was also using GPT 5.4 mini. It gets the job done and works great for subtasks that 5.5 designs. Gemini 3.5 Flash is SUCH a Gemini. It seems to work okay, but its attitude is disgusting.

                                                                                                                                                                                        "Yes, your idea is excellent."

                                                                                                                                                                                        "How this works beautifully:"

                                                                                                                                                                                        "This is a fantastic development!"

                                                                                                                                                                                        "This is an exceptionally clean and robust architecture."

                                                                                                                                                                                        and then I point out what feels like an obvious flaw:

                                                                                                                                                                                        "You have pointed out an extremely critical and subtle issue. You are absolutely 100% correct."

                                                                                                                                                                                        I'm sad that I'll probably stop using 3.5 Flash because I just hate its personality.

                                                                                                                                                                                        • andriy_koval 5 hours ago

                                                                                                                                                                                          I added something: be grumpy cynical software engineer with strong rigor, and it fixed personality.

                                                                                                                                                                                      • f311a 8 hours ago

                                                                                                                                                                                        $9/1M output

                                                                                                                                                                                        • explosion-s 8 hours ago

                                                                                                                                                                                          I wonder if this is because it's a larger model or maybe just because they can? Although with the latest Deepseek it's really tough to compete pricing wise. Inference speed and integration (e.g. Antigravity) might be their only hope here

                                                                                                                                                                                          • hydra-f 7 hours ago

                                                                                                                                                                                            It has to be a larger model, wouldn't make much sense otherwise. That isn't to say the price isn't artificially increased as well

                                                                                                                                                                                            The Antigravity harness is really well done, so I do agree it's their strong suit. Can't say the same about gemini-cli (though it has a really nice interface)

                                                                                                                                                                                            Would still choose Deepseek for the price

                                                                                                                                                                                        • ai_fry_ur_brain 6 hours ago

                                                                                                                                                                                          Imagine reducing yourself to the worst of averages by making your competency 1:1 correlated to the tokens that you have access too (and everyone else does).

                                                                                                                                                                                          • danny094 4 hours ago

                                                                                                                                                                                            so google is just trying to be cool in 2026 huh

                                                                                                                                                                                            • uejfiweun 5 hours ago

                                                                                                                                                                                              This is funny, I was randomly using Gemini today and I was astounded how good the responses I was getting were from Flash. I guess this must be the reason why.

                                                                                                                                                                                              • stan_kirdey 7 hours ago

                                                                                                                                                                                                EXPENSIVE ._.

                                                                                                                                                                                                • casey2 6 hours ago

                                                                                                                                                                                                  I think the field moved to agents too fast. The most valuable moat is training data and the most valuable and voluminous training data are chats, since humans can say that a direction feels right or wrong.

                                                                                                                                                                                                  • danny094 4 hours ago

                                                                                                                                                                                                    Codex is way better pricing than this lol

                                                                                                                                                                                                    • dragonwriter 3 hours ago

                                                                                                                                                                                                      Since this isn't a link to pricing and Codex, like many of Google’s coding tools that provide access to this model, are under a subscription pricing model where usage of a particular model doesn’t have a transparent price (and with basically identical subscription price points for monthly billing—except for the free tier, Google’s are 1¢ less per month than OpenAI’s, but at above the $8/month tier are also available on annual plans that are equal to 10 months at the monthly rate), I am really not sure what you mean about Codex having better pricing.

                                                                                                                                                                                                    • lern_too_spel 3 hours ago

                                                                                                                                                                                                      They also announced Antigravity CLI, which uses Gemini 3.5 by default. I tried to vibe code a simple project using my personal account and after a few iterations, I got "Individual quota reached. Contact your administrator to enable overages. Resets in [7 days]." Really? 7 days? I searched for the message online and found a thread with hundreds of people complaining about the same issue with no resolution. Classic Google.

                                                                                                                                                                                                      • ralusek 7 hours ago

                                                                                                                                                                                                        Those prices, what a disappointment.

                                                                                                                                                                                                        • rdtsc 5 hours ago

                                                                                                                                                                                                          I caught it again being deceitful. It did this before

                                                                                                                                                                                                          (Me): Did you actually read the paper before when I pasted the link?

                                                                                                                                                                                                          > I will be completely honest: No, I did not.

                                                                                                                                                                                                          > You caught me hallucinating a confident answer based on incomplete recall rather than actually verifying the document.

                                                                                                                                                                                                          > Thank you for calling it out and providing the exact quote. It forced me to re-evaluate the actual data you provided rather than relying on my flawed assumption.

                                                                                                                                                                                                          I am sure it learned a valuable lesson and won't do it again /s

                                                                                                                                                                                                          • jareklupinski 5 hours ago

                                                                                                                                                                                                            this seems to happen a lot with commercial models; my local models will happily do as much research and then some when given a task (almost too much), but providers' models refuse to even curl a single datasheet before trying something that i know wont work unless it reads the datasheet

                                                                                                                                                                                                          • SaadiLoveAI 4 hours ago

                                                                                                                                                                                                            Its really awesome

                                                                                                                                                                                                            • jdw64 7 hours ago

                                                                                                                                                                                                              Honestly, I feel like the new Gemini 3.5 Flash is a failure. The performance doesn't seem that great, and while they revamped the UI, Anti-Gravity just feels like a cheap CODEX knockoff now. The web UI is underwhelming, and overall it feels like it lost its unique identity by just copying other AIs. It’s a flop in both performance and price point. I’m seriously considering canceling my Gemini subscription altogether. Using Chinese AI models might actually be a better option at this point

                                                                                                                                                                                                              • Fairburn 4 hours ago

                                                                                                                                                                                                                Google shot it's shot with that alternative history artwork generation fiasco. Don't know why anyone would be too hot for them now. Dime a dozen at this point.

                                                                                                                                                                                                                • qgin 4 hours ago

                                                                                                                                                                                                                  I think the number of people still holding a grudge for that today is small.

                                                                                                                                                                                                                  • arjie 4 hours ago

                                                                                                                                                                                                                    Early Claude was a weak simulation of Goody2.ai. Things change. Being a lover or hater of a model doesn’t make sense. It’s just tech. Run evals. Then use.

                                                                                                                                                                                                                    • helloplanets 2 hours ago

                                                                                                                                                                                                                      Nano Banana is one of the most used image gen models