Comments Page - Gemini 3.1 Pro Preview

« Back Gemini 3.1 Pro Previewconsole.cloud.google.comSubmitted by MallocVoidstar 3 hours ago

sigmar 2 hours ago
blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...
edit: biggest benchmark changes from 3 pro:
arc-agi-2 score went from 31.1% -> 77.1%
apex-agents score went from 18.4% -> 33.5%
- ripbozo an hour ago
  Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests
  maxall4 an hour ago
  Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.
  boplicity an hour ago
  Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.
  blinding-streak an hour ago
  I assume all the frontier models are benchmaxxing, so it would make sense
- sho_hn 2 hours ago
  The touted SVG improvements make me excited for animated pelicans.
  takoid an hour ago
  I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj
  The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.
  makeavish an hour ago
  Looks great!
  benatkin 32 minutes ago
  Here's what I got from Gemini Pro on gemini.google.com, it thought for under a minute...might you have been using AI studio? https://jsbin.com/zopekaquga/edit?html,output
  It does say 3.1 in the Pro dropdown box in the message sending component.
  vunderba 5 minutes ago
  SVG is an under-rated use case for LLMs because it gives you the scalability of vector graphics along with CSS-style interactivity (hover effects, animations, transitions, etc.).
  james2doyle 28 minutes ago
  The blog post includes a video showcasing the improvements. Looks really impressive: https://blog.google/innovation-and-ai/models-and-research/ge...
  aoeusnth1 an hour ago
  I imagine they're also benchgooning on SVG generation
esafak an hour ago
Anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.
- emp17344 an hour ago
  Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.
- oliveiracwb 6 minutes ago
  With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence
- redox99 an hour ago
  I don't think there's much recursive improvement yet.
  I'd say it's a combination of
  A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.
  B) There's more compute online
  C) Competition is more fierce.
- PlatoIsADisease an hour ago
  Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...
  Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.
  If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.
maxloh 2 hours ago
Gemini 3 seems to have a much smaller token output limit than 2.5. I used to use Gemini to restructure essays into an LLM-style format to improve readability, but the Gemini 3 release was a huge step back for that particular use case.
Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response, it still truncates the source text too aggressively, losing vital context and meaning in the restructuring process.
I hope the 3.1 release includes a much larger output limit.
- esafak 2 hours ago
  People did find Gemini very talkative so it might be a response to that.
- NoahZuniga an hour ago
  Output limit has consistently been 64k tokens (including 2.5 pro).
- jayd16 an hour ago
  > Even when the model is explicitly instructed to pause due to insufficient tokens
  Is there actually a chance it has the introspection to do anything with this request?
  maxloh 15 minutes ago
  Yeah, it does. It was possible with 2.5 Flash.
  Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...
  otabdeveloper4 20 minutes ago
  No.
- MallocVoidstar an hour ago
  > Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response
  AI models can't do this. At least not with just an instruction, maybe if you're writing some kind of custom 'agentic' setup.
  maxloh 14 minutes ago
  Yeah, it does. It was possible with 2.5 Flash.
  Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...
zhyder an hour ago
Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.
Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.
I wish Google also updated Flash-lite to 3.0+, would like to use that for the Explore subagent (which Claude Code uses Haiku for). These subagents seem to be Claude Code's strength over Gemini CLI, which still has them only in experimental mode and doesn't have read-only ones like Explore.
- WarmWash an hour ago
  >I wish Google also updated Flash-lite to 3.0+
  I hope every day that they have made gains on their diffusion model. As a sub agent it would be insane, as it's compute light and cranks 1000+ tk/s
  zhyder 8 minutes ago
  Agree, can't wait for updates to the diffusion model.
  Could be useful for planning too, given its tendency to think big picture first. Even if it's just an additional subagent to double-check with an "off the top off your head" or "don't think, share first thought" type of question. More generally would like to see how sequencing autoregressive thinking with diffusion over multiple steps might help with better overall thinking.
WarmWash 2 hours ago
It seems google is having a disjointed roll out, and there will likely be an official announcement in a few hours. Apparently 3.1 showed up unannounced in vertex at 2am or something equally odd.
Either way early user tests look promising.
vinhnx an hour ago
Model card https://deepmind.google/models/model-cards/gemini-3-1-pro/
clhodapp 2 hours ago
There's a very short blog post up: https://blog.google/innovation-and-ai/models-and-research/ge...
qingcharles an hour ago
I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).
It's only February...
cubefox 17 minutes ago
I see things are moving in the expected direction.
> Misalignment (Exploratory)
> (Deep Think mode) On stealth evaluations, the model performs similarly to Gemini 3 Pro. On situational awareness, the model is stronger than Gemini 3 Pro: on three challenges which no other model has been able to consistently solve, max tokens, context size mod, and oversight frequency, the model achieves a success rate of almost 100%. However, its performance on other challenges is inconsistent, and thus the model does not reach the alert threshold.
mark_l_watson 2 hours ago
Fine, I guess. The only commercial API I use to any great extent is gemini-3-flash-preview: cheap, fast, great for tool use and with agentic libraries. The 3.1-pro-preview is great, I suppose, for people who need it.
Off topic, but I like to run small models on my own hardware, and some small models are now very good for tool use and with agentic libraries - it just takes a little more work to get good results.
- throwaway2027 an hour ago
  Seconded. Gemini used to be trash and I used Claude and Codex a lot but gemini-3-flash-preview punches above it's weight, it's decent and I rarely if ever run into any token limit either.
- PlatoIsADisease an hour ago
  What models are you running locally? Just curious.
  I am mostly restricted to 7-9B. I still like ancient early llama because its pretty unrestricted without having to use an abliteration.
- nurettin an hour ago
  I like to ask claude how to prompt smaller models for the given task. With one prompt it was able to make a low quantized model call multiple functions via json.
msavara an hour ago
Somehow doesn't work for me :) "An internal error has occurred"
makeavish an hour ago
I hope to have great next two weeks before it gets nerfed.
- unsupp0rted an hour ago
  I've found Google (at least in AI Studio) are the only provider NOT to nerf their models after a few weeks
  scrlk 12 minutes ago
  IME, they definitely nerf models. gemini-2.5-pro-exp-03-25 through AI Studio was amazing at release and steadily degraded. The quality started tanking around the time they hid CoT.
  makeavish 41 minutes ago
  I don't use AI studio for my work. I used Antigravity/Gemini CLI and 3 pro was great for few weeks and now it's worse than 3 flash or any smaller model from competitor which are rated lower on benchmarks
Topfi 2 hours ago
Appears the only difference to 3.0 Pro Preview is Medium reasoning. Model naming has long gone from even trying to make sense, but considering 3.0 is still in preview itself, increasing the number for such a minor change is not a move in the right direction.
- GrayShade 2 hours ago
  Maybe that's the only API-visible change, saying nothing about the actual capabilities of the model?
- xnx an hour ago
  > increasing the number for such a minor change is not a move in the right direction
  A .1 model number increase seems reasonable for more than doubling ARC-AGI 2 score and increasing so many other benchmarks.
  What would you have named it?
- argsnd 2 hours ago
  I disagree. Incrementing the minor number makes so much more sense than “gemini-3-pro-preview-1902” or something.
- jannyfer an hour ago
  According to the blog post, it should be also great at drawing pelicans riding a bicycle.
__jl__ an hour ago
Another preview release. Does that mean the recommended model by Google for production is 2.5 Flash and Pro? Not talking about what people are actually doing but the google recommendation. Kind of crazy if that is the case
denysvitali an hour ago
Where is Simon's pelican?
- codethief 44 minutes ago
  Not Simon's but here is one: https://news.ycombinator.com/item?id=47075709
  denysvitali 31 minutes ago
  Thank you!
- saberience an hour ago
  Please no, let's not.
cmrdporcupine an hour ago
Doesn't show as available in gemini CLI for me. I have one of those "AI Pro" packages, but don't see it. Typical for Google, completely unclear how to actually use their stuff.
saberience an hour ago
I always try Gemini models when they get updated with their flashy new benchmark scores, but always end up using Claude and Codex again...
I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.
I.e. they are benchmaxing
Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.
- konart 40 minutes ago
  > but without assessing whether the models are actually improving in practical use-cases
  Which cases? Not trying to sound bad but you didn't even provide of cases you are using Claude\Codex\Gemini for.
- skerit an hour ago
  I'm glad someone else is finally saying this, I've been mentioning this left and right and sometimes I feel like I'm going crazy that not more people are noticing it.
  Gemini can go off the rails SUPER easily. It just devolves into a gigantic mess at the smallest sign of trouble.
  For the past few weeks, I've also been using XML-like tags in my prompts more often. Sometimes preferring to share previous conversations with `<user>` and `<assistant>` tags. Opus/Sonnet handles this just fine, but Gemini has a mental breakdown. It'll just start talking to itself.
  Even in totally out-of-the-ordinary sessions, it goes crazy. After a while, it'll start saying it's going to do something, and then it pretends like it's done that thing, all in the same turn. A turn that never ends. Eventually it just starts spouting repetitive nonsense.
  And you would think this is just because the bigger the context grows, the worse models tend to get. But no! This can happen well below even the 200.000 token mark.
- user34283 an hour ago
  I exclusively use Gemini for Chat nowadays, and it's been great mostly. It's fast, it's good, and the app works reliably now. On top of that I got it for free with my Pixel phone.
  For development I tend to use Antigravity with Sonnet 4.5, or Gemini Flash if it's about a GUI change in React. The layout and design of Gemini has been superior to Claude models in my opinion, at least at the time. Flash also works significantly faster.
  And all of it is essentially free for now. I can even select Opus 4.6 in Antigravity, but I did not yet give it a try.
- cmrdporcupine an hour ago
  Honestly doesn't feel like Google is targeting the agentic coding crowd so much as they are the knowledge worker / researcher / search-engine-replacement market?
  Agree Gemini as a model is fairly incompetent inside their own CLI tool as well as in opencode. But I find it useful as a research and document analysis tool.
techgnosis an hour ago
I'd love a new Gemini agent that isn't written with Node.js. Not sure why they think that's a good distribution model.