« BackQwen3.5: Towards Native Multimodal Agentsqwen.aiSubmitted by danielhanchen 5 hours ago
  • dash2 an hour ago

    You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question.

    • WithinReason 40 minutes ago

      Is that the new pelican test?

    • simonw an hour ago
      • moffers 11 minutes ago

        I like the little spot colors it put on the ground

        • tarruda an hour ago

          At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.

          I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D

          • embedding-shape 42 minutes ago

            How many times do you run the generation and how do you chose which example to ultimately post and share with the public?

            • canadiantim 33 minutes ago

              42

          • danielhanchen 4 hours ago

            For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5

            • bertili 2 hours ago

              Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.

              • hmmmmmmmmmmmmmm 6 minutes ago

                Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.

                • echelon 42 minutes ago

                  I hope China keeps making big open weights models. I'm not excited about local models. I want to run hosted open weights models on server GPUs.

                  People can always distill them.

                  • halJordan 15 minutes ago

                    Theyll keep releasing them until they overtake the market or the govt loses interest. Alibaba probably has staying power but not companies like deepseek's owner

                  • lostmsu an hour ago

                    Will 2026 M5 MacBook come with 390+GB of RAM?

                    • alex43578 an hour ago

                      Quants will push it below 256GB without completely lobotomizing it.

                      • bertili an hour ago

                        Most certainly not, but the Unsloth MLX fits 256GB.

                        • embedding-shape an hour ago

                          Curious what the prefilled and token generation speed is. Apple hardware already seem embarrassingly slow for the prefill step, and OK with the token generation, but that's with way smaller models (1/4 size), so at this size? Might fit, but guessing it might be all but usable sadly.

                    • tarruda an hour ago

                      Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.

                      • gunalx an hour ago

                        Sad to not see smaller distills of this model being released alongside the flaggship. That has historically been why i liked qwen releases. (Lots of diffrent sizes to pick from from day one)

                      • Matl 10 minutes ago

                        Is it just me or are the 'open source' models increasingly impractical to run on anything other than massive cloud infra at which point you may as well go with the frontier models from Google, Anthropic, OpenAI etc.?

                        • regularfry 3 minutes ago

                          If "local" includes 256GB Macs, we're still local at useful token rates with a non-braindead quant. I'd expect there to be a smaller version along at some point.

                        • mynti 3 hours ago

                          Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?

                          • robkop an hour ago

                            Rumours say you do something like:

                              Download every github repo
                                -> Classify if it could be used as an env, and what types
                                  -> Issues and PRs are great for coding rl envs
                                  -> If the software has a UI, awesome, UI env
                                  -> If the software is a game, awesome, game env
                                  -> If the software has xyz, awesome, ...
                                -> Do more detailed run checks, 
                                  -> Can it build
                                  -> Is it complex and/or distinct enough
                                  -> Can you verify if it reached some generated goal
                                  -> Can generated goals even be achieved
                                  -> Maybe some human review - maybe not
                                -> Generate goals
                                  -> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
                                ... Do the rest of the normal RL env stuff
                            • NitpickLawyer an hour ago

                              The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc.

                              So then the next next version is even better, because it got more data / better data. And it becomes better...

                              This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.

                              For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.

                              • alex43578 an hour ago

                                Judgement-based problems are still tough - LLM as a judge might just bake those earlier model’s biases even deeper. Imagine if ChatGPT judged photos: anything yellow would win.

                                • NitpickLawyer 32 minutes ago

                                  Agreed. Still tough, but my point was that we're starting to see that combining methods works. The models are now good enough to create rubrics for judgement stuff. Once you have rubrics you have better judgements. The models are also better at taking pages / chapters from books and "judging" based on those (think logic books, etc). The key is that capabilities become additive, and once you unlock something, you can chain that with other stuff that was tried before. That's why test time + longer context -> IMO improvements on stuff like theorem proving. You get to explore more, combine ideas and verify at the end. Something that was very hard before (i.e. very sparse rewards) becomes tractable.

                            • yorwba 2 hours ago

                              Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.

                            • ggcr 4 hours ago

                              From the HuggingFace model card [1] they state:

                              > "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."

                              Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?

                              [1] https://huggingface.co/Qwen/Qwen3.5-397B-A17B

                              • NitpickLawyer 4 hours ago

                                Yes, it's described in this section - https://huggingface.co/Qwen/Qwen3.5-397B-A17B#processing-ult...

                                Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.

                                Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).

                                We'll see where the 3rd party inference providers will settle wrt cost.

                                • ggcr 3 hours ago

                                  Thanks, I've totally missed that

                                  It's basically the same as with the Qwen2.5 and 3 series but this time with 1M context and 200k native, yay :)

                                • danielhanchen 4 hours ago

                                  Unsure but yes most likely they use YaRN, and maybe trained a bit more on long context maybe (or not)

                                • Alifatisk an hour ago

                                  Wow, the Qwen team is pushing out content (models + research + blogpost) at an incredible rate! Looks like omni-modals is their focus? The benchmark look intriguing but I can’t stop thinking of the hn comments about Qwen being known for benchmaxing.

                                  • trebligdivad an hour ago

                                    Anyone else getting an automatically downloaded PDF 'ai report' when clicking on this link? It's damn annoying!

                                    • lollobomb an hour ago

                                      Yes, but does it answer questions about Tiananmen Square?

                                      • Zetaphor 6 minutes ago

                                        Why is this important to anyone actually trying to build things with these models

                                      • ddtaylor an hour ago

                                        Does anyone know the SWE bench scores?

                                        • isusmelj 2 hours ago

                                          Is it just me or is the page barely readable? Lots of text is light grey on white background. I might have "dark" mode on on Chrome + MacOS.

                                          • Jacques2Marais an hour ago

                                            Yes, I also see that (also using dark mode on Chrome without Dark Reader extension). I sometimes use the Dark Reader Chrome extension, which usually breaks sites' colours, but this time it actually fixes the site.

                                            • thunfischbrot an hour ago

                                              That seems fine to me. I am more annoyed at the 2.3MB sized PNGs with tabular data. And if you open them at 100% zoom they are extremely blurry.

                                              Whatever workflow lead to that?

                                              • dryarzeg 2 hours ago

                                                I'm using Firefox on Linux, and I see the white text on dark background.

                                                > I might have "dark" mode on on Chrome + MacOS.

                                                Probably that's the reason.