« BackPhi-4 Bug Fixesunsloth.aiSubmitted by danielhanchen 12 hours ago
  • danielhanchen 12 hours ago

    Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest MIT licensed LLM to be on par with GPT-4o mini

    1. End of sentence should be <|im_end|> not <|endoftext|>

    2. Chat template should not auto add an assistant prompt

    3. Padding token should not be EOS but <|dummy_87|>

    I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth

    I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...

    • CGamesPlay 9 hours ago

      > We converted Phi-4 to Llama’s architecture for better accuracy and easier use.

      What does this mean? When I think about "model architecture", I think about the number of weights in each layer, the organization of the layers, etc. And AFAIK, it's untenable to "port" a model from one to the other without effectively retraining it. So what does it actually mean to "convert to Llama's architecture"?

      • danielhanchen 8 hours ago

        Oh Phi-4's architecture is inspired from Llama itself, except they merged the attention matrices into 1 large matrix for better FLOP utilization, and the gate/up matrices in the MLP.

        Phi-3 use to use sliding window attention, but they got rid of that in Phi-4.

        So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch (by unmerging the merges), and now you can "Llama-fy" Phi-4 to Llama arch.

        The reason why accuracy increases in finetuning is because during LoRA finetuning, you learn only 1 A matrix for merged QKV, whilst unmerging it creates 3 A matrices - this allows the model to have more freedom to learn new features.

        • Sn0wCoder 9 hours ago

          Would guess GGUF so you can run on llama.cpp, LM Studio, etc..., but OP can hopefully clarity further for you.

          • danielhanchen 8 hours ago

            Yep converting to Llama arch definitely makes accessibility much better - also many fast LLM serving libraries normally support Llama, so it makes it easier to port and use!

        • sunaookami 11 hours ago

          Wasn't Phi-3 also bugged/is still bugged? Seems like Microsoft just doesn't care.

          >to be on par with GPT-4o mini

          Phi is known to overfit benchmarks. It's way, way worse then that.

          • danielhanchen 9 hours ago

            Phi-3 should be fixed as well - but yes there were bugs as well! https://x.com/danielhanchen/status/1782853167572832650

            Phi-3's sliding window should be 2048 and not 2047, and they also had chat template issues - I uploaded correct versions to https://huggingface.co/unsloth/Phi-3.5-mini-instruct

            • throwaway314155 10 hours ago

              Anecdotally, I've been experimenting with Phi-4 the past hour or so (so, yeah, not very comprehensive) and it's certainly a strong model. Definitely better than the previous Phi models.

              • danielhanchen 9 hours ago

                Yep Phi-4 definitely is better than Phi-3.5!

            • simonw 10 hours ago

              Huh! That may explain why I kept on getting visible <|im_end|> output when I tried running a Phi-4 GGUF file using llama.cpp.

              • danielhanchen 9 hours ago

                Oh yes exactly! I trimmed it out now :)

                The better chat template should be:

                {% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}

              • sroussey 8 hours ago

                Can you convert to ONNX so I can try in web browser?

            • danielhanchen 7 hours ago

              Update: The Phi-4 team is actively working on adding all our fixes into the original model! https://huggingface.co/microsoft/phi-4/discussions/21

              • t1amat 10 hours ago

                Daniel’s fixes to Phi-4 make it the best scoring Phi-4 on HF’s Open LLM Leaderboard. Great job on that.

                Unsloth is a masterpiece, keep up the great work!

                • danielhanchen 9 hours ago

                  Thanks a lot!

                • NooneAtAll3 6 hours ago
                  • danielhanchen 6 hours ago

                    Sorry are there some issues with our website?

                    • NooneAtAll3 2 hours ago

                      yep, it appears for a second - then displays only this :(

                      • danielhanchen an hour ago

                        Oh no :( Do you know which device / platform?

                  • excerionsforte 7 hours ago

                    Available on Ollama already: https://ollama.com/vanilj/phi-4-unsloth

                    • tandr 2 hours ago

                      looking at "original" Phi4 on ollama, it looks like they have fixed parameters issue for im_start/end

                      • danielhanchen 7 hours ago

                        Oh fabulous! :)

                      • lostmsu 11 hours ago

                        The benchmark results of the model before and after the "fixes" do not match numbers reported in the model card: https://huggingface.co/microsoft/phi-4

                        According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.

                        • danielhanchen 9 hours ago

                          Oh yes I found this to be a bit strange - I uploaded our versions and Microsoft's own version to Hugging Face's public LLM leaderboard - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

                          You can see Microsoft's own original Phi-3 scores 12.31% - I'm unsure why. My fixes at least pushes it to 20%.

                          It's possible because HF's benchmark does "Scoring: Exact match: Was the solution generated correct and in the expected format" which might be the issue

                        • adultSwim 9 hours ago

                          Are there alternatives to unsloth?

                          I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.

                          Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.

                          • danielhanchen 8 hours ago

                            Multi GPU support is definitely coming to Unsloth OSS! Our goal was to release it this month, but unsure on exact timelines - maybe next month!!

                            • adultSwim 5 hours ago

                              Thank you!

                              • danielhanchen an hour ago

                                I'll ping you when it comes along!

                          • make3 9 hours ago

                            "Yes it improves performance!" proceeds to show the most unconvincing stats ever

                            you can probably blow on your GPU and get a similar performance change

                          • TZubiri 9 hours ago

                            Ah yes, drawing ASCII art, the de facto benchmark for evaluating LLM quality.

                            • danielhanchen 9 hours ago

                              Anecdotal evidence was provided to show some Redditors tested it out - but I do agree it's not correct to show that as an example - so I uploaded our fixed versions to Hugging Face's public LLM leaderboard here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_... - this shows the fixes do in fact work!

                            • wsintra2022 8 hours ago

                              >Reddit comments show our fixes make Phi-4 inference much better

                              I’d like to try ‘Reddit comments show my fixes make app better’ in my next review

                              • danielhanchen 7 hours ago

                                Fixed versions are also independently scored by Hugging Face's Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

                                The Reddit LocalLlama community is actually pretty cool - tonnes of research actually comes from the community - for example kaiokendev's linear RoPE scaling, YaRN, NTK Aware RoPE Scaling, many LLM benchmarks - many researchers use LocalLlama to share research and discuss on new stuff.

                                I know a lot of AI researchers use the "LocalLlama vibe check" which essentially is an anecdotal approach to LLM evaluation - ie instead of relying on Chat LMsys or LLM benchmarks, 3rd party crowd sourced vibe checks sometimes do much better.

                                • danielhanchen 7 hours ago

                                  As an update - the Phi-4 team is actively working on incorporating all fixes! See https://huggingface.co/microsoft/phi-4/discussions/21