• MKuykendall a day ago

    Hey HN! I built this because I was tired of waiting 10 seconds for Ollama's 680MB binary to start just to run a 4GB model locally.

    Quick demo - working VSCode + local AI in 30 seconds: curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/late... ./shimmy serve # Point VSCode/Cursor to localhost:11435

    The technical achievement: Got it down to 5.1MB by stripping everything except pure inference. Written in Rust, uses llama.cpp's engine.

    One feature I'm excited about: You can use LoRA adapters directly without converting them. Just point to your .gguf base model and .gguf LoRA - it handles the merge at runtime. Makes iterating on fine-tuned models much faster since there's no conversion step.

    Your data never leaves your machine. No telemetry. No accounts. Just a tiny binary that makes GGUF models work with your AI coding tools.

    Would love feedback on the auto-discovery feature - it finds your models automatically so you don't need any configuration.

    What's your local LLM setup? Are you using LoRA adapters for anything specific?

    • sunscream89 7 hours ago

      How do I use it with ollama models?

      • MKuykendall 3 hours ago

        To use Shimmy (instead of Ollama):

          1. Install Shimmy:
          cargo install shimmy
          2. Get GGUF models (same models you'd use with Ollama):
          # Download to ./models/ directory
          huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir
           ./models/
          # Or use existing Ollama models from ~/.ollama/models/
          3. Start serving:
          ./shimmy serve
          4. Use with any OpenAI-compatible client at http://localhost:11435
      • carlos_rpn a day ago

        You may have noticed already, but the link to the binary is throwing a 404.

        • MKuykendall a day ago

          This should be fixed now!

      • stupidgeek314 a day ago

        Windows Defender tripped this for me, calling it out as Bearfoos trojan. Most likely a false positive, but jfyi.

        • MKuykendall 15 hours ago

          Try cargo install or intentionally exclude, unsigned Rust binaries will do this.

        • homarp a day ago

          Nice, a rust tool wrapping llama.cpp

          how does it differ from llama-server?

          and from llama-swap?

          • MKuykendall a day ago

            Shimmy is designed to be "invisible infrastructure" - the simplest possible way to get local inference working with your existing AI tools. llama-server gives you more control, llama-swap gives you multi-model management.

              Key differences:
              - Architecture: llama-swap = proxy + multiple servers, Shimmy = single server
              - Resource usage: llama-swap runs multiple processes, Shimmy = one 50MB process
              - Use case: llama-swap for managing many models, Shimmy for simplicity
            • MKuykendall a day ago

              Shimmy is for when you want the absolute minimum footprint - CI/CD pipelines, quick local testing, or systems where you can't install 680MB of dependencies.

          • cat-turner 18 hours ago

            looks cool, ty! really great project will try this out.