Never imagined our project would make it to the HN front page on Sunday!
Tabby has undergone significant development since its launch two years ago [0]. It is now a comprehensive AI developer platform featuring code completion and a codebase chat, with a team [1] / enterprise focus (SSO, Access Control, User Authentication).
Tabby's adopters [2][3] have discovered that Tabby is the only platform providing a fully self-service onboarding experience as an on-prem offering. It also delivers performance that rivals other options in the market. If you're curious, I encourage you to give it a try!
[1]: https://demo.tabbyml.com/search/how-to-add-an-embedding-api-...
[2]: https://www.reddit.com/r/LocalLLaMA/s/lznmkWJhAZ
[3]: https://www.linkedin.com/posts/kelvinmu_last-week-i-introduc...
Do you have a plugin for MSVC?
Not yet, consider subscribe https://github.com/TabbyML/tabby/issues/322 for future updates!
Is it only compatible with Nvidia and Apple? Will this work with an AMD GPU?
Yes - AMD GPU is supported through vulkan backend:
As someone unfamiliar with local AIs and eager to try, how does the “run tabby in 1 minute”[1] compare to e.g. chatgpt’s free 4o-mini? Can I run that docker command on a medium specced macbook pro and have an AI that is comparably fast and capable? Or are we not there (yet)?
Edit: looks like there is a separate page with instructions for macbooks[2] that has more context.
> The compute power of M1/M2 is limited and is likely to be sufficient only for individual usage. If you require a shared instance for a team, we recommend considering Docker hosting with CUDA or ROCm.
[1]: https://github.com/TabbyML/tabby#run-tabby-in-1-minute
docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model StarCoder-1B --device cuda --chat-model Qwen2-1.5B-Instruct
[2]: https://tabby.tabbyml.com/docs/quick-start/installation/appl...gpt-4o-mini might not be the best point of reference for what good LLMs can do with code: https://aider.chat/docs/leaderboards/#aider-polyglot-benchma...
A teeny tiny model such as a 1.5B model is really dumb, and not good at interactively generating code in a conversational way, but models in the 3B or less size can do a good job of suggesting tab completions.
There are larger "open" models (in the 32B - 70B range) that you can run locally that should be much, much better than gpt-4o-mini at just about everything, including writing code. For a few examples, llama3.3-70b-instruct and qwen2.5-coder-32b-instruct are pretty good. If you're really pressed for RAM, qwen2.5-coder-7b-instruct or codegemma-7b-it might be okay for some simple things.
> medium specced macbook pro
medium specced doesn't mean much. How much RAM do you have? Each "B" (billion) of parameters is going to require about 1GB of RAM, as a rule of thumb. (500MB for really heavily quantized models, 2GB for un-quantized models... but, 8-bit quants use 1GB, and that's usually fine.)
Also context size significantly impacts ram/vram usage and in programming those chats get big quickly
Side question : open source models tend to be less "smart" than private ones, do you intend to compensate by providing a better context (eg query relevant technology docs to feed context)?
For something similar I use Continue.dev with ollama, it’s always nice to see more tools in the space! But as usual, you need pretty formidable hardware to run the actually good models, like the 32B version of Qwen2.5-coder.
> How to utilize multiple NVIDIA GPUs?
| Tabby only supports the use of a single GPU. To utilize multiple GPUs, you can initiate multiple Tabby instances and set CUDA_VISIBLE_DEVICES (for cuda) or HIP_VISIBLE_DEVICES (for rocm) accordingly.
So using 2 NVLinked GPU's with inference is not supported? Or is that situation different because NVLink treats the two GPU as a single one?
> So using 2 NVLinked GPU's with inference is not supported?
To make better use of multiple GPUs, we suggest employing a dedicated backend for serving the model. Please refer to https://tabby.tabbyml.com/docs/references/models-http-api/vl... for an example
What is the recommended hardware? GPU required? Could this run OK on an older Ryzen APU (Zen 3 with Vega 7 graphics)?
The usual bottleneck for self-hosted LLMs is memory bandwidth. It doesn't really matter if there are integrated graphics or not... the models will run at the same (very slow) speed on CPU-only. Macs are only decent for LLMs because Apple has given Apple Silicon unusually high memory bandwidth, but they're still nowhere near as fast as a high-end GPU with extremely fast VRAM.
For extremely tiny models like you would use for tab completion, even an old AMD CPU is probably going to do okay.
Good to know. It also looks like you can host TabbyML as an on-premise server with docker and serve requests over a private network. Interesting to think that a self-hosted GPU server might become a thing.
Check https://www.reddit.com/r/LocalLLaMA/s/lznmkWJhAZ to see a local setup with 3090.
That thread doesn't seem to mention hardware. It would be really helpful to just put hardware requirements in the GitHub README.
So does this run on your personal machine, or can you install it on a local company server and have everyone in the company connect to it?
Tabby is engineered for team usage, intended to be deployed on a shared server. However, with robust local computing resources, you can also run Tabby on your individual machine. Check https://www.reddit.com/r/LocalLLaMA/s/lznmkWJhAZ to see a local setup with 3090.
How would I tell this to use an api framework it doesn’t know ?
Tabby comes with builtin RAG support so you can add this api framework to it.
Example: https://demo.tabbyml.com/search/how-to-configure-sso-in-tabb...
Settings page: https://demo.tabbyml.com/settings/providers/doc
Not a dupe, as that was nearly two years ago. https://news.ycombinator.com/newsfaq.html#reposts
Didn’t you mean to name it Spacey?
Unfortunate name. Can you connect Tabby to the OpenAI-compatible TabbyAPI? https://github.com/theroyallab/tabbyAPI
I though that Tabby, the ssh client [1], got AI capabilities...
At least per Github, the TabbyML project is older than the TabbyAPI project.
Also, wildly more popular, to the tune of several magnitudes more forks and stars. If anything, this question should be asked of the TabbyAPI project.
I'm not sure what's going on with TabbyAPI's github metrics, but exl2 quants are very popular among nvidia local LLM crowd and TabbyAPI comes in tons of reddit posts of people using it. Might be just my bubble, not saying they're not accurate, just generally surprised such a useful project has under 1k stars. On the flip side, LLMs will hallucinate about TabbyML if you ask it TabbyAPI related questions, so I'd agree the naming is unfortunate.