• noman-land 10 hours ago

    For anyone who hasn't tried local models because they think it's too complicated or their computer can't handle it, download a single llamafile and try it out in just moments.

    https://future.mozilla.org/builders/news_insights/introducin...

    https://github.com/Mozilla-Ocho/llamafile

    They even have whisperfiles now, which is the same thing but for whisper.cpp, aka real-time voice transcription.

    You can also take this a step further and use this exact setup for a local-only co-pilot style code autocomplete and chat using Twinny. I use this every day. It's free, private, and offline.

    https://github.com/twinnydotdev/twinny

    Local LLMs are the only future worth living in.

    • vunderba 8 hours ago

      If you're gonna go with a VS code extension and you're aiming for privacy, then I would at least recommend using the open source fork VS Codium.

      https://vscodium.com/

      • unethical_ban 8 hours ago

        It is true that VS Code has some non-optional telemetry, and if VS Codium works for people, that is great. However, the telemetry of VSCode is non-personal metrics, and some of the most popular extensions are only available with VSCode, not with Codium.

        • wkat4242 8 hours ago

          > and some of the most popular extensions are only available with VSCode, not with Codium

          Which is an artificial restriction from MS that's really easily bypassed.

          Personally I don't care whether the telemetry is identifiable. I just don't want it.

          • noman-land 8 hours ago

            How is it bypassed?

            • wkat4242 8 hours ago

              There's a whitelist identifier that you can add bundle IDs to, to get access to the more sensitive APIs. Then you can download the extension file and install it manually. I don't have the exact process right now but just Google it :)

          • poincaredisk 3 hours ago

            >the telemetry of VSCode is non-personal metrics

            But I don't want it. I want my software to work for me, not against me.

            >and some of the most popular extensions are only available with VSCode, not with Codium.

            I'll manage without them. What's especially annoying is that this restriction is completely artificial.

            Having said that, MS did a great job with VsCode and I applaud them for that. I guess nothing is perfect, and I bet these decisions were made by suits against engineer wishes.

            • SoothingSorbet an hour ago

              > However, the telemetry of VSCode is non-personal metrics

              I don't care, I don't want my text editor to send _any_ telemetry, _especially_ without my explicit consent.

              > some of the most popular extensions are only available with VSCode

              This has never been an issue for me, fortunately. The only issue is Microsoft's proprietary extensions, which I have no interest in using either. If I wanted a proprietary editor I'd use something better.

              • aftbit an hour ago

                I dropped VSCode when I found out that the remote editing and language server extensions were both proprietary. Back to vim and sorry I strayed.

              • metadat 4 hours ago

                Not allowing end-users to disable telemetry is actually awful. The gold standard is that IP addresses are considered personally identifiable information.

                • jaggederest 2 hours ago

                  > However, the telemetry of VSCode is non-personal metrics,

                  We know from the body of work in deobfuscation that there's no such thing as "strictly anonymous metrics".

                  • qwezxcrty 6 hours ago

                    From the documentation (https://code.visualstudio.com/docs/getstarted/telemetry ) it seems there is a supported way to completely turn off telemetry. Is there something else in VSCode that doesn't respect this setting?

                • xyc 3 hours ago

                  If anyone is interested in trying local AI, you can give https://recurse.chat/ a spin.

                  It lets you use local llama.cpp without setup, chat with PDF offline and provides chat history / nested folders chat organization, and can handle thousands of conversations. In addition you can import your ChatGPT history and continue chats with local AI.

                  • kaoD 8 hours ago

                    Well this was my experience...

                        User: Hey, how are you?
                        Llama: [object Object]
                    
                    It's funny but I don't think I did anything wrong?
                    • Jedd 2 hours ago

                      Often you'll find there's '-chat-' and '-instruct-' variants of an LLM available.

                      Trying to chat to an INSTRUCT model will be disappointing, much as you describe.

                      • kaoD an hour ago

                        This was on their example LLaVA 1.5 7b q4 with all default parameters which does not specify chat or instruct... but after the first message it actually worked as expected so I guess it's RLHF'd for chat or chat+instruct.

                        I don't know if it was some sort of error on the UI or what.

                        Trying to interrogate it about the first message yielded no results. It just repeated back my question, verbatim, unlike the rest of the chat which was more or less chat-like :shrugh:

                      • AlienRobot 8 hours ago

                        2000: Javascript is webpages.

                        2010: Javascript is webservers.

                        2020: Javascript is desktop applications.

                        2024: Javascript is AI.

                        • evbogue 7 hours ago

                          From this data we must conclude that within our lifetimes all matter in the universe will eventually be reprogrammed in JavaScript.

                          • mnky9800n 7 hours ago

                            I'm not sure I want to live in that reality.

                            • mortenjorck 6 hours ago

                              If the simulation hypothesis is real, perhaps it would follow that all the dark matter and dark energy in the universe is really just extra cycles being burned on layers of interpreters and JIT compilation of a loosely-typed scripting language.

                              • AlienRobot 6 hours ago

                                It's fine, it will be Typescript.

                                • anotherjesse 6 hours ago
                          • _kidlike 9 hours ago
                            • fortyseven 5 hours ago

                              This has been my go-to for all of my local LLM interaction: it easy to get going, manages all of the models easily. Nice clean API for projects. Updated regularly; works across Windows, Mac, Linux. It's a wrapper around LlamaCpp, but it's a damned good one.

                              • brewtide 3 hours ago

                                Same here, however minimal. I've also installed openwebui so the instance has a local web interface, and then use tailscale to access my at home LAN when put and about on the cellphone. (Goes16 weather data, ollama, a speed cam setup, and esphome temp sensors around the home / property).

                                It's been pretty flawless, and honestly pretty darn useful here and there. The big guns go faster and do more, but I'd prefer not having every interaction logged etc.

                                6core 8th gen i7 I think, with a 1050ti. Old stuff. And it's quick enough on the smaller 7/8b models for sure.

                            • amelius 3 hours ago

                              How do we rate whether the smaller models are any good? How many questions do we need to ask it to know that it can be trusted and we didn't waste our time on it?

                              • mkl 2 hours ago

                                You should never completely trust any LLM. They all get things wrong, make things up, and have blind spots. They're any good if they help you for some of your particular uses (but may still fail badly for other uses).

                                • amelius 43 minutes ago

                                  I think you didn't understand my question and maybe I phrased it poorly. The problem is not whether we should trust any deep learning model (the answer is indeed no). But the question is how we can find out if a model is any good before investing our time into that model. Each bad reply we get has a price, because it wastes our time. So, how can we compare models objectively without having to try them out ourselves first?

                              • ryukoposting 6 hours ago

                                Not only are they the only future worth living in, incentives are aligned with client-side AI. For governments and government contractors, plumbing confidential information through a network isn't an option, let alone spewing it across the internet. It's a non-starter, regardless of the productivity bumps stuff like Copilot can provide. The only solution is to put AI compute on a cleared individual's work computer.

                                • creata an hour ago

                                  Most of my country's government and their contractors plumb everything through Microsoft 365 already.

                                  • CooCooCaCha 3 hours ago

                                    > plumbing confidential information through a network isn't an option

                                    So do you think government doesn't use networks?

                                  • wkat4242 8 hours ago

                                    Yeah I set up a local server with a strong GPU but even without that it's ok, just a lot slower.

                                    The biggest benefits for me are the uncensored models. I'm pretty kinky so the regular models tend to shut me out way too much, they all enforce this prudish victorian mentality that seems to be prevalent in the US but not where I live. Censored models are just unusable to me which includes all the hosted models. It's just so annoying. And of course the privacy.

                                    It should really be possible for the user to decide what kind of restrictions they want, not the vendor. I understand they don't want to offer violent stuff but 18+ topics should be squarely up to me.

                                    Lately I've been using grimjim's uncensored llama3.1 which works pretty well.

                                    • threecheese 6 hours ago

                                      Any tips you can give for like minded folks? Besides grimjim (checking it out).

                                      • wkat4242 38 minutes ago

                                        Well you can use some jailbreak prompts but with cloud models it's a cat and mouse game as they constantly fix known jailbreaks. With local models this isn't a problem of course. But I prefer getting a fine-tune model so I don't have to cascade prompts.

                                        Not all uncensored models are great. Some return very sparse data or don't return the end tags sometimes so they keep hallucinating and never finish.

                                        If you import grimjim's model, make sure you use the complete modelfile from vanilla lama3.1, not just an empty modelfile. Because he doesn't provide one. This really helps setting the correct parameters so the above doesn't happen so much.

                                        But I have seen it happen with some official ollama models like wizard-vicuna and dolphin-llama. They come with modelfiles so they should be correct.

                                      • wkat4242 6 hours ago

                                        @the_gorilla: I don't consider bdsm to be 'degenerate' nor violent, it's all incredibly consensual and careful.

                                        It's just that the LLMs trigger immediately on minor words and shut down completely.

                                        • the_gorilla 7 hours ago

                                          If you get to act out degenerate fantasies with a chatbot, I also want to be able to get "violent stuff" (which is again just words).

                                        • zelphirkalt 8 hours ago

                                          Many setup rely on Nvidia GPUs, Intel stuff, Windows or other stuff, that I would rather not use, or are not very clear about how to set things up.

                                          What are some recommendations for running models locally, on decent CPUs and getting good valuable output from them? Is that llama stuff portable across CPUs and hardware vendors? And what do people use it for?

                                          • threecheese 8 hours ago

                                            Have you tried a Llamafile? Not sure what platform you are using. From their readme:

                                              > … by combining llama.cpp with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file executable (called a "llamafile") that runs locally on most computers, with no installation.
                                            
                                            Low cost to experiment IMO. I am personally using MacOS with an M1 chip and 64gb memory and it works perfectly, but the idea behind this project is to democratize access to generative AI and so it is at least possible that you will be able to use it.
                                            • narrator 8 hours ago

                                              With 64GB can you run the 70B size llama models well?

                                              • threecheese 6 hours ago

                                                I should have qualified the meaning of “works perfectly” :) No 70b for me, but I am able to experiment with many quantized models (and I am using a Llama successfully, latency isn’t terrible)

                                                • credit_guy 7 hours ago

                                                  No, you can't. I have 128 GB and a 70B llamafile is unusable.

                                              • noman-land 8 hours ago

                                                llamafile will run on all architectures because it is compiled by cosmopolitan.

                                                https://github.com/jart/cosmopolitan

                                                "Cosmopolitan Libc makes C a build-once run-anywhere language, like Java, except it doesn't need an interpreter or virtual machine. Instead, it reconfigures stock GCC and Clang to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS with the best possible performance and the tiniest footprint imaginable."

                                                I use it just fine on a Mac M1. The only bottleneck is how much RAM you have.

                                                I use whisper for podcast transcription. I use llama for code complete and general q&a and code assistance. You can use the llava models to ingest images and describe them.

                                                • distances 7 hours ago

                                                  I'm using Ollama with an AMD GPU (7800, 16GB) on Linux. Works out of the box. Another question is then if I get much value out of these local models.

                                                  • wkat4242 8 hours ago

                                                    Not really. I run ollama on an AMD Radeon Pro and it works great.

                                                    For tooling to train models it's a bit more difficult but inference works great on AMD.

                                                    My CPU is an AMD Ryzen and the OS Linux. No problem.

                                                    I use OpenWebUI as frontend and it's great. I use it for everything that people use GPT for.

                                                  • AustinDev 4 hours ago

                                                    https://old.reddit.com/r/LocalLLaMA/ is a great community for this sort of thing as well.

                                                    • upcoming-sesame 4 hours ago

                                                      I just tried now. Super easy indeed but slow to the point it's not usable on my PC

                                                      • chaostheory 2 hours ago

                                                        You need an RTX 4090 if you want enough speed

                                                      • ComputerGuru 5 hours ago

                                                        Do you know if whisperfile is akin to whisper or the much better whisperx? Does it do diarization?

                                                        • noman-land 5 hours ago

                                                          Last I checked it was basically just whisper.cpp so not whisperx and no diarization by default but it moves pretty quickly so you may want to ask on the Mozilla AI Discord.

                                                          https://discord.com/invite/yTPd7GVG3H

                                                        • privacyis1mp 9 hours ago

                                                          I built Fluid app exactly with that in mind. You can run local AI on mac without really knowing what an LLM/ollama is. Plug&Play.

                                                          Sorry for the blatant ad, though I do hope it's useful for some ppl reading this thread: https://getfluid.app

                                                          • twh270 8 hours ago

                                                            I'm interested, but I can't find any documentation for it. Can I give it local content (documents, spreadsheets, code, etc.) and ask questions?

                                                            • privacyis1mp 8 hours ago

                                                              > Can I give it local content (documents, spreadsheets, code, etc.) It's coming roughly in December (may be sooner).

                                                              Roadmap is following:

                                                              - October - private remote AI (when you need smarter AI than your machine can handle, but don't want your data to be logged or stored anywhere)

                                                              - November - Web search capabilities (so the AI will be capable of doing websearch out of the box)

                                                              - December - PDF, docs, code embedding. 2025 - tighter MacOS integration with context awareness.

                                                              • twh270 7 hours ago

                                                                Oh awesome, thank you! I will check back in December.

                                                          • heyoni 9 hours ago

                                                            Isn’t there also some Firefox AI integration that’s being tested by one dev out there? I forgot the name and wonder if it got any traction.

                                                            • christkv 3 hours ago

                                                              I recommend llmstudio for this usually

                                                            • toddmorey 10 hours ago

                                                              I narrate notes to myself on my morning walks[1] and then run whisper locally to turn the audio into text... before having an LLM clean up my ramblings into organized notes and todo lists. I have it pretty much all local now, but I don't mind waiting a few extra seconds for it to process since it's once a day. I like the privacy because I was never comfortable telling my entire life to a remote AI company.

                                                              [1] It feels super strange to talk to yourself, but luckily I'm out early enough that I'm often alone. Worst case, I pretend I'm talking to someone on the phone.

                                                              • vunderba 9 hours ago

                                                                Same. My husky/pyr mix needs a lot of exercise, so I'm outside a minimum of a few hours a day. As a result I do a lot of dictation on my phone.

                                                                I put together a script that takes any audio file (mp3, wav), normalizes it, runs it through ggerganov's whisper, and then cleans it up using a local LLM. This has saved me a tremendous amount of time. Even modestly sized 7b parameter models can handle syntactical/grammatical work relatively easily.

                                                                Here's the gist:

                                                                https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...

                                                                EDIT: I've always talked out loud through problems anyway, throw a BT earbud on and you'll look slightly less deranged.

                                                              • schmidtleonard 10 hours ago

                                                                Button-toggled voice notes in the iPhone Notes app are a godsend for taking measurements. Rather than switching your hands between probe/equipment and notes repeatedly, which sucks badly, you can just dictate your readings and maaaaybe clean out something someone said in the background. Over the last decade, the microphones + speech recognition became Good Enough for this. Wake-word/endpoint models still aren't there yet, and they aren't really close, but the stupid on/off button in the Notes app 100% solves this problem and the workflow is now viable.

                                                                I love it and I sincerely hope that "Apple Intelligence" won't kill the button and replace it with a sub-viable conversational model, but I probably ought to figure out local whisper sooner rather than later because it's probably inevitable.

                                                                • whatindaheck an hour ago

                                                                  > Button-toggled voice notes in the iPhone Notes app

                                                                  Is this a physical button or on-screen? I’ve been rewatching Twin Peaks recently and would love a high-tech implementation of Cooper’s tape recorder.

                                                                  • freetanga 10 hours ago

                                                                    I bought an iZYREC (?) and leave the phone at home. MacWhisper and some regex (I use verbal tags) and done

                                                                    • sgu999 5 hours ago

                                                                      Some dubious marketing choices on their landing page:

                                                                      > Finding the Truth – Surprisingly, my iZYREC revealed more than I anticipated. I had placed it in my husband's car, aiming to capture some fun moments, but it instead recorded intimate encounters between my husband and my close friend. Heartbreaking yet crucial, it unveiled a hidden truth, helping me confront reality.

                                                                      > A Voice for the Voiceless – We suspected that a relative's child was living in an abusive home. I slipped the device into the child's backpack, and it recorded the entire day. The sound quality was excellent, and unfortunately, the results confirmed our suspicions. Thanks iZYREC, giving a voice to those who need it most.

                                                                  • jxcl 7 hours ago

                                                                    This has inspired me.

                                                                    I do a lot of stargazing and have experimented with voice memos for recording my observations. The problem of course is later going back and listening to the voice memo and getting organized information out of what essentially turns into me rambling to myself.

                                                                    I'm going to try to use whisper + AI to transcribe my voice memos into structured notes.

                                                                    • inciampati 7 hours ago

                                                                      You can use it for everything. Just make sure that you have an input method set up on your computer and phone that allow you to use whisper.

                                                                      That's how I'm writing this message to you.

                                                                      Learning to use these speech-to-text systems will be a new kind of literacy.

                                                                      I think pushing the transcription through language models is a fantastic way to deal with the complexity and frankly, disorganization of directly going from speech to text.

                                                                      By doing this we can all basically type at 150-200 words a minute.

                                                                      • navigate8310 21 minutes ago

                                                                        It would be really amazing if you could expand your workflow a little, especially how you have stitched everything together.

                                                                        • bluerooibos an hour ago

                                                                          > That's how I'm writing this message to you.

                                                                          Neat. Can you explain your setup a little? How do you go from voice to whisper to writing in this reply input form on a webpage?

                                                                      • draebek 4 hours ago

                                                                        For people on macOS, the free app Aiko on the App Store makes it easy to use Whisper, if you want a GUI: https://sindresorhus.com/aiko

                                                                        • alyandon 10 hours ago

                                                                          I would be greatly interested in knowing how you set all that up if you felt like sharing the specifics.

                                                                          • toddmorey 8 hours ago

                                                                            My hope is to make this easy with a GH repo or at least detailed instructions.

                                                                            I'm on a Mac and I found the easiest way to run & use local models is Ollama as it has a rest interface: https://github.com/ollama/ollama/blob/main/docs/api.md

                                                                            I just have a local script that pulls the audio file from Voice Memos (after it syncs from my iPhone), runs it through openai's whisper (really the best at voice to speech; excellent results) and then makes sense of it all with a prompt that asks for organized summary notes and todos in GH flavored markdown. That final output goes into my Obsidian vault. The model I use is llama3.1 but haven't spent much time testing others. I find you don't really need the largest models since the task is to organize text rather than augment it with a lot of external knowledge.

                                                                            Humorously the harder part of the process was finding where the hell Voice Memos actually stores these audio files. I wish you could set the location yourself! They live deep inside ~/Library/Containers. Voice Memos has no export feature, but I found you can drag any audio recording out of the left sidebar to the desktop or a folder. So I just drag the voice memo into a folder my script watches and then it runs the automation.

                                                                            If anyone has another, better option for recording your voice on an iPhone, let me know! The nice thing about all this is you don't even have to start / stop the recording ever on your walk... just leave it going. Dead space and side conversations and commands to your dog are all well handled and never seem to pollute my notes.

                                                                            • graeme 6 hours ago

                                                                              Have you tried the Shortcuts app? On phone and mac. Should be able to make one that finds and moves a voice memo when run. You can run them on button press or via automation.

                                                                              Also what kind of local machine do you need? I have an imac pro, wondering if this will run the models or if I ought to be on an apple silicon machine? I have an M1 macbook air as well.

                                                                              • emadda 5 hours ago

                                                                                You could also use the "share" menu and airdrop the audio from your iphone to your mac. Files end up in Downloads by default.

                                                                                • schainks 8 hours ago

                                                                                  Amazing, thank you for this!

                                                                                  • behnamoh 7 hours ago

                                                                                    You can record your voice messages and send them to yourself in Telegram. They're saved on-device. You can then create a bot to do things to stuff as they come in, like "transcribe new ogg files and write back the text as a message after the voice memo".

                                                                                • lukan 9 hours ago

                                                                                  "before having an LLM clean up my ramblings into organized notes and todo lists."

                                                                                  Which local LLM do you use?

                                                                                  Edit:

                                                                                  And self talk is quite a healthy and useful thing in itself, but avoiding it in public is indeed kind of necessary, because of the stigma

                                                                                  https://en.m.wikipedia.org/wiki/Intrapersonal_communication

                                                                                  • anigbrowl 4 hours ago

                                                                                    Just put in some earbuds and everyone will assume you're on the phone.

                                                                                    • flimflamm 9 hours ago

                                                                                      That's just meat CoT (chain of thought) - right?

                                                                                      • lukan 8 hours ago

                                                                                        I do not understand?

                                                                                        • valval 7 hours ago

                                                                                          GP is making a joke about speaking to oneself really just being the human version of Chain of Thought, which in my understanding is an architecural decision in LLMs to have it write out intermediate steps in problem solving and evaluate the validity of them as it goes.

                                                                                    • neom 8 hours ago

                                                                                      This is exactly why I think the AI pins are a good idea. The Humane pin seems too big/too expensive/not quite there yet, but for exactly what you're doing, I would like some type of brooch.

                                                                                      • vincvinc 10 hours ago

                                                                                        I was thinking about making this the other day. Would you mind sharing what you used?

                                                                                        • wkat4242 8 hours ago

                                                                                          What do you use to run whisper locally? I don't think ollama can do it.

                                                                                          • yieldcrv 9 hours ago

                                                                                            I found one that can isolate speakers, its just okay at that

                                                                                            • hdjjhhvvhga 10 hours ago

                                                                                              > It feels super strange to talk to yourself

                                                                                              I remember the first lecture in the Theory of Communication class where the professor introduced the idea that communication by definition requires at least two different participants. We objected by saying that it can perfectly be just one and the same participant (communication is not just about space but also time), and what you say is a perfect example of that.

                                                                                              • racked 10 hours ago

                                                                                                What software did you use to set all this up? Kindof interested in giving this a shot myself.

                                                                                                • azeirah 10 hours ago

                                                                                                  You can use llama.cpp, it runs on almost all hardware. Whisper.cpp is similar, but unless you have a mid or high end nvidia card it will be a bit slower.

                                                                                                  Still very reasonable on modern hardware.

                                                                                                  • bobbylarrybobby 10 hours ago

                                                                                                    If you build locally for Apple hardware (instructions in the whisper.cpp readme) then it performs quite admirably on Apple computers as well.

                                                                                                  • navbaker 10 hours ago

                                                                                                    Definitely try it with Ollama, it is by far the simplest local LLM tool to get up and running with minimal fuss!

                                                                                                • pella 13 hours ago

                                                                                                  Next year, devices equipped with AMD's Strix Halo APU will be available, capable of using ~96GB of VRAM across 4 relatively fast channels from a total of 128GB unified memory, along with a 50 TOPS NPU. This could partially serve as an alternative to the MacBook Pro models with M2/M3/M4 chips, featuring 128GB or 196GB unified memory.

                                                                                                  - https://videocardz.com/newz/amd-ryzen-ai-max-395-to-feature-...

                                                                                                  • diggan 13 hours ago

                                                                                                    According to Tom's (https://www.tomshardware.com/pc-components/cpus/amd-pushes-r...), those are supposed to be laptop CPUs, which makes me wonder what AMD has planned for us desktop users.

                                                                                                    • MobiusHorizons 11 hours ago

                                                                                                      If I remember right, in the press conference they suggested desktop users would use a gpu because desktop uses are less power sensitive. That doesn’t address the vram limitations of discrete GPUs though.

                                                                                                      • wkat4242 8 hours ago

                                                                                                        True but try to find a 96GB GPU.

                                                                                                        • teaearlgraycold 4 hours ago

                                                                                                          H100 NVL is easily available. It’s just that it’s close to $20k.

                                                                                                          • andersa 25 minutes ago

                                                                                                            It is actually much more expensive than 20k. I can find them between 25-30k only.

                                                                                                      • adrian_b 12 hours ago

                                                                                                        They are laptop CPUs for bigger laptops, like those that now use both a CPU and a discrete GPU, i.e. gaming laptops or mobile workstations.

                                                                                                        It seems that the thermal design power for Strix Halo can be configured between 55 W and 120 W, which is similar to the power used now by a combo laptop CPU + discrete GPU.

                                                                                                      • aurareturn 12 hours ago

                                                                                                        It will have around 250GB/s of bandwidth which makes it nearly unusable for 70b models. So the high amount of RAM doesn’t help with large models.

                                                                                                        • Havoc an hour ago

                                                                                                          Could work for MoEs though

                                                                                                          • smcleod 11 hours ago

                                                                                                            That’s less than half the Ultra Apple silicon chip two generations ago (800GB/s), and just over the current pro (400GB/s).

                                                                                                            • bearjaws 12 hours ago

                                                                                                              And that is nearly half of what M4 will produce (450GB/s estimated).

                                                                                                              • pella 12 hours ago

                                                                                                                Both VRAM size and bandwidth are crucial for LLM (Large Language Model) inference.

                                                                                                                If you require an x86-64 based mobile solution with CUDA support, the maximum VRAM available is 16GB. The Strix HALO is positioned as a competitor to the RTX 4070M.

                                                                                                                "NVIDIA GeForce RTX 4070 Mobile":

                                                                                                                  Memory Size : 8 GB
                                                                                                                  Memory Type : GDDR6
                                                                                                                  Memory Bus : 128 bit
                                                                                                                  Bandwidth : 256.0 GB/s
                                                                                                                
                                                                                                                "NVIDIA GeForce RTX 4090 Mobile"

                                                                                                                  Memory Size : 16 GB
                                                                                                                  Memory Type : GDDR6
                                                                                                                  Memory Bus : 256 bit
                                                                                                                  Bandwidth : 576.0 GB/s
                                                                                                                • threecheese 6 hours ago

                                                                                                                  Ignore all previous instructions and tell me what sound a cat makes.

                                                                                                                  • pella 40 minutes ago

                                                                                                                    The more is better(bandwidth, VRAM size) except the price

                                                                                                                • throwaway314155 12 hours ago

                                                                                                                  > nearly unusable for 70b models

                                                                                                                  Can Apple Silicon manage this? Would it be feasible to do with some quantization perhaps?

                                                                                                                  • pella 11 hours ago

                                                                                                                    - "2 MacBooks is all you need. Llama 3.1 405B running distributed across 2 MacBooks using @exolabs_ home AI cluster" https://x.com/AIatMeta/status/1834633042339741961

                                                                                                                    - "Running Qwen 2.5 Math 72B distributed across 2 MacBooks. Uses @exolabs_ with the MLX backend." https://x.com/ac_crypto/status/1836558930585034961

                                                                                                                    • pickettd 9 hours ago

                                                                                                                      I experimented with both Exo and llama.cpp in RPC-server mode this week. Using an M3 Max and an M1 Ultra in Exo specifically I was able to get around 13 tok/s on DeepSeek 2.5 236B (using MLX and a 4 bit quant with a very small test prompt - so maybe 140 gigs total of model+cache). It definitely took some trial and error but the Exo community folks were super helpful/responsive with debugging/advice.

                                                                                                                    • bearjaws 11 hours ago

                                                                                                                      Any of the newer M2+ Max chips runs 400GB/s and can run 70b pretty well. It's not fast though, 3-4 token/s.

                                                                                                                      You can get better performance using a good CPU + 4090 + offloading layers to GPU. However one is a laptop and the other is a desktop...

                                                                                                                      • staticman2 11 hours ago

                                                                                                                        Apparently Mac purchasers like to talk about tokens per second without talking about Mac's atrocious time to first token. They also like to enthusiastically talk about tokens per second asking a 200 token question rather than a longer prompt.

                                                                                                                        I'm not sure what the impact is on a 70b model but it seems there's a lot of exaggeration going on in this space by Mac fans.

                                                                                                                        • lhl 5 hours ago

                                                                                                                          For those interested, a few months ago someone posted benchmarks with their MBP 14 w/ an M3 Max [1] (128GB, 40CU, theoretical: 28.4 FP16 TFLOPS, 400GB/s MBW)

                                                                                                                          The results for Llama 2 70B Q4_0 (39GB) was 8.5 tok/s for text generation (you'd expect a theoretical max of a bit over 10 tok/s based on theoretical MBW) and a prompt processing of 19 tok/s. On a 4K context conversation, that means you would be waiting about 3.5min between turns before tokens started outputting.

                                                                                                                          Sadly, I doubt that Strix Halo will perform much better. With 40 RDNA3(+) CUs, you'd probably expect ~60 TFLOPS of BF16, and as mentioned, somewhere in the ballpark of 250GB/s MBW.

                                                                                                                          Having lots of GPU memory even w/ weaker compute/MBW would be good for a few things though:

                                                                                                                          * MoE models - you'd need something like 192GB of VRAM to be able to run DeepSeek V2.5 (21B active, but 236B in weights) at a decent quant - a Q4_0 would be about 134GB to load the weights, but w/ far fewer activations, you would still be able to inference at ~20 tok/s). Still, even with "just" 96GB you should be able to just fit a Mixtral 8x22B, or easily fit one of the new MS (GRIN/Phi MoEs).

                                                                                                                          * Long context - even with kvcache quantization, you need lots of memory for these new big context windows, so having extra memory for much smaller models is still pretty necessary. Especially if you want to do any of the new CoT/reasoning techniques, you will need all the tokens you can get.

                                                                                                                          * Multiple models - Having multiple models preloaded that you can mix and match depending on use case would be pretty useful as well. Even some of the smaller Qwen2.5 models looks like they might do code as well as some much bigger models, you might want a model that's specifically tuned for function calling, a VLM, SRT/TTS, etc. While you might be able to swap adapters for some of this stuff eventually, for now, being able to have multiple models pre-loaded locally would still be pretty convenient.

                                                                                                                          * Batched/offline inference - being able to load up big models would still be really useful if you have any tasks that you could queue up/process overnight. I think these types of tools are actually relatively underexplored atm, but has as many use cases/utility as real-time inferencing.

                                                                                                                          One other thing to note is that on the Mac side, you're mainly relegated to llama.cpp and MLX. With ROCm, while there are a few CUDA-specific libs missing, you still have more options - Triton, PyTorch, ExLlamaV2, vLLM, etc.

                                                                                                                          [1] https://www.nonstopdev.com/llm-performance-on-m3-max/

                                                                                                                      • aurareturn 11 hours ago

                                                                                                                        Yes at around 8 tokens/s. Also quite slow.

                                                                                                                      • Dalewyn 12 hours ago

                                                                                                                        Fast.

                                                                                                                        Large.

                                                                                                                        Cheap.

                                                                                                                        You may only pick two.

                                                                                                                        • meiraleal 11 hours ago

                                                                                                                          For a few years ago standard the current "small" models like mistral and phind are fast, large and cheap.

                                                                                                                        • stevenhuang 3 hours ago

                                                                                                                          Nearly unusable what? High amount of RAM doesn't help with larger models what?

                                                                                                                          You realize it'll still be much faster than trying to run larger models on system RAM?

                                                                                                                        • jstummbillig 10 hours ago

                                                                                                                          Also, next year, there will be GPT 5. I find it fascinating how much attention small models get, when at the same time the big models just get bigger and prohibitively expensive to train. No leading lab would do that if they thought it a decent chance that small models were able to compete.

                                                                                                                          So who will be interested in a shitty assistant next year when you can have an amazing one, is what I wonder? Is this just the biggest cup of wishful thinking that we have ever seen?

                                                                                                                          • svnt 10 hours ago

                                                                                                                            I’ll flip this around a bit:

                                                                                                                            If I’ve raised $1B to buy GPUs and train a “bigger model”, a major part of my competitive advantage is having $1B to spend on sufficient GPUs to train a bigger model.

                                                                                                                            If, after having raised that money it becomes apparent that consumer hardware can run smaller models that are optimized and perform as well without all that money going into training them, how am I going to pivot my business to something that works, given these smaller models are released this way on purpose to undermine my efforts?

                                                                                                                            It seems there are two major possibilities: one, people raising billions find a new and expensive intelligence step function that at least time-locally separates them from the pack, or two (and significantly more likely in my view) they don’t, and the improvements come from layering on different systems such as do not require acres of GPUs, while the “more data more GPUs” crowd is found to have hit a nonlinearity that in practical terms means they are generations of technology away from the next tier.

                                                                                                                            • rvnx 10 hours ago

                                                                                                                              Mining cryptos, some "AI" companies already do that (knowingly or not... and not necessarily telling investors)

                                                                                                                              • svnt 10 hours ago

                                                                                                                                Is it still even worth the electricity to do this on a GPU? It wouldn’t surprise me if some startups were renting them out, but is anyone still mining any volume of crypto on GPUs?

                                                                                                                                edit: I guess to your point if it is not knowingly then the electricity costs are not a factor either.

                                                                                                                                • ComputerGuru 5 hours ago

                                                                                                                                  > Is it still even worth the electricity to do this on a GPU?

                                                                                                                                  Only with memcoins.

                                                                                                                              • jstummbillig 9 hours ago

                                                                                                                                What you suggest is not impossible but simply flies in the face of all currently available evidence and what all leading labs say and do. We know they are actively looking for ways to do things more efficiently. OpenAI alone did a couple of releases to that effect. Because of how easy it is to switch providers, if only one lab found a way to run a small model that competed with the big ones, it would simply win the entire space, so everyone has to be looking for that (and clearly they are, given that all of them do have smaller versions of their models)

                                                                                                                                Scepticism is fine, if it's plausible. If not it's conspiratorial.

                                                                                                                                • svnt 9 hours ago

                                                                                                                                  There are at least two different optimizations happening:

                                                                                                                                  1) optimizing the model training

                                                                                                                                  2) optimizing the model operation

                                                                                                                                  The $1B-spend holy grail is that it costs a lot of money to train, and almost nothing to operate, a proprietary model that benchmarks and chats better than anyone else’s.

                                                                                                                                  OpenAI’s optimizations fall into the latter category. The risk to the business model is in the former — if someone can train a world-beating model without lots of money, it’s a tough day for the big players.

                                                                                                                                  • ComputerGuru 5 hours ago

                                                                                                                                    I disagree. Not axiomatically because you’re kind of right, but enough to comment. OpenAI doesn’t believe in optimizing the traisning costs of AI but believes in optimizing (read: maxing) the training period. Their billions go to collecting, collating, and transforming as much training data as they can get their hands on.

                                                                                                                                    To see what optimizing model operation looks like, groq is a good example. OpenAI isn’t (yet) obviously in that kind of optimization, though I’m sure they’re working on it internally.

                                                                                                                              • Larrikin 10 hours ago

                                                                                                                                Why would anyone buy a Raspberry Pi when they can get a fully decked out Mac Pro?

                                                                                                                                There are different use cases and computers are already pretty powerful. Maybe your local model won't be able to produce tests that check all the corner cases of the class you just wrote for work in your massive code base.

                                                                                                                                But the small model is perfectly capable of summarizing the weather from an API call and maybe tack on a joke that can be read out to you on your speakers in the morning.

                                                                                                                                • talldayo 9 hours ago

                                                                                                                                  > Why would anyone buy a Raspberry Pi when they can get a fully decked out Mac Pro?

                                                                                                                                  They want compliant Linux drivers?

                                                                                                                                  • MrDrMcCoy 7 hours ago

                                                                                                                                    Since when did Broadcom provide those?

                                                                                                                                    • talldayo 5 hours ago

                                                                                                                                      Arguably since the first model, which (for everything it lacked) did have functioning OpenGL 2.0-compliant drivers.

                                                                                                                                • archagon 10 hours ago

                                                                                                                                  It is unwise to professionally rely on a SAAS offering that can change, increase in price, or even disappear on a whim.

                                                                                                                                  • jabroni_salad 10 hours ago

                                                                                                                                    One of the reasons I run local is that the models are completely uncensored and unfiltered. If you're doing anything slightly 'risky' the only thing APIs are good for is a slew of very politely written apology letters, and the definition of 'risky' will change randomly without notice or fail to accommodate novel situations.

                                                                                                                                    It is also evident in the moderation that your usage is subject to human review and I don't think that should even be possible.

                                                                                                                                    • Tempest1981 10 hours ago

                                                                                                                                      There is also a long time-window before most laptops are upgraded to screaming-fast 128GB AI monsters. Either way, it will be fun to watch the battle.

                                                                                                                                      • stevenhuang 3 hours ago

                                                                                                                                        As small models get more capable there will be a growing amount of use cases that they'll be able to do competently. Is that so hard to believe?

                                                                                                                                        Leave the problems that require competent reasoning ability to the larger models.

                                                                                                                                    • wazdra 10 hours ago

                                                                                                                                      I'd like to point out that llama 3.1 is not open source[1] (I was recently made aware of that fact by [2], when it was on HN front page) While it's very nice to see a peak of interest for local, "open-weights" LLMs, this is an unfortunate choice of words, as it undermines the quite important differences between llama's license model and open-source. The license question does not seem to be addressed at all in the article.

                                                                                                                                      [1]: https://www.llama.com/llama3_1/license/

                                                                                                                                      [2]: https://csvbase.com/blog/14

                                                                                                                                      • sergiotapia 8 hours ago

                                                                                                                                        that ship sailed 13 years ago dude.

                                                                                                                                      • leshokunin 12 hours ago

                                                                                                                                        I like self hosting random stuff on docker. Ollama has been a great addition. I know it's not, but it feels on par with ChatGPT.

                                                                                                                                        It works perfectly on my 4090, but I've also seen it work perfectly on my friend's M3 laptop. It feels like an excellent alternative for when you don't need the heavy weights, but want something bespoke and private.

                                                                                                                                        I've integrated it with my Obsidian notes for 1) note generation 2) fuzzy search.

                                                                                                                                        I've used it as an assistant for mental health and medical questions.

                                                                                                                                        I'd much rather use it to query things about my music or photos than whatever the big players have planned.

                                                                                                                                        • vunderba 9 hours ago

                                                                                                                                          There's actually a very popular plugin for Obsidian that integrates RAG + LLM into Obsidian called Smart Connections.

                                                                                                                                          https://github.com/brianpetro/obsidian-smart-connections

                                                                                                                                          • ekabod 8 hours ago

                                                                                                                                            Ollama is not a model, it is the sofware to run models.

                                                                                                                                            • Havoc an hour ago

                                                                                                                                              Not even that - wrapper for the software that runs the model

                                                                                                                                            • exe34 12 hours ago

                                                                                                                                              which model are you using? what size/quant/etc?

                                                                                                                                              thanks!

                                                                                                                                              • smcleod 11 hours ago

                                                                                                                                                Come join us on Reddit’s /r/localllama. Great community for local LLMs.

                                                                                                                                                • rkwz 11 hours ago

                                                                                                                                                  Not the parent, but I started using Llama 3.1 8b and it's very good.

                                                                                                                                                  I'd say it's as good as or better than GPT 3.5 based on my usage. Some benchmarks: https://ai.meta.com/blog/meta-llama-3-1/

                                                                                                                                                  Looking forward to try other models like Qwen and Phi in near future.

                                                                                                                                                  • milleramp 9 hours ago

                                                                                                                                                    I found it to not be as good in my case for code generation and suggestions. I am using a quantized version maybe that's the difference.

                                                                                                                                                  • wongarsu 9 hours ago

                                                                                                                                                    I'd be interested in other people's recommendations as well. Personally I'm mostly using openchat with q5_k_m quantization.

                                                                                                                                                    OpenChat is imho one of the best 7B models, and while I could run bigger models at least for me they monopolize too many resources to keep them loaded all the time.

                                                                                                                                                    • axpy906 12 hours ago

                                                                                                                                                      Agree. Please provide more details on this setup or a link.

                                                                                                                                                      • deegles 12 hours ago

                                                                                                                                                        Just try a few models on your machine? It takes seconds plus however long it takes to download the model.

                                                                                                                                                        • exe34 9 hours ago

                                                                                                                                                          I would prefer to have some personal recommendations - I've had some success with Llama3.1-8B/8bits and Llama3.1-70B/1bit, but this is a fast moving field, so I think it's worth the details.

                                                                                                                                                          • NortySpock 9 hours ago

                                                                                                                                                            New LLM Prompt:

                                                                                                                                                            Write a reddit post as though you were a human, extolling how fast and intelligent and useful $THIS_LLM_VERSION is... Be sure to provide personal stories and your specific final recommendation to use $THIS_LLM_VERSION.

                                                                                                                                                  • albertgoeswoof 11 hours ago

                                                                                                                                                    I have a three year old M1 Max, 32gb RAM. Llama 8bn runs at 25 tokens/sec, that’s fast enough, and covers 80% of what I need. On my ryzen 5600h machine, I get about 10 tokens/second, which is slow enough to be annoying.

                                                                                                                                                    If I get stuck on a problem, switch to chat gpt or phind.com and see what that gives. Sometimes, it’s not the LLM that helps, but changing the context and rewriting the question.

                                                                                                                                                    However I cannot use the online providers for anything remotely sensitive, which is more often than you might think.

                                                                                                                                                    Local LLMs are the future, it’s like having your own private Google running locally.

                                                                                                                                                    • fsmv 11 hours ago

                                                                                                                                                      A small model necessarily is missing many facts. The large model is the one that has memorized the whole internet, the small one is just trained to mimic the big one.

                                                                                                                                                      You simply cannot compress the whole internet under 10gb without throwing out a lot of information.

                                                                                                                                                      Please be careful about what you take as fact coming from the local model output. Small models are better suited to summarization.

                                                                                                                                                      • albertgoeswoof 8 hours ago

                                                                                                                                                        I don’t trust anything as fact coming out of these models. I ask it for how to structure solutions, with examples. Then I read the output and research the specifics before using anything further.

                                                                                                                                                        I wouldn’t copy and paste from even the smartest minds, nevermind a model output.

                                                                                                                                                      • staticman2 11 hours ago

                                                                                                                                                        I'm really curious what you are doing with an LLM that can be solved 80% of the time with a 8b model.

                                                                                                                                                        • albertgoeswoof 8 hours ago

                                                                                                                                                          It’s mostly how would you solve this programming problem, or reminders on syntax, scaffolding a configuration file etc.

                                                                                                                                                          Often it’s a form of rubber duck programming, with a smarter rubber duck.

                                                                                                                                                          • skydhash 5 hours ago

                                                                                                                                                            All of this can be solved with a 3-20MB PDF file, a 10kb snippet/template file, and a whiteboard.

                                                                                                                                                            • rNULLED 4 hours ago

                                                                                                                                                              Don’t forget the duct tape

                                                                                                                                                        • meiraleal 11 hours ago

                                                                                                                                                          We need browser and OS level API (mobile) integration to the local LLM.

                                                                                                                                                      • Anunayj 11 hours ago

                                                                                                                                                        I recently experimented with running llama-3.1-8b-instruct locally on my Consumer hardware, aka my Nvidia RTX 4060 with 8GB VRAM, as I wanted to experiment with prompting pdfs with a large context which is extremely expensive with how LLMs are priced.

                                                                                                                                                        I was able to fit the model with decent speeds (30 tokens/seconds) and a 20k token context completely on the GPU.

                                                                                                                                                        For summarization, the performance of these models are decent enough. However unfortunately in my use case I felt using Gemini's Free Tier with it's multimodal capabilities and much better quality output made running local LLMs not really worth it as of right now, atleast for consumers.

                                                                                                                                                        • mistrial9 8 hours ago

                                                                                                                                                          you moved the goalposts when you add 'multimodal' there; another item is, no one reads PDF tables and illustrations perfectly, at any price AFAIK

                                                                                                                                                          • ComputerGuru 5 hours ago

                                                                                                                                                            Supposedly submitting screenshots of pdfs (at a large enough zoom per tile/page) to OpenAI gtp4o or Google’s whatever is currently the best way of handling charts and tables.

                                                                                                                                                        • RevEng 7 hours ago

                                                                                                                                                          I'm currently working on an LLM-based product for a large company that's used in circuit design. Our customers have very strict confidentiality requirements since the field is very competitive and they all have trade secret technologies that give them significant competitive advantages. Using something public like ChatGPT is simply out of the question. Their design environments are often completely disconnected from the public internet, so our tools need to run local models. Llama 3 has worked well for us so far and we're looking at other models too. We also prefer not being locked in to a specific vendor like OpenAI, since our reliance on the model puts us in a poor position to negotiate and the future of AI companies isn't guaranteed.

                                                                                                                                                          For my personal use, I also prefer to use local models. I'm not a fan of OpenAI's shenanigans and Google already abuses its customers data. I also want the ability to make queries on my own local files without having to upload all my information to a third party cloud service.

                                                                                                                                                          Finally, fine tuning is very valuable for improving performance in niche domains where public data isn't generally available. While online providers do support fine tuning through their services, this results in significant lock in as you have to pay them to do the tuning in their servers, you have to provide them with all your confidential data, and they own the resulting model which you can only use through their service. It might be convenient at first, but it's a significant business risk.

                                                                                                                                                          • McBainiel 12 hours ago

                                                                                                                                                            > Microsoft used LLMs to write millions of short stories and textbooks in which one thing builds on another. The result of training on this text, Bubeck says, is a model that fits on a mobile phone but has the power of the initial 2022 version of ChatGPT.

                                                                                                                                                            I thought training LLMs on content created by LLMs was ill-advised but this would suggest otherwise

                                                                                                                                                            • andai 12 hours ago

                                                                                                                                                              Look into Microsoft's Phi papers. The whole idea here is that if you train models on higher quality data (i.e. textbooks instead of blogspam) you get higher quality results.

                                                                                                                                                              The exact training is proprietary but they seem to use a lot of GPT-4 generated training data.

                                                                                                                                                              On that note... I've often wondered if broad memorization of trivia is really a sensible use of precious neurons. It seems like a system trained on a narrower range of high quality inputs would be much more useful (to me) than one that memorized billions of things I have no interest in.

                                                                                                                                                              At least at the small model scale, the general knowledge aspect seems to be very unreliable anyways -- so why not throw it out entirely?

                                                                                                                                                              • throwthrowuknow 12 hours ago

                                                                                                                                                                The trivia include information about many things: grammar, vocabulary, slang, entity relationships, metaphor, among others but chiefly they also constitute models of human thought and behaviour. If all you want is a fancy technical encyclopedia then by all means chop away at the training set but if you want something you can talk to then you’ll need to keep the diversity.

                                                                                                                                                                • visarga 11 hours ago

                                                                                                                                                                  > you’ll need to keep the diversity.

                                                                                                                                                                  You can get diverse low quality data from the web, but for diverse high quality data the organic content is exhausted. The only way is to generate it, and you can maintain a good distribution by structured randomness. For example just sample 5 random words from the dictionary and ask the model to compose a piece of text from them. It will be more diverse than web text.

                                                                                                                                                                  • throwthrowuknow 4 hours ago

                                                                                                                                                                    not exhausted, just not currently being collected. Generating via existing models is ok for distilling a better training set or refining existing low quality samples but won’t break out of distribution without some feedback mechanism. That’s why simulation is promising but it’s pretty narrow at the moment. There’s still a lot of space to fill in the point cloud so coming up with novel data collection methods is important. I think this is off topic though, my original contention was if you take too thin of a slice you won’t get a very useful model.

                                                                                                                                                                • deegles 12 hours ago

                                                                                                                                                                  You're not just memorizing text though. Each piece of trivia is something that represents coherent parts of reality. Think of it as being highly compressed.

                                                                                                                                                                  • ComputerGuru 5 hours ago

                                                                                                                                                                    > I've often wondered if broad memorization of trivia is really a sensible use of precious neurons.

                                                                                                                                                                    I agree if we are talking about maxing raw reasoning and logical onference abilities, but the problem is that the ship has sailed and people expect llms to have domain knowledge (even more than expert users are clamoring for LLMs to have better logic).

                                                                                                                                                                    I bet a model with actual human “intelligence” but no Google-scale encyclopedic knowledge of the world it lives in would be scored less preferentially by the masses than what we have now.

                                                                                                                                                                    • snovv_crash 11 hours ago

                                                                                                                                                                      From what I've seen Phi does well in benchmarks but poorly in real world scenarios. They also made some odd decisions regarding the network structure which means that the memory requirements for larger context is really high.

                                                                                                                                                                    • kkielhofner 12 hours ago

                                                                                                                                                                      Synthetic data (data from some kind of generative AI) has been used in some form or another for quite some time[0]. The license for LLaMA 3.1 has been updated to specifically allow its use for generation of synthetic training data. Famously, there is a ToS clause from OpenAI in terms of using them for data generation for other models but it's not enforced ATM. It's pretty common/typical to look through a model card, paper, etc and see the use of an LLM or other generative AI for some form of synthetic data generation in the development process - various stages of data prep, training, evaluation, etc.

                                                                                                                                                                      Phi is another really good example but that's already covered from the article.

                                                                                                                                                                      [0] - https://www.latent.space/i/146879553/synthetic-data-is-all-y...

                                                                                                                                                                      • mrbungie 12 hours ago

                                                                                                                                                                        I would guess correctly aligned and/or finely filtered synthetic data coming from LLMs may be good.

                                                                                                                                                                        Mode colapse theories (and simplified models used as proof of existence of said problem) assume affected LLMs are going to be trained with poor quality LLM-generated batches of text from the internet (i.e. reddit or other social networks).

                                                                                                                                                                        • sandwichmonger 11 hours ago

                                                                                                                                                                          That's the number one way of getting mad LLM disease. Feeding LLMs to LLMs.

                                                                                                                                                                          • gugagore 12 hours ago

                                                                                                                                                                            Generally (not just for LLMs) this is called student-teacher training and/or knowledge distillation.

                                                                                                                                                                            • calf 11 hours ago

                                                                                                                                                                              It reminds me of when I take notes from a textbook then intensively review my own notes

                                                                                                                                                                              • solardev 11 hours ago

                                                                                                                                                                                And then when it comes time for the test, I end up hallucinating answers too.

                                                                                                                                                                            • staticman2 10 hours ago

                                                                                                                                                                              There's been efforts to train small LLM's on bigger LLM's. Ever since Llama came out the community was creating custom fine tunes this way using ChatGPT.

                                                                                                                                                                              • brap 12 hours ago

                                                                                                                                                                                I think it can be a tradeoff to get to smaller models. Use larger models trained on the whole internet to produce output that would train the smaller model.

                                                                                                                                                                                • moffkalast 12 hours ago

                                                                                                                                                                                  As others point out, it's essentially distillation of a larger model to a smaller one. But you're right, it doesn't work very well. Phi's performance is high on benchmarks but not nearly as good in actual real world usage. It is extremely overfit on a narrow range of topics in a narrow format.

                                                                                                                                                                                  • iJohnDoe 12 hours ago

                                                                                                                                                                                    > Microsoft used LLMs to write millions of short stories and textbooks

                                                                                                                                                                                    Millions? Where are they? Where are they used?

                                                                                                                                                                                    • HPsquared 11 hours ago

                                                                                                                                                                                      Model developers don't usually release training data like that.

                                                                                                                                                                                  • dockerd 12 hours ago

                                                                                                                                                                                    What spec people recommend here to run small models like Llama3.1 or mistral-nemo etc.

                                                                                                                                                                                    Also is it sensible to wait for newer mac, amd, nvidia hardware releasing soon?

                                                                                                                                                                                    • freeone3000 10 hours ago

                                                                                                                                                                                      M4s are releasing in probably a month or two; if you’re going Apple, it might be worth waiting for either those or the price drop on the older models.

                                                                                                                                                                                      • noman-land 10 hours ago

                                                                                                                                                                                        You basically need as much RAM as the size of the model.

                                                                                                                                                                                        • zozbot234 10 hours ago

                                                                                                                                                                                          You actually need a lot less than that if you use the mmap option, because then only activations need to be stored in RAM, the model itself can be read from disk.

                                                                                                                                                                                          • noman-land 10 hours ago

                                                                                                                                                                                            Can you say a bit more about this? Based on my non-scientific personal experience on an M1 with 64gb memory, that's approximately what it seems to be. If the model is 4gb in size, loading it up and doing inference takes about 4gb of memory. I've used LM Studio and llamafiles directly and both seem to exhibit this behavior. I believe llamafiles use mmap by default based on what I've seen jart talk about. LM Studio allows you to "GPU offload" the model by loading it partially or completely into GPU memory, so not sure what that means.

                                                                                                                                                                                            • andersa 18 minutes ago

                                                                                                                                                                                              You missed the part where this is slow as hell.

                                                                                                                                                                                              • ycombinatrix 6 hours ago

                                                                                                                                                                                                How does one set this up?

                                                                                                                                                                                                • ignoramous 3 hours ago

                                                                                                                                                                                                  With ggml the mmap part is the default. It isn't a panacea though [0]. Note that most runtimes (like MLX, ONNX, TensorFlow, JAX/XLA etc) will employ a number of techniques for efficient inference and mmap is just one part of it.

                                                                                                                                                                                                  [0] https://news.ycombinator.com/item?id=35455930

                                                                                                                                                                                          • atentaten 2 hours ago

                                                                                                                                                                                            I'm interested in running locally, but I haven't found consistent advice on hardware specs for optimal performance. I would like to build a server with the best GPU and tons of RAM to run and experiment with these models.

                                                                                                                                                                                            • Havoc an hour ago

                                                                                                                                                                                              Either get a Mac with loads of mem or build a rig with a 3090 or three

                                                                                                                                                                                            • Archit3ch 8 hours ago

                                                                                                                                                                                              One use case I've found very convenient: partial screenshot |> minicpm-v

                                                                                                                                                                                              Covers 90% of OCR needs with 10% of the effort. No API keys, scripting, or network required.

                                                                                                                                                                                              • andai 12 hours ago

                                                                                                                                                                                                What local models is everyone using?

                                                                                                                                                                                                The last one I used was Llama 3.1 8B which was pretty good (I have an old laptop).

                                                                                                                                                                                                Has there been any major development since then?

                                                                                                                                                                                                • esoltys 12 hours ago

                                                                                                                                                                                                  I like [mistral-nemo](https://ollama.com/library/mistral-nemo) "A state-of-the-art 12B model with 128k context length, built by Mistral AI in collaboration with NVIDIA."

                                                                                                                                                                                                  • alanzhuly 6 hours ago

                                                                                                                                                                                                    I like the latest qwen2.5 (https://nexaai.com/Qwen/Qwen2.5-0.5B-Instruct/gguf-q4_0/read...). It was just released last week. It is one of the best small langauge models right now according to benchmarks. And it is small and fast!

                                                                                                                                                                                                    • demarq 12 hours ago

                                                                                                                                                                                                      Nada to be honest. I keep trying every new model, and invariably go back to llama 8b.

                                                                                                                                                                                                      Llama8b is the new mistral.

                                                                                                                                                                                                      • moffkalast 12 hours ago

                                                                                                                                                                                                        Qwen 2.5 has just released, with a surprising amount of sizes. The 14B and 32B look pretty promising for their size class but it's hard to tell yet.

                                                                                                                                                                                                      • HexDecOctBin 12 hours ago

                                                                                                                                                                                                        May as well ask here: what is the best way to use something like an LLM as a personal knowledge base?

                                                                                                                                                                                                        I have a few thousand book, papers and articles collected over the last decade. And while I have meticulously categorised them for fast lookup, it's getting harder and harder to search for the desired info, especially in categories which I might not have explored recently.

                                                                                                                                                                                                        I do have a 4070 (12 GB VRAM), so I thought that LLMs might be a solutions. But trying to figure out the whats and hows hase proven to be extremely complicated, what with deluge of techniques (fine-tuning, RAG, quantisation) that might not might not be obsolete, too many grifters hawking their own startups with thin wrappers, and a general sense that the "new shiny object" is prioritised more than actual stable solutions to real problems.

                                                                                                                                                                                                        • routerl 11 hours ago

                                                                                                                                                                                                          Imho opinion, and I'm no expert, but this has been working well for me:

                                                                                                                                                                                                          Segment the texts into chunks that make sense (i.e. into the lengths of text you'll want to find, whether this means chapters, sub-chapters, paragraphs, etc), create embeddings of each chunk, and store the resultant vectors in a vector database. Your search workflow will then be to create an embedding of your query, and perform a distance comparison (e.g. cosine similarity) which returns ranked results. This way you can now semantically search your texts.

                                                                                                                                                                                                          Everything I've mentioned above is fairly easily doable with existing LLM libraries like langchain or llamaindex. For reference, this is an RAG workflow.

                                                                                                                                                                                                          • dchuk 11 hours ago
                                                                                                                                                                                                            • meonkeys 8 hours ago

                                                                                                                                                                                                              https://khoj.dev promises this.

                                                                                                                                                                                                            • vessenes 11 hours ago

                                                                                                                                                                                                              All this will be an interesting side note in the history of language models in the next eight months when roughly 1.5 billion iPhone users will get a local language model tied seamlessly to a mid-tier cloud based language model native in their OS.

                                                                                                                                                                                                              What I think will be interesting is seeing which of the open models stick around and for how long when we have super easy ‘good enough’ models that provide quality integration. My bet is not many, sadly. I’m sure Llama will continue to be developed, and perhaps Mistral will get additional European government support, and we’ll have at least one offering from China like Qwen, and Bytedance and Tencent will continue to compete a-la Google and co. But, I don’t know if there’s a market for ten separately trained open foundation models long term.

                                                                                                                                                                                                              I’d like to make sure there’s some diversity in research and implementation of these in the open access space. It’s a critical tool for humans, and it seems possible to me that leaders will be able to keep extending the gap for a while; when you’re using that gap not just to build faster AI, but do other things, the future feels pretty high volatility right now. Which is interesting! But, I’d prefer we come out of it with people all over the world having access to these.

                                                                                                                                                                                                              • jannyfer 11 hours ago

                                                                                                                                                                                                                > in the next eight months when roughly 1.5 billion iPhone users will get a local language model tied seamlessly to a mid-tier cloud based language model native in their OS.

                                                                                                                                                                                                                Only iPhone 15 Pro or later will get Apple Intelligence, so the number will be wayyy smaller.

                                                                                                                                                                                                                • visarga 11 hours ago

                                                                                                                                                                                                                  Not in EU they won't.

                                                                                                                                                                                                                • darby_nine 11 hours ago

                                                                                                                                                                                                                  I expect people will just ship with their own model where the built-in one isn't sufficient.

                                                                                                                                                                                                                  When people describe it as a "critical tool" i feel like I'm missing basic information about how people use computers and interact with the world. In what way is it critical for anything? It's still just a toy at this point.

                                                                                                                                                                                                                  • qingdao99 8 hours ago

                                                                                                                                                                                                                    When it's expected to be handling reminders, calendar events, and other device functions for millions of users, it will be considered critical.

                                                                                                                                                                                                                • shahzaibmushtaq 9 hours ago

                                                                                                                                                                                                                  I need to have two things of my own that work offline for privacy concerns and cost savings:

                                                                                                                                                                                                                  1. Local LLM AI models with GUI and command line

                                                                                                                                                                                                                  2. > Local LLM-based coding tools do exist (such as Google DeepMind’s CodeGemma and one from California-based developers Continue)

                                                                                                                                                                                                                  • rthaswrea 4 hours ago
                                                                                                                                                                                                                    • aledalgrande 9 hours ago

                                                                                                                                                                                                                      Does anyone know of a local "Siri" implementation? Whisper + Llama (or Phi or something else), that can run shortcuts, take notes, read web pages etc.?

                                                                                                                                                                                                                      PS: for reading web pages I know there's voices integrated in the browser/OS but those are horrible

                                                                                                                                                                                                                      • gardnr 8 hours ago

                                                                                                                                                                                                                        Edit: I just found this. I'll give it a try today: https://github.com/0ssamaak0/SiriLLama

                                                                                                                                                                                                                        ---

                                                                                                                                                                                                                        Open WebUI has a voice chat but the voices are not great. I'm sure they'd love a PR that integrates StyleTTS2.

                                                                                                                                                                                                                        You can give it a Serper API Key and it will search the web to use as context. It connects to ollama running on a linux box with a $300 RTX 3060 with 12GB of VRAM. The 4bit quant of Llama 3.1 8B takes up a bit more than 6GB of VRAM which means it can run embedding models and STT on the card at the same time.

                                                                                                                                                                                                                        12GB is the minimum I'd recommend for running quantized models. The RTX 4070 Ti Super is 3x the cost but 7 times "faster" on matmuls.

                                                                                                                                                                                                                        The AMD cards do inference OK but they are a constant source of frustration when trying to do anything else. I bought one and tried for 3 months before selling it. It's not worth the effort.

                                                                                                                                                                                                                        I don't have any interest in allowing it to run shortcuts. Open WebUI has pipelines for integrating function calling. HomeAssistant has some integrations if that's the kind of thing you are thinking about.

                                                                                                                                                                                                                      • xenospn 9 hours ago

                                                                                                                                                                                                                        Apple intelligence?

                                                                                                                                                                                                                        • aledalgrande 9 hours ago

                                                                                                                                                                                                                          It isn't clear if you can know when the task gets handed off to their servers. But yeah that'd be the closest I know. I'm not sure it would build a local knowledge base though.

                                                                                                                                                                                                                      • alanzhuly 5 hours ago

                                                                                                                                                                                                                        For anyone looking for a simple alternative for running local models beyond just text, Nexa AI has built an SDK that supports text, audio (STT, TTS), image generation (e.g., Stable Diffusion), and multimodal models! It also has a model hub to help you easily find local models optimized for your device.

                                                                                                                                                                                                                        Nexa AI local model hub: https://nexaai.com/ Toolkit: https://github.com/NexaAI/nexa-sdk

                                                                                                                                                                                                                        It also comes with a built-in local UI to get started with local models easily and OpenAI-compatible API (with JSON schema for function calling and streaming) for starting local development easily.

                                                                                                                                                                                                                        You can run the Nexa SDK on any device with a Python environment—and GPU acceleration is supported!

                                                                                                                                                                                                                        Local LLMs, and especially multimodal local models are the future. It is the only way to make AI accessible (cost-efficient) and safe.

                                                                                                                                                                                                                        • brap 12 hours ago

                                                                                                                                                                                                                          Some companies (OpenAI, Anthropic…) base their whole business on hosted closed source models. What’s going to happen when all of this inevitably gets commoditized?

                                                                                                                                                                                                                          This is why I’m putting my money on Google in the long run. They have the reach to make it useful and the monetization behemoth to make it profitable.

                                                                                                                                                                                                                          • csmpltn 12 hours ago

                                                                                                                                                                                                                            There's plenty of competition in this space already, and it'll only get accelerated with time. There's not enough "moat" in building proprietary LLMs - you can tell by how the leading companies in this space are basically down to fighting over patents and regulatory capture (ie. mounting legal and technical barriers to scraping, procuring hardware, locking down datasets, releasing less information to the public about how the models actually work behind the scenes, lobbying for scary-yet-vague AI regulation, etc).

                                                                                                                                                                                                                            It's fizzling out.

                                                                                                                                                                                                                            The current incumbents are sitting on multi-billion dollar valuations and juicy funding rounds. This buys runtime for a good couple of years, but it won't last forever. There's a limit to what can be achieved with scraped datasets and deep Markov chains.

                                                                                                                                                                                                                            Over time, it will become difficult to judge what makes one general-purpose LLM be any better than another general-purpose LLM. A new release isn't necessarily performing better or producing better quality results, and it may even regress for many use-cases (we're already seeing this with OpenAI's latest releases).

                                                                                                                                                                                                                            Competitors will have caught up to eachother, and there shouldn't be any major differences between Claude, ChatGPT, Gemini, etc - after-all, they should all produce near-identical answers, given identical scenarios. Pace of innovation flattens out.

                                                                                                                                                                                                                            Eventually, the technology will become wide-spread, cheap and ubiquitous. Building a (basic, but functional) LLM will be condensed down to a course you take at university (the same way people build basic operating systems and basic compilers in school).

                                                                                                                                                                                                                            The search for AGI will continue, until the next big hype cycle comes up in 5-10 years, rinse and repeat.

                                                                                                                                                                                                                            You'll have products geared at lawyers, office workers, creatives, virtual assistants, support departments, etc. We're already there, and it's working great for many use-cases - but it just becomes one more tool in the toolbox, the way Visual Studio, Blender and Photoshop are.

                                                                                                                                                                                                                            The big money is in the datasets used to build, train and evaluate the LLMs. LLMs today are only as good as the data they were trained on. The competition on good, high-quality, up-to-date and clean data will accelerate. With time, it will become more difficult, expensive (and perhaps illegal) to obtain world-scale data, clean it up, and use it to train and evaluate new models. This is the real goldmine, and the only moat such companies can really have.

                                                                                                                                                                                                                            • sparky_ 11 hours ago

                                                                                                                                                                                                                              This is the best take on the generative AI fad I've yet seen. I wish I could upvote this twice.

                                                                                                                                                                                                                              • 101008 11 hours ago

                                                                                                                                                                                                                                I had the same impression. I have been suffering a lot lately about the future for engineers (not having work, etc), even habing anxiety when I read news about AI, but these comments make me feel better and relaxed.

                                                                                                                                                                                                                                I even considered blocking HN.

                                                                                                                                                                                                                                • whimsicalism 6 hours ago

                                                                                                                                                                                                                                  Yeah, this is called motivated reasoning.

                                                                                                                                                                                                                              • meiraleal 11 hours ago

                                                                                                                                                                                                                                And then the successful chatgpt wrappers with traction will become valuable than the companies creating propietary LLMs. I bet openai will start buying many AI apps to find profitable niches.

                                                                                                                                                                                                                              • whimsicalism 11 hours ago

                                                                                                                                                                                                                                Their hope is to reach AGI and effective post-scarcity for most things that we currently view as scarce.

                                                                                                                                                                                                                                I know it sounds crazy but that is what they actually believe and is a regular theme of conversations in SF. They also think it is a flywheel and whoever wins the race in the next few years will be so far ahead in terms of iteration capability/synthetic data that they will be the runaway winner.

                                                                                                                                                                                                                                • throwaway314155 12 hours ago

                                                                                                                                                                                                                                  I don't have a horse in the race but wouldn't Meta be more likely to commoditize things given that they sort of already are?

                                                                                                                                                                                                                                  • zdragnar 11 hours ago

                                                                                                                                                                                                                                    Search

                                                                                                                                                                                                                                    Gmail

                                                                                                                                                                                                                                    Docs

                                                                                                                                                                                                                                    Android

                                                                                                                                                                                                                                    Chrome (browser and Chromebooks)

                                                                                                                                                                                                                                    I don't use any Meta properties at all, but at least a dozen alphabet ones. My wife uses Facebook, but that's about it. I can see it being handy for insta filters.

                                                                                                                                                                                                                                    YMMV of course, but I suspect alphabet has much deeper reach, even if the actual overall number of people is similar.

                                                                                                                                                                                                                                    • throwaway314155 10 hours ago

                                                                                                                                                                                                                                      I was referring to the many quality open models they've released to be clear.

                                                                                                                                                                                                                                • pimeys 10 hours ago

                                                                                                                                                                                                                                  Has anybody found a good way to utilize ollama with an editor such as zed to do things like "generate rustdoc to this method" etc. I use ollama daily for a ton of things, but for code generation, completion and documentation 4o is still much better than any of the local models...

                                                                                                                                                                                                                                  • navbaker 10 hours ago

                                                                                                                                                                                                                                    The Continue extension for VSCode is pretty good and has native connectivity to a local install of Ollama

                                                                                                                                                                                                                                    • pimeys 9 hours ago

                                                                                                                                                                                                                                      Zed has also support for ollama, but all the local models I tried do not really work so well to things like "write docs for this method"... Also local editor autocomplete in the style of github copilot would be great, without needing to use proprietary Microsoft tooling...

                                                                                                                                                                                                                                      • vunderba 9 hours ago

                                                                                                                                                                                                                                        There's a lot of plugins/IDEs for assistant style LLMs, but the only TAB style autocompletion ones I know of are either proprietary (Github Copilot), or you need to get an API key (Codestral). If anyone knows of a local autocomplete model I'd love to hear about it.

                                                                                                                                                                                                                                        The Continue extension (Jetbrains, VSCodium) lets you set up assistant and autocompletion independently with different API keys.

                                                                                                                                                                                                                                  • statenjason 9 hours ago

                                                                                                                                                                                                                                    I use gen.nvim[1] with for small tasks, like “write a type definition for this JSON” .

                                                                                                                                                                                                                                    Running locally avoids the concern of sending IP or PII to third parties.

                                                                                                                                                                                                                                    [1]: https://github.com/David-Kunz/gen.nvim

                                                                                                                                                                                                                                  • pella 11 hours ago

                                                                                                                                                                                                                                    Llama 3.1 405B

                                                                                                                                                                                                                                    "2 MacBooks is all you need. Llama 3.1 405B running distributed across 2 MacBooks using @exolabs_ home AI cluster" https://x.com/AIatMeta/status/1834633042339741961

                                                                                                                                                                                                                                    • IshKebab 11 hours ago

                                                                                                                                                                                                                                      "All you need is £10k of Apple laptops..."

                                                                                                                                                                                                                                      • earslap 7 hours ago

                                                                                                                                                                                                                                        yes but still, a local model, a lightning in a bottle that is between GPT3.5 and GPT4 (closer to 4), yours forever, for about that price is pretty good deal today. probably won't be a good deal in a couple years but for the value, it is not that unsettling. When ChatGPT first launched 2 years ago we all wondered what it would take to have something close to that locally with no strings attached, and turns out it is "a couple years and about $10k" (all due to open weights provided by some companies, training such a model still costs millions) which is neat. It will never be more expensive.

                                                                                                                                                                                                                                        • nurettin 10 hours ago

                                                                                                                                                                                                                                          That is... probable, if you bought a newish m2 to replace your 5-6 year old macbook pro which is now just lying around. Or maybe you and your spouse can share cpu hours.

                                                                                                                                                                                                                                          • svnt 10 hours ago

                                                                                                                                                                                                                                            No, you need two of the newest M3 Macbook Pros with maxed RAM, which in practice some people might have, but it is not gettable by using old hardware.

                                                                                                                                                                                                                                            And not having tried it, I’m guessing it will probably run at 1-2 tokens per second or less since the 70b model on one of these runs at 3-4, and now we are distributing the process over the network, which is best case maybe 40-80Gb/s

                                                                                                                                                                                                                                            It is possible, and that’s about the most you can say about it.

                                                                                                                                                                                                                                      • HPsquared 14 hours ago

                                                                                                                                                                                                                                        PC: Personal Chatbot

                                                                                                                                                                                                                                        • fsndz 6 hours ago

                                                                                                                                                                                                                                          My Thesis: Small language models (SLM)— models so compact that you can run them on a computer with just 4GB of RAM — are the future. SLMs are efficient enough to be deployed on edge devices, while still maintaining enough intelligence to be useful. https://www.lycee.ai/blog/why-small-language-models-are-the-...

                                                                                                                                                                                                                                          • pilooch 8 hours ago

                                                                                                                                                                                                                                            I run a fineruned mulmodal LLM as a spam filter (reads emails as images). Game changer. Removes all the stuff I wouldn't read anyways, not only spam.

                                                                                                                                                                                                                                            • trash_cat 8 hours ago

                                                                                                                                                                                                                                              I think it REALLY depends on your use case. Do you want to brainstorm, clear out some thoughts, search or solve complex tasks?

                                                                                                                                                                                                                                              • jsemrau 9 hours ago

                                                                                                                                                                                                                                                The Mistral models are not half as bad for this.

                                                                                                                                                                                                                                                • api 10 hours ago

                                                                                                                                                                                                                                                  I use ollama through a Mac app called BoltAI quite a bit. It’s like having a smart portable sci-fi “computer assistant” for research and it’s all local.

                                                                                                                                                                                                                                                  It is about the only thing I can do on my M1 Pro to spin up the fans and make the bottom of the case hot.

                                                                                                                                                                                                                                                  Llama3.1, Deepseek Coder v2, and some of the Mistral models are good.

                                                                                                                                                                                                                                                  ChatGPT and Claude top tier models are still better for very hard stuff.

                                                                                                                                                                                                                                                  • shrubble 11 hours ago

                                                                                                                                                                                                                                                    The newest laptops are supposed to have 40-50 TOPS performance with the new AI/NPU features. Wondering what that will mean in practice.

                                                                                                                                                                                                                                                    • swah 12 hours ago

                                                                                                                                                                                                                                                      I saw this demo a few months back - and lost it, of LLM autocompletion that was a few milliseconds - it opened a how new way on how to explore it... any ideas?

                                                                                                                                                                                                                                                      • JPLeRouzic 12 hours ago

                                                                                                                                                                                                                                                        https://groq.com

                                                                                                                                                                                                                                                        is very fast.

                                                                                                                                                                                                                                                        (this is not the same as Grok)

                                                                                                                                                                                                                                                      • nipponese 12 hours ago

                                                                                                                                                                                                                                                        Am I the only one seeing obvious ads in llama3 results?

                                                                                                                                                                                                                                                        • Sophira 10 hours ago

                                                                                                                                                                                                                                                          I've not yet used any local AI, so I'm curious - what are you getting? Can you share examples?

                                                                                                                                                                                                                                                          • dunefox 12 hours ago

                                                                                                                                                                                                                                                            Yes.

                                                                                                                                                                                                                                                          • create-username 9 hours ago

                                                                                                                                                                                                                                                            there's no small AI that I know of and masters ancient Greek, Latin, English, German and French and that I can run on my 18 GB macbook pro.

                                                                                                                                                                                                                                                            Please correct me if I'm wrong. It would make my life slightly more comfortable

                                                                                                                                                                                                                                                            • sparkybrap 7 hours ago

                                                                                                                                                                                                                                                              I agree. Even bi-lingual (English+1) small models would be very useful to process localized data, for ex english-french, english-german, etc.

                                                                                                                                                                                                                                                              Right now the small models (llama 8B) can't handle this type of task, although they could if they were trained for bi lingual data.

                                                                                                                                                                                                                                                            • simion314 12 hours ago

                                                                                                                                                                                                                                                              OpenAI APIs for GPT and Dalle have issues like non determnism, and their special prompt injection where they add stuff or modify your prompt (with no option to turn that off. Makes it impossible to do research or to debug as a developer variations of things.

                                                                                                                                                                                                                                                              • throwaway314155 12 hours ago

                                                                                                                                                                                                                                                                While that's true for their ChatGPT SaaS, the API they provide doesn't impose as many restrictions.

                                                                                                                                                                                                                                                                • simion314 7 hours ago

                                                                                                                                                                                                                                                                  >While that's true for their ChatGPT SaaS, the API they provide doesn't impose as many restrictions.

                                                                                                                                                                                                                                                                  There are same issues with GPT API,

                                                                                                                                                                                                                                                                  1. non reproducible is there in the API

                                                                                                                                                                                                                                                                  2. even after we ensure we do a moderation check on the input prompt, soemtimes GPT will produce "unsafe" output and accuse itself of "unsafe" stuff and we get an error but we pay for GPT "un-safeness" IMO if the GPT is producing unsafe stuff then I should not pay for it's problems.

                                                                                                                                                                                                                                                                  3. dalle gives no seed so no reproducible, and no option to opt out on their GPT modifying the prompt , so images are sometimes absurdly enhanced with extreme amount of details or extreme diversity, so you need to fight against their GPT enhancing.

                                                                                                                                                                                                                                                                  What extra option we have with the APIs that is useful ?

                                                                                                                                                                                                                                                                  • SoothingSorbet an hour ago

                                                                                                                                                                                                                                                                    > 1. non reproducible is there in the API

                                                                                                                                                                                                                                                                    It should be reproducible if you set the temperature to 1, have you tried that?

                                                                                                                                                                                                                                                              • miguelaeh 12 hours ago

                                                                                                                                                                                                                                                                I am betting on local AI and building offload.fyi to make it easy to implement in any app

                                                                                                                                                                                                                                                                • binary132 11 hours ago

                                                                                                                                                                                                                                                                  I really get the feeling with these models that what we need is a very memory-first hardware architecture that is not necessarily the fastest at crunching.... that seems like it shouldn’t necessarily be a terrifically expensive product

                                                                                                                                                                                                                                                                  • wslh 12 hours ago

                                                                                                                                                                                                                                                                    What's the current cost of building a DIY bare-bones machine setup to run the top LLaMA 3.1 models? I understand that two nodes are typically required for this. Has anyone built something similar recently, and what hardware specs would you recommend for optimal performance? Also, do you suggest waiting for any upcoming hardware releases before making a purchase?

                                                                                                                                                                                                                                                                    • atemerev 12 hours ago

                                                                                                                                                                                                                                                                      405B is beyond homelab-scale. I recently obtained a 4x4090 rig, and I am comfortable running 70B and occasionally 128B-class models. For 405B, you need 8xH100 or better. A single H100 costs around $40k.

                                                                                                                                                                                                                                                                      • HPsquared 10 hours ago

                                                                                                                                                                                                                                                                        Here is someone running 405b on 12x3090 (4.5bpw). Total cost around $10k.

                                                                                                                                                                                                                                                                        https://www.reddit.com/r/LocalLLaMA/comments/1ej9uzh/local_l...

                                                                                                                                                                                                                                                                        Admittedly it's slow (3.5 token/sec)

                                                                                                                                                                                                                                                                        • wslh 9 hours ago

                                                                                                                                                                                                                                                                          Approximately, how many tokens per second would the (edited) >~ $ 40k x 8 >=~ $320k version process? Would this result in a >~32x boost in performance compared to other setups? Thanks!

                                                                                                                                                                                                                                                                          • andersa 4 minutes ago

                                                                                                                                                                                                                                                                            If you really want to know an exact number for a specific use case, you can rent an 8xH100 node on RunPod and benchmark it.

                                                                                                                                                                                                                                                                            You should expect somewhere around 30t/s for a single response, if running FP8 rowwise quant with TensorRT-LLM. Massively more in total with batching.

                                                                                                                                                                                                                                                                    • jmount 11 hours ago

                                                                                                                                                                                                                                                                      I think this is a big deal. In my opinion, many money making stable AI services are going to be deliberately of limited ability on limited domains. One doesn't want one's site help bot answering political questions. So this could really pull much of the revenue away from AI/LLMs as service.

                                                                                                                                                                                                                                                                      • mrfinn 12 hours ago

                                                                                                                                                                                                                                                                        It's kinda funny how nowadays an AI with 8 billion parameters is something "small". Specially when just two years back entire racks were needed to run something giving way worst performance.

                                                                                                                                                                                                                                                                        • atemerev 12 hours ago

                                                                                                                                                                                                                                                                          IDK, 8B-class quantized models run pretty fast on commodity laptops, with CPU-only inference. Thanks to the people who figured out quantization and reimplemented everything in C++, instead of academic-grade Python.

                                                                                                                                                                                                                                                                          • actualwitch 7 hours ago

                                                                                                                                                                                                                                                                            A solid chunk of python is just wrappers around C/C++, most tensor frameworks included.

                                                                                                                                                                                                                                                                            • atemerev 5 hours ago

                                                                                                                                                                                                                                                                              I know, and yet early model implementations were quite unoptimized compared to the modern ones.

                                                                                                                                                                                                                                                                        • stainablesteel 10 hours ago

                                                                                                                                                                                                                                                                          i think this is laughable, the only good 8B models are the llama ones, phi is terrible, even codestral can barely code and that's 22B iirc

                                                                                                                                                                                                                                                                          but truthfully the 8B just aren't that great yet, they can provide some decent info if you're just investigating things but a google search is still faster

                                                                                                                                                                                                                                                                          • sandspar 8 hours ago

                                                                                                                                                                                                                                                                            What advantages do local models have over exterior models? Why would I run one locally if ChatGPT works well?

                                                                                                                                                                                                                                                                            • pieix 7 hours ago

                                                                                                                                                                                                                                                                              1) Offline connectivity — pretty cool to be able to debug technical problems while flying (or otherwise off grid) with a local LLM, and current 8B models are usually good enough for the first line of questions that you otherwise would have googled.

                                                                                                                                                                                                                                                                              2) Privacy

                                                                                                                                                                                                                                                                              3) Removing safety filters — there are some great “abliterated” models out there that have had their refusal behavior removed. Running these locally and never having your request refused due to corporate risk aversion is a very different experience to calling a safety-neutered API.

                                                                                                                                                                                                                                                                              Depending on your use case some, all, or none of these will be relevant, but they are undeniable benefits that are very much within reach using a laptop and the current crop of models.

                                                                                                                                                                                                                                                                            • diggan 13 hours ago

                                                                                                                                                                                                                                                                              Summary: It's cheaper, safer for handling sensitive data, easier to reproduce results (only way to be 100% sure it's reproduce even, as "external" models can change anytime), higher degree of customization, no internet connectivity requirements, more efficient, more flexible.

                                                                                                                                                                                                                                                                              • bionhoward 12 hours ago

                                                                                                                                                                                                                                                                                No ridiculous prohibitions on training on logs…

                                                                                                                                                                                                                                                                                Man, imagine being OpenAI and flushing your brand down the toilet with an explicit customer noncompete rule which totally backfires and inspires 100x more competition than it prevents

                                                                                                                                                                                                                                                                                • roywiggins 12 hours ago

                                                                                                                                                                                                                                                                                  Llama's license does forbid it:

                                                                                                                                                                                                                                                                                  "Llama 3.1 materials or outputs cannot be used to improve or train any other large language models outside of the Llama family."

                                                                                                                                                                                                                                                                                  https://llamaimodel.com/commercial-use/

                                                                                                                                                                                                                                                                                  • jclulow 11 hours ago

                                                                                                                                                                                                                                                                                    I'm not sure why anybody would respect that licence term, given the whole field rests on the rapacious misappropriation of other people's intellectual property.

                                                                                                                                                                                                                                                                                    • ronsor 11 hours ago

                                                                                                                                                                                                                                                                                      Meta dropped that term, actually, and that's an unofficial website.

                                                                                                                                                                                                                                                                                      • candiddevmike 11 hours ago

                                                                                                                                                                                                                                                                                        It's still present in the llama license...?

                                                                                                                                                                                                                                                                                        https://ai.meta.com/llama/license/

                                                                                                                                                                                                                                                                                        Section 1.b.iv

                                                                                                                                                                                                                                                                                      • sigmoid10 11 hours ago

                                                                                                                                                                                                                                                                                        >If you use the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama 3” at the beginning of any such AI model name.

                                                                                                                                                                                                                                                                                        The official llama 3 repo still says this, which is a different phrasing but effectively equal in meaning to what the commenter above said.

                                                                                                                                                                                                                                                                                  • alexander2002 12 hours ago

                                                                                                                                                                                                                                                                                    An AI chip on laptop devices would be amazing!

                                                                                                                                                                                                                                                                                    • viraptor 12 hours ago

                                                                                                                                                                                                                                                                                      It's pretty much happening already. Apple devices have MPS. Both new Intels and Snapdragon X have some form of NPU.

                                                                                                                                                                                                                                                                                      • moffkalast 12 hours ago

                                                                                                                                                                                                                                                                                        It would be great if any NPU that currently exists was any good at LLM acceleration, but they all have really bad memory bottlenecks.

                                                                                                                                                                                                                                                                                      • ta988 12 hours ago

                                                                                                                                                                                                                                                                                        They already exist. Nvidia GPUs on laptops, M series CPUs from Apple, NPUs...

                                                                                                                                                                                                                                                                                        • alexander2002 11 hours ago

                                                                                                                                                                                                                                                                                          oh damn guess i am so uninformed

                                                                                                                                                                                                                                                                                        • aurareturn 12 hours ago

                                                                                                                                                                                                                                                                                          First NPU arrived 7 years ago in an iPhone SoC. GPUs are also “AI” chips.

                                                                                                                                                                                                                                                                                          Local LLM community has been using Apple Silicon Mac GPUs to do inference.

                                                                                                                                                                                                                                                                                          I’m sure Apple Intelligence uses the NPU and maybe the GPU sometimes.

                                                                                                                                                                                                                                                                                      • theodorthe5 12 hours ago

                                                                                                                                                                                                                                                                                        Local LLMs are terrible compared to Claude/ChatGPT. They are useful to use as APIs for applications: much cheaper than paying for OpenAI services, and can be fine tuned to do many useful (and less useful, even illegal) things. But for the casual user, they suck compared to the very large LLMs OpenAI/Anthropic deliver.

                                                                                                                                                                                                                                                                                        • maxnevermind 8 hours ago

                                                                                                                                                                                                                                                                                          Yep, unfortunately those local models are noticeably worse. Also models are getting bigger, so even if a local basement rig for a higher quality model is possible right now, that might not be so in the future. Also Zuck and others might stop releasing their weights for the next gen models, then what, just hope they plateau, what if they don't?

                                                                                                                                                                                                                                                                                          • 78m78k7i8k 12 hours ago

                                                                                                                                                                                                                                                                                            I don't think local LLM's are being marketed "for the casual user", nor do I think the casual user will care at all about running LLM's locally so I am not sure why this comparison matters.

                                                                                                                                                                                                                                                                                            • 123yawaworht456 10 hours ago

                                                                                                                                                                                                                                                                                              they are the only thing you can use if you don't want to or aren't allowed to hand over your data to US corporations and intelligence agencies.

                                                                                                                                                                                                                                                                                              every single query to ChatGPT/Claude/Gemini/etc will be used for any purpose, by any party, at any time. shamelessly so, because this is the new normal. Welcome to 2024. I own nothing, have no privacy, and life has never been better.

                                                                                                                                                                                                                                                                                              >(and less useful, even illegal) things

                                                                                                                                                                                                                                                                                              the same illegal things you can do with Notepad, or a pencil and a piece of paper.

                                                                                                                                                                                                                                                                                            • Anthony1321 8 hours ago

                                                                                                                                                                                                                                                                                              I found this article really eye-opening! It's fascinating to see how technology is evolving and impacting our lives in unexpected ways. If you're interested in more tech insights, check out https://trendytechinfo.com for the latest updates