Comments Page - Evaluating LLMs for my personal use case

« Back Evaluating LLMs for my personal use casedarkcoding.netSubmitted by goranmoomin 2 days ago

sireat 2 days ago
Basically it boils down that for most queries google/gemini-2.5-flash is the workhorse fast/cheap/good enough.
Add in multimodality, 1M context and it is such a Swiss army knife.
It is cheap and performant enough to run 100k queries. (Took a bit over a day and cost around 30 Euros for a major document classification task). Yes in theory this could have been done with fine-tuned BERT or maybe even with some older methods but it saved way too much time.
There is another factor that may explain why Flash is #1 in most categories on OpenRouter - Flash has gotten reasonably decent at less common human languages.
Most cheap (including Flash Lite) and local models mostly have English focused training.
- karmakaze 2 days ago
  This was my initial assessment as well. Also note:
  > Grok I forgot about until it was too late.
  I was surprised by how much I prefer Grok to others. Even its persona is how I prefer it, detailed without volunteering unwanted information or sycophanty. In general I'd use Grok-3 more than 4 which is good enough for common uses.
  I suspect that Claude would be best, only if I gave it a long complex task with enough instructions up front so it could grind away on it while I was doing something else and not waiting on it.
- vjerancrnjak a day ago
  How do you run so many, I’m constantly exhausting the resources can’t even concurrently call 20 times?
  sireat a day ago
  While I do have multiple OpenRouter accounts(personal and organizational) I did not even look into concurrent calls - it was sequential.
  The job was set on Friday and ready on Monday. On average it was about 5k tokens (documents ranging from 1k to 200k in size) and only about 10 tokens out.
  Average response was about 1.5 seconds ~ 40 hours for full set.
  I really did some heavy prompt testing to limit output.
  Even then every few thousand queries you'd get some double token responses. That is Gemini would respond in duplicate - ie Daisy Daisy.
rplnt 2 days ago
> Almost all models got almost all my evaluations correct
I find this the most surprising. I have yet to cross 50% threshold of bullshit to possibly truth. In any kind of topic I use LLMs for.
- simonw 2 days ago
  It's useful to build up an intuition for what kind of questions LLMs can answer and what kind of questions they can't.
  Once you've done that your success rate goes way up.
  Aachen a day ago
  While it's useful to not bother when you know it's unlikely to give good results, it does also feel a bit like a cop-out to suggest that the user shouldn't be asking it certain (unspecified) things in the first place. If this is the only solution, we should just crowdsource topics or types of question it can't do >50% of the time so not everyone has to reinvent the wheel
  theshrike79 an hour ago
  If you ask an LLM to count the r's in "strawberry sherbert", it's 100% hit and miss.
  But have it create a script or program in any language you want to do the same, I'm 99% sure it'll get it right the first time.
  People use LLMs like graphing calculators, they're not. But you can have one MAKE a calculator and it'll get it right.
  mierz00 a day ago
  It’s not that simple.
  I’m making a tool to analyse financial transactions for accountants and identify things like misallocated expenses. Initially I was getting an LLM to try and analyse hundreds of transactions in one go. It was correct roughly 40-50% of the time, inconsistent and hallucinated frequently.
  I changed the method to simple yes no question and to analyse each transaction individually. Now it is correct 85% of the time and very consistent.
  Same model, same question essentially but a different way of asking it.
  Aachen 18 hours ago
  I don't see how that issue couldn't be an entry on the "not to do" or "not optimal usage" list
  rplnt a day ago
  Oftentimes I ask simple factual questions that I don't know the answer to. This is something it should excel at, yet it usually fails, at least on the first try. I guess I subconsciously ignore questions that are extremely easy to google (if you ignore the worst AI in existence) or can be found by opening the [insert keyword] wikipedia article. You don't need AI for those.
  simonw a day ago
  Amusingly enough, my rule of thumb for if an LLM is likely to be able to answer a question is "could somebody who just read the relevant Wikipedia page answer this?"
  Although that changed this year with o3 (and now GPT-5) getting really good at using Bing for search: https://simonwillison.net/2025/Apr/21/ai-assisted-search/
  apwell23 a day ago
  > It's useful to build up an intuition for what kind of questions LLMs can answer and what kind of questions they can't.
  Can you put you intuition into words so we can learn from you ?
  simonw a day ago
  I can't. That's my single biggest frustration about using LLMs: so much of what they can and cannot do comes down to intuition you need to build up over time, and I can't figure out how to express that intuition in a way that can quickly transfer to other people.
- Workaccount2 a day ago
  Would you be willing to share some of those chats?
  rplnt a day ago
  The most recent one I have was not in English. It was a translation question of a slang word between two non-English languages. It failed miserably (just made up some complete nonsense). Google had no trouble finding relevant pages or images for that word (without any extra prompt), so it was rather unique and not that obscure. Disclaimer: I'm not using any extra prompts like "don't make shit up and just tell me you don't know".
  Most recent technical I can remember (and now would be a good time to have the actual prompt) was that I asked whether MySQL has a way to run UPDATE without waiting for lock. Basically ignore rows that are locked. It (Sonnet 4 IIRC) answered of course and gave me an invalid query in the form of `UDPATE ... SKIP LOCKED`;
  I can't imagine what damage this does if people are using it for questions they don't/can't verify. Programming is relatively safe in this regard.
  But as I noted in my other reply, there will be a bias on my side, as I probably disregard questions that I know how to easily find answers to. That's not something I'd applaud AI for.
orochimaaru a day ago
I've had LLMs send me down complete rabbit holes for questions that are very specific. Here's an example:
We use glowroot (and open source JAVA APM). I was trying to compile it on my mac and some of the protobuf Maven plugins threw an issue. I gave copilot the entire pom.xml and the specific error and the versions being used. It send me on a complete wild goose chase and hallucinated like crazy even suggesting version upgrades to versions that do not exist or recommending parameters that have no use in the plugin.
Long story short, I just went to the github issues page of the maven plugin and searched and someone had posted a solution. Again, the solution wasn't new. It was suggested around apple started using ARM for their laptops. It was there in github and yet copilot hallucinated.
So, I don't feel too confident of coding assistants. Yes, they do a decent enough job to get your boilerplate done. But they're hopeless to resolve specific issues.
JSR_FDED 2 days ago
Which of these can I run locally on a 64GB Mac Mini Pro? And how much does quantization affect the quality?
- simonw 2 days ago
  I use a 64GB M2 MacBook Pro. I tend to find any model that's smaller than 32B works well (I can just about run a 70B but it's not worth it as I have to quit all other apps first).
  My current favorite to run on my machine is OpenAI's gpt-oss-20b because it only uses 11GB of RAM and it's designed to run at that quantization size.
  I also really like playing with the Qwen 3 family at various sizes and I'm fond of Mistral Small 3.2 as a vision LLM that works well.
  JSR_FDED 2 days ago
  Thanks. Do you get any value from those for coding?
  simonw 2 days ago
  Only when I'm offline (on planes for example) - I've had both Mistral Small and gpt-oss-20b be useful for Python and JavaScript stuff.
  If I have an internet connection I'll use GPT-5 or Claude 4 or Gemini 2.5 instead - they're better and they don't need me to dedicate a quarter of my RAM or run down my battery.
  mettamage a day ago
  Useful info! I have an M1 Mac with 64 GB and haven't experimented with offline models recently. I'll come back to this when I need my AI Maccie snackie, haha. Offline Apple Intelligence isn't at the level I want it, yet.
prism56 2 days ago
I'm pretty new to AI and have access to a few models in Kagi. I just never know which to pick, kind of annoys me I might not be using the best
- seeg a day ago
  I use Kagi as well and, from my experiments, deepseek gives me answers I like the most.
  brendoelfrendo a day ago
  I also use the Kagi assistant, and one thing that this post inspired me to do was to test out some of the models with and without web access, and I encourage you to do the same. Qwen, for example, behaves completely differently when it's summarizing searches vs operating without that tool. I assume this has to do with how Kagi tells it to respond, but it seems to impact some models more than others.
  prism56 a day ago
  Are the responses better or worse?
  brendoelfrendo a day ago
  I think that would require some testing and evaluation according to personal preference. My opinion is that Qwen 3 and Deepseek Chat v3.1 become much more concise when summarizing searches, but much less detailed. When used without the web search tool, they become more verbose (and express more "personality"), which may be less desirable in some contexts, but they also gave more informative answers. With Kimi K2 I seem to find the opposite; it really does well when analyzing search results (and really likes inserting tables to give break down its findings), and its "offline" version was much less detailed.
  Oh, and an interesting finding: Kagi's selector indicates that they're offering Deepseek Chat v3.1's non-reasoning version, but when I ran it without web search it appears to have messed up and output some of its chain of thought, so it clearly is thinking.
dcre a day ago
> To access their best models via the API, OpenAI now requires you to complete a Know-You-Customer process similar to opening a bank account.
This is not entirely true. It was entirely true for o3, but for GPT-5 it is only true for streaming and for reasoning summaries. I use GPT-5 with reasoning through the API without verifying my identity, I just don’t get the summary of the reasoning step. I don’t miss it, either. I never read them anymore now that the surprise has worn off.
giancarlostoro 2 days ago
Him using different ones is why I use Perplexity, I get to try different models and honestly its pretty darn decent, gives me everything in an organized way, I can see all the different links, and all the files it outputs can be downloaded as a simple zip file. It has everything from GTP5 to Deepseek R1 and even Grok.
There's other sites similar to perplexity that host multiple models as well, I have not tried the plethora of others, I feel like Perplexity does the most to make sure whatever model you pick it works right for you and all its output is usefully catalogued.
- mark_l_watson a day ago
  I also use Perplexity APIs, specifically their combined web search tool + decent models. Useful combination and easier that what I used to do: using a search API like Brave and rolling my own code to combine LLM and search.
  That said I have been having too much fun running Melisearch to build a local search index for many web sites that I use for reference and combine that with a small Python app that also uses local models running on Ollama. I will probably wrap this into an example to add to one of my books: not that practical but fun.
sandreas 2 days ago
This is an interesting overview, thank you. Different tasks, different models, all-day-usage and pretty complete (while still opinionated, which I like).
However, checking the results my personal overall winner if I had to pick only ONE probably would be
```
  deepseek/deepseek-chat-v3-0324
```
which is a good compromise between fast, cheap and good :-) Only for specific tasks (write a poem...) I would prefer a thinking model.
- iamnotagenius a day ago
  [dead]
EagnaIonat 2 days ago
> To access their best models via the API, OpenAI now requires you to complete a Know-You-Customer process similar to opening a bank account.
While this is true, you can download the OpenAI open source model and run it in Ollama.
The thinking is a little slow, but the results have been exceptional vs other local models.
https://ollama.com/library/gpt-oss
- 0x457 2 days ago
  openai/gpt-oss-120b is in this blog post.
faangguyindia 2 days ago
i use gemini flash and pro for pretty much everything. Why? they offer it free to test.
I tried signup for openai wayy too much friction, they start asking for payment without even you using any free credits, guess what that's one sure way to lose business.
same for claude, i couldn't even get claude through vertex as its available only in limited regions, and i am in asia pasific right now.
thorum 2 days ago
> Six of the eleven picked the same movie
This is surely the greatest weakness of current LLMs for any task needing a spark of creativity.
- torginus a day ago
  I have noticed this too - often when one model volunteered the wrong answer - such as making up a nonexistent API, I asked another, and it gave me the exact same thing! It's highly unlikely that two totally independent models would make up the same fictional thing.
  There must be something strange going on (most likely training on each others' wrong outputs, but I dunno)
  joseda-hg a day ago
  I've been burned by getting a deprecated version of an API Or hallucinated that a method of X library should exist in Y because they're similar
- Timwi 2 days ago
  This is definitely something very early LLMs could do that has kind of gotten beat out of them. I used to ask ChatGPT to simulate a text adventure game, but now if you try that you always get exactly the same one.
  sireat 2 days ago
  Curious, what kind of prompt gives you the same text adventure game?
  Surely it is a question of prompting some context(in UI mode) or with additional kicker of temperature (if using API)?
  At the very least some set up prompt such as "Give me 5 scenarios for text adventure game" would break the sameness?
  There have always been theories that OpenAI and other LLM providers cache some responses - this could be one hypothesis.
  karmakaze 2 days ago
  I'm now imagining 5 hipster AIs writing those stories--different in predictable ways.
mark_l_watson a day ago
Key comment author made at the end: result evals were done blind: nice.
I have always juggled a multitude of API keys and this article has almost convinced me to try Open Router. It is easy enough to purchase compute credit from Chinese companies, but I could save myself some time.
The thing I liked most in the analysis was the emphasis on speed and cost.
The speed and cost issue is important. I recently read that many AI startups in the USA are favoring faster and cheaper models from China (China is quietly overtaking America with open AI models - iX Broker: https://ixbroker.com/blog/china-is-quietly-overtaking-americ...). The Economist has a similar article but it is paywalled.