Comments Page - Show HN: Benchmarking VLMs vs. Traditional OCR

« Back Show HN: Benchmarking VLMs vs. Traditional OCRgetomni.aiSubmitted by themanmaran 3 days ago

jasonjmcghee 5 hours ago
A big takeaway for me is that Gemini Flash 2.0 is a great solution to OCR, considering accessibility, cost, accuracy, and speed.
It also has a 1M token context window, though from personal experience it seems to work better the smaller the context window is.
Seems like Google models have been slowly improving. It wasn't so long ago I completely dismissed them.
- shawabawa3 4 hours ago
  And from my personal experience with Gemini 2.0 flash Vs 2.0 pro is not even close
  I had gemini 2.0 pro read my entire hand written, stain covered, half English, half french family cookbook perfectly first time
  It's _crazy_ good. I had it output the whole thing in latex format to generate a printable document immediately too
banditelol an hour ago
Anyone have tried comparing with Qwen VL based model? I heard good things about its performance on ocr compared to other self hostable model, but haven't really tried benchmarking its performance
simonw an hour ago
The benchmark I most want to see around OCR is one that covers risks from accidental (or deliberate) prompt injection - I want to know how likely it is that a model might OCR a page and then accidentally act on instructions in that content rather than straight transcribing it as text.
I'm interested in the same thing for audio transcription too, for models like Gemini or GPT-4o audio accepting audio input.
gbertb an hour ago
How does this compare to Marker https://github.com/VikParuchuri/marker?
bn-l 2 hours ago
What is the privacy of the documents for the cloud service? There’s nothing in the privacy policy about data sent over the api.
EarlyOom 2 days ago
OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.
- codelion 3 hours ago
  That's a great point about the limitations of traditional OCR with rotated or poorly scanned documents. I agree that VLMs really shine when it comes to understanding context and extracting information beyond just the text itself. It's pretty cool how they can map implicit relationships, like those X-axis labels you mentioned.
betula_ai 2 days ago
Thank you for sharing this. Some of the other public models that we can host ourselves may perform in practice better than the models listed - e.g. Qwen 2.5 VL https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file
fzysingularity 3 days ago
What VLMs do you use when you're listing OmniAI - is this mostly wrapping the model providers like your zerox repo?