• jasonjmcghee 5 hours ago

    A big takeaway for me is that Gemini Flash 2.0 is a great solution to OCR, considering accessibility, cost, accuracy, and speed.

    It also has a 1M token context window, though from personal experience it seems to work better the smaller the context window is.

    Seems like Google models have been slowly improving. It wasn't so long ago I completely dismissed them.

    • shawabawa3 4 hours ago

      And from my personal experience with Gemini 2.0 flash Vs 2.0 pro is not even close

      I had gemini 2.0 pro read my entire hand written, stain covered, half English, half french family cookbook perfectly first time

      It's _crazy_ good. I had it output the whole thing in latex format to generate a printable document immediately too

    • banditelol an hour ago

      Anyone have tried comparing with Qwen VL based model? I heard good things about its performance on ocr compared to other self hostable model, but haven't really tried benchmarking its performance

      • simonw an hour ago

        The benchmark I most want to see around OCR is one that covers risks from accidental (or deliberate) prompt injection - I want to know how likely it is that a model might OCR a page and then accidentally act on instructions in that content rather than straight transcribing it as text.

        I'm interested in the same thing for audio transcription too, for models like Gemini or GPT-4o audio accepting audio input.

        • gbertb an hour ago

          How does this compare to Marker https://github.com/VikParuchuri/marker?

          • bn-l 2 hours ago

            What is the privacy of the documents for the cloud service? There’s nothing in the privacy policy about data sent over the api.

            • EarlyOom 2 days ago

              OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.

              • codelion 3 hours ago

                That's a great point about the limitations of traditional OCR with rotated or poorly scanned documents. I agree that VLMs really shine when it comes to understanding context and extracting information beyond just the text itself. It's pretty cool how they can map implicit relationships, like those X-axis labels you mentioned.

              • betula_ai 2 days ago

                Thank you for sharing this. Some of the other public models that we can host ourselves may perform in practice better than the models listed - e.g. Qwen 2.5 VL https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file

                • fzysingularity 3 days ago

                  What VLMs do you use when you're listing OmniAI - is this mostly wrapping the model providers like your zerox repo?