• zerop 6 hours ago

    I had been using GPT4o for extracting insights from Scanned docs, it was doing fine. But very recently (since they launched new model - o1), it's not working. GPT4o is refusing to extract text from images and says it can't do it, though it was doing same thing with same prompts till last week. I am not sure if this is intentional downgrade and it can be clubbed with new model launch, but it's really frustrating for me. I cancelled my GPT4 premium and moved to claude. It works good.

    • itissid 5 hours ago

      This. Inconsistency is a big problem for large tasks, you are better off making your own models to do this.

      I have seen this odd kind of inconsistency in generating the same results, sometimes in the same chat itself after starting off fine.

      I was once trying to extract hand written dates and times from a large pdf document in batches of 10 pages at a time from a very specific part of the page. IN some documents it started by refusing, but not in other different chat windows that I tried with the same document. Sometimes it would say there is an error, and then it would work in a new chat window. But I am not sure why, but just starting a new chat works for these kind of situations.

      Sometimes it will start off fine with OCR, then as the task progresses, it will start hallucinating. Even though the text to be extracted follows a pattern like dates, it for the life of me could not get it right.

      • rrrix1 17 minutes ago

        > "...you are better off making your own models to do this"

        I'm doubtful you meant what you wrote here. Using a readymade UI or API to perform an effectively magical task (for most of us) is an entirely different paradigm to "just train your own model."

        In reality, for us non-ML model training mortals, we're actually probably better off hiring a human to do basic data entry.

      • avibhu an hour ago

        Have you tried few shot prompting? Something on the lines of:

        User: Extract x from the given scanned document. <sample_img_1>

        Assistant: <sample_img_1_output>

        User: Extract x from the given scanned document. <sample_img_2>

        Assistant: <sample_img_2_output>

        User: Extract x from the given scanned document. <query_image>

        In my experience, this seems to make the model significantly more consistent.

        • fragmede 5 hours ago

          Why not just switch back to GPT-4? it's still there.

          • itchyjunk 2 hours ago

            So is 4o. Problem isn't the absence of model, it's inconsistency.

        • pierre 7 hours ago

          Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2).

          The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...)

          However this model will get better and we may soon have a good pdf to md model.

          • fkilaiwi 29 minutes ago

            what paper are you referring to?

            • authorfly 5 hours ago

              What about combining old school OCR with GPT visual OCR?

              If your old school OCR output has output that is not present in the visual one, but is coherent (e.g. english sentences), you could get it back and slot it into the missing place from the visual output.

            • charlie0 an hour ago

              I do this all the time for old specs, but one issue I worry about is accuracy. It's hard to confirm if the translations are 100% correct.

              • smusamashah 10 hours ago

                I have not found any mention of accuracy. Since it's using LLM, how accurate the conversion is? As in does that NASA document match 100% with the pdf or did it introduce any made up things (hallucinations)?

                That converted NASA doc should be included in repo and linked in readme if you haven't already.

                • uLogMicheal an hour ago

                  Couldn't one just program a linear word matching test to ensure correctness?

                  • yigitkonur35 10 hours ago

                    People are really freaked out about hallucinations, but you can totally tackle that with solid prompts. The one in the repo right now is doing a pretty good job. Keep in mind though, this project is all about maxing out context for LLMs in products that need PDF input.

                    We're not talking about some hardcore archiving system for the Library of Congress here. The goal is to boost consistency whenever you're feeding PDF context into an LLM-powered tool. Appreciate the feedback, I'll be sure to add that in.

                    • williamdclt 7 hours ago

                      I don’t think any prompting skill guarantees the absence of hallucination. And if hallucination is possible, you will usually need to worry about it

                      • Foobar8568 7 hours ago

                        As soon as you have something else than a paragraph in a single column layout, you will get hallucinations, random stuff, cut off etc even if you say which pages to look at, LLM will just do what ever.

                        • freedomben 5 hours ago

                          Can you give some examples of prompts that you use that will tackle hallucinations?

                      • jdthedisciple 8 hours ago

                        Zerox [0] was featured on here recently and does the exact same thing

                        [0] https://github.com/getomni-ai/zerox

                        • bravura 6 hours ago

                          I've also been using the nougat models from meta, which are trained to turn PDF into md using the donut architecture

                        • jdross 11 hours ago

                          How does this compare with commercial OCR APIs on a cost per page?

                          • yigitkonur35 10 hours ago

                            It is a lot cheaper! While cost-effectiveness may not be the primary advantage, this solution offers superior accuracy and consistency. Key benefits include precise table generation and output in easily editable markdown format.

                            Let's make some numbers game:

                            - Average token usage per image: ~1200 - Total tokens per page (including prompt): ~1500 - [GPT4o] Input token cost: $5 per million tokens - [GPT4o] Output token cost: $15 per million tokens

                            For 1000 documents: - Estimated total cost: $15

                            This represents excellent value considering the consistency and flexibility provided. For further cost optimization, consider:

                            1. Utilizing GPT4 mini: Reduces cost to approximately $8 per 1000 documents 2. Implementing batch API: Further reduces cost to around $4 per 1000 documents

                            I think it offers an optimal balance of affordability & reliability.

                            PS: One of the most affordable solution on market, cloudconvert charges ~30$ for 1K document (pdftron mode required 4 credits)

                            • johndough 10 hours ago

                              > I think it offers an optimal balance of affordability & reliability.

                              It is hard to trust "you" when ChatGPT wrote that text. You never know which part of the answer is genuine and which part was made up by ChatGPT.

                              To actually answer that question: Pricing varies quite a bit depending on what exactly you want to do with a document.

                              Text detection generally costs $1.5 per 1k pages:

                              https://cloud.google.com/vision/pricing

                              https://aws.amazon.com/textract/pricing/

                              https://azure.microsoft.com/en-us/pricing/details/ai-documen...

                              • yigitkonur35 10 hours ago

                                You've got a point, but try testing it on a tricky example like the Apollo 17 document - you know, with those sideways tables and old-school writing. You'll see all three non-AI services totally bomb. Now, if you tweak it to batch = 1 instead of 10, you'll notice there's hardly any made-up stuff. When you dial down the temperature close to zero, it's super unlikely to see hallucinations with limited context. At worst, you might get some skipped bits, but that's not a dealbreaker for folks looking to feed PDFs into AI systems. Let's face it, regular OCR already messes up so much that...

                                • hnlmorg 5 hours ago

                                  AWS Textract does use ML and I’ve personally used it to parse tables for automated invoice processing.

                                  You wouldn’t get a markdown document automatically generated (or at least you couldn’t when I last used it a few years ago) but you did get an XML document

                                  That XML document was actually better for our purposes because it gives you a confidence score and is properly structured, so floating frame, tables and columns would be properly structured in the output document. This reduces the risk of hallucinations.

                                  It’s less of an out-of-the-box solution but that’s to be expected with AWS APIs.

                                  • rescbr 2 hours ago

                                    For a similar use case I’m using Azure Document AI - at least you can ask for markdown/html output directly from it instead of parsing the output structure from Textract.

                                    And it’s cheaper too.

                                  • Propelloni 7 hours ago

                                    > you might get some skipped bits, but that's not a dealbreaker for folks looking to feed PDFs into AI systems

                                    Unless it is. We have a few hundred PDF per month (mostly tables) where we need 100% accuracy. Currently we feed them into an OCR and have humans check the result. I do not win anything if I have to check the LLM output, too.

                                    • llm_trw 4 hours ago

                                      I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?

                                      • Propelloni an hour ago

                                        > I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?

                                        I guess anything up to 5 ¢ per page would be acceptable. But I'm afraid my company wouldn't be a customer. We are in Germany and we deal with particularly protected private data, there is no chance that we would exfiltrate this data to a cloud service.

                                        • llm_trw an hour ago

                                          What's the total spend per quarter? For a margin that fat I'd be willing to jump through a lot of hoops if you're doing enough pages.

                                          The models (currently) fit in 24gb vram sequentially with small enough batch sizes, so a local server with consumer grade gpus wouldn't be impossible.

                                        • rescbr 2 hours ago

                                          At least for my use case, which is Layout processing (i.e. must output tables in some kind of table format), the OCR part (Azure Document AI or AWS Textract) dominates the cost factor.

                                          Running OCR on a document is twice more expensive than processing the output on the most expensive GPT offering. Intuitively, this was kind of unexpected for me. Only when I did some calculations on Excel that I realized it.

                                          If you’re able to halve the pricing for Layout output then you’re unblocking lots of use cases out there.

                                          • Lerc 2 hours ago

                                            I guess it depends on the use case, but if it surpasses the error rate that exists in the source document then it would be difficult to argue against.

                                            Specific things like evidentiary use would want 100% but that's at a level where any document processing would be suspect.

                                            What is the the typical range for error rate in PDF generation in various fields? Even robust technical documents have the occasional typo.

                                            • llm_trw 2 hours ago

                                              I'm not using generative models to fill in details not present in the original document. If there's a typo there then there will be a typo in the transcript. If you want to fix that then you can run another model on top of it.

                                              • Lerc 2 hours ago

                                                I realise that. The point is that a user is implicitly committing to the baseline error rate that exists in whatever means by which the document was created. If any additional loss was insignificant in proportion to that error rate then it would be unreasonable to reject it on that basis.

                                        • authorfly 5 hours ago

                                          I will say I have had a look at your code here. I really do value your innovation here in gaining better accuracy, but I don't think it's is much more accurate for obscure PDF cases - Maybe it halves those obscure errors. I found it still hallucinated or failed to parse some text (e.g. that unusual languages, screenshots with tiny blurred JPEG text, images/shapes remain hallucination issues with your solution). BTW I noticed a small typo "Convert document as is be creative to use markdown effectively" in the prompt. For me changing this and adding text about returning "None" if the text is unreadable reduced hallucinations.

                                          Would you contrast your accuracy with Textract? Because Textract is 10x cheaper than this at approx 1 cent per page (and 20x cheaper than Cloudconvert). What documents make more sense to use with your tool? Is it worth waiting till gpt-4o costs drop 10x with the same quality level (i.e. not gpt-4o-mini) to use this? In my use case it's better to drop than to hallucinate.

                                          What do you think makes sense in relation to Textract?

                                          • llm_trw 4 hours ago

                                            That is not a tricky example. Those tables are as clear cut as clear cut can be.

                                    • magicalhippo 10 hours ago

                                      Was just looking for something like this. Does it handle equations to latex or similar? How about rotated tables, ie landscape mode but page is still portait?

                                      • yigitkonur35 10 hours ago

                                        I messed around with some rotating tables in that Apollo 17 demo video - you can check it out in the repo if you want. It's pretty straightforward to tweak just by changing the prompt. You can customize that prompt section in the code to fit whatever you need.

                                        Oh, and if you throw in a line about LaTeX, it'll make things even more consistent. Just add it to that markdown definition part I set up. Honestly, it'll probably work pretty well as is - should be way better than those clunky old OCR systems.

                                        • x_may 4 hours ago

                                          Check out Nougat from meta

                                          • nicodjimenez 8 hours ago

                                            Have you checked out Mathpix? It's another option.

                                            Disclaimer: I'm the founder.

                                            • magicalhippo an hour ago

                                              Was looking for a self-hosted solution as I have quite on/off needs, but I'll give it a whirl as it looks quite promising.

                                              • troysk 6 hours ago

                                                +1! Most LLMs can already output Mathpix markdown. I prompt it to do so and it gives the code and this use a rendering library to show the scalable and selectable equations. No wonder facebook nougat also uses it. Good stuff!

                                            • scottmcdot 4 hours ago

                                              Does it do image to MD too?

                                              • eth0up 4 hours ago

                                                I used GPT4o to convert heavily convoluted PDFs into csv files. The files were Florida Lottery Pick(n) histories, which they deliberately convolute to prevent automatic searching; ctrl-f does nothing and a fsck-ton of special characters embellish the whole file.

                                                I had previously done so manually, with regex, and was surprised with the quality of the end results of GPT, despite many preceding failed iterations. The work was done in two steps, first with pdf2text, then python.

                                                I'm still trying to created a script to extract the latest numbers from the FL website and append to a cvs list, without re-running the stripping script on the whole PDF every time. Why? I want people to have the ability to freely search the entire history of winning numbers, which in their web hosted search function, is limited to only two of 30+ years.

                                                I know there's a more efficient method, but I don't know more than that.

                                                • mmh0000 an hour ago

                                                  This sounds like a fun and interesting challenge! I am tempted to try it on my own

                                                  I’m surprised an LLM actually works for that purpose. It has been my experience with gpt reading pdfs that it’ll get the first few entries from a pdf correct then just start making up numbers.

                                                  I’ve tried a few times having gpt4 analyze a credit card statement and it adds random purchases and leaves out others. And that’s with a “clean” PDF. I wouldn’t trust an llm at all on an obfuscated pdf, at least not without thorough double checking.

                                                  • eth0up an hour ago

                                                    >then just start making up numbers...

                                                    Absolutely! It's a fucking criminal in that regard. But that's why everything is done with hard python code and the results are tested multiple times. As an assistant, gpt can be fabulous, but the user must run the necessary scripts on their own and be ever ready for a knife in the back at any moment.

                                                    Edit: below is an example of what it generated after a lot of debugging and hassle:

                                                      import re
                                                    import csv from datetime import datetime

                                                    def clean_and_structure_data(text): """Cleans and structures the extracted text data.""" # Regular expression pattern to match the lottery data pattern = r'(\d{2}/\d{2}/\d{2})\s+(E|M)\s+(\d{1})\s-\s(\d{1})\s-\s(\d{1})\s-\s(\d{1})(?:\s+FB\s+(\d))?' matches = re.findall(pattern, text)

                                                        structured_data = []
                                                        for match in matches:
                                                            date, draw_type, n1, n2, n3, n4, fireball = match
                                                            # Format the date to include the full year
                                                            date = datetime.strptime(date, '%m/%d/%y').strftime('%m/%d/%Y')
                                                            # Concatenate the numbers, ensuring leading zeros are preserved, and enclose in quotes
                                                            numbers = f'"{n1}{n2}{n3}{n4}"'
                                                            structured_data.append({
                                                                'Date': date,
                                                                'Draw': draw_type,
                                                                'Numbers': numbers,
                                                                'Fireball': fireball or ''  # Use empty string if Fireball is None
                                                            })
                                                        return structured_data
                                                    
                                                    def save_to_csv(data, output_path): """Saves the structured data to a CSV file.""" # Sort data by date in descending order sorted_data = sorted(data, key=lambda x: datetime.strptime(x['Date'], '%m/%d/%Y'), reverse=True)

                                                        with open(output_path, 'w', newline='') as csvfile:
                                                            fieldnames = ['Date', 'Draw', 'Numbers', 'Fireball']
                                                            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
                                                            writer.writeheader()
                                                            for row in sorted_data:
                                                                writer.writerow(row)
                                                    
                                                    def main(): # Path to the text file txt_path = 'PICK4.txt' # Ensure this path points to your actual text file output_csv_path = 'output.csv' # Ensure this path is where you want the CSV file saved

                                                        try:
                                                            with open(txt_path, 'r') as file:
                                                                text = file.read()
                                                           
                                                            cleaned_data = clean_and_structure_data(text)
                                                            save_to_csv(cleaned_data, output_csv_path)
                                                            print(f"Data successfully extracted and saved to {output_csv_path}")
                                                        except Exception as e:
                                                            print(f"An error occurred: {e}")
                                                    
                                                    if __name__ == "__main__": main()
                                                  • is_true 2 hours ago

                                                    There are private APIs that have that data (now and history)

                                                    Do you think the official data published is 100% correct if they were trying to hide something?

                                                    • eth0up an hour ago

                                                      I am honestly not certain why they obstruct easy access to the number history. It's obviously accessible, but only through manually parsing the PDF. Their prior embedded search function, approximately two years ago, would return all permutations of the queried number from day 1 to present. They modified it to exclude results more than two years old. The PDF contains the entire data set, but isn't searchable. Why? Dunno. But I'm cynical

                                                      I've also compiled a list of all numbers that have never occurred, count of each occurrence and a lot more. My anomaly analytics have included everything, as an ignoramous, I can throw at it; chi squared; isolated forest; time series; and a lot of stuff I don't properly understand. Most anomalies found have been, if narrowly, within expected randomness, but I intend to fortify my proddings eventually. Although I'm actually confident I'm barking up the wrong tree, the data obfuscation is objectively dubious, for whatever the reason.

                                                    • alchemist1e9 3 hours ago

                                                      Off topic - but the obvious follow up question is why do you want people to have this ability to search the entire history?

                                                      • eth0up 2 hours ago

                                                        Thanks for asking...

                                                        1) I'm a rebel

                                                        2) I am irritated by deliberate obfuscations of public data, especially by a source that I suspect is corrupt. Although my extensive analysis has not yet revealed any significant pattern anomalies in their numbers.

                                                        3) It's kind of my re-intro into python, which I never made significant progress in but always wanted to.

                                                        4) It's literally the real history of all winning numbers since inception. Individuals may have various reasons for accessing this data, but I've been using it to test for manipulation. I presume for most folks it would be curiosity, or gambler's fallacy type stuff. Regardless, it shouldn't be obfuscated.

                                                        • alchemist1e9 2 hours ago

                                                          I had suspected you’re are suspicious of manipulation. I have heard many rumors of lottery corruption and manipulation.

                                                          It’s certainly a big red flag if they are deliberately obstructing access to the data.

                                                          Make sense your project and I’d probably take 30 mins to look at the data if I came across it. I’m somewhat decent at data and number analysis so if there is something and enough people can easily take a look at it, then it might get exposed.

                                                          Interesting and good luck.