Comments Page - Show HN: PDF to MD by LLMs – Extract Text/Tables/Image Descriptives by GPT4o

« Back Show HN: PDF to MD by LLMs – Extract Text/Tables/Image Descriptives by GPT4ogithub.comSubmitted by yigitkonur35 9 months ago

Oras 9 months ago
While this is a nice development, it’s quite risky parsing documents with LLMs. In usual OCRs, you have boundaries to check, but with LLMs, you just get a black box output.
As others mentioned, consistency is key in parsing documents and consistency is not a feature of LLMs.
The output might look plausible, but without proper validation this is just a nice local playground that can’t make it to production.
- drdaeman 9 months ago
  Speaking of the devil - I've just had hallucinations with Ollama and reader-lm model (converting HTML to Markdown) the other day. In 40% of cases it spew out things that weren't in the input (not exactly surprising, knowing that it's a generative model).
  Turns out the model needs temperature of zero (and then it seem to behave well, at least in simple tests), but it wasn't in the model settings.
  https://github.com/ollama/ollama/issues/6875#issuecomment-23...
  yigitkonur35 9 months ago
  You're spot on. We shouldn't lump all LLMs together. This approach might work wonders for Anthropic and OpenAI's top-tier models, but it could fall flat with smaller, less complex ones.
  I purposely set the temperature to 0.1, thinking the LLM might need a little wiggle room when whipping up those markdown tables. You know, just enough leeway to get creative if needed.
- yigitkonur35 9 months ago
  I get your worries about LLMs and their consistency problems. But I think we can fix a lot of that using LLMs themselves for checks. If you're after top-notch accuracy, you could throw in another prompt, add some visual and text input, and double-check that nothing's lost in translation. The cheaper models are actually great for this kind of quality control. LLMs have come a long way since they first showed up, and I reckon they've stepped up their game enough to shake off that bad rap for giving mixed signals.
  Oras 9 months ago
  How would you know something is missing?
  I tried multiple OCRs before and it’s hard to tell if the output is accurate or not but just comparing manually.
  I created a tool to visualise the output of OCR [0] to see what’s missing and there are many cases that would be quite concerning especially when working with financial data.
  This tool wouldn’t work with LLMs as they don’t return the character recognition (to my knowledge), which will make it harder to evaluate them on a scale.
  If I want to use LLMs for the task, I would use them to help with training ML model to do OCR better, such as creating thousands of synthetic data to train.
  [0] https://github.com/orasik/parsevision
  yigitkonur35 9 months ago
  Wow, you knocked it out of the park! I'll be sure to use this when I tackle that evaluation.
  whiplash451 9 months ago
  If you can use an LLM for sanity checking, why can’t you use it for extraction at the first place?
  ithkuil 9 months ago
  Because currently models output a stream of tokens directly which are the performance and billing unit. Better models can do a better job at producing reasonable output but there is a limit to what can be done "on the fly".
  Some models like openai o1 started employing internal "thinking" tokens which may or may not be equivalent to performing multiple passes with the same or different models but it has a similar effect.
  One way to look at it is that if you want better results you have to put more computational resources in thinking. Also, just like humans, a team effort yields better results in producing well rounded results because you combine the strengths and you offset the weaknesses of different team members.
  You can technically wrap all this into a single black box and have it converse with you as if it was one single entity that internally uses multiple models to think and cross check etc. The output is likely not going to be in real-time though and real time conversation was until now a very important feature.
  In future we may on one hand relax the real time constraint and accept that for some tasks accuracy is more important than real time results.
  Or we may eventually have faster machines or more clever algorithms that may "think" more in shorter amounts of time.
  (Or a combination of the two)
- cyanydeez 9 months ago
  Determinism is also up there because post processing can catch and fix common errors
zerop 9 months ago
I had been using GPT4o for extracting insights from Scanned docs, it was doing fine. But very recently (since they launched new model - o1), it's not working. GPT4o is refusing to extract text from images and says it can't do it, though it was doing same thing with same prompts till last week. I am not sure if this is intentional downgrade and it can be clubbed with new model launch, but it's really frustrating for me. I cancelled my GPT4 premium and moved to claude. It works good.
- itissid 9 months ago
  This. Inconsistency is a big problem for large tasks, you are better off making your own models to do this.
  I have seen this odd kind of inconsistency in generating the same results, sometimes in the same chat itself after starting off fine.
  I was once trying to extract hand written dates and times from a large pdf document in batches of 10 pages at a time from a very specific part of the page. IN some documents it started by refusing, but not in other different chat windows that I tried with the same document. Sometimes it would say there is an error, and then it would work in a new chat window. But I am not sure why, but just starting a new chat works for these kind of situations.
  Sometimes it will start off fine with OCR, then as the task progresses, it will start hallucinating. Even though the text to be extracted follows a pattern like dates, it for the life of me could not get it right.
  rrrix1 9 months ago
  > "...you are better off making your own models to do this"
  I'm doubtful you meant what you wrote here. Using a readymade UI or API to perform an effectively magical task (for most of us) is an entirely different paradigm to "just train your own model."
  In reality, for us non-ML model training mortals, we're actually probably better off hiring a human to do basic data entry.
- avibhu 9 months ago
  Have you tried few shot prompting? Something on the lines of:
  User: Extract x from the given scanned document. <sample_img_1>
  Assistant: <sample_img_1_output>
  User: Extract x from the given scanned document. <sample_img_2>
  Assistant: <sample_img_2_output>
  User: Extract x from the given scanned document. <query_image>
  In my experience, this seems to make the model significantly more consistent.
  yigitkonur35 9 months ago
  For highly consistent responses, manually transcribing the most challenging page of the document (or engaging in multiple rounds of dialogue with Claude) and incorporating it as a few-shot example can dramatically improve overall consistency.
- yigitkonur35 9 months ago
  I ought to test this with Sonnet too and compare the results. I feel it might perform better on OCR tasks. While I went with Azure OpenAI due to fewer rate restrictions, you've got a point - Sonnet could really shine here.
- eth0up 9 months ago
  I have observed a lot of similar contradictions, where the large lout insists it can't do something that it did many times 'last week'.
  Super frustrating when really trying to accomplish something!
- fragmede 9 months ago
  Why not just switch back to GPT-4? it's still there.
  itchyjunk 9 months ago
  So is 4o. Problem isn't the absence of model, it's inconsistency.
pierre 9 months ago
Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2).
The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...)
However this model will get better and we may soon have a good pdf to md model.
- fzysingularity 9 months ago
  We’ve been doing exactly this by doubling-down on VLMs (https://vlm.run)
  - VLMs are way better at handling layout and context where OCR systems fail miserably
  - VLMs read documents like humans do, which makes dealing with special layouts like bullets, tables, charts, footnotes much more tractable with a singular approach rather than have to special case a whole bunch of OCR + post-processing
  - VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference
  In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether.
  yigitkonur35 9 months ago
  I did a ton of Googling before writing this code, but I couldn't find you guys anywhere. If I had, I'd have definitely used your stuff. You might want to think about running some small-scale Google Ads campaigns. They could be especially effective if you target people searching for both LLM and OCR together. Great product, congratz!
  fzysingularity 9 months ago
  Hey, thanks! DM me if you want to test it out (sudeep@vlm.run).
  Agreed on SEO - we’re redoing our landing page and searchability. We recently rebranded, hence the lack of direct search hits for LLM / OCR.
- authorfly 9 months ago
  What about combining old school OCR with GPT visual OCR?
  If your old school OCR output has output that is not present in the visual one, but is coherent (e.g. english sentences), you could get it back and slot it into the missing place from the visual output.
  yigitkonur35 9 months ago
  You're absolutely right. I use PDFTron (through CloudCovert) for full document OCR, but for pages with fewer than 100 characters, I switch to this API. It's a great combo – I get the solid OCR performance of SolidDocument for most content, but I can also handle tricky stuff like stats, old-fashioned text, or handwriting that regular OCR struggles with. That's why I added page numbers upfront.
- fkilaiwi 9 months ago
  what paper are you referring to?
  perrywky 9 months ago
  I guess this: https://arxiv.org/html/2409.01704v1
constantinum 9 months ago
There is also LLMWhisperer, a document pre-processor specifically made for LLM consumption.
As other mentioned, accuracy is the one part of solution criteria, other include, how does the preprocessing engine scale/performs at large scale, and how does it handle very complex documents like, bank loan forms with checkboxes, IRS tax forms with multi-layered nested tables etc.
https://unstract.com/llmwhisperer/
LLMWhisperer is a part of Unstract - An open-source tool for unstructured document ETL.
https://github.com/Zipstack/unstract
jdthedisciple 9 months ago
Zerox [0] was featured on here recently and does the exact same thing
[0] https://github.com/getomni-ai/zerox
smusamashah 9 months ago
I have not found any mention of accuracy. Since it's using LLM, how accurate the conversion is? As in does that NASA document match 100% with the pdf or did it introduce any made up things (hallucinations)?
That converted NASA doc should be included in repo and linked in readme if you haven't already.
- uLogMicheal 9 months ago
  Couldn't one just program a linear word matching test to ensure correctness?
- yigitkonur35 9 months ago
  People are really freaked out about hallucinations, but you can totally tackle that with solid prompts. The one in the repo right now is doing a pretty good job. Keep in mind though, this project is all about maxing out context for LLMs in products that need PDF input.
  We're not talking about some hardcore archiving system for the Library of Congress here. The goal is to boost consistency whenever you're feeding PDF context into an LLM-powered tool. Appreciate the feedback, I'll be sure to add that in.
  williamdclt 9 months ago
  I don’t think any prompting skill guarantees the absence of hallucination. And if hallucination is possible, you will usually need to worry about it
  undefined 9 months ago
  [deleted]
  Foobar8568 9 months ago
  As soon as you have something else than a paragraph in a single column layout, you will get hallucinations, random stuff, cut off etc even if you say which pages to look at, LLM will just do what ever.
  AdieuToLogic 9 months ago
  > People are really freaked out about hallucinations, but you can totally tackle that with solid prompts.
  > The goal is to boost consistency whenever you're feeding PDF context into an LLM-powered tool.
  These two assertions are contradictory.
  There are no "solid prompts" which obviate anthropomorphic "LLM hallucinations." Also, there is no deterministic consistency when "feeding PDF context" into an intrinsically non-deterministic algorithm, as any "LLM-powered tool" is by definition.
  freedomben 9 months ago
  Can you give some examples of prompts that you use that will tackle hallucinations?
  smusamashah 9 months ago
  > hallucinations, but you can totally tackle that with solid prompts.
  This is so wrong. This so much sound as if you have not used LLMs to do any real work.
bravura 9 months ago
I've also been using the nougat models from meta, which are trained to turn PDF into md using the donut architecture
- rahimnathwani 9 months ago
  Does it work well on documents that aren't academic papers?
  https://facebookresearch.github.io/nougat/
charlie0 9 months ago
I do this all the time for old specs, but one issue I worry about is accuracy. It's hard to confirm if the translations are 100% correct.
TZubiri 9 months ago
Ok attempt number 158 at parsing pdfs, here we go, this time surely it will work.
eth0up 9 months ago
I used GPT4o to convert heavily convoluted PDFs into csv files. The files were Florida Lottery Pick(n) histories, which they deliberately convolute to prevent automatic searching; ctrl-f does nothing and a fsck-ton of special characters embellish the whole file.
I had previously done so manually, with regex, and was surprised with the quality of the end results of GPT, despite many preceding failed iterations. The work was done in two steps, first with pdf2text, then python.
I'm still trying to created a script to extract the latest numbers from the FL website and append to a cvs list, without re-running the stripping script on the whole PDF every time. Why? I want people to have the ability to freely search the entire history of winning numbers, which in their web hosted search function, is limited to only two of 30+ years.
I know there's a more efficient method, but I don't know more than that.
- yigitkonur35 9 months ago
  I really appreciate you sharing your hands-on experience with a real-world scenario. It's interesting how people unfamiliar with traditional OCR often doubt LLMs, but having worked with actual documents, I know how inefficient classic OCR methods can be. So these minor errors don't alarm me at all. Your use case sounds fascinating - I might just incorporate it into my own benchmarks. Thanks again for your insightful comment!
- mmh0000 9 months ago
  This sounds like a fun and interesting challenge! I am tempted to try it on my own
  I’m surprised an LLM actually works for that purpose. It has been my experience with gpt reading pdfs that it’ll get the first few entries from a pdf correct then just start making up numbers.
  I’ve tried a few times having gpt4 analyze a credit card statement and it adds random purchases and leaves out others. And that’s with a “clean” PDF. I wouldn’t trust an llm at all on an obfuscated pdf, at least not without thorough double checking.
  eth0up 9 months ago
  >then just start making up numbers...
  Absolutely! It's a fucking criminal in that regard. But that's why everything is done with hard python code and the results are tested multiple times. As an assistant, gpt can be fabulous, but the user must run the necessary scripts on their own and be ever ready for a knife in the back at any moment.
  Edit: below is an example of what it generated after a lot of debugging and hassle:
  import re
  import csv from datetime import datetime
  def clean_and_structure_data(text): """Cleans and structures the extracted text data.""" # Regular expression pattern to match the lottery data pattern = r'(\d{2}/\d{2}/\d{2})\s+(E|M)\s+(\d{1})\s-\s(\d{1})\s-\s(\d{1})\s-\s(\d{1})(?:\s+FB\s+(\d))?' matches = re.findall(pattern, text)
  structured_data = [] for match in matches: date, draw_type, n1, n2, n3, n4, fireball = match # Format the date to include the full year date = datetime.strptime(date, '%m/%d/%y').strftime('%m/%d/%Y') # Concatenate the numbers, ensuring leading zeros are preserved, and enclose in quotes numbers = f'"{n1}{n2}{n3}{n4}"' structured_data.append({ 'Date': date, 'Draw': draw_type, 'Numbers': numbers, 'Fireball': fireball or '' # Use empty string if Fireball is None }) return structured_data
  def save_to_csv(data, output_path): """Saves the structured data to a CSV file.""" # Sort data by date in descending order sorted_data = sorted(data, key=lambda x: datetime.strptime(x['Date'], '%m/%d/%Y'), reverse=True)
  with open(output_path, 'w', newline='') as csvfile: fieldnames = ['Date', 'Draw', 'Numbers', 'Fireball'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for row in sorted_data: writer.writerow(row)
  def main(): # Path to the text file txt_path = 'PICK4.txt' # Ensure this path points to your actual text file output_csv_path = 'output.csv' # Ensure this path is where you want the CSV file saved
  try: with open(txt_path, 'r') as file: text = file.read() cleaned_data = clean_and_structure_data(text) save_to_csv(cleaned_data, output_csv_path) print(f"Data successfully extracted and saved to {output_csv_path}") except Exception as e: print(f"An error occurred: {e}")
  if __name__ == "__main__": main()
- ssl-3 9 months ago
  I had the same problem with a PDF schematic for a BTT Octopus 3d printer board (which is published on their Github repo).
  Unsearchable, weird characters behind the curtain, and etc.
  But I don't blame deliberate obfuscation (or any other deliberate attempt to hide information) at all.
  Instead, I simply blame incompetence.
  (There's a ton of shitty PDFs in the world; this is just an example that I've encountered recently.)
- alchemist1e9 9 months ago
  Off topic - but the obvious follow up question is why do you want people to have this ability to search the entire history?
  eth0up 9 months ago
  Thanks for asking...
  1) I'm a rebel
  2) I am irritated by deliberate obfuscations of public data, especially by a source that I suspect is corrupt. Although my extensive analysis has not yet revealed any significant pattern anomalies in their numbers.
  3) It's kind of my re-intro into python, which I never made significant progress in but always wanted to.
  4) It's literally the real history of all winning numbers since inception. Individuals may have various reasons for accessing this data, but I've been using it to test for manipulation. I presume for most folks it would be curiosity, or gambler's fallacy type stuff. Regardless, it shouldn't be obfuscated.
  alchemist1e9 9 months ago
  I had suspected you’re are suspicious of manipulation. I have heard many rumors of lottery corruption and manipulation.
  It’s certainly a big red flag if they are deliberately obstructing access to the data.
  Make sense your project and I’d probably take 30 mins to look at the data if I came across it. I’m somewhat decent at data and number analysis so if there is something and enough people can easily take a look at it, then it might get exposed.
  Interesting and good luck.
  undefined 9 months ago
  [deleted]
- is_true 9 months ago
  There are private APIs that have that data (now and history)
  Do you think the official data published is 100% correct if they were trying to hide something?
  eth0up 9 months ago
  I am honestly not certain why they obstruct easy access to the number history. It's obviously accessible, but only through manually parsing the PDF. Their prior embedded search function, approximately two years ago, would return all permutations of the queried number from day 1 to present. They modified it to exclude results more than two years old. The PDF contains the entire data set, but isn't searchable. Why? Dunno. But I'm cynical
  I've also compiled a list of all numbers that have never occurred, count of each occurrence and a lot more. My anomaly analytics have included everything, as an ignoramous, I can throw at it; chi squared; isolated forest; time series; and a lot of stuff I don't properly understand. Most anomalies found have been, if narrowly, within expected randomness, but I intend to fortify my proddings eventually. Although I'm actually confident I'm barking up the wrong tree, the data obfuscation is objectively dubious, for whatever the reason.
  is_true 9 months ago
  I've worked in the field and it could just be that the developers in charge of the new site didn't know/care how to get the data from the old system.
  eth0up 9 months ago
  Old Hanlon's razor. Maybe, but I'd rather assume malice when it comes to Florida Lottery
refulgentis 9 months ago
GPT 4o doesn't do actual OCR and there's much smaller and more effective models for specifically this problem.
I appreciate your work, intent, and sharing it. It's very important to appreciate what you're doing and its context when sharing it.
At that point, you are responsible for it, and the choices you make when communicating about it reflect on you.
- yigitkonur35 9 months ago
  I've found this method really useful for prepping PDFs before running them through AI. I mix it with traditional OCR for a hybrid approach. It's a game-changer for getting info from tricky pages. Sure, you wouldn't bet the farm on it for a big, official project, but it's still pretty solid. If you're willing to spend a bit more, you can use extra prompts to check for any context skips. It's a lot of work, though - probably best left to companies that specialize in this stuff.
  I've been testing it out on pitch decks made in Figma and saved as JPGs. Surprisingly, the LLM OCR outperformed top dogs like SolidDocuments and PDFtron. Since I'm mainly after getting good context for the LLM from PDFs, I've been using this hybrid setup, bringing in the LLM OCR for pages that need it. In my book, this API is perfect for these kinds of situations.
fzysingularity 9 months ago
One nit in the repo README - you might want to change the cost reporting to be as $15 / 1000 pages instead of documents.
KoolKat23 9 months ago
This is handy, one thing I've noticed using 3.5 Sonnet, the tables that aren't the correct orientation are more prone to incorrect output.
I know this was an issue when GPT 4 vision initially came out due to training, not sure if it's a solved problem or if your tool handles this.
jdross 9 months ago
How does this compare with commercial OCR APIs on a cost per page?
- yigitkonur35 9 months ago
  It is a lot cheaper! While cost-effectiveness may not be the primary advantage, this solution offers superior accuracy and consistency. Key benefits include precise table generation and output in easily editable markdown format.
  Let's make some numbers game:
  - Average token usage per image: ~1200 - Total tokens per page (including prompt): ~1500 - [GPT4o] Input token cost: $5 per million tokens - [GPT4o] Output token cost: $15 per million tokens
  For 1000 documents: - Estimated total cost: $15
  This represents excellent value considering the consistency and flexibility provided. For further cost optimization, consider:
  1. Utilizing GPT4 mini: Reduces cost to approximately $8 per 1000 documents 2. Implementing batch API: Further reduces cost to around $4 per 1000 documents
  I think it offers an optimal balance of affordability & reliability.
  PS: One of the most affordable solution on market, cloudconvert charges ~30$ for 1K document (pdftron mode required 4 credits)
  johndough 9 months ago
  > I think it offers an optimal balance of affordability & reliability.
  It is hard to trust "you" when ChatGPT wrote that text. You never know which part of the answer is genuine and which part was made up by ChatGPT.
  To actually answer that question: Pricing varies quite a bit depending on what exactly you want to do with a document.
  Text detection generally costs $1.5 per 1k pages:
  https://cloud.google.com/vision/pricing
  https://aws.amazon.com/textract/pricing/
  https://azure.microsoft.com/en-us/pricing/details/ai-documen...
  yigitkonur35 9 months ago
  You've got a point, but try testing it on a tricky example like the Apollo 17 document - you know, with those sideways tables and old-school writing. You'll see all three non-AI services totally bomb. Now, if you tweak it to batch = 1 instead of 10, you'll notice there's hardly any made-up stuff. When you dial down the temperature close to zero, it's super unlikely to see hallucinations with limited context. At worst, you might get some skipped bits, but that's not a dealbreaker for folks looking to feed PDFs into AI systems. Let's face it, regular OCR already messes up so much that...
  hnlmorg 9 months ago
  AWS Textract does use ML and I’ve personally used it to parse tables for automated invoice processing.
  You wouldn’t get a markdown document automatically generated (or at least you couldn’t when I last used it a few years ago) but you did get an XML document
  That XML document was actually better for our purposes because it gives you a confidence score and is properly structured, so floating frame, tables and columns would be properly structured in the output document. This reduces the risk of hallucinations.
  It’s less of an out-of-the-box solution but that’s to be expected with AWS APIs.
  rescbr 9 months ago
  For a similar use case I’m using Azure Document AI - at least you can ask for markdown/html output directly from it instead of parsing the output structure from Textract.
  And it’s cheaper too.
  malcolmhere 9 months ago
  You can get Markdown nowadays too, at least using this Python wrapper:
  https://aws-samples.github.io/amazon-textract-textractor/not...
  It's very consistent, though pricey.
  Propelloni 9 months ago
  > you might get some skipped bits, but that's not a dealbreaker for folks looking to feed PDFs into AI systems
  Unless it is. We have a few hundred PDF per month (mostly tables) where we need 100% accuracy. Currently we feed them into an OCR and have humans check the result. I do not win anything if I have to check the LLM output, too.
  llm_trw 9 months ago
  I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?
  rescbr 9 months ago
  At least for my use case, which is Layout processing (i.e. must output tables in some kind of table format), the OCR part (Azure Document AI or AWS Textract) dominates the cost factor.
  Running OCR on a document is twice more expensive than processing the output on the most expensive GPT offering. Intuitively, this was kind of unexpected for me. Only when I did some calculations on Excel that I realized it.
  If you’re able to halve the pricing for Layout output then you’re unblocking lots of use cases out there.
  Propelloni 9 months ago
  > I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?
  I guess anything up to 5 ¢ per page would be acceptable. But I'm afraid my company wouldn't be a customer. We are in Germany and we deal with particularly protected private data, there is no chance that we would exfiltrate this data to a cloud service.
  llm_trw 9 months ago
  What's the total spend per quarter? For a margin that fat I'd be willing to jump through a lot of hoops if you're doing enough pages.
  The models (currently) fit in 24gb vram sequentially with small enough batch sizes, so a local server with consumer grade gpus wouldn't be impossible.
  Propelloni 9 months ago
  I'll check and get back to you. How can I reach you?
  llm_trw 9 months ago
  Email at omni_vision_ai@proton.me
  Lerc 9 months ago
  I guess it depends on the use case, but if it surpasses the error rate that exists in the source document then it would be difficult to argue against.
  Specific things like evidentiary use would want 100% but that's at a level where any document processing would be suspect.
  What is the the typical range for error rate in PDF generation in various fields? Even robust technical documents have the occasional typo.
  llm_trw 9 months ago
  I'm not using generative models to fill in details not present in the original document. If there's a typo there then there will be a typo in the transcript. If you want to fix that then you can run another model on top of it.
  Lerc 9 months ago
  I realise that. The point is that a user is implicitly committing to the baseline error rate that exists in whatever means by which the document was created. If any additional loss was insignificant in proportion to that error rate then it would be unreasonable to reject it on that basis.
  yigitkonur35 9 months ago
  You're right. For my API that prepares PDFs for LLMs, fixing typos makes sense. But yeah, keeping original text is crucial for most OCR tasks.
  refulgentis 9 months ago
  You and your delegated writing are lying about cost.
  It is off by 2 orders of magnitude.
  My guess is you're using the token counting algorithm for pre-4o with the costs for 4o and later.
  That aside, I strongly suggest taking a week off from code-outside-work and use that time to reflect-as-work. The post and ensuing comments are a horror show. Don't take it too hard, it probably won't matter in the long run, no ones going to remember.
  But you'd get a lot out of taking it harder than you did in the comments I've seen, including one this morning where you replied to me. It worries me that you don't seem to understand how sloppy this work is.
  When I was 14, my math teacher gave me a 0 on a test because I just wrote the answers instead of showing work. That gave me a powerful appreciation for being precise, clear, and accurate.
  The only positive outcome is that even though there was enough upvotes for a simple, sloppy, mispurposed GPT wrapper to end up on the front page for ~16 hours, near-universally, the comments seem to understand contextually there's a lot of problems with how this was shared.
  authorfly 9 months ago
  I will say I have had a look at your code here. I really do value your innovation here in gaining better accuracy, but I don't think it's is much more accurate for obscure PDF cases - Maybe it halves those obscure errors. I found it still hallucinated or failed to parse some text (e.g. that unusual languages, screenshots with tiny blurred JPEG text, images/shapes remain hallucination issues with your solution). BTW I noticed a small typo "Convert document as is be creative to use markdown effectively" in the prompt. For me changing this and adding text about returning "None" if the text is unreadable reduced hallucinations.
  Would you contrast your accuracy with Textract? Because Textract is 10x cheaper than this at approx 1 cent per page (and 20x cheaper than Cloudconvert). What documents make more sense to use with your tool? Is it worth waiting till gpt-4o costs drop 10x with the same quality level (i.e. not gpt-4o-mini) to use this? In my use case it's better to drop than to hallucinate.
  What do you think makes sense in relation to Textract?
  fzysingularity 9 months ago
  Re: obscure PDFs, I’d love to see a PDF dataset with a whole bunch of these from different domains.
  I think in general it’s very hard to say if any approach is “good enough” until you see some serious degree of variability in the input domain.
  llm_trw 9 months ago
  That is not a tricky example. Those tables are as clear cut as clear cut can be.
magicalhippo 9 months ago
Was just looking for something like this. Does it handle equations to latex or similar? How about rotated tables, ie landscape mode but page is still portait?
- yigitkonur35 9 months ago
  I messed around with some rotating tables in that Apollo 17 demo video - you can check it out in the repo if you want. It's pretty straightforward to tweak just by changing the prompt. You can customize that prompt section in the code to fit whatever you need.
  Oh, and if you throw in a line about LaTeX, it'll make things even more consistent. Just add it to that markdown definition part I set up. Honestly, it'll probably work pretty well as is - should be way better than those clunky old OCR systems.
- x_may 9 months ago
  Check out Nougat from meta
  magicalhippo 9 months ago
  Thanks, looks very interesting, but also somewhat abandoned. Will keep an eye on it in case someone picks up the torch.
- nicodjimenez 9 months ago
  Have you checked out Mathpix? It's another option.
  Disclaimer: I'm the founder.
  troysk 9 months ago
  +1! Most LLMs can already output Mathpix markdown. I prompt it to do so and it gives the code and this use a rendering library to show the scalable and selectable equations. No wonder facebook nougat also uses it. Good stuff!
  magicalhippo 9 months ago
  Was looking for a self-hosted solution as I have quite on/off needs, but I'll give it a whirl as it looks quite promising.
devops000 9 months ago
You could transform arXiv to a markdown website
scottmcdot 9 months ago
Does it do image to MD too?
- vunderba 9 months ago
  Unless the only thing you want is a description of the image, then the real answer is NO. You can get an LLM to do something like "If you encounter an image that is not easily convertable to standard markdown, insert a [[DESCRIPTION OF IMAGE]] here." placeholder, but at that point you've lost information that may be salient to the original PDF.
  The reason is because these multimodal LLMs can give you descriptions/OCR/etc., but they cannot give you quantifiable information related to placement.
  So if there was a picture of a tiger in the middle of the page converted to a bitmap, you couldn't get the LLM to give you something like this: "Image detected at pixel position (120, 200) - (240, 500)." - because that's really want you want.
  You almost need segmentation system middleware that the LLM can forward to which can cut out these images to use in markdown syntax:
  ![A tiger burning brightly](/assets/images/tiger.png)
- undefined 9 months ago
  [deleted]
- yigitkonur35 9 months ago
  Yes, you can customize this as you wish by adding it to your prompt.
wittjeff 9 months ago
license please?
bschmidt1 9 months ago
My previous employer needs this.
I won't tell them :) :D >:D :|
gdevenyi 9 months ago
[flagged]
- undefined 9 months ago
  [deleted]