Comments Page - Extracting financial disclosure and police reports with OpenAI Structured Output

« Back Extracting financial disclosure and police reports with OpenAI Structured Outputgist.github.comSubmitted by danso 9 months ago

synthc 9 months ago
My first job (around 2010) was to extract events from financial news and police reports.
We built this huge system with tons of regexes, custom parsers, word lists, ontologies etc. It was a huge effort to get somewhat acceptable accuracy.
It is humbling to see that these days a 100 line Python script can do the same thing but better: AI has basically taken over my first job.
- dataguy_ 9 months ago
  I can see this being true to a lot of old jobs, like my brother's first job that basically was to transcribe audio tapes. whisper can do it in no time, that's crazy.
- danofsteel32 9 months ago
  I’ve had a similar experience extracting transactions from my PDF bank statements [1]. GPT-4o and GPT-4o-mini perform as well the janky regex parser I wrote a few years ago. The fact that they can zero shot the problem makes me think there’s a lot of bank statements in the training data.
  [1] https://dandavis.dev/pnc-virtual-wallet-statement-parser.htm...
- morkalork 9 months ago
  Well, your first job today would be writing that 100 line Python script then doing something 100x more interesting with the events than writing truck loads of regexs?
  HWR_14 9 months ago
  No, his first job would be a more senior developer writing 100 line Python script instead of hiring an intern to write a truck load of RegExs. After that dev saved time just writing the script over mentoring/explaining/hiring the intern, that dev would then do the more interesting things with the events.
  That is, his first job is now gone.
rcarmo 9 months ago
I’ve had pretty dismal results doing the same with spreadsheets—even with the data nicely tagged (and numbers directly adjacent to the labels) GPT-4o would completely make up figures to satisfy the JSON schema passed to it. YMMV.
- TrainedMonkey 9 months ago
  I wonder if adversarial model which looks at user input & LLM output and predicts whether output is accurate + maybe output what is not accurate. This worked pretty well for image generation.
  alach11 9 months ago
  This is a common workflow for these sorts of problems. I've done similar a few times. The downside is the additional cost.
- infecto 9 months ago
  On the flip side I have had a lot of success parsing spreadsheets and other tables into their markdown or similar representation and pulling data out of that quite accurately.
druskacik 9 months ago
Data extraction is definitely one of the most useful functions of LLM, however, in my experience a large model is necessary for a reliable extraction - I tested smaller, open-weights models and the performance was not sufficient.
I wonder, did anyone try to fine-tune a model specifically for general formatted data extraction? My naive thinking is that this should be pretty doable - after all, it's basically just restructuring the content using mostly the same tokens as input.
The reason why this would be useful (in my case) is because while large LLMs are perfectly capable of extraction, I often need to run it on millions of texts, which would be too costly. That's the reason I usually end up creating a custom small model, which is faster and cheaper. But a general small extraction-focused LLM would solve this.
I thought about fine-tuning Llama3-1B or Qwen models on larger models outputs, but my focus is currently elsewhere.
- msp26 9 months ago
  Have you looked into structured generation with a library like outlines? https://github.com/dottxt-ai/outlines
  druskacik 9 months ago
  Yes! This library is great and definitely helps, but I still had problems with performance. For example, smaller models would still hallucinate when extracting a JSON field when the given field wasn't present in the text (I'd expect null, but it provided either an incorrect value from the text, or a totally made up value).
- petercooper 9 months ago
  Jina did something related for extracting content from raw HTML and wrote about the techniques they used here: https://jina.ai/news/reader-lm-small-language-models-for-cle... .. in my tests, the 1.5bn model works extremely well, though the open model is non commercial.
chx 9 months ago
How do you know the output has anything to do with the input? Hint: you don't. You are building a castle on quicksand. As always, the only thing LLMs are usable for:
https://hachyderm.io/@inthehands/112006855076082650
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.
TrackerFF 9 months ago
We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.
Worked better than any OCR we tried.
- thenaturalist 9 months ago
  How are you going to find (not even talking about correcting) hallucinated errors?
  If money is involved and the LLM produces hallucination errors, how do you handle monetary impacts of such errors?
  How does that approach scale financially?
  ClearAndPresent 9 months ago
  Indeed. I anticipate the next Post Office Scandal(1,2) attributed to LLMs.
  1 https://en.wikipedia.org/wiki/British_Post_Office_scandal 2 https://www.postofficescandal.uk/
  rolandog 9 months ago
  That's awful.
  Reminds me of the Dutch childcare benefits scandal [0], where 26,000 families were unfairly labeled as having committed tax fraud (11,000 of which had been targeted via "risk profiling", as they had dual nationalities [1]). Bad policy + automation = disaster. The wikipedia article doesn't fully explain how some automated decisions were made (e.g. You had a typo in a form, therefore all previous benefits were clawed-back; if you owe more than €3.000,- then you're a fraudster and if you called to ask for clarification they wouldn't help you — you're officially labeled a fraudster, you see).
  Edit: couldn't find a source for my last statement, but I remember hearing it in an episode of the great Dutch News podcast. I'll see if I can find it.
  [0]: https://en.wikipedia.org/wiki/Dutch_childcare_benefits_scand...
  [1]: https://www.dutchnews.nl/2021/02/full-scale-parliamentary-in...
  rolandog 9 months ago
  Just for posterity, I couldn't find the specific podcast episode, but there are public statements from some of the victims [0] available online (translated):
  > What personally hurt you the most about how you were treated? > Derya: 'The worst thing about all of this, I think, was that I was registered as a fraudster. But I didn't know anything about that. There was a legal process for it, but they blocked it by not telling me what OGS (intent gross negligence) entailed. They had given me the qualification OGS and that was reason for judges to send me home with rejections. I didn't get any help anywhere and only now do I realize that I didn't stand a chance. All those years I fought against the heaviest sanction they could impose on me and I didn't know anything. I worked for the government. I worked very hard. And yet I was faced with wage garnishment and had to use the food bank. If I had known that I was just a fraudster and that was why I was being treated like that, I wouldn't have exhausted myself to prove that I did work hard and could pay off my debts myself. I literally and figuratively worked myself to death. And the consequences are now huge. Unfortunately.'
  [0]: https://www.bnnvara.nl/artikelen/hoe-gaat-het-nu-met-de-slac...
  is_true 9 months ago
  We tried all models from openai and google to get data from images and all of them made "mistakes".
  The images are tables with 4 columns and 10 rows of numbers and metadata above that are in a couple of fields. We had thousands of images already loaded and when we tried to check those previously loaded images we found quite a few errors.
  infecto 9 months ago
  Multimodal LLMs are not up for these tasks imo. It can describe an image but its not great on tables and numbers. Now on the other hand, using something like Textract to get the text representation of the table and then feeding that into a LLM was a massive success for us.
  is_true 9 months ago
  LLMs don't offer much value for our use case, almost all values are just numbers
  infecto 9 months ago
  Then you should be using something like Textract or other tooling in that space. Multimodal LLMs are no replacement.
  is_true 9 months ago
  We use opencv + tesseract and easyocr
  thenaturalist 9 months ago
  Curious, did that make you "fall back" to more conservative OCR?
  Or what else did you do to correct them?
  is_true 9 months ago
  We already had an OCR solution. We were exploring models in case the information source changes
  petercooper 9 months ago
  Not the OP, but if doing this at scale, I'd consider a quorum approach using several models and looking for a majority to agree (otherwise bump it for human review). You could also get two different approaches out of each model by using purely the model and external OCR + model and compare those too.
  coredog64 9 months ago
  I’m working on a problem in this space, and that’s the approach I’m taking.
  More detailed explanation: I have to OCR dense, handwritten data using technical codes. Luckily, the form designers included intermediate steps. The intermediate fields are amenable to Textract, so I can use a multimodal model to OCR the full table and then error check.
- gloosx 9 months ago
  Did you finally balance out lol? If you didn't, would you approach finding a mistake by going through each bill manually?
marcell 9 months ago
I’m making a free open source library for this, check it at http://github.com/fetchfox/fetchfox
MIT license. It’s just one line of code to get started: ‘fox.run(“get data from example.com”)’
- thenaturalist 9 months ago
  How do you plan to address prompt injection/ poisoned data for a method that simply vacuums unchecked inputs into an LLM?
  marcell 9 months ago
  It hasn’t been an issue yet, but I’m sure it will come up at some point. If you see a problem please file an issue.
  thenaturalist 9 months ago
  So assuming it would be an issue, given that you’re building such a tool, what would your approach be?
  If I put an invisible tag on my website and it tells your scraper to ignore all previous prompts, leak its entire history and send all future prompts and replies to a web address while staying silent about it, how would you handle that?
  alach11 9 months ago
  A casual look at the source shows the architecture won't allow the attacks you're talking about. Since each request runs separately, there's no way for prompt injection on one request to influence a future request. Same thing for leaking history.
4ad 9 months ago
What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.
- TrackerFF 9 months ago
  To be fair, there are some considerations here:
  1) There's plenty of old data out there. Newspaper scans from the days before computers, or digitalization of the newspaper process. Or the original files simply got lost, so manually scanned pages is all you have.
  2) There could be policies about making the data public, but in a way that discourages data scraping.
  3) The providers of the data simply don't have the resources or incentives to develop a working API.
  And many more.
- blitzar 9 months ago
  What is even sadder is that this data (especially the more recent data) is entered first in machine readable formats then sliced and diced and spat out in a non-machine readable format.
- jxramos 9 months ago
  I'd like to see financial transactions and purchases abide by some json format standard, metadata and a list of items with full product name, quantity purchased, total unit volume/amount of product, price, and unit price.
- DrillShopper 9 months ago
  Yeah, wow, humanity is so stupid for not distributing the machine readable format for the local newspaper in 1920. Gosh we're just so dumb
tpswa 9 months ago
Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.
- danso 9 months ago
  I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:
  openai.BadRequestError: Error code: 400 - {'error': {'message': "Missing required parameter: 'response_format.json_schema'.", 'type': 'invalid_request_error', 'param': 'response_format.json_schema', 'code': 'missing_required_parameter'}}
  I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own
  [0] https://platform.openai.com/docs/guides/structured-outputs/s...
- ec109685 9 months ago
  I’ve been using jsonschema since forever with function calling. Does structured output just formalize things?
  chaos_emergent 9 months ago
  function calling provides a "hint" in the form of a JSON schema for an LLM to follow. the models are trained to follow provided schemas. If you have really complicated or deeply nested models, they can become less stable at generating schema-conformant JSON.
  Structured outputs apply a context-free grammar to the prediction generation so that, for each token generation, only tokens that generate a perfectly conformant JSON schema are considered.
  The benefit of doing this is predictability, but there's a trade-off in prediction stability; apparently structured output can constrain the model to generate in a way that takes it off the "happy path" of how it assumes text should be generated.
  Happy to link you to some papers I've skimmed on it if you're interested!
  pmg0 9 months ago
  Could you share some of those papers? I had a great discussion with Marc Fischer from the LMQL team [0] on this topic while at ICML earlier this year. Their work recommended decoding to natural language templates with mad lib-style constraints to follow that “happy path” you refer to, instead of decoding to a (relatively more specific latent) JSON schema [1]. Since you provided a template and knew the targeted tokens for generation you could strip your structured content out of the message. This technique also allowed for beam search where you can optimize tokens which lead to the tokens contain your expected strings, avoiding some weird token concatenation process. Really cool stuff!
  [0] https://lmql.ai/ [1] https://arxiv.org/abs/2311.04954
  undefined 9 months ago
  [deleted]
  throwup238 9 months ago
  Structured output uses "constrained decoding" under the hood. They convert the JSON schema to a context free grammar so that when the model samples tokens, invalid tokens are masked to have a probability of zero. It's much less likely to go off the rails.
philipwhiuk 9 months ago
I'm deeply worried by the impact of hallucinations in this sort of tool.
beoberha 9 months ago
Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.
- 0tfoaij 9 months ago
  OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.
  Baeocystin 9 months ago
  Any pointers on where we can check the best local models per amount of VRAM available? I only have consumer level cards available, but I would think something that just fits in to a 24Gb card should noticably outperform something scaled for an 8Gb card, yes?
  fnord77 9 months ago
  lm studio tells you what models fit in your available RAM, with or without quantization
- minimaxir 9 months ago
  The Berkeley Function-Calling Leaderboard tracks function calling/structured data performance from multiple models: https://gorilla.cs.berkeley.edu/leaderboard.html
  Llama isn't on there but a few finetunes of it (Hermes) are OSS.
  lolinder 9 months ago
  Llama 3 70B is on there, ranked 20.
- int_19h 9 months ago
  Your problem isn't that you're using a local model. It's that you're using an 8b model. The stuff you're comparing it to is two orders of magnitude larger.
- gdiamos 9 months ago
  I usually come to a different conclusion using the JSON output on Lamini, e.g. even with Llama 3.2 3B
  https://lamini-ai.github.io/inference/json_output
  Most of these models can read. If the relevant facts are in the prompt, they can almost always extract them correctly.
  Of course bigger models do better on more complex tasks and reasoning unless you use finetuning or memory tuning.
  dcreater 9 months ago
  You should probably disclose you're the founder of lamini.
  Do you have any publicly available validation data demonstrating 100% json compliance?
  gdiamos 9 months ago
  I am a founder. It’s not meant to be a secret.
  Obviously I’m biased, but I also spend every day using tools like this.
  Regarding json compliance, we have a formal grammar and a test suite. If you find a bug please report it. I’d appreciate having more test coverage.
- A4ET8a8uTh0 9 months ago
  << Stuff like this shows how much better the commercial models are than local models.
  I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.
  edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.
- kgeist 9 months ago
  In my tests, Llama 3.1 8b was way worse than Llama 2 13b or Solar 13b.
- tpm 9 months ago
  In my experience the Qwen2-VL models are great at this.
- thatcat 9 months ago
  I mean, those aren't comparable models. I wonder how the 405b version compares.
  Tiberium 9 months ago
  You raise a valid point, but 4o is way smaller than 405B. And 4o mini that's described in the article is highly likely <30B (if we're talking dense models).
  maleldil 9 months ago
  Is the size of OpenAI's models public, or is this guesswork?
  qwe----3 9 months ago
  If your company has a lot of ex openai employees then you know ;)
  And the public numbers are mostly right, the latest values are likely smaller now- they have been working on down sizing everything
1oooqooq 9 months ago
if you're "parsing" structured or even semi structured data with a LLM.... sigh.
an true scotch engineer know tagged data goes into the other end. but I guess that doesn't align with openai target audience and business goals.
i guess that would be fine to clean the new training data... but then you risk extrapolating hallucinations
- danso 9 months ago
  The financial disclosures example was meant to be a toy example; with the way U.S. House members file their disclosure reports now, everything should be in a relatively predictable PDF with underlying text [0], but that wasn't always the case [1]. I think this API would've been pretty helpful to orgs like OpenSecrets who in the past had to do record and enter this data manually.
  (I wouldn't trust the API alone, but combine it with human readers/validators, i.e., let OpenAI do the data entry part, and have humans do the proofreading)
  [0] https://disclosures-clerk.house.gov/public_disc/financial-pd...
  [1] https://disclosures-clerk.house.gov/public_disc/financial-pd...
pooingcode 9 months ago
I am a huge fan of using Stuctured Output to extract data.
Huge benefit that you can lock down model performance with as you fine-tune your prompt or extend out use cases.
I wrote about it here on my blog where i replaced a project’s prompt with Structured Output using Pydantic models https://amberwilliams.io/blogs/474b0361-cbc1-4fa5-b047-c042f...
Zaheer 9 months ago
Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/
There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!
- undefined 9 months ago
  [deleted]
- matchagaucho 9 months ago
  Similarly I've found old-school OCR is needed for more reliability.
  MarkMarine 9 months ago
  I've been using this to OCR some photos I took of books and it's remarkable at it. My first pass was just a loop where I'd OCR, feed the text to the model and ask it to normalize into a schema but I found out just sending the image to the model and asking it to OCR and turn it into the shape of data I wanted was so much more accurate.
  bagels 9 months ago
  Combining google's ocr with llm gives OCR superpowers. Tell the llm the text is from an ocr and ask it to correct it.
  saturn8601 9 months ago
  That sounds like it could be very dangerous when the LLM gets it wrong...
  bagels 9 months ago
  Depends what you're using it for. If you're relying on OCR, you've already got to accept some amount of error.
- hackernewds 9 months ago
  Is this simply the OCR bits to feed to openai structured output?
- artisandip7 9 months ago
  tried it works great, ty!
minimaxir 9 months ago
> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.
OpenAI's API only accepts images: https://platform.openai.com/docs/guides/vision
To my knowledge, all the LLM services that take in PDF input do their own text extraction of the PDF before feeding it to an LLM.
- tyre 9 months ago
  or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems
- ec109685 9 months ago
  I don’t think it does OCR. It’s able to use the structure of the PDF to guide the parsing.
mmsc 9 months ago
Adding to the list of "now try it with"....
The SEC's EDGAR database (which is for SEC filings) is another nightmare ready to end. Extracting individual sections from a filing is, afaik, impossible pragmatically.
I tried making two parsers: https://github.com/MegaManSec/SEC-Feed-Parser and https://github.com/MegaManSec/SEC-sec-incident-notifier but they're just hacks.
Then just link it up to your automated investment platform and you're ready to go!
- infecto 9 months ago
  Would you not want to read the XBRL from the filing? I thought those are now mandatory.
  This is one of those interesting areas where its hard to innovate because the data is already available from most/all data vendors and its cheap and accurate enough that nobody is going to reinvent those processes but also too expensive for an individual to purchase.
  derivagral 9 months ago
  My (admittedly aged) experience with XBRL is that each company was able to define its own fields/format within that spec, and that most didn't agree on common names for common fields. Parsing it wasn't fun.
  infecto 9 months ago
  I have spotty education on the matter but I believe they all conform to a FASB taxonomy so there is at least a list of possible tags in use. I do wonder if any of the big data vendors actually use this though.
jsemrau 9 months ago
The SEC has a well defined API with EDGAR.
https://jdsemrau.substack.com/p/mem0-building-a-sec-10k-anal...
jxramos 9 months ago
Does anyone follow Vik's work? eg https://x.com/VikParuchuri/status/1846153661791011158
kiakiaa 9 months ago
Fine-tuning smaller models specifically for data extraction could indeed save costs for large-scale tasks; I've found tools like FetchFox helpful for efficiently extracting data from websites using AI.
myflash13 9 months ago
Is there an automated way to check results and reduce hallucinations? Would it help to do a second pass with another LLM as a sanity check to see if numbers match?
- thibaut_barrere 9 months ago
  This is what I am implementing a the moment (together with sampling for errors).
Imanari 9 months ago
Does anybody have experience with Azure Document Intelligence? How does it compare to OAIs extraction capabilities?
frays 9 months ago
Very cool and real application of LLMs. Although hallucinations is still something to be very wary of.
andrewg4445 9 months ago
This helped me a lot, thx.
black_13 9 months ago
[dead]