• alex-moon 5 hours ago

    Big fan of this write up as it presents a really easy to understand and at the same time brutally honest example of a domain in which a) you would expect LLMs to perform very well, b) they don't and c) the solution is to make the use of ML more targeted, a complement to human reasoning rather than a replacement for it.

    Over and over again we see businesses sinking money into "AI" where they are effectively doing a) and then calling it a day, blithely expecting profit to roll in. The day cannot come too soon when these businesses all lose their money and the hype finally dies - and we can go back to using ML the way this write up does (ie the way it is meant to be used). Let's hope no critical systems (eg healthcare or law enforcement) make the same mistake businesses are before that time.

    • infecto 3 hours ago

      On the flip side I thought the write up was weak on details and while "brutally honest" it did not touch on how they even tried to implement an LLM in the workflow and for all we know they were using an outdated model or a bad implementation. Your bias seems to follow it though, you have jumped so quickly into a camp that its easy to enjoy an article that supports your worldview.

      • jerf 3 hours ago

        To be honest, I exited the article thinking the answer is "no", or at least, perilously close to "no". The same amount of work put into a conventional solution probably would have been better. That cross-product "solution" is a generalized fix for data generation from a weak data source and as near as I can tell is what is actually doing most of the lifting, not the LLM.

        That said, I'm not convinced there isn't something to the idea, I just don't know that that is the correct use of LLMs. I find myself wondering if from-scratch training, of a much, much smaller model trained on the original data, using LLM technology but not using one of the current monsters, might not work better. I also wonder if this might be a case where prompt engineering isn't the way to go but directly sampling the resulting model might be a better way to go. Or maybe start with GPT-2 and ask it for lists of things; in a weird sort of way, GPT-2's "spaciness" and inaccuracy is sort of advantageous for this. Asking "give me a list of names" and getting "Johongle X. Boodlesmith" would be disastrous from a modern model, but for this task is actually a win. (And I wouldn't ask GPT-2 to try to format the data, I'd probably go for just getting a list of nicely randomized-but-plausible data, and solve all the issues like "tying the references together" conventionally.)

    • zebomon 24 minutes ago

      Good read. I wonder to what degree this kind of step-making which I suppose is what is often happening under the hood of OpenAI's o1 "reasoning" model, is set up manually (manually as in a case-by-case basis) as you've done here.

      I'm reminded of an evening that I spent playing Overcooked 2 with my partner recently. We made it through to the 4-star rounds, which are very challenging, and we realized that for one of the later 4-star rounds, one could reach the goal rather easily -- by taking advantage of a glitch in the way that items are stored on the map. This realization brought up an interesting conversation, as to whether or not we should then beat the round twice, once using the glitch and once not.

      With LLMs right now, I think there's still a widespread hope (wish?) that the emergent capabilities seen in scaled-up data and training epochs will yield ALL capabilities hereon. Fortunately for the users of this site, hacking together solutions seems like it's going to remain necessary for many goals.

      • WhiteOwlEd 34 minutes ago

        Building on this, Human preference optimization (such as Direct Preference Optimization or Kahneman Tversky Optimization) could be used to help in refining models to create better data.

        I wrote about this more recently in the context of using LLMs to improve data pipelines. That blog post is at: https://www.linkedin.com/posts/ralphbrooks_bigdata-dataengin...

        • SkyVoyager99 2 hours ago

          I think this article does a good job in capturing the complexities of generating test data for real world databases. Generating mock data using LLMs for individual tables based on the naming of the fields is one thing, but doing it across multiple tables, while honoring complex relationships across them (primary-foreign keys across 1:1, 1:N, and M:N with intermediate tables) is a whole another level of a challenge. And it's even harder for databases such as MongoDB, where the relationships across collections are often implicit and can best be inferred based on the names of the fields.

          • dogma1138 5 hours ago

            Most LLMs I’ve played with are terrible at generating mock data that is in any way useful because they are strongly reinforced against anything that could be used for “recall”.

            At least for playing around with llama2 for this you need to abliterate it the point of lobotomy to do anything and then the usefulness drops for other reasons.

            • nonameiguess 6 minutes ago

              We faced probably about the worst form of this problem you can face when working for the NRO on ground processing of satellite data. When new orbital sensor platforms are developed, new processing software has to be developed in tandem, but the software has to be developed and tested before the platforms are actually launched, so real data is impossible and you have to generate and process synthetic data instead.

              Even then, it's an entirely tractable problem. If you understand the physical characteristics and capabilities of the sensors and the basic physics of satellite imaging in general, you simply use that knowledge. You can't possibly know what you're really going to see when you get into space and look, but you at least know the mathematical characteristics the data will have.

              The entire problem here is you need a lot of expertise to do this. It's not even expertise I have or any other software developer had or has. We needed PhDs in orbital mechanics, atmospheric studies, and image science to do it. There isn't and probably never will be a "one-click" button to just make it happen, but this kind of thing might honestly be a great test for anyone that truly believes LLMs can reason at a level equal to human experts. Generate a form of data that has never existed, thus cannot have been in your training set, from first principles of basic physics.

              • sgarland 3 hours ago

                IMO, nothing beats a carefully curated selection of data, randomly selected (with correlations as needed). The problem is you rapidly start getting into absurd levels of detail for things like postal addresses, at least, if you want them to be accurate.

                • dartos an hour ago

                  Maybe I’m confused, but why would an llm be better at mapping tuples to functions as opposed to a kind of switch statement?

                  Especially since it doesn’t seem to totally understand the breadth of possible kinds of faked data?

                  • yawnxyz 2 hours ago

                    ok so a long time ago I used "real-looking examples" in a bunch of client prototypes (for a big widely known company's web store) and the account managers couldn't tell whether these were items new that had been released or not... so somehow the mock data ended up in production (before it got caught and snipped)

                    ever since then I use "real-but-dumb examples" so people know in a glance that it can't possibly be real

                    the reason I don't like latin placeholder text is b/c the word lengths are different than english so sentence widths end up very different

                    • globalise83 10 minutes ago

                      Yes, this should be a lesson in all software engineering courses: never use real or realistic data in examples or documentation. Once made the mistake of using a realistic but totally fake configuration id and had people use it in their production setup. Far better to use configId=justanexampleid or whatever.

                    • ShanAIDev an hour ago

                      This is a fascinating topic! The ability to generate high-fidelity mock data can significantly streamline development and testing processes. it's a smart move given the diverse tech stacks in use today.

                      Overall, this looks like a promising direction!

                      • danielbln 4 hours ago

                        Did I miss it or did the article not mention which LLM they tried, what prompts they've used and then they also mention zero-shot only, meaning no in-context learning? And they didn't think to tweak the instructions after it failed the first time? I don't know, doesn't seem like they really tried all that hard and basically just quickly checked the "yep, LLMs don't work here" box.

                        • pitah1 7 hours ago

                          The world of mock data generation is now flooded with ML/AI solutions generating data but this is a solution that understands it is better to generate metadata to help guide the data generation. I found this was the case given the former solutions rely on production data, retraining, slow speed, huge resources, no guarantee about leaking sensitive data and its inability to retain referential integrity.

                          As mentioned in the article, I think there is a lot of potential in this area for improvement. I've been working on a tool called Data Caterer (https://github.com/data-catering/data-caterer) which is a metadata-driven data generator that also can validate based on the generated data. Then you have full end-to-end testing using a single tool. There are also other metadata sources that can help drive these kinds of tools outside of using LLMs (i.e. data catalogs, data quality).

                          • chromanoid 3 hours ago

                            The article reads like it was a bullet point list inflated by AI. But maybe I am just allergic to long texts nowadays.

                            I wonder if we will use AI users to generate mock data and e2e test our applications in the near future. This would probably generate even more realistic data.

                            • benxh 4 hours ago

                              I'm pretty sure that Neosync[0] does this to a pretty good degree, it is open source and YC funded too.

                              [0] https://www.neosync.dev/

                              • roywiggins 2 hours ago

                                a digression but

                                > this text has been the industry's standard dummy text ever since some printed in the 1500s

                                doesn't seem to be true:

                                https://slate.com/news-and-politics/2023/01/lorem-ipsum-hist...

                                • lysecret 6 hours ago

                                  This is a very good point, that's probably my number one use-case of things like copilot chat, just to fill in some of my types and generate some test cases.

                                  • eesmith 4 hours ago

                                    A European friend of mine told me about some of the problems of mock data generation.

                                    A hard one, at least for the legal requirements in her field, is that it must not include a real person's information.

                                    Like, if it says "John Smith, 123 Oak St." and someone actually lives there with that name, then it's a privacy violation.

                                    You end up having to use addresses that specifically do not exist, and driver's license numbers which are invalid, etc.

                                    • mungoman2 3 hours ago

                                      Surely that's only their interpretation of privacy laws, and not something tested in courts.

                                      It seems unlikely to actually break regulations if it's clear that the data has been fished out of the entropy well.

                                      • aithrowawaycomm 3 hours ago

                                        But if "fished out of the entropy well" includes "a direct copy of something which should not have been in the training data in the first place, like a corporate HR document," then that's a big problem.

                                        I don't think AI providers get to hide behind an "entropy well" defense when that entropy is a direct consequence of AI professionals' greed and laziness around data governance.

                                    • thelostdragon 3 days ago

                                      This looks quite interesting and promising.