from Hacker News

In the land of LLMs, can we do better mock data generation?

by pncnmnp on 9/29/24, 5:06 PM with 46 comments

by alex-moon on 10/2/24, 11:09 AM
Big fan of this write up as it presents a really easy to understand and at the same time brutally honest example of a domain in which a) you would expect LLMs to perform very well, b) they don't and c) the solution is to make the use of ML more targeted, a complement to human reasoning rather than a replacement for it.
Over and over again we see businesses sinking money into "AI" where they are effectively doing a) and then calling it a day, blithely expecting profit to roll in. The day cannot come too soon when these businesses all lose their money and the hype finally dies - and we can go back to using ML the way this write up does (ie the way it is meant to be used). Let's hope no critical systems (eg healthcare or law enforcement) make the same mistake businesses are before that time.
by jumploops on 10/2/24, 5:25 PM
The title and the contents don’t match.
The author expected to use LLMs to just solve the mock data problem, including traversing the schema and generating the correct Rust code for DB insertions.
This demonstrates little about using LLMs for _mock data_ and more about using LLMs for understanding existing system architecture.
The latter is a hard problem, as humans are known to create messy and complex systems (see: any engineer joining a new company).
For mock data generation, we’ve[0] actually found LLMs to be fantastic, however there are a few tricks.
1. Few shot prompting: use a couple of example “records” by inserting user/assistant messages to “prime” the context 2. Keep the records you’ve generated in context, as in, treat every record generated as a historical chat message. This helps avoid duplicates/repeats of common tropes (e.g. John Smith) 3. Split your tables into multiple generations steps — e.g. start with “users” and then for each user generate an “address” (with history!), and so on. Model your mock data creation after your schema and its constraints, don’t rely on the LLM for this step. 4. Separate out mock data generation and DB updates into disparate steps. First generate CSVs (or JSON/YAML) of your data, and then use a separate script(s) to insert that data. This helps avoid issues at insertion as you can easily tweak, retry, or pass on malformed data.
LLMs are fantastic tools for mock data creation, but don’t expect them to also solve the problem of understanding your legacy DB schemas and application code all at once (yet?).
[0]https://www.youtube.com/watch?v=BJ1wtjdHn-E
by edrenova on 10/2/24, 5:07 PM
Nice write up, mock data generation with LLMs is pretty tough. We spent time trying to do it across multiple tables and it always had issues. Whether you look at classical ML models like GANs or even LLMs, they struggle with producing a lot of data and respecting FKs, Constraints and other relationships.
Maybe some day, it gets better but for now, we've found that using a more traditional algorithmic approach is more consistent.
Transparency: founder of Neosync - open source data anonymization - github.com/nucleuscloud/neosync
by danielbln on 10/2/24, 12:57 PM
Did I miss it or did the article not mention which LLM they tried, what prompts they've used and then they also mention zero-shot only, meaning no in-context learning? And they didn't think to tweak the instructions after it failed the first time? I don't know, doesn't seem like they really tried all that hard and basically just quickly checked the "yep, LLMs don't work here" box.
by dogma1138 on 10/2/24, 11:55 AM
Most LLMs I’ve played with are terrible at generating mock data that is in any way useful because they are strongly reinforced against anything that could be used for “recall”.
At least for playing around with llama2 for this you need to abliterate it the point of lobotomy to do anything and then the usefulness drops for other reasons.
by pitah1 on 10/2/24, 9:15 AM
The world of mock data generation is now flooded with ML/AI solutions generating data but this is a solution that understands it is better to generate metadata to help guide the data generation. I found this was the case given the former solutions rely on production data, retraining, slow speed, huge resources, no guarantee about leaking sensitive data and its inability to retain referential integrity.
As mentioned in the article, I think there is a lot of potential in this area for improvement. I've been working on a tool called Data Caterer (https://github.com/data-catering/data-caterer) which is a metadata-driven data generator that also can validate based on the generated data. Then you have full end-to-end testing using a single tool. There are also other metadata sources that can help drive these kinds of tools outside of using LLMs (i.e. data catalogs, data quality).
by SkyVoyager99 on 10/2/24, 2:54 PM
I think this article does a good job in capturing the complexities of generating test data for real world databases. Generating mock data using LLMs for individual tables based on the naming of the fields is one thing, but doing it across multiple tables, while honoring complex relationships across them (primary-foreign keys across 1:1, 1:N, and M:N with intermediate tables) is a whole another level of a challenge. And it's even harder for databases such as MongoDB, where the relationships across collections are often implicit and can best be inferred based on the names of the fields.
by nonameiguess on 10/2/24, 4:22 PM
We faced probably about the worst form of this problem you can face when working for the NRO on ground processing of satellite data. When new orbital sensor platforms are developed, new processing software has to be developed in tandem, but the software has to be developed and tested before the platforms are actually launched, so real data is impossible and you have to generate and process synthetic data instead.
Even then, it's an entirely tractable problem. If you understand the physical characteristics and capabilities of the sensors and the basic physics of satellite imaging in general, you simply use that knowledge. You can't possibly know what you're really going to see when you get into space and look, but you at least know the mathematical characteristics the data will have.
The entire problem here is you need a lot of expertise to do this. It's not even expertise I have or any other software developer had or has. We needed PhDs in orbital mechanics, atmospheric studies, and image science to do it. There isn't and probably never will be a "one-click" button to just make it happen, but this kind of thing might honestly be a great test for anyone that truly believes LLMs can reason at a level equal to human experts. Generate a form of data that has never existed, thus cannot have been in your training set, from first principles of basic physics.
by sgarland on 10/2/24, 1:00 PM
IMO, nothing beats a carefully curated selection of data, randomly selected (with correlations as needed). The problem is you rapidly start getting into absurd levels of detail for things like postal addresses, at least, if you want them to be accurate.
by zebomon on 10/2/24, 4:04 PM
Good read. I wonder to what degree this kind of step-making which I suppose is what is often happening under the hood of OpenAI's o1 "reasoning" model, is set up manually (manually as in a case-by-case basis) as you've done here.
I'm reminded of an evening that I spent playing Overcooked 2 with my partner recently. We made it through to the 4-star rounds, which are very challenging, and we realized that for one of the later 4-star rounds, one could reach the goal rather easily -- by taking advantage of a glitch in the way that items are stored on the map. This realization brought up an interesting conversation, as to whether or not we should then beat the round twice, once using the glitch and once not.
With LLMs right now, I think there's still a widespread hope (wish?) that the emergent capabilities seen in scaled-up data and training epochs will yield ALL capabilities hereon. Fortunately for the users of this site, hacking together solutions seems like it's going to remain necessary for many goals.
by yawnxyz on 10/2/24, 2:53 PM
ok so a long time ago I used "real-looking examples" in a bunch of client prototypes (for a big widely known company's web store) and the account managers couldn't tell whether these were items new that had been released or not... so somehow the mock data ended up in production (before it got caught and snipped)
ever since then I use "real-but-dumb examples" so people know in a glance that it can't possibly be real
the reason I don't like latin placeholder text is b/c the word lengths are different than english so sentence widths end up very different
by benxh on 10/2/24, 12:49 PM
I'm pretty sure that Neosync[0] does this to a pretty good degree, it is open source and YC funded too.
[0] https://www.neosync.dev/
by WhiteOwlEd on 10/2/24, 3:54 PM
Building on this, Human preference optimization (such as Direct Preference Optimization or Kahneman Tversky Optimization) could be used to help in refining models to create better data.
I wrote about this more recently in the context of using LLMs to improve data pipelines. That blog post is at: https://www.linkedin.com/posts/ralphbrooks_bigdata-dataengin...
by larodi on 10/2/24, 4:48 PM
The thing is that this test data generation does not work if you don't account for the schema. Author did so, well done. Been following the same algo for an year, and it works as long, as context big enough to keep ids generated. or otherwise you feed ids for the FKs missing.
But this is really not a breakthrough, anyone with fair knowledge of LLMs and E/R should be able to devise it. the fact not many people have interdisciplinary knowledge is very much evident from all text2sql papers for example which is a similar domain.
by eesmith on 10/2/24, 12:58 PM
A European friend of mine told me about some of the problems of mock data generation.
A hard one, at least for the legal requirements in her field, is that it must not include a real person's information.
Like, if it says "John Smith, 123 Oak St." and someone actually lives there with that name, then it's a privacy violation.
You end up having to use addresses that specifically do not exist, and driver's license numbers which are invalid, etc.
by chromanoid on 10/2/24, 1:33 PM
The article reads like it was a bullet point list inflated by AI. But maybe I am just allergic to long texts nowadays.
I wonder if we will use AI users to generate mock data and e2e test our applications in the near future. This would probably generate even more realistic data.
by lysecret on 10/2/24, 10:55 AM
This is a very good point, that's probably my number one use-case of things like copilot chat, just to fill in some of my types and generate some test cases.
by roywiggins on 10/2/24, 2:03 PM
a digression but
> this text has been the industry's standard dummy text ever since some printed in the 1500s
doesn't seem to be true:
https://slate.com/news-and-politics/2023/01/lorem-ipsum-hist...
by hluska on 10/3/24, 2:09 AM
From the article:
“It should generate realistic data based solely on the schema, without requiring any external user input—a “one-click” solution with minimal friction.“
This is extremely ambitious and ambition will always be very cool.
by dartos on 10/2/24, 3:42 PM
Maybe I’m confused, but why would an llm be better at mapping tuples to functions as opposed to a kind of switch statement?
Especially since it doesn’t seem to totally understand the breadth of possible kinds of faked data?
by erehweb on 10/3/24, 4:58 AM
See also the Charlie Javice case, where she allegedly defrauded JP Morgan into buying her student financial aid company using mock data https://www.nbcnews.com/news/us-news/startup-founder-charlie...
by thelostdragon on 9/29/24, 9:13 PM
This looks quite interesting and promising.