by pncnmnp on 9/29/24, 5:06 PM with 46 comments
by alex-moon on 10/2/24, 11:09 AM
Over and over again we see businesses sinking money into "AI" where they are effectively doing a) and then calling it a day, blithely expecting profit to roll in. The day cannot come too soon when these businesses all lose their money and the hype finally dies - and we can go back to using ML the way this write up does (ie the way it is meant to be used). Let's hope no critical systems (eg healthcare or law enforcement) make the same mistake businesses are before that time.
by jumploops on 10/2/24, 5:25 PM
The author expected to use LLMs to just solve the mock data problem, including traversing the schema and generating the correct Rust code for DB insertions.
This demonstrates little about using LLMs for _mock data_ and more about using LLMs for understanding existing system architecture.
The latter is a hard problem, as humans are known to create messy and complex systems (see: any engineer joining a new company).
For mock data generation, we’ve[0] actually found LLMs to be fantastic, however there are a few tricks.
1. Few shot prompting: use a couple of example “records” by inserting user/assistant messages to “prime” the context 2. Keep the records you’ve generated in context, as in, treat every record generated as a historical chat message. This helps avoid duplicates/repeats of common tropes (e.g. John Smith) 3. Split your tables into multiple generations steps — e.g. start with “users” and then for each user generate an “address” (with history!), and so on. Model your mock data creation after your schema and its constraints, don’t rely on the LLM for this step. 4. Separate out mock data generation and DB updates into disparate steps. First generate CSVs (or JSON/YAML) of your data, and then use a separate script(s) to insert that data. This helps avoid issues at insertion as you can easily tweak, retry, or pass on malformed data.
LLMs are fantastic tools for mock data creation, but don’t expect them to also solve the problem of understanding your legacy DB schemas and application code all at once (yet?).
by edrenova on 10/2/24, 5:07 PM
Maybe some day, it gets better but for now, we've found that using a more traditional algorithmic approach is more consistent.
Transparency: founder of Neosync - open source data anonymization - github.com/nucleuscloud/neosync
by danielbln on 10/2/24, 12:57 PM
by dogma1138 on 10/2/24, 11:55 AM
At least for playing around with llama2 for this you need to abliterate it the point of lobotomy to do anything and then the usefulness drops for other reasons.
by pitah1 on 10/2/24, 9:15 AM
As mentioned in the article, I think there is a lot of potential in this area for improvement. I've been working on a tool called Data Caterer (https://github.com/data-catering/data-caterer) which is a metadata-driven data generator that also can validate based on the generated data. Then you have full end-to-end testing using a single tool. There are also other metadata sources that can help drive these kinds of tools outside of using LLMs (i.e. data catalogs, data quality).
by SkyVoyager99 on 10/2/24, 2:54 PM
by nonameiguess on 10/2/24, 4:22 PM
Even then, it's an entirely tractable problem. If you understand the physical characteristics and capabilities of the sensors and the basic physics of satellite imaging in general, you simply use that knowledge. You can't possibly know what you're really going to see when you get into space and look, but you at least know the mathematical characteristics the data will have.
The entire problem here is you need a lot of expertise to do this. It's not even expertise I have or any other software developer had or has. We needed PhDs in orbital mechanics, atmospheric studies, and image science to do it. There isn't and probably never will be a "one-click" button to just make it happen, but this kind of thing might honestly be a great test for anyone that truly believes LLMs can reason at a level equal to human experts. Generate a form of data that has never existed, thus cannot have been in your training set, from first principles of basic physics.
by sgarland on 10/2/24, 1:00 PM
by zebomon on 10/2/24, 4:04 PM
I'm reminded of an evening that I spent playing Overcooked 2 with my partner recently. We made it through to the 4-star rounds, which are very challenging, and we realized that for one of the later 4-star rounds, one could reach the goal rather easily -- by taking advantage of a glitch in the way that items are stored on the map. This realization brought up an interesting conversation, as to whether or not we should then beat the round twice, once using the glitch and once not.
With LLMs right now, I think there's still a widespread hope (wish?) that the emergent capabilities seen in scaled-up data and training epochs will yield ALL capabilities hereon. Fortunately for the users of this site, hacking together solutions seems like it's going to remain necessary for many goals.
by yawnxyz on 10/2/24, 2:53 PM
ever since then I use "real-but-dumb examples" so people know in a glance that it can't possibly be real
the reason I don't like latin placeholder text is b/c the word lengths are different than english so sentence widths end up very different
by benxh on 10/2/24, 12:49 PM
by WhiteOwlEd on 10/2/24, 3:54 PM
I wrote about this more recently in the context of using LLMs to improve data pipelines. That blog post is at: https://www.linkedin.com/posts/ralphbrooks_bigdata-dataengin...
by larodi on 10/2/24, 4:48 PM
But this is really not a breakthrough, anyone with fair knowledge of LLMs and E/R should be able to devise it. the fact not many people have interdisciplinary knowledge is very much evident from all text2sql papers for example which is a similar domain.
by eesmith on 10/2/24, 12:58 PM
A hard one, at least for the legal requirements in her field, is that it must not include a real person's information.
Like, if it says "John Smith, 123 Oak St." and someone actually lives there with that name, then it's a privacy violation.
You end up having to use addresses that specifically do not exist, and driver's license numbers which are invalid, etc.
by chromanoid on 10/2/24, 1:33 PM
I wonder if we will use AI users to generate mock data and e2e test our applications in the near future. This would probably generate even more realistic data.
by lysecret on 10/2/24, 10:55 AM
by roywiggins on 10/2/24, 2:03 PM
> this text has been the industry's standard dummy text ever since some printed in the 1500s
doesn't seem to be true:
https://slate.com/news-and-politics/2023/01/lorem-ipsum-hist...
by hluska on 10/3/24, 2:09 AM
“It should generate realistic data based solely on the schema, without requiring any external user input—a “one-click” solution with minimal friction.“
This is extremely ambitious and ambition will always be very cool.
by dartos on 10/2/24, 3:42 PM
Especially since it doesn’t seem to totally understand the breadth of possible kinds of faked data?
by erehweb on 10/3/24, 4:58 AM
by thelostdragon on 9/29/24, 9:13 PM