by yigitkonur35 on 9/22/24, 2:05 AM with 91 comments
In testing with NASA's Apollo 17 flight documents, it successfully converted complex, multi-oriented pages into well-structured Markdown.
The project is open-source and available on GitHub. Feedback is welcome.
by Oras on 9/22/24, 4:40 PM
As others mentioned, consistency is key in parsing documents and consistency is not a feature of LLMs.
The output might look plausible, but without proper validation this is just a nice local playground that can’t make it to production.
by zerop on 9/22/24, 10:41 AM
by pierre on 9/22/24, 9:54 AM
The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...)
However this model will get better and we may soon have a good pdf to md model.
by constantinum on 9/23/24, 2:55 AM
As other mentioned, accuracy is the one part of solution criteria, other include, how does the preprocessing engine scale/performs at large scale, and how does it handle very complex documents like, bank loan forms with checkboxes, IRS tax forms with multi-layered nested tables etc.
https://unstract.com/llmwhisperer/
LLMWhisperer is a part of Unstract - An open-source tool for unstructured document ETL.
by jdthedisciple on 9/22/24, 8:35 AM
by smusamashah on 9/22/24, 6:35 AM
That converted NASA doc should be included in repo and linked in readme if you haven't already.
by bravura on 9/22/24, 10:18 AM
by charlie0 on 9/22/24, 3:25 PM
by TZubiri on 9/22/24, 7:54 PM
by eth0up on 9/22/24, 12:33 PM
I had previously done so manually, with regex, and was surprised with the quality of the end results of GPT, despite many preceding failed iterations. The work was done in two steps, first with pdf2text, then python.
I'm still trying to created a script to extract the latest numbers from the FL website and append to a cvs list, without re-running the stripping script on the whole PDF every time. Why? I want people to have the ability to freely search the entire history of winning numbers, which in their web hosted search function, is limited to only two of 30+ years.
I know there's a more efficient method, but I don't know more than that.
by refulgentis on 9/22/24, 6:34 PM
I appreciate your work, intent, and sharing it. It's very important to appreciate what you're doing and its context when sharing it.
At that point, you are responsible for it, and the choices you make when communicating about it reflect on you.
by fzysingularity on 9/22/24, 9:30 PM
by KoolKat23 on 9/23/24, 11:16 AM
I know this was an issue when GPT 4 vision initially came out due to training, not sure if it's a solved problem or if your tool handles this.
by jdross on 9/22/24, 5:55 AM
by magicalhippo on 9/22/24, 6:41 AM
by devops000 on 9/23/24, 7:50 AM
by scottmcdot on 9/22/24, 12:49 PM
by wittjeff on 9/22/24, 8:03 PM
by bschmidt1 on 9/23/24, 11:07 PM
I won't tell them :) :D >:D :|