by phodo on 5/30/24, 8:24 PM with 90 comments
by bastien2 on 5/30/24, 11:39 PM
by pierre on 5/30/24, 10:34 PM
https://docs.llamaindex.ai/en/stable/getting_started/starter...
by m0shen on 5/30/24, 9:54 PM
As far as AI goes, not sure.
by Ey7NFZ3P0nzAe on 5/31/24, 5:05 PM
It supports virtually all LLMs and embeddings, including local LLMs and local embedding It scales surprisingly well and I have tons of improvements to come, when I have some free time or procrastinate. Don't hesitate to ask for features!
Here's the link: https://github.com/thiswillbeyourgithub/DocToolsLLM/
by constantinum on 5/31/24, 3:13 AM
There is a 20 min read on why parsing PDFs is hell: https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
To parse PDFs for RAG applications, you'll need tools like LLMwhisperer[1] or unstructured.io[2].
Now back to your problem:
This solution might be an overkill for your requirement, but you can try the following:
To set things up quickly, try Unstract[3], an open-source document processing tool. You can set this up and bring your own LLM models; it also supports local models. It has a GUI to write prompts to get insights from your documents.[4]
[1] https://unstract.com/llmwhisperer/ [2] https://unstructured.io/ [3] https://github.com/Zipstack/unstract [4] https://github.com/Zipstack/unstract/blob/main/docs/assets/p...
by elrostelperien on 5/30/24, 9:59 PM
Without AI, but searching the PDF content, I use Recoll (https://www.recoll.org/) or ripgrep-all (https://github.com/phiresky/ripgrep-all)
by hm-nah on 5/31/24, 2:23 PM
It’s not local, but the Azure Document Intelligence OCR service has a number of prebuilt models. The “prebuilt-read” model is $1.50/1k pages. Once you OCR your docs, you’ll have a JSON of all the text AND you get breakdowns by page/word/paragraph/tables/figures/alllll with bouding-boxes.
Forget the Lang/Llama/Chain-theory. You can do it all in vanilla Python.
by Kikawala on 5/30/24, 10:19 PM
SecureAI-Tools: https://github.com/SecureAI-Tools/SecureAI-Tools
by pixelmonkey on 5/31/24, 2:45 AM
by SoftTalker on 5/31/24, 2:23 AM
by Kikobeats on 6/7/24, 1:49 PM
Here an example turning a arxiv paper into real text:
https://api.microlink.io/?data.html.selector=html&embed=html...
It looks like PDF, but it you open devtools you can see it's actually a very precise HTML representation.
by theolivenbaum on 5/31/24, 11:50 AM
by brailsafe on 5/31/24, 1:17 AM
What I haven't seen suggested though, is the built-in spotlight. Press CMD+Space, type some unique words that might appear in the document, and spotlight will search it. This also works surprisingly well for non-OCRd images of text, anything inside a zip file, an email, etc..
by yousnail on 5/30/24, 10:47 PM
I’ve used both for sensitive internal SOPs, and both work quite well. Private gpt excels at ingesting many separate documents, the other excels at customization. Both are totally offline, and can use mostly whatever models you want.
by ssahoo on 6/8/24, 2:43 PM
Get a copilot PC with recall enabled and quickly scan through the documents by opening in Adobe acrobat reader. Voillla! You will have an sqlite dB that has your index. Few days later, Adobe could have your data in their llm.
by gibsonf1 on 5/30/24, 9:36 PM
by pawelduda on 5/31/24, 11:07 AM
by ilaksh on 5/31/24, 3:22 AM
https://andrejusb.blogspot.com/2024/03/optimizing-receipt-pr...
But I suggest that you just skip that and use gpt-4o. They aren't actually going to steal your data.
Sort through it to find anything with a credit card number or anything ahead time.
Or you could look into InternVL..
Or a combination of PaddleOCR first and then use a strong LLM via API, like gpt-4o or llama3 70b via together.ai
If you truly must do it locally, then if you have two 3090s or 4090s it might work out. Otherwise it the LLMs may not be smart enough to give good results.
Leaving out the details of your hardware makes it impossible to give good advice about running locally. Other than, it's not really necessary.
by bendsawyer on 5/30/24, 10:05 PM
The result is a huge step up from 'full text search' solutions, for my use case. I can have conversations with decades of documents, and it's incredibly helpful. The support scheme keeps my original documents unconnected from the machine, which I own, while updates are done over a remote link. It's great, and I feel safe.
Things change so fast in this space that there did not seem to be a cheap, stable, local alternative. I honestly doubt one is coming. This is not a on-size-fits-all problem.
by skapa_flow on 5/31/24, 8:29 AM
by phodo on 6/3/24, 6:00 AM
by westcort on 5/31/24, 1:33 AM
by hulitu on 5/31/24, 5:12 AM
Adobe Reader can search all PDFs in a directory. They hide this function though.
by kkfx on 5/31/24, 7:07 AM
ocrmypdf + ripgrep-all, recoll (GUI+XLI xapian wrapper) if you prefer an indexed version, for mere full-text search, currently nothing gives better results. The semantic search it's still not there, Paperless-ngx, tagspaces and so on demand way too much time for adding just a single document to be useful at a certain scale.
My own personal version is org-mode, I keep all my stuff org-attached, so instead of searching the pdfs I search my notes linking them, a kind of metadata-rich, taggable, quick, full-text search however even if org-ql is there I almost not use it, just org-roam-node-find and counsel-rg on notes. Once done this allow for quick enough manual and variously automated archiving, do it on a large home directory it's a very long and tedious manual work. For me it's worth done since I keep adding documents and using them, but it took more than an year to be "almost done enough" and it's still unfinished after 4 years.
by treetalker on 5/31/24, 2:33 AM
If you’re having trouble thinking of search terms to plug into HoudahSpot (or grep etc.) then I suppose you could ask a chatbot to assist your brainstorming, and then plug those terms into HoudahSpot/grep/etc.
by epirogov on 5/31/24, 10:58 AM
by dudus on 5/30/24, 9:33 PM
If you trust Google that is.
by jesterson on 5/31/24, 7:55 AM
There is no AI or any other modern fad, but fulltext search (including OCR for image files inside PDFs) works great
by 1123581321 on 5/30/24, 10:02 PM
If you're okay with some false positives, Devonthink would work as is, actually.
by edgyquant on 5/30/24, 10:57 PM
by jeffreyq on 5/31/24, 2:28 PM
by hypefi on 6/1/24, 8:19 AM
by Tylast on 5/31/24, 9:34 AM
by sciencesama on 5/30/24, 11:58 PM
by gandalfthepink on 5/31/24, 12:39 AM
by vrighter on 6/1/24, 5:14 AM
by finack on 5/30/24, 9:49 PM
by adyashakti on 5/30/24, 8:47 PM