from Hacker News

Classifying all of the pdfs on the internet

by Nydhal on 8/19/24, 12:23 PM with 110 comments

by gnewton77 on 8/19/24, 5:59 PM
Did some similar work with similar visualizations ~2009, on ~5.7M research articles (PDFs, private corpus) from scientific publishers Elsevier, Springer:
Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009. https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...
I am the first author.
by minimaxir on 8/19/24, 3:59 PM
One of the now-underdiscussed features of embeddings is that you can indeed use any existing statistical modeling techniques on them out of the box, and as a bonus avoid the common NLP preprocessing nuances and pitfalls (e.g. stemming) entirely.
This post is a good example on why going straight to LLM embeddings for NLP is a pragmatic first step, especially for long documents.
by snats on 8/19/24, 5:07 PM
Hi! Author here, I wasn't expecting this to be at the top of HN, AMA
by whistle650 on 8/19/24, 3:58 PM
Interesting read with lots of good detail, thank you. A comment: if you are balancing the classes when you do one vs all binary training, and then use the max probability for inference, your probabilities might not be calibrated well, which could be a problem. Do you correct the probabilities before taking the argmax?
by llm_trw on 8/19/24, 3:00 PM
Back in 2006 there were multiple 1tb collections of textbooks as torrents. I imagine the size and number has only grown since then.
by buildbot on 8/19/24, 2:17 PM
I have 20-40TB (pre-dedup) of PDFs - 8TB is a lot but not even close to the total number of PDFs available.
by guiomie on 8/19/24, 4:04 PM
Interesting and fun article! I've been experimenting with various LLMs/GenAI solutions to extract tabular data from PDFs with underwhelming results. It seems like they are good at extracting strings of text and summarizing (e.g what was the total price? when was this printed?) but extracting reliably into a CSV has a decent margin of error.
by josh-sematic on 8/19/24, 11:39 PM
Very cool! At Airtrain we’ve also found embeddings can be very valuable for building classification models. If you’re looking to play around with a large amount of text and embeddings we actually recently deduped and embedded all of fineweb-edu (also mentioned in the article) and put the resulting dataset on Hugging Face: https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fort...
by ned_at_codomain on 8/19/24, 3:26 PM
This is a really cool idea, thanks for sharing. I don't have that much free time these days, but I was thinking of trying a similar-but-different project not too long ago.
I wanted to make a bit of an open source tool to pull down useful time series data for the social sciences (e.g. time series of social media comments about grocery prices). Seems like LLMs have unlocked all kinds of new research angles that people aren't using yet.
I may steal some of your good ideas if I ever get to work on that side project :)
by sireat on 8/20/24, 6:38 AM
Nice work! You've taken multiple approaches similar to what I sometimes do at the national library, I've used all kind of embeddings -> classifiers / LDA.
Curious on your prompt: https://github.com/snat-s/m/blob/main/classify_metadata/prom...
Wouldn't this be basically prompting to classify by the type of URL?
by mehulashah on 8/19/24, 8:05 PM
Classification is just a start. Wondering if it's worth doing something more -- like turning all of the text into Markdown or HTML? Would anyone find that interesting?
by pxdm on 8/19/24, 3:32 PM
My first thought on seeing the PCA embeddings scatterplot was "I wonder what pdfs are at the centre of those two clusters?" The most typical pdfs on the internet.
by muratsu on 8/19/24, 7:37 PM
I would have expected the finetuned model to perform much better. Would be curious to see the performance with other models
by Thaxll on 8/19/24, 2:33 PM
First you need a good PDF library :/
by excalibur on 8/19/24, 10:47 PM
> How would you classify all the pdfs in the internet?
Definitely as 'hot dog' or 'not a hot dog'.
by Mindey on 8/20/24, 5:49 AM
Whoever said there "internet," they fail to grasp how big the internet really is.
by autokad on 8/19/24, 5:04 PM
would be interesting to see if they tried LDA (latent direchelet allocation) topics
by layer8 on 8/19/24, 4:14 PM
I would have expected categories for product brochures and product manuals.
by TuringNYC on 8/19/24, 3:06 PM
Ive been playing with https://www.aryn.ai/ for Partitioning. Curious if anyone has tried these tools for better data extraction from PDFs. Any other suggestions?
(I'm a bit disappointed that most of the discussion is about estimating the size of PDFs on the internet, I'd love to hear more about different approaches to extracting better data from the PDFs.)
by niels_bom on 8/20/24, 1:29 AM
Typo: “meats the eye”
by byteknight on 8/19/24, 6:16 PM
This seems like cool work but with a ton of "marketing hype speak" that immediately gets watered down by the first paragraph.
Ordering of statements.
1. (Title) Classifying all of the pdfs on the internet
2. (First Paragraph) Well not all, but all the PDFs in Common Crawl
3. (First Image) Well not all of them, but 500k of them.
I am not knocking the project, but while categorizing 500k PDFs is something we couldnt necessarily do well a few years ago, this is far from "The internet's PDFs".
by afh1 on 8/19/24, 1:28 PM
Interesting read, I did not know about Common Crawl. I feel like RTBF is kind of a lost battle these days with more and more crawlers for AI and whatnot. Once on the internet there is no way back, for better or for worse. This tangent aside, 8TB is really not a lot of data, it's just 8 consumer-grade 1TB hard drives. I find it hard to believe this is "the largest corpus of PDFs online", maybe the largest public one. Not sure how representative it is of "the whole internet".

by ks2048 on 8/19/24, 5:47 PM

    I don’t have 8TB laying around, but we can be a bit more clever.... In particular I cared about a specific column called url. I really care about the urls because they essentially tell us a lot more from a website than what meats the eye.

I'm I correct that it is only only using the URL of the PDF to do classification? Maybe still useful, but that's quite a different story than "classifying all the pdfs".