from Hacker News

Show HN: One-Click CSV Deduplication (open-source)

by remolacha on 11/6/24, 4:28 PM with 2 comments

I made an app to fuzzy-deduplicate my Google Sheets and CRM records

- No manual configuration required - Works out-of-the-box on most data types (ex. people, companies, product catalog)

Implementation details:

- Embeds records using an E5 model - Performs similarity search using DuckDB w/ vector similarity extension - Does last-mile comparison and merges duplicates using Claude

Demo video: https://youtu.be/7mZ0kdwXBwM

Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it

Lmk any feedback on how to make this better!

  • by OliverGilan on 11/6/24, 4:37 PM

    Curious how this scales. Just tried this with the test dataset and it was probably the slickest deduplication experience I’ve had