by sakisv on 8/6/24, 5:52 PM with 210 comments
by brikym on 8/6/24, 10:29 PM
At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.
by RasmusFromDK on 8/7/24, 7:55 AM
One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).
I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.
A fun project, but challenging at times, and annoying problems to fix.
by batata004 on 8/7/24, 4:25 AM
by maerten on 8/7/24, 4:25 PM
> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.
I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/
Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.
The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...
It works for the most part, as long as at least one correct barcode number is provided for a product.
by pcblues on 8/7/24, 7:26 AM
by seanwilson on 8/7/24, 12:11 AM
You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.
by langsoul-com on 8/6/24, 11:55 PM
You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.
Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.
I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it
by xyst on 8/6/24, 9:16 PM
For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).
On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.
by grafraf on 8/7/24, 7:40 AM
I'm also one of the founders and the current CTO, so there been a lot of scraping and maintaining during the years. We are scraping over 30 million prices daily.
by odysseus on 8/7/24, 2:07 AM
In Europe, that would probably be Aldi/Lidl.
In the U.S., maybe Costco/Trader Joe's.
For online, CamelCamelCamel/Amazon. (for health/beauty/some electronics but not food)
If you can buy direct from the manufacturer, sometimes that's even better. For example, I got a particular brand of soap I love at the soap's wholesaler site in bulk for less than half the retail price. For shampoo, buying the gallon size direct was way cheaper than buying from any retailer.
by andrewla on 8/6/24, 8:01 PM
The flip side to this is that very often you find that the data populating the site is in a very simple JSON format to facilitate easy rendering, ironically making the scraping process a lot more reliable.
by ikesau on 8/6/24, 6:37 PM
I really wish supermarkets were mandated to post this information whenever the price of a particular SKU updated.
The tools that could be built with such information would do amazing things for consumers.
by xnx on 8/6/24, 6:59 PM
by gadders on 8/7/24, 2:59 PM
by ptrik on 8/7/24, 1:45 PM
What's the thought process behind using a CI server - which I thought is mainly for builds - for what essentially is a data pipeline?
by jfil on 8/8/24, 3:37 PM
One difference: I'm recording each scraping session as a HAR file (for proving provenance). mitmproxy (mitmdump) is invaluable for that.
by nosecreek on 8/6/24, 7:05 PM
by PigiVinci83 on 8/7/24, 2:11 PM
by lotsofpulp on 8/6/24, 6:36 PM
For example, I recently purchased fruit and dairy at Safeway in the western US, and after I had everything I wanted, I searched each item in the Safeway app, and it had coupons I could apply for $1.5 to $5 off per item. The other week, my wife ran into the store to buy cream cheese. While she did that, I searched the item in the app, and “clipped” a $2.30 discount, so what would have been $5.30 to someone that didn’t use the app was $3.
I am looking at the receipt now, and it is showing I would have spent $70 total if I did not apply the app discounts, but with the app discounts, I spent $53.
These price obfuscation tactics are seen in many businesses, making price tracking very difficult.
by hnrodey on 8/7/24, 2:39 PM
Something I learned recently, which might help your scrapers, is the ability in Playwright to sniff the network calls made through the browser (basically, programmatic API to the Network tab of the browser).
The boost is that you allow the website/webapp to make the API calls and then the scraper focuses on the data (rather than allowing the page to render DOM updates).
This approach falls apart if the page is doing server side rendering as there are no API calls to sniff.
by mishu2 on 8/7/24, 12:37 PM
For my project (https://frankendash.com/), I also ran into issues with dynamically generated class names which change on every site update, so in the end I just went with saving a crop area from the website as an image and showing that.
by kinderjaje on 8/11/24, 1:07 PM
We built a system for admins so they can match products from Site A with products from Site B.
The scraping part was not that hard. We used our product https://automatio.co/ where possible, and where we couldn't, we built some scrapers from scratch t using simple CURL or Puppetteer.
Thanks for sharing your experience, especially that I didn't use Playwright before.
by moohaad on 8/6/24, 10:10 PM
by Stubbs on 8/7/24, 10:40 AM
You just reminded me, it's probably still running today :-D
by ptrik on 8/7/24, 1:43 PM
Good lesson on cloud economics. Below certain threshold we get linear performance gain with more expensive instance type. It is essentially the same amount of spending but you would save time running the same workload with more expensive machine but for shorter period of time.
by scarredwaits on 8/7/24, 12:53 PM
by joelthelion on 8/7/24, 10:19 AM
by haolez on 8/6/24, 7:35 PM
by NKosmatos on 8/7/24, 12:54 AM
Nice blog post and very informative. Good to read that it costs you less than 70€ per year to run this and hope that the big supermarkets don’t block this somehow.
Have you thought of monetizing this? Perhaps with ads from the 3 big supermarkets you scrape ;-)
by antman on 8/6/24, 10:16 PM
by 6510 on 8/7/24, 1:01 PM
Someone showed me this a decade ago. The site had many obvious issues but it did list everything. If I remember correctly it was started to stop merchants pricing things by who is buying.
I forget which country it was.
by cynicalsecurity on 8/7/24, 10:09 AM
Yep, AWS is hugely overrated and overpriced.
by jonatron on 8/7/24, 6:58 AM
by hk1337 on 8/6/24, 9:49 PM
by janandonly on 8/8/24, 1:52 PM
by ptrik on 8/7/24, 1:44 PM
Wonder how's the data from R2 fed into frontend?
by Closi on 8/7/24, 11:30 AM
Although might be hard to do with messy data.
by SebFender on 8/7/24, 11:34 AM
by Alifatisk on 8/7/24, 10:24 AM
How would one scrape those? Anyone experienced?
by throwaway346434 on 8/7/24, 2:16 PM
by Scrapemist on 8/7/24, 4:17 AM
by ptrik on 8/7/24, 1:46 PM
Why did you pick Tailscale as the solution for proxy vs scraping with something like AWS Lambda?
by mt_ on 8/7/24, 9:04 AM
by raybb on 8/7/24, 12:12 PM