from Hacker News

Tracking supermarket prices with Playwright

by sakisv on 8/6/24, 5:52 PM with 210 comments

  • by brikym on 8/6/24, 10:29 PM

    I have been doing something similar for New Zealand since the start of the year with Playwright/Typescript dumping parquet files to cloud storage. I've just collecting the data I have not yet displayed it. Most of the work is getting around the reverse proxy services like Akamai and Cloudflare.

    At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.

  • by RasmusFromDK on 8/7/24, 7:55 AM

    Nice writeup. I've been through similar problems that you have with my contact lens price comparison website https://lenspricer.com/ that I run in ~30 countries. I have found, like you, that websites changing their HTML is a pain.

    One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).

    I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.

    A fun project, but challenging at times, and annoying problems to fix.

  • by batata004 on 8/7/24, 4:25 AM

    I created a similar website which got lots of interest in my city. I scrape even app and websites data using a single server at Linode with 2GB of RAM with 5 IPv4 and 1000 IPv6 (which is free) and every single product is scraped at most 40 minutes interval, never more than that with avg time of 25 minutes. I use curl impersonate and scrape JSON as much as possible because 90% of markets provide prices from Ajax calls and the other 10% I use regex to easily parse the HTML. You can check it at https://www.economizafloripa.com.br
  • by maerten on 8/7/24, 4:25 PM

    Nice article!

    > The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.

    I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

    I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/

    Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.

    The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...

    It works for the most part, as long as at least one correct barcode number is provided for a product.

  • by pcblues on 8/7/24, 7:26 AM

    This is interesting because I believe the two major supermarkets in Australia can create a duopoly in anti-competitive pricing by just employing price analysis AI algorithms on each side and the algorithms will likely end up cooperating to maximise profit. This can probably be done legally through publicly obtained prices and illegally by sharing supply cost or profit per product data. The result is likely to be similar. Two trained AIs will maximise profit in weird ways using (super)multidimensional regression analysis (which is all AI is), and the consumer will pay for maximised profits to ostensible competitors. If the pricing data can be obtained like this, not much more is needed to implement a duopoly-focused pair of machine learning implementations.
  • by seanwilson on 8/7/24, 12:11 AM

    > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products. However the way they write the prices has changed and now a bag of chips doesn't cost €1.99 but €199. To catch these changes I rely on my transformation step being as strict as possible with its inputs.

    You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.

  • by langsoul-com on 8/6/24, 11:55 PM

    The hard thing is not scraping, but getting around the increasingly sophisticated blockers.

    You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.

    Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.

    I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it

  • by xyst on 8/6/24, 9:16 PM

    Would be nice to have a price transparency of goods. It would make processes like this much more easier to track by store, and region.

    For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).

    On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.

  • by grafraf on 8/7/24, 7:40 AM

    We have been doing it for the Swedish market in more than 8 years. We have a website https://www.matspar.se/ , where the customer can browse all the products of all major online stores, compare the prices and add the products they want to buy in the cart. The customer can in the end of the journey compare the total price of that cart (including shipping fee) and export the cart to the store they desire to order it.

    I'm also one of the founders and the current CTO, so there been a lot of scraping and maintaining during the years. We are scraping over 30 million prices daily.

  • by odysseus on 8/7/24, 2:07 AM

    I used to price track when I moved to a new area, but now I find it way easier to just shop at 2 markets or big box stores that consistently have low prices.

    In Europe, that would probably be Aldi/Lidl.

    In the U.S., maybe Costco/Trader Joe's.

    For online, CamelCamelCamel/Amazon. (for health/beauty/some electronics but not food)

    If you can buy direct from the manufacturer, sometimes that's even better. For example, I got a particular brand of soap I love at the soap's wholesaler site in bulk for less than half the retail price. For shampoo, buying the gallon size direct was way cheaper than buying from any retailer.

  • by andrewla on 8/6/24, 8:01 PM

    One problem that the author notes is that so much rendering is done client side via javascript.

    The flip side to this is that very often you find that the data populating the site is in a very simple JSON format to facilitate easy rendering, ironically making the scraping process a lot more reliable.

  • by ikesau on 8/6/24, 6:37 PM

    Ah, I love this. Nice work!

    I really wish supermarkets were mandated to post this information whenever the price of a particular SKU updated.

    The tools that could be built with such information would do amazing things for consumers.

  • by xnx on 8/6/24, 6:59 PM

    Scraping tools have become more powerful than ever, but bot restrictions have become equally more strict. It's hard to scrape reliably under any circumstance, or even consistently without residential proxies.
  • by gadders on 8/7/24, 2:59 PM

    This reminds me a bit of a meme that said something along the lines of "I don't want AI to draw my art, I want AI review my weekly grocery shop, workout which combinations of shops save me money, and then schedule the deliveries for me."
  • by ptrik on 8/7/24, 1:45 PM

    > My CI of choice is [Concourse](https://concourse-ci.org/) which describes itself as "a continuous thing-doer". While it has a bit of a learning curve, I appreciate its declarative model for the pipelines and how it versions every single input to ensure reproducible builds as much as it can.

    What's the thought process behind using a CI server - which I thought is mainly for builds - for what essentially is a data pipeline?

  • by jfil on 8/8/24, 3:37 PM

    I'm building something similar for 7 grocery vendor in Canada and am looking to talk with others who are doing this - my email is in my profile.

    One difference: I'm recording each scraping session as a HAR file (for proving provenance). mitmproxy (mitmdump) is invaluable for that.

  • by nosecreek on 8/6/24, 7:05 PM

    Very cool! I did something similar in Canada (https://grocerytracker.ca/)
  • by PigiVinci83 on 8/7/24, 2:11 PM

    Nice article, enjoyed reading it. I’m Pier, co founder of https://Databoutique.com, which is a marketplace for web scraped data. If you’re willing to monetize your data extractions, you can list them on our website. We just started with the grocery industry and it would be great to have you on board.
  • by lotsofpulp on 8/6/24, 6:36 PM

    In the US, retail businesses are offering individualized and general coupons via the phone apps. I wonder if this pricing can be tracked, as it results in significant differences.

    For example, I recently purchased fruit and dairy at Safeway in the western US, and after I had everything I wanted, I searched each item in the Safeway app, and it had coupons I could apply for $1.5 to $5 off per item. The other week, my wife ran into the store to buy cream cheese. While she did that, I searched the item in the app, and “clipped” a $2.30 discount, so what would have been $5.30 to someone that didn’t use the app was $3.

    I am looking at the receipt now, and it is showing I would have spent $70 total if I did not apply the app discounts, but with the app discounts, I spent $53.

    These price obfuscation tactics are seen in many businesses, making price tracking very difficult.

  • by hnrodey on 8/7/24, 2:39 PM

    Nice job getting through all this. I kind of enjoy writing scrapers and browser automation in general. Browser automation is quite powerful and under explored/utilized by the average developer.

    Something I learned recently, which might help your scrapers, is the ability in Playwright to sniff the network calls made through the browser (basically, programmatic API to the Network tab of the browser).

    The boost is that you allow the website/webapp to make the API calls and then the scraper focuses on the data (rather than allowing the page to render DOM updates).

    This approach falls apart if the page is doing server side rendering as there are no API calls to sniff.

  • by mishu2 on 8/7/24, 12:37 PM

    Playwright is basically necessary for scraping nowadays, as the browser needs to do a lot of work before the web page becomes useful/readable. I remember scraping with HTTrack back in high school and most of the sites kept working...

    For my project (https://frankendash.com/), I also ran into issues with dynamically generated class names which change on every site update, so in the end I just went with saving a crop area from the website as an image and showing that.

  • by kinderjaje on 8/11/24, 1:07 PM

    A few years ago, we had a client and built a price-monitoring app for women's beauty products. They had multiple marketplaces, and like someone mentioned before, it was tricky because many products come in different sizes and EANs, and you need to be able to match them.

    We built a system for admins so they can match products from Site A with products from Site B.

    The scraping part was not that hard. We used our product https://automatio.co/ where possible, and where we couldn't, we built some scrapers from scratch t using simple CURL or Puppetteer.

    Thanks for sharing your experience, especially that I didn't use Playwright before.

  • by moohaad on 8/6/24, 10:10 PM

    Cloudflare Worker has Browser Rendering API
  • by Stubbs on 8/7/24, 10:40 AM

    I did something very similar but for the price of wood from sellers here in the UK but instead of Platwright, which I'd never heard of at the time, I used NodeRED.

    You just reminded me, it's probably still running today :-D

  • by ptrik on 8/7/24, 1:43 PM

    > I went from 4vCPUs and 16GB of RAM to 8vCPUs and 16GB of RAM, which reduced the duration by about ~20%, making it comparable to the performance I get on my MBP. Also, because I'm only using the scraping server for ~2h the difference in price is negligible.

    Good lesson on cloud economics. Below certain threshold we get linear performance gain with more expensive instance type. It is essentially the same amount of spending but you would save time running the same workload with more expensive machine but for shorter period of time.

  • by scarredwaits on 8/7/24, 12:53 PM

    Great article and congrats on making this! It would be great to have a chat if you like, because I’ve built Zuper, also for Greek supermarkets, which has similar goals (and problems!)
  • by joelthelion on 8/7/24, 10:19 AM

    We should mutualize scraping efforts, creating a sort of Wikipedia of scraped data. I bet a ton of people and cool applications would benefit from it.
  • by haolez on 8/6/24, 7:35 PM

    I heard that some e-commerce sites will not block scrappers, but poison the data shown to them (e.g. subtly wrong prices). Does anyone know more about this?
  • by NKosmatos on 8/7/24, 12:54 AM

    Hey, thanks for creating https://pricewatcher.gr/en/ very much appreciated.

    Nice blog post and very informative. Good to read that it costs you less than 70€ per year to run this and hope that the big supermarkets don’t block this somehow.

    Have you thought of monetizing this? Perhaps with ads from the 3 big supermarkets you scrape ;-)

  • by antman on 8/6/24, 10:16 PM

    Looks great. Perhaps more than 30 days comparisons would be interesting. Or customizable should be fast enough with a duckdb backend
  • by 6510 on 8/7/24, 1:01 PM

    Can someone name the South-American country where they have a government price comparison website. Listing all products was required by law.

    Someone showed me this a decade ago. The site had many obvious issues but it did list everything. If I remember correctly it was started to stop merchants pricing things by who is buying.

    I forget which country it was.

  • by cynicalsecurity on 8/7/24, 10:09 AM

    > My first thought was to use AWS, since that's what I'm most familiar with, but looking at the prices for a moderately-powerful EC2 instance (i.e. 4 cores and 8GB of RAM) it was going to cost much more than I was comfortable to spend for a side project.

    Yep, AWS is hugely overrated and overpriced.

  • by jonatron on 8/7/24, 6:58 AM

    If you were thinking of making a UK supermarket price comparison site, IIRC there's a company who owns all the product photos, read more at https://news.ycombinator.com/item?id=31900312
  • by hk1337 on 8/6/24, 9:49 PM

    I would be curious if there were a price difference between what is online and physically in the store.
  • by janandonly on 8/8/24, 1:52 PM

    I live in the Netherlands, where we are blessed with a price comparison website (https://tweakers.net/pricewatch/) for gadgets.
  • by ptrik on 8/7/24, 1:44 PM

    > The data from the scraping are saved in Cloudflare's R2 where they have a pretty generous 10GB free tier which I have not hit yet, so that's another €0.00 there.

    Wonder how's the data from R2 fed into frontend?

  • by Closi on 8/7/24, 11:30 AM

    This is great! Would be great if the website would give a summary of which shop was actually cheapest (e.g. based on a basket of comparable goods that all retailers stock).

    Although might be hard to do with messy data.

  • by SebFender on 8/7/24, 11:34 AM

    I've worked with similar solutions for decades (complete different need) and in the end web changes made the solution unscalable. Fun idea to play but with too many error scenarios.
  • by Alifatisk on 8/7/24, 10:24 AM

    Some stores don’t have an interactive website but instead send out magazines to your email with news for the week.

    How would one scrape those? Anyone experienced?

  • by throwaway346434 on 8/7/24, 2:16 PM

  • by Scrapemist on 8/7/24, 4:17 AM

    What if you add all products to your shopping cart and save it as “favourites” and scrape that every other day.
  • by ptrik on 8/7/24, 1:46 PM

    > While the supermarket that I was using to test things every step of the way worked fine, one of them didn't. The reason? It was behind Akamai and they had enabled a firewall rule which was blocking requests originating from non-residential IP addresses.

    Why did you pick Tailscale as the solution for proxy vs scraping with something like AWS Lambda?

  • by mt_ on 8/7/24, 9:04 AM

    What about networking costs? Is it free in Hetzner?
  • by raybb on 8/7/24, 12:12 PM

    Anyone know of one of these for Spain?