from Hacker News

Ask HN: Web scraping easy-medium-hard challenges?

by isoos on 11/17/20, 1:58 PM with 4 comments

I'd like to help a friend to learn more about web scraping (and also with test automation, but that is less fun). Are you aware of any tutorial, competition or anything in-between which has tasks with varying difficulties?

E.g. easy: iterate over the ids of the articles and call curl on it. difficult: you need puppeteer with multiple JS tricks to get through the first few pages, and the end is far away...

  • by tdeck on 11/19/20, 8:08 AM

    Surprised nobody has mentioned ASP websites yet - definitely among the hardest. Those sites carry so much state in cookies rather than URLs, so you have to follow all the UI interactions in order to get to the result you're trying to parse. The markup is also typically really bloated and filled with randomly-generated IDs.
  • by ev1 on 11/17/20, 8:47 PM

    Easy: find a random Wordpress blog. Crawl by category, author, or page.

    Medium: Scrape Yelp.

    Hard: Scrape Yelp and exclude all randomly generated garbage data, false phone numbers, incorrect hours when they detect you're a bot and start feeding you bad data instead of blocking you.

    Hard, expensive: Purchase a pair of limited edition sneakers requiring 3D Secure and 2FA.

  • by quickthrower2 on 11/17/20, 9:33 PM

    An easy challenge that is also very fruitful is “scraping” RSS feeds. A lot of good information is provided by RSS and the challenge could be to aggregate and filter some RSS feeds then generate a new one.
  • by rdtwo on 11/18/20, 5:23 PM

    Medium scrape Craigslist, create a database with your results and graph out prices.

    Then link up reposts to track price history

    Use image recognition to find reused images

    - medium hard Use web scraping to buy a ps5