from Hacker News

Taking action against scraping for hire

by pawelkobojek on 7/7/22, 12:33 PM with 227 comments

by iandanforth on 7/7/22, 2:03 PM
Collecting the rhetorical BS:
"scraping attacks"
Scraping is not an attack. Monopolists want to pretend they own your data because they get unlimited access to monetize it whereas competitors should have none.
"self-compromised"
Monopolists want to sell you thus it's imperative they maintain the fiction of "one person, one account". By admitting you own your account, they'd have to allow sharing and they wouldn't be able to provide their customers (advertisers) with reliable data about individuals.
"protect people from scraping"
Monopolists will protect themselves and call it protecting you. They will attempt to make you afraid of some other actor using your data in harmful ways so as to detract from how they monetize you and use your data in harmful ways.
"deter the abuse"
Monopolists don't want to argue about what constitutes abuse. Anything they write in their TOS is entirely for their benefit and only constrained by local law (if that). They will abuse you to the fullest extent they can get away with while arguing that any action to use your rights is "abuse."
"safeguard people against clone sites"
Monopolists want to maintain their monopoly, there is no greater threat than a direct challenge to that monopoly by allowing data to move freely.
--
More subtle but even more ironic rhetorical points
"for hire" / "paying for access"
Emphasizing that people making money (gasp) for providing this service, is bad.
"industry leader in taking legal action" + "across many platforms and national boundaries, also requires a collective effort from platforms, policymakers and civil society"
Monopolists can pay high priced marketers to rebrand them as patriotic hero figures fighting valiantly for the little guy.
by fxtentacle on 7/7/22, 1:43 PM
Of course, Facebook wants to make it sound like scraping is illegal, when it generally isn't.
But account hijacking and mass-creation of accounts just to access private pages are clear violations of the Facebook and Instagram ToS, so they surely can sue for that.
by HeckFeck on 7/7/22, 1:22 PM
Data harvesting is moral for me, but not for thee.
by rustdeveloper on 7/7/22, 1:15 PM
“This industry makes scraping available to individuals and companies that otherwise would not have the capabilities.” - seems like web scraping companies are doing a good job :)
by PhilipA on 7/7/22, 1:42 PM
>Octopus, a US subsidiary of a Chinese national high-tech enterprise, built a cloud-based platform designed to provide paying customers access to on-demand scraping software and services.
It is interesting as how they try to position this as a Chinese attack on them.
by throwaway_meta on 7/7/22, 3:28 PM
People that are criticizing this probably were also critical of the Cambridge Analytica scandal, but it would be useful to compare what happened there and here.
With Cambridge Analytica:
- Facebook allowed users (with informed consent) to allow external developers to access their data and limited data about their friends, in order to build social-enabled apps.
- CA exploited this to scrape basic profile data from a large number of users. It broke the ToS by doing so (in particular by using the data for purposes different than stated)
Here the same is happening:
- people are giving a third company access to their profile, which includes access to friends' data (in fact a lot more than what the app platform allowed to do)
- the company is scraping all the data.
At the time of CA, the criticism was that Facebook didn't do enough to enforce its ToS (or maybe that the data sharing should have not been allowed in the first place? But the terms were common knowledge and the attack potential became clear only in hindsight), here people are criticizing that Facebook is in fact enforcing its ToS.
Also note that strong enforcement against scraping is one of the mandates that came from the FTC settlement.
It seems inevitable that any news about Facebook/Meta is read in the worst possible light these days, even when the criticism is self-contradictory. I would expect less superficial commentary from HN.
by carride on 7/7/22, 1:58 PM
In the early days of FB, they convinced people that pages (or some content, sorry I do not know the FB terms) could be public for anyone to view without needing to login to FB. This was very helpful for small businesses and communities. In many countries this is still the quickest place to make a public page. Though now, every small business or community page I want to visit is locked out unless I login FB. Even if I do login it is impossible to copy paste the important details of a page or post, plus the UI is as ugly as it has always been.
by htrp on 7/7/22, 1:40 PM
This is different from LinkedIn v HiQ because HiQ was only scraping publicly available data that was generally accessible to the broader internet. In these two cases, the data is being scraped from FB/Insta using credentials that the client handed over or the mass creation of accounts solely for scraping purposes.
by i_have_an_idea on 7/7/22, 1:59 PM
> After paying for access to the scraping software, customers self-compromised their Facebook and Instagram accounts by providing their authentication information to Octopus
"self-compromised" lol
clearly these people just wanted an automated way to access their own data
by pclmulqdq on 7/7/22, 1:25 PM
They have to keep the walls up on their garden so they can get maximum value from harvesting.
by ok123456 on 7/7/22, 3:03 PM
Remember back when facebook grew their little network by scraping your gmail contacts.
Google blocked them.
There was animus between the two companies that resulted in Facebook not making an official android app until 2010.
by pid-1 on 7/7/22, 1:27 PM
> scrapping attack
by almog on 7/7/22, 2:12 PM
Ironically, around a year ago I disclosed (using their White Hat bug bounty program) that I'm able to access recruitment data (candidates details mostly) using very cheap form of scraping against a 3rd party service provider, they dismissed it and instructed me to report it to the 3rd party that operates that service (which I did beforehand but the issue has had not been fixed).
Sorry for being vague here, I haven't publicly disclosed it yet, but will probably have to if it don't get fixed.
by nicholasjarnold on 7/7/22, 2:53 PM
Funny story from the early days of TheFaceBook, probably around 2005ish:
I was a webmaster of a set of servers on a major university's network. I also had access (enough to run arbitrary programs that had pretty much full ingress/egress to the public internet) to a number of machines across the campus's network. Through some of my coursework and ACM chapter activities I met some other similarly minded technical people with similar levels of access.
We decide that it would be fun to use our superpowers (access + programming abilities + curiosity) to sign up for various accounts on FB and essentially scrape and friend as much as possible. At the time they had some rate limiting, some IP banning (which wasn't terrible because the Uni gave public IPv4 addrs to all machines on campus by default) and then added some early CAPTCHA which we ended up breaking pretty trivially with some python and image recognition code.
Never got sued... :) Never really did much with the scripts or data except test that they worked. Fun times.
by cosmiccatnap on 7/7/22, 1:43 PM
I would consider this appropriate if one of the largest offenders of scrapping weren't the one pretending to be the offended.
by paultopia on 7/7/22, 1:47 PM
"Scraping attacks" LOL
by samsoftstuff on 7/7/22, 2:11 PM
It's like they don't know that courts just made it legal: https://techcrunch.com/2022/04/18/web-scraping-legal-court/
by Nextgrid on 7/7/22, 3:32 PM
So much bad faith in this press release but not surprising from such a disgusting company, with of course some China-related fear-mongering despite no evidence of wrongdoing.
> After paying for access to the scraping software, customers self-compromised their Facebook and Instagram accounts by providing their authentication information to Octopus.
They didn't "self-compromise" their account. They trust Octopus to act on their behalf, and unlike Facebook, Octopus' interests are most likely more aligned with their users' since their service is paid. This is no different from handing your Facebook credentials to your social media manager or secretary. There's no evidence that Octopus misused this access in any way.
> Octopus designed the software to scrape data accessible to the user when logged into their accounts, including data about their Facebook Friends such as email address, phone number, gender and date of birth, as well as Instagram followers and engagement information such as name, user profile URL, location and number of likes and comments per post.
This is either information people intend to be public or information they trust their friends to keep private. Now if Octopus was leaking the private information to third-parties it would be one thing, but so far I see no evidence Octopus was disclosing the scraped information to anyone but their customer (who is already authorized to access it).
> Meta is an industry leader in taking legal action to protect people from scraping and exposing these types of services
Translation: Meta is an industry leader in protecting its disgusting business model that hinges on making public data behind a walled garden with an unacceptable "privacy" policy. There wouldn't be a market for Octopus (or other scrapers) if Facebook already allowed customers to efficiently access information they're already entitled to, but that would be against their interests as their entire business hinges on information being held hostage.
They've created a problem, are selling the cure (well in this case monetizing it via ads) and are now pissed off that someone else is selling the cure for cheaper.
by Litost on 7/7/22, 2:38 PM
Anyone else heard of Tim Berners-Lee's idea of hosting your data in pods outside the relevant corps wanting access to it and you controlling what's shared and how? This is such a completely different way of doing it, I'm not sure of all the implications, be that from admin (how much effort) to security (would this be a massive hacking opportunity) etc. https://www.theregister.com/2022/01/20/tim_bernerslee/
by allenleein on 7/7/22, 1:53 PM
Ironically, Octopus reminds me of "Octopus VR" in the Silicon Valley show.
https://www.youtube.com/watch?v=ltFB4WBdDg4
by viburnum on 7/7/22, 2:08 PM
One of Facebook’s earliest acquisitions was a scraping company called Octazen.
by dangerlibrary on 7/7/22, 1:39 PM
Fingers crossed they eventually get around to suing Clearview AI out of existence.
https://www.nytimes.com/2020/01/18/technology/clearview-priv...
by oxff on 7/7/22, 2:10 PM
Pretty rich idea coming from FB, lol. They do human scraping.
by trasz on 7/7/22, 1:45 PM
We need to update the law to make sure Meta loses in cases like this.
by jmyeet on 7/7/22, 2:06 PM
I'm torn on Web scraping because the extreme of each end of the spectrum on this issue both seem unreasonable.
On one side, you have people who say any form of scraping is be disallowed, even prosecutable. This went so far that the Department of Justice on behalf of AT&T prosecuted a case of URL modification [1]. One of the few bright spots for this psychotic Supreme Court was to curtail the government's power under the CFAA by limiting what constituted "unauthorized" access [2].
On the other hand, there are those who think that any level of scraping should be fine and I think that's untenable too. Consider Yahoo indexing of Stack Overflow [3]:
> In the meantime, since Yahoo (via Slurp!) is about 0.3% of our traffic, but insists on rudely consuming a huge chunk of our prime-time bandwidth, they’re getting IP banned and blocked.
Do these "scraping extremists" think such actions should be illegal? It's actually not that far-fetched given the Ninth Circuit decided LinkedIn wrongly blocked HiQ scraping [4]. Like if you change your website with the intent that it'll make scraping more difficult, is that a problem? What if it's an unintended side effect?
Additionally, companies like Meta, Google and Apple are going to be way more acountable to abiding by data retention laws and regulations than any scraper. If it's OK to scrape FB.com completely, that information is out there forever.
I certainly think the government shouldn't prosecute on behalf of companies. At least that should expose to people how the government's #1 priority is in fact to protect the true constituents: corporations and the capital-owning class.
[1]: https://www.techdirt.com/2013/09/30/dojs-insane-argument-aga...
[2]: https://en.wikipedia.org/wiki/Van_Buren_v._United_States
[3]: https://stackoverflow.blog/2009/06/16/the-perfect-web-spider...
[4]: https://blog.ericgoldman.org/archives/2019/09/ninth-circuit-...
by romanovcode on 7/7/22, 2:23 PM
> Meta is an industry leader in taking legal action to protect people from scraping and exposing these types of services, which provide scraping as a service across multiple websites.
Sure, as long as Meta is not the one selling the data to Cambridge Analytica it's wrong.
by xvector on 7/7/22, 1:51 PM
HN is hypocritical - most commenters here are against this because "Meta bad," but at the same time, most commenters wouldn't want their posts shared privately amongst friends to be scraped and made available publicly.
by throwaway5959 on 7/7/22, 2:00 PM
Wasn’t Meta stealing news articles and not paying news organizations for them?
by NelsonMinar on 7/7/22, 2:36 PM
Octopus sounds really useful; is there an open source equivalent? I'd love to be able to scrape my own data on Facebook. Their data export feature is fairly good but far from complete.
by typon on 7/7/22, 2:32 PM
Google has turned Google Search into a walled garden by scraping people's content and serving it up on their own platter. Is anyone going to stand up to them?
by dmje on 7/7/22, 4:00 PM
Or Facebook could just open up their data. Oh wait, not their data, silly me. Everyone else's data. Keep on scraping, I say.
by rmbyrro on 7/7/22, 6:00 PM
The fact they're wasting time on that is a sign that Facebook decay phase has already started.
by upupandup on 7/7/22, 3:21 PM
whoa wasn't there somebody on HN that ran a web scraping shop that were boasting they can scrape instagram a while back? are these the same guys???
I don't know how far Facebook can get with this, thought Linkedin's court ruling made scraping legal de-facto
by jascii on 7/7/22, 2:16 PM
So, Facebook doesn't want to share the data it wants us to share with them? Figures...
by postalrat on 7/7/22, 3:53 PM
Hey instagram/facebook/linkedin/etc: It's not your data.
by samsoftstuff on 7/7/22, 2:11 PM
It's like they don't know that courts made it legal: https://techcrunch.com/2022/04/18/web-scraping-legal-court/
by neya on 7/7/22, 2:21 PM
Evil Big Co. that literally STEALS people's personal information everywhere they go even after they've indicated they want to be left alone is now offended when someone does the same to them?
Well, color me surprised /s
Fuck Facebook. Meta. Or whatever you want to call it.
by Hedepig on 7/7/22, 1:30 PM
Is this much different from LinkedIn vs hiQ?
by throw20220707 on 7/7/22, 2:10 PM
From GDPR point-of-view this kind of 3rd party data collection is not acceptable (assuming it covers personal information, for example names of people and what they have posted). The difference with Meta's own data collection is that the users have relationship with Meta and users have given their permission for Meta to handle the data. Users also know they can contact Meta and ask them to remove the data.
3rd parties don't have the consent from users. Users don't even have an idea these companies might be holding their data.
by uhtred on 7/7/22, 4:07 PM
Fuck off Facebook you scumbags
by Komodai on 7/7/22, 1:53 PM
Is it Octopus Data Inc. aka Octoparse they are suing?
by jacooper on 7/7/22, 1:37 PM
They are will using fb.com domain? I though meta is not FaceBook?....