from Hacker News

Data accidentally exposed by Microsoft AI researchers

by deepersprout on 9/18/23, 2:30 PM with 226 comments

  • by saurik on 9/18/23, 3:22 PM

    A number of replies here are noting (correctly) how this doesn't have much to do with AI (despite some sentences in this article kind of implicating it; the title doesn't really, fwiw) and is more of an issue with cloud providers, confusing ways in which security tokens apply to data being shared publicly, and dealing with big data downloads (which isn't terribly new)...

    ...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.

    > This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.

  • by sillysaurusx on 9/18/23, 3:10 PM

    The article tries to play up the AI angle, but this was a pretty standard misconfiguration of a storage token. This kind of thing happens shockingly often, and it’s why frequent pentests are important.
  • by hdesh on 9/18/23, 3:01 PM

    On a lighter note - I saw a chat message that started with "Hey dude! How is it going". I'm disappointed that the response was not https://nohello.net/en/.
  • by quickthrower2 on 9/18/23, 3:51 PM

    Two of the things that make me cringe are mentioned. Pickle files and SAS tokens. I get nervous dealing with Azure storage. Use RBAC. They should depreciate SAS and account keys IMO.

    SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.

    My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.

    If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.

  • by stevanl on 9/18/23, 3:14 PM

    Looks like it was up for 2 years with that old link[1]. Fixed two months ago.

    [1] https://github.com/microsoft/robust-models-transfer/blame/a9...

  • by jl6 on 9/18/23, 4:22 PM

    Kind of incredible that someone managed to export Teams messages out from Teams…
  • by pradn on 9/18/23, 6:14 PM

    It's not reasonable to expect human security token generation to be perfectly secure all the time. The system needs to be safe overall. The organization should have set an OrgPolicy on this entire project to prevent blanket sharing of auth tokens/credentials like this. Ideally blanket access tokens should be opt-in, not opt-out.

    Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.

  • by mola on 9/18/23, 4:05 PM

    It's always funny that wiz's big security revelations are almost always about Microsoft. When wiz's founder was the highest ranking in charge of cyber security at Microsoft in his previous job .
  • by anon1199022 on 9/18/23, 2:34 PM

    Just proves how hard it cloud security now. 1-2 mistake and you expose TB's. Insane.
  • by formerly_proven on 9/18/23, 3:43 PM

    This stands out

    > Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.

    Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.

  • by bkm on 9/18/23, 3:00 PM

    Would be insane if the GPT4 model is in there somewhere (as its served by Azure).
  • by wodenokoto on 9/18/23, 3:46 PM

    I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

    Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.

  • by kevinsundar on 9/18/23, 9:13 PM

    This is very similar to how some security researchers got access to TikTok's S3 bucket: https://medium.com/berkeleyischool/cloudsquatting-taking-ove...

    They used the same mechanism of using common crawl or other publicly available web crawler data to source dns records for s3 buckets.

  • by EGreg on 9/18/23, 4:07 PM

    This seems to be a common occurrence with Big Tech and Big Government, so we better get used to it:

    https://qbix.com/blog/2023/06/12/no-way-to-prevent-this-says...

    https://qbix.com/blog/2021/01/25/no-way-to-prevent-this-says...

  • by rickette on 9/18/23, 3:25 PM

    At this point MS might as well aquire Wiz, given the number of azure security findings they have found.
  • by lijok on 9/18/23, 7:18 PM

    I wouldn't trust MSFT with my glass of chocolate milk at this point. I would come back to lipstick all over the rim and somehow multiple leaks in the glass
  • by gumballindie on 9/18/23, 5:18 PM

    Would be cool if someone analysed - i am fairly certain it has proprietary code and data laying around. Would be useful for future lawsuits against microsoft and others that steal people’s ip for “training” purposes.
  • by madelyn-goodman on 9/18/23, 5:37 PM

    This is so unfortunate but a clear illustration of something I've been thinking about a lot when it comes to LLMs and AI. It seems like we're forgetting that we are just handing our data over to these companies on a solver platter in the form of our prompts. Disclosure that I do work for Tonic.ai and we are working on a way to automatically redact any information you send to an LLM - https://www.tonic.ai/solar
  • by naikrovek on 9/18/23, 4:27 PM

    Amazing how ingrained it is in some people to just go around security controls.

    someone chose to make that SAS have a long expiry and someone chose to make it read-write.

  • by baz00 on 9/18/23, 6:14 PM

    What's that, the second major data loss / leak event from MSFT recently.

    Is your data really safe there?

  • by h1fra on 9/18/23, 5:07 PM

    The article is focusing on AI and teams messages for some reason, but the exposed bucket had password, ssh keys, credentials, .env and most probably a lot of proprietary code. I can't even imagine the nightmare it has created internally.
  • by svaha1728 on 9/18/23, 3:57 PM

    Embrace, extend, and extinguish cybersecurity with AI. It's the Microsoft way.
  • by fithisux on 9/19/23, 4:27 AM

    My opinion is that it was not an "accident", but they prepare us for the era where powerful companies will "own" our data in the name of security.

    Should have been sent to prison.

  • by riwsky on 9/18/23, 3:45 PM

    If only Microsoft hadn’t named the project “robust” models transfer, they could have dodged this Hubrisbleed attack.
  • by bt1a on 9/18/23, 3:06 PM

    Don't get pickled, friends!
  • by 34679 on 9/18/23, 5:28 PM

    @4mm character width:

    4e-6 * 3.8e+13 = 152 million kilometers of text.

    Nearly 200 round trips to the moon.

  • by avereveard on 9/18/23, 3:21 PM

    Oof. Is that containing code from GitHub private repos?
  • by endisneigh on 9/18/23, 2:59 PM

    how is this sort of stuff not at least encrypted at rest?
  • by mymac on 9/18/23, 3:32 PM

    Fortunately not a whole of of data and for sure with a little bit like that there wasn't anything important, confidential or embarrassing in there. Looking forward to Microsoft's itemised list of what was taken, as well as their GDPR related filing.
  • by Nischalj10 on 9/18/23, 3:16 PM

    zsh, any way to download the stuff?
  • by EMCymatics on 9/18/23, 4:35 PM

    That's a lot of data.
  • by munchler on 9/18/23, 3:05 PM

    > This case is an example of the new risks organizations face when starting to leverage the power of AI more broadly, as more of their engineers now work with massive amounts of training data.

    It seems like a stretch to associate this risk with AI specifically. The era of "big data" started several years before the current AI boom.

  • by buro9 on 9/18/23, 2:56 PM

    Part of me thought "this is fine as very few could actually download 38TB".

    But that's not true as it's just so cheap to spin up a machine and some storage on a Cloud provider and deal with it later.

    It's also not true as I've got a 1Gbps internet connection and 112TB usable in my local NAS.

    All of a sudden (over a decade) all the numbers got big and massive data exfiltration just looks to be trivial.

    I mean, obviously that's the sales pitch... you need this vendor's monitoring and security, but that's not a bad sales pitch as you need to be able to imagine and think of the risk to monitor for it and most engineers aren't thinking that way.

  • by anyoneamous on 9/18/23, 4:20 PM

    Straight to jail.
  • by HumblyTossed on 9/18/23, 3:40 PM

    Microsoft, too big to fa.. care.