by deepersprout on 9/18/23, 2:30 PM with 226 comments
by saurik on 9/18/23, 3:22 PM
...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.
> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.
by sillysaurusx on 9/18/23, 3:10 PM
by hdesh on 9/18/23, 3:01 PM
by quickthrower2 on 9/18/23, 3:51 PM
SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.
My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.
If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.
by stevanl on 9/18/23, 3:14 PM
[1] https://github.com/microsoft/robust-models-transfer/blame/a9...
by jl6 on 9/18/23, 4:22 PM
by pradn on 9/18/23, 6:14 PM
Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.
by mola on 9/18/23, 4:05 PM
by anon1199022 on 9/18/23, 2:34 PM
by formerly_proven on 9/18/23, 3:43 PM
> Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.
Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.
by bkm on 9/18/23, 3:00 PM
by wodenokoto on 9/18/23, 3:46 PM
Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.
by kevinsundar on 9/18/23, 9:13 PM
They used the same mechanism of using common crawl or other publicly available web crawler data to source dns records for s3 buckets.
by EGreg on 9/18/23, 4:07 PM
https://qbix.com/blog/2023/06/12/no-way-to-prevent-this-says...
https://qbix.com/blog/2021/01/25/no-way-to-prevent-this-says...
by rickette on 9/18/23, 3:25 PM
by lijok on 9/18/23, 7:18 PM
by gumballindie on 9/18/23, 5:18 PM
by madelyn-goodman on 9/18/23, 5:37 PM
by naikrovek on 9/18/23, 4:27 PM
someone chose to make that SAS have a long expiry and someone chose to make it read-write.
by baz00 on 9/18/23, 6:14 PM
Is your data really safe there?
by h1fra on 9/18/23, 5:07 PM
by svaha1728 on 9/18/23, 3:57 PM
by fithisux on 9/19/23, 4:27 AM
Should have been sent to prison.
by riwsky on 9/18/23, 3:45 PM
by bt1a on 9/18/23, 3:06 PM
by 34679 on 9/18/23, 5:28 PM
4e-6 * 3.8e+13 = 152 million kilometers of text.
Nearly 200 round trips to the moon.
by avereveard on 9/18/23, 3:21 PM
by endisneigh on 9/18/23, 2:59 PM
by mymac on 9/18/23, 3:32 PM
by Nischalj10 on 9/18/23, 3:16 PM
by EMCymatics on 9/18/23, 4:35 PM
by munchler on 9/18/23, 3:05 PM
It seems like a stretch to associate this risk with AI specifically. The era of "big data" started several years before the current AI boom.
by buro9 on 9/18/23, 2:56 PM
But that's not true as it's just so cheap to spin up a machine and some storage on a Cloud provider and deal with it later.
It's also not true as I've got a 1Gbps internet connection and 112TB usable in my local NAS.
All of a sudden (over a decade) all the numbers got big and massive data exfiltration just looks to be trivial.
I mean, obviously that's the sales pitch... you need this vendor's monitoring and security, but that's not a bad sales pitch as you need to be able to imagine and think of the risk to monitor for it and most engineers aren't thinking that way.
by anyoneamous on 9/18/23, 4:20 PM
by HumblyTossed on 9/18/23, 3:40 PM