SANS Stormcast Monday Mar 3rd: AI Training Data Leaks; MITRE Caldera Vuln; modsecurity bypass

March 02, 2025

SANS Stormcast Monday Mar 3rd: AI Training Data Leaks; MITRE Caldera Vuln; modsecurity bypass

Episode Transcript

Common Crawl includes Common Leaks
The "Common Crawl" dataset, a large dataset created by spidering website, contains as expected many API keys and other secrets. This data is often used to train large language models
https://trufflesecurity.com/blog/research-finds-12-000-live-api-keys-and-passwords-in-deepseek-s-training-data
Github Repositories Exposed by Copilot
As it is well known, Github's Copilot is using data from public GitHub repositories to train it's model. However, it appears that repositories who were briefly left open and later made private have been included as well, allowing Copilot users to retrieve files from these repositories.
https://www.lasso.security/blog/lasso-major-vulnerability-in-microsoft-copilot
MITRE Caldera Framework Allows Unauthenticated Code Execution
The MITRE Caldera adversary emulation framework allows for unauthenticted code execution by allowing attackers to specify compiler options
https://medium.com/@mitrecaldera/mitre-caldera-security-advisory-remote-code-execution-cve-2025-27364-5f679e2e2a0e
modsecurity Rule Bypass
Attackers may bypass the modsecurity web application firewall by prepending encoded characters with 0.
https://github.com/owasp-modsecurity/ModSecurity/security/advisories/GHSA-42w7-rmv5-4x2j

News Tech News Technology

Show Transcript

See full episode transcriptTranscript is autogenerated by AI

Hello and welcome to the Monday, March 3rd, 2025 edition of the SANS and at Storm Center's Stormcast. My name is Johannes Ulrich and today I'm recording from Baltimore, Maryland. Well, let's start today with some stories about AI training data. The first one here comes from Truffle Security. Truffle Security, of course, is the company behind Truffle Hawk, the very frequently used and well-respected tool that allows you to identify API keys and other secrets that you may leak in Git or other repositories and such. So Truffle Security took a big database of AI training data that's being offered by Common Crawl. Common Crawl is going out and spidering the web for many years now. They have something like 400 terabytes of data that they are offering. And well, it shouldn't really be a surprise because it's the same thing that we had with Google and other web spiders that offer them the data publicly, that it now becomes, well, probably straightforward to find things like API keys that people leaked on their websites. A little bit tricky here that this data is also historic data. I believe they're doing this for the last 10 years or such. So it is not just current data. Now, sites like Google, they offer some historic data, but usually focus more on current data. They found, again, 12,000 what Truffle Security considers live keys, which means that they work according to Truffle Hawk. Truffle Hawk has a little sort of test feature that allows you to make sure that these are not just simple sample or expired credentials that are being used here. They point out in their paper that this number of roughly 12,000 secrets is, of course, just an estimate. There are some that they missed just because they were formatted not correctly. And then, of course, always a little bit tricky to figure out if they're actually being used, if they're just demo credentials and such. They also point out that many of the credentials can be found across a large number of websites in this data repository. Initially, when I read this, I first thought that, hey, maybe these are just demo credentials and such. Maybe you often have, like, the snake oil secret key that comes with Apache that, of course, is all over the place. Well, according to Truffle Security, they believe that this is more multiple websites using the same piece of JavaScript. It could identify, like, suppliers and supply chains and such. So I have to really see what this all means. Overall, yes, if data is exposed, it probably got captured by someone. From my own experience, particularly for a smaller website, the vast majority of sort of hits you get is crawlers like this. So no real big surprise if these credentials end up pretty quickly in repositories like this common crawl and can then easily be abused. The second story that's also related to training data comes from researchers at Lasso Security. And what they noticed is that the training data being used by GitHub's Copilot, which, well, is Microsoft, contains data from what's now private GitHub repositories. So Copilot uses GitHub as training data. And that's publicly known. That's well -established. But they only use public repositories. What Lasso Security here found is that, well, if your repository was public even for a relatively short amount of time or when you initially set it up, well, it's going to be added. And the GitHub Copilot doesn't necessarily remove data after it's marked as private by the author of that data. And not only that, now you may say, hey, you know, if it's part of that training data, it may not be such a big deal. It may just help people code a little bit or such. You can actually ask Copilot for, hey, list the files in that particular repository. And with that, you basically get a very direct interface into these files that were public at the time. Again, this is only if these files were public at a particular point in time. But they found literally thousands of these repositories were exposed, some sort of big name brand companies. Just like I said earlier, if at any point in time your data was exposed, assume it got grabbed by someone and, well, has to be considered leaked at this point. Well, then we got some vulnerabilities to talk about Miter Caldera. It's a framework to make it easy to simulate adversaries so your red teamers may use it. It implements a REST API and allows for plugins to be controlled to automate various parts of the attack scenario. Sadly, Miter announced last week that Caldera itself is vulnerable to some interesting command injection. The vulnerability derives from the Manx and Sandcat agents. These agents are intended to be used to implement a reverse shell, but they require authentication. However, these agents have the ability to be compiled just in time for a particular platform. And the attacker can actually then supply some compile parameters and with that they can execute arbitrary code. Interesting in part because this is part of an attack framework that's supposed to execute arbitrary code, but not for everybody, only for authorized users here. And that's sort of how you definitely want to update it. There's a great sort of post by Miter. Actually, I like that they go in detail what really went wrong here. But with that, they also did publish a proof of concept exploit. I don't see this as sort of a very likely to be exploited vulnerability, but could certainly be exploited in a more targeted attack. And we have an interesting vulnerability in mod security. This vulnerability is not super severe, but well, it does allow bypassing of mod security rules. And since that's the point of mod security, it sort of invalidates the tool somewhat. All you have to do is you have to prepend HTML encoded values with zeros. Luckily, it only affects one particular version of mod security. So double check and if necessary, update.

Digital Dispatch Podcast Podcast Artwork Image

SANS Daily Stormcast

Johannes Ullrich