Hello and welcome to the Monday, March 3rd, 2025
edition of the SANS and at Storm Center's Stormcast. My
name is Johannes Ulrich and today I'm recording from
Baltimore, Maryland. Well, let's start today with some
stories about AI training data. The first one here comes
from Truffle Security. Truffle Security, of course, is the
company behind Truffle Hawk, the very frequently used and
well-respected tool that allows you to identify API
keys and other secrets that you may leak in Git or other
repositories and such. So Truffle Security took a big
database of AI training data that's being offered by Common
Crawl. Common Crawl is going out and spidering the web for
many years now. They have something like 400 terabytes
of data that they are offering. And well, it
shouldn't really be a surprise because it's the same thing
that we had with Google and other web spiders that offer
them the data publicly, that it now becomes, well, probably
straightforward to find things like API keys that people
leaked on their websites. A little bit tricky here that
this data is also historic data. I believe they're doing
this for the last 10 years or such. So it is not just
current data. Now, sites like Google, they offer some
historic data, but usually focus more on current data.
They found, again, 12,000 what Truffle Security considers
live keys, which means that they work according to Truffle
Hawk. Truffle Hawk has a little sort of test feature
that allows you to make sure that these are not just simple
sample or expired credentials that are being used here. They
point out in their paper that this number of roughly 12,000
secrets is, of course, just an estimate. There are some that
they missed just because they were formatted not correctly.
And then, of course, always a little bit tricky to figure
out if they're actually being used, if they're just demo
credentials and such. They also point out that many of
the credentials can be found across a large number of
websites in this data repository. Initially, when I
read this, I first thought that, hey, maybe these are
just demo credentials and such. Maybe you often have,
like, the snake oil secret key that comes with Apache that,
of course, is all over the place. Well, according to
Truffle Security, they believe that this is more multiple
websites using the same piece of JavaScript. It could
identify, like, suppliers and supply chains and such. So I
have to really see what this all means. Overall, yes, if
data is exposed, it probably got captured by someone. From
my own experience, particularly for a smaller
website, the vast majority of sort of hits you get is
crawlers like this. So no real big surprise if these
credentials end up pretty quickly in repositories like
this common crawl and can then easily be abused. The second
story that's also related to training data comes from
researchers at Lasso Security. And what they noticed is that
the training data being used by GitHub's Copilot, which,
well, is Microsoft, contains data from what's now private
GitHub repositories. So Copilot uses GitHub as
training data. And that's publicly known. That's well
-established. But they only use public repositories. What
Lasso Security here found is that, well, if your repository
was public even for a relatively short amount of
time or when you initially set it up, well, it's going to be
added. And the GitHub Copilot doesn't necessarily remove
data after it's marked as private by the author of that
data. And not only that, now you may say, hey, you know, if
it's part of that training data, it may not be such a big
deal. It may just help people code a little bit or such. You
can actually ask Copilot for, hey, list the files in that
particular repository. And with that, you basically get a
very direct interface into these files that were public
at the time. Again, this is only if these files were
public at a particular point in time. But they found
literally thousands of these repositories were exposed,
some sort of big name brand companies. Just like I said
earlier, if at any point in time your data was exposed,
assume it got grabbed by someone and, well, has to be
considered leaked at this point. Well, then we got some
vulnerabilities to talk about Miter Caldera. It's a
framework to make it easy to simulate adversaries so your
red teamers may use it. It implements a REST API and
allows for plugins to be controlled to automate various
parts of the attack scenario. Sadly, Miter announced last
week that Caldera itself is vulnerable to some interesting
command injection. The vulnerability derives from the
Manx and Sandcat agents. These agents are intended to be used
to implement a reverse shell, but they require
authentication. However, these agents have the ability to be
compiled just in time for a particular platform. And the
attacker can actually then supply some compile parameters
and with that they can execute arbitrary code. Interesting in
part because this is part of an attack framework that's
supposed to execute arbitrary code, but not for everybody,
only for authorized users here. And that's sort of how
you definitely want to update it. There's a great sort of
post by Miter. Actually, I like that they go in detail
what really went wrong here. But with that, they also did
publish a proof of concept exploit. I don't see this as
sort of a very likely to be exploited vulnerability, but
could certainly be exploited in a more targeted attack. And
we have an interesting vulnerability in mod security.
This vulnerability is not super severe, but well, it
does allow bypassing of mod security rules. And since
that's the point of mod security, it sort of
invalidates the tool somewhat. All you have to do is you have
to prepend HTML encoded values with zeros. Luckily, it only
affects one particular version of mod security. So double
check and if necessary, update.