SecOps Confidential Podcast /

Why most teams are flying blind in security operations with Bleon Proko

Why most teams are flying blind in security operations with Bleon Proko

James Berthoty talks with Bleon Proko, cloud security researcher at Exaforce, about why cloud incident response is so much harder than traditional SOC work, what log sources most teams are ignoring, and how to build detection coverage that actually keeps up with attackers.

Why most teams are flying blind in security operations with Bleon Proko

Summary

In this episode of SecOps Confidential, James Berthoty talks with Bleon Proko, cloud security researcher at Exaforce, about why cloud security operations are still tripping up teams that are otherwise solid at traditional SOC work. Bleon breaks down the structural gap between cloud security engineers focused on posture and SOC teams drowning in raw log sources they don't know what to do with. They get into which log sources matter most (including S3 data events and Bedrock logs that most people skip), how to approach basic detection building without getting buried in false positives, and how attackers tend to stay basic while defenders often miss things hiding in plain sight. Bleon also shares lessons from his own cloud research, including a real honeypot that caught a full threat actor team, and his framework for building detection coverage you can actually maintain.

Show Notes

Why cloud security still catches SOC teams off guard, even years into cloud adoption
The gap between CNAPP-focused cloud engineers and traditional security operations teams
Where teams go wrong, collecting logs they never actually use for detection
Which CloudTrail event categories and data sources to prioritize first
Why Bedrock logs and S3 data events are underused and what attackers do with that gap
How to approach detection engineering for cloud without drowning in false positives
Why most attacks are still basic, and why that doesn't make them easier to catch
Novel techniques like AWS Cloud Control and GCP Sys projects used for evasion and persistence
A practical framework for building cloud detection coverage: start with what you use, watch what you don't

Links

Exaforce

James Berthoty (00:01) Hello everyone, welcome back to SecOps Confidential, the place where you can get the real stories on security operations. Today I'm joined by Bleon from Exaforce. Bleon, how can the people think about you? Do a little intro for yourself here.

Bleon (00:17) Thanks for having me on the podcast. It's an honor and a pleasure to be here. My name is Bleon. I'm currently working as a researcher, a cloud security researcher at Exaforce, previously at Permiso. Before that I've been working as a security analyst for different companies in Albania, as well as independently doing pen tests and assessments mostly in Europe.

James Berthoty (00:50) Cool. Today we're going to talk about how to operationalize cloud security and incident response. Traditionally, at least in my experience, security operations teams tend to really struggle with this area. I'm interested in your theory on why, but for me it has always been because when you think of what the default security operations person is working with, it's usually Splunk plus CrowdStrike. Those tools are very much geared towards a certain kind of architecture, primarily meant for monitoring large fleets of Windows laptops and firewall logs. That seems to be probably the majority of security operations work that I've been exposed to when I see traditional SOCs.

I was actually part of bringing cloud security operations into ReliaQuest's service, and I know firsthand all of the frustrations that came along with that — all of the challenges. Really it's because we're trying to unpack a whole new type of infrastructure where every single thing can get logged and is a potential for detection. It's like this giant new playground to work with. And it's a technical skillset that most security operations teams don't have. They're very used to Windows infrastructure, but you start introducing containers and ephemeral workloads and things start getting pretty crazy. From what you've seen, why has the cloud been such a challenge for security operations teams to effectively monitor and respond to incidents on?

Bleon (02:22) I'm going to answer this question in two parts. One is not specific to cloud infrastructure. I believe the most difficult part — and having gone through this myself — is bad budgeting together with very bad management or very bad task management, together with very overworked security engineers, and lastly the products themselves offering very bad monitoring service. All of that basically happens — at least several of those will happen in many companies, no matter what infrastructure they have, no matter what requirements they have.

As far as cloud goes: cloud is not something new, it's actually been many years since we've seen infrastructure being set up in it, but the senior people doing cloud are usually people who come from infrastructure. One thing you notice if you've worked as a security engineer for on-prem infrastructures is that people have to adapt. People have to make do with whatever they have.

It's not uncommon, in the case of a breach, to not have an EDR — just an antivirus — and then close the laptop and open it somewhere to do forensics. And the forensics isn't even real forensics; you're just opening the laptop and trying to figure out what the breach is. It's totally not uncommon to not have logs stored in the SIEM because those are too noisy, or the system gets hacked so you have to physically go there to review those logs.

This creates a problem for cloud. Cloud is not controlled by us — it's controlled by the provider, or at least the backend is. If you do not enable those logs, if you do not store them somewhere, it's going to be hard for you to go back and look at them. It becomes even harder when you have to deal with specific logs of specific services that are stored in cloud which, the moment they are gone, they are gone. You cannot retrieve them anymore.

This becomes a problem of people being used to one way of doing things versus cloud providers not caring about your security beyond their own costs — because you have to care about your own security. The fact that there are a lot of services spread out that you have to look into... in my opinion, the transition from something that we are used to doing before versus something that we don't have control of is one of the main reasons why cloud has been harder to secure and detect.

James Berthoty (05:49) I want to go back into some of those log sources for if somebody is trying to spin up cloud security for the first time. But I actually want to briefly ask your opinion on bringing AI into this — cloud security was sort of the hot thing over the last 10 years, and now AI security is very much becoming the new hot thing. But the industry seems to be very confused at the moment about whether this is really a distinct discipline or not, because so many of the AI logs are really the same fundamentals — network logs, cloud logs, process and compute logs. Do you think there's some additional layer for security operations when it comes to AI and bringing AI runtime security into the SOC? Or do you think we're still trying to cover our basics on some of these cloud instances before we can even do detection and response well on AI workloads?

Bleon (06:47) When it comes to securing AI systems, I would say the logs are basically kind of the same. You have application logs, you have prompts (which are essentially injection logs), you have network logs, and then you have service maintenance logs. And also access logs to the instances where those are stored.

When it comes to utilizing AI as an added help for security — not just cloud, but any infrastructure — I always say AI is no more than a tool. Don't consider AI more than a tool; don't consider AI less than a tool. This is the problem that I see. There's this idea online that everybody wants to speak about AGI and everybody wants to speak about how in the next five years everything will be done by AI, or on the other side that AI does nothing because it's just string injection. My opinion is that AI is the same as any other tool that you use. Learn how to adapt it to your own infrastructure and then use it. If you don't see the need to do that, don't do it. If you see the need to do it, try to understand what it can offer. That's how simple it is.

We are not reinventing the wheel. We have been using automation before. We have been using rules, using automation, tasks that would go through logs or scan something. I always had automated Nessus scans when I used to work as a security engineer. Now, if you can make Nessus scans also work together with an LLM, you will have a better analysis of the entire report. Whenever I would do an automated network scan, I would have like 30,000 vulnerabilities and most of them would be something like "this SSL doesn't have this specific algorithm" — okay, we just skip that. Now that we have AI, I consider it more advanced automation. We can actually get context about whatever we do. This works for attack, this works for defense, this works for everything. You give context to something, it understands the context and gives you an output. That's a tool, and that's a very useful tool.

James Berthoty (09:32) Sounds like we're pretty well aligned on that front — AI really brings particular advantages and accelerations, but there's no fundamental change. Like Nessus is still part of that flow. People can get very abstract about "an agent's just going and doing it," and it turns out that agent's running a Nessus scan. It's not reinventing the wheel somehow.

You've done a ton of threat research and discoveries across identity planes and cloud planes. I want to start with this: there are a lot of people who, because you mentioned tools as being part of why this security operations transition has been difficult — and I totally agree — do you think a lot of teams are still trying to roll their own version of a cloud incident response tool? Like they're sending all of their CloudTrail and VPC flow logs to a CNAP tool, but also sending it directly to their SIEM, and then trying to build custom detections in both and trying to work out of both tools. And if they are, what log sources are missing? Where do people typically fail to put the whole story together when they're working with a SIEM, a CNAP, an identity tool, and maybe a SaaS tool? You end up with this picture that can be really hard to know — do I have all the coverage I need to actually detect something malicious happening?

Bleon (11:04) I actually had this specific discussion when I first went to Horovsk-Kalcek with a guy — I'm going to be honest, I didn't catch his name, so shout out to you if you see this. His opinion was literally: "I used to work at Permiso and I'm doing what you are doing, but I'm doing it internally. I'm collecting everything and putting it in the SIEM. I have rules. I've made the basic dashboard where I have everything. Why do I need you?"

I told him: if you think that you don't need us, you don't need us. It's good that you've done that. The question to him was how well are you aware of your attacks? How good is your team at actually analyzing those attacks and finding IOCs, finding all sorts of indicators that will lead to a detection? How good are they at applying those detections? How good are they at doing threat hunting to understand if you have been breached, or have been breached by something similar? How good are they at updating them all the time? You have a lot of features — if you think that your team has all of that and can do it very well, it's totally fine. You don't need a dedicated product.

But this is the problem — this is the hardest part to do. Not just "okay, I'm collecting everything and storing it somewhere." Okay, but how are you correlating it? Do you know all types of attacks that are happening? Are you researching new types of attacks? Are you even looking into blogs? Are you looking into different feeds to understand new things that are happening, new TTPs that are coming up, new malware that is coming up, new techniques targeting different providers and different deployments? Do you have detections that are good at adopting themselves so that at any point they're able to detect — in a very broad scope — that an attack is happening, while generating the least amount of false positives and without missing a lot of true detections?

This is, in my opinion, the hardest part about setting up everything yourself. I see a lot of people do that. I used to work for companies that would do that — just collecting everything, putting it somewhere, writing detections. That was the hardest part. I started working at one company during the time of ProxyLogon. That was one of the hardest attacks we had to deal with, because you could deal with the most basic payloads, but the rest of our day was just going and hunting and trying to understand if something new was coming. There would always be new payloads that would pass detection. So it was constant work of one to two people just doing that. That is, in my opinion, what's hard about implementing your own full security, detection, and everything.

James Berthoty (14:31) There are two things here. I think one is that I don't think many people are aware of this weird gap in the industry that CNAP created, where cloud security engineers got hired as a distinct engineering practice that was really focused purely on proactive security — vulnerability management and posture management. And then on the other side you have the SOC, already overwhelmed, just getting thrown a ton of new log sources and being expected to make sense of them. You have this real-time detection gap that's existed for a lot of teams across the cloud.

In theory all of this stuff bleeds together, but most cloud security engineers just don't have comfort navigating the underlying logs and doing runtime custom detections. And at the same time, most security operations teams don't feel comfortable managing the intricacies of a Kubernetes cluster to understand the public-facing attack surface and all of those things.

I actually want to show an example of a VPC flow log, because it's my favorite example of the kind of thing you're describing — so many people just collect these logs and do nothing with them. I remember, and it's still a constant discussion, whether teams should collect their VPC flow logs at all. And if you haven't seen a VPC flow log, it's literally just: this instance on this IP connected to this port on this IP, and the traffic was allowed.

Bleon (15:58) Accepted or rejected, yeah.

James Berthoty (16:02) Unless you have this crazy robust asset inventory and you're translating all of the IP addresses and you have very specific things you're looking for, there's actually really nothing you're doing with this from a pure SIEM engineering standpoint. But what's weird is that on the other side, if you're a vendor and you have access to these logs and also access to the assets, there's actually a lot you can do with them from a security product perspective — mapping traffic flows between assets, looking for deviations, tracking egress traffic, doing DNS lookups. There's a lot of stuff you can do. But for most teams, they just send them somewhere, they eat up compute or storage, and nothing happens.

What are the key log sources that people should make sure they have coverage for in the first place?

Bleon (17:09) I would say the data logs. You have management events in CloudTrail, but you also have data events, which include S3 and Lambda logs. One that I hadn't thought of — and this is that idea that if you don't know about something, you never use it — was Bedrock data logs. It makes complete sense for them to exist, but I never thought about it until I actually had to use them. They contain the logs of all the prompts, the questions, and the answers that the LLM produced. If you're using Bedrock, that's a very important thing to have, because —

I guess two years ago, when I was doing the AWS Quarantine Policy Bypass research, at the same time we set up a bucket permission honeypot. We enabled the logs. We ended up finding the entire threat actor team that was utilizing leaked credentials to create their own site powered by AI and others.

So I would say that the data events are one of those sources that people just collect and do nothing with. You have a huge amount of GetObject requests — why the hell are those there? Especially if they come from one identity, one source. Why does that identity need to make 10,000 requests at once? Or even 1,000 requests at once — why are they downloading all of those objects? Why are they deleting all of those objects at once? Is this normal? Was it one task that had a duplicate bucket for a specific reason and is now being deleted? Or was it something that an attacker did — trying to delete everything for some extortion, or they downloaded and then deleted?

Way too many PutObject calls on the same objects at the same time might indicate somebody is either trying to override the content or, in the case of KMS ransomware, uploading a version that's been encrypted. If you don't have versioning on that bucket, you now have an encrypted, inaccessible version of the file. This is one of the things that a lot of people should not only log, but also get context from. Even just from S3 data sources, you have a lot of cases you can look into.

James Berthoty (20:25) It's a great point that people don't appreciate all the intricacies of the way cloud services log their different services. Just because you turn on CloudTrail doesn't mean you have access to everything — you have to opt in to a lot of logs. Bedrock is a very similar example. AWS Glue is another. The line between what's a metric versus what is a security-relevant action is very blurred.

I think that's why we've seen a lot of older-school SIEM and EDR/XDR providers acquiring observability companies — because they need to ingest this other type of data, these low-level metrics and logs, and then try to reconstruct them to make sense of what happened.

Let's start with some basic detections, because I want to get into some of the more advanced pieces that I think Exaforce and Permiso are both great examples of — tracing identity behavior across different tools. That I think is too hard for most detection engineering teams, just not a feasible practice to sustain. So before we get there, what are some basic detection rules that people should try to build themselves from those log sources?

Bleon (21:45) Some of the old log sources, or just the ones that people usually do not use like data sources?

James Berthoty (21:52) Yeah, just your basic CloudTrail type stuff.

Bleon (21:55) Okay. My opinion is if you don't have the most basic ones — privilege escalation events — you should always enable those. These are the bare bones. If you have nothing else, enable those. In cloud, privilege escalation is almost always just updating another resource, so it's always a write event that's going to be triggered. Enable those. For a lot of services you don't use, you won't ever see that event, so you're fine. For the services you don't use at all, configure it so that any time you see that specific event, it always triggers an alert.

For other events you'll see regularly — creating roles, creating or attaching policies, inline policy additions — you can go and analyze each one of them, because you'll see a lot of events, but some you should never see. For example, with CodeStar (which got discontinued — but even with something like that): if you don't use that service and you see that event, why is it there? Why is that identity doing that when they shouldn't be? That might be Bleon just clicking around in the console trying to set up an instance. But let's presume that wasn't it. Why is Bleon doing that? Is Bleon testing, or is it actually someone else?

Next is the most common enumeration techniques. This is a hard thing to do due to how cloud works — each time you click around in the management console in any cloud provider, you're basically just executing the API. It will show you a "red marker" that you don't have access to a specific thing if you don't have access to it. But again, looking for a large burst of events coming from one specific source, which transition from one type to another to another, might be an indicator.

The next thing is more of an indicator than an event-based detection: check for unusual user agents. If you see a large number of requests with the user agent of Boto, it might be brute force. If you see a lot with the user agent of Stalksuit, it might just be somebody scanning. It seems very basic, but this stuff shows up all the time.

There are a lot of cases where attackers don't modify anything, so we get into that metaphor of the plane with the bullet holes — we're always reinforcing the holes we already know about, but attackers are always updating their techniques and targeting the other holes. AWS has made it harder, but not impossible, to modify the user agent. Ultimately it's just an HTTP request, so an attacker can craft anything they want. Always remember: even though I'm saying to check for common malicious user agents, those will be changed. Try to check for anomalies.

This is easier said than done, because as you said, Permiso and Exaforce both do this very well, and many other MDR and security companies do too. But not a lot of people doing their own detection engineering will do this well. The question is: does an identity start and finish with the same source, the same user agent, the same device, at the times you'd expect it to be active? If you see 10 events happening at the same time that you know would take about 2 minutes to click through manually, and then immediately there's a new event from a different source — that might be an indicator of an attack.

So this is usually where I say the hardest part about detecting is starting with the basics, like understanding the most common techniques people use to exploit and privilege escalate, then understanding how they enumerate resources to get there. The next part is literally just baselining, because you can't detect an attack well if all the other indicators have been modified to look normal.

James Berthoty (28:08) That's good. There's a basic version of this stuff — like the enumeration rules — that's pretty easy to set up, but then teams quickly hit a lot of false positives. It's a constant detection tuning problem. When someone clicks on a screen they don't have access to in AWS, it'll fire off like 20 deny events all at once. If your enumeration detector says "alert if 20 events within a second," you're going to trigger a ton of false positives. So it almost requires that level of baselining.

And then the other piece is the identity component, which I think is what most teams are missing — having your set of root identities, tracing the role assumptions that are happening, so you can actually baseline and monitor user behavior in a way that's much more organic than just "if role starts with DevOps, then don't alert." And that's how you get stuff like the Okta breach, where a DevOps user's machine got hit by a fake gaming software installer, credentials got out as a result, and that was that.

I want to pivot, since we're getting low on time, to talk about some of your red teaming research. If someone works at a company and does a lot of their own detection engineering, and they want to get into cloud security research or start doing more red teaming type work in the cloud, where should they even start? What should they try to do and mess around with?

Bleon (30:02) I would say the first thing you need to know is what you're going to use and what you're not going to use. Cloud does a very good job of limiting you by default — in AWS you can only do getCallerIdentity until you enable more. You enable what you want to enable.

The first thing you need to do is: why are we enabling this cloud service? What are we intending to do? Then prevent everything else. After that, understand what types of attacks can happen based on the services you're using, and build detections around those. Then continue to the second part — go and cover all the other attacks and services you do not use. The reason you do the third part is that if an attacker compromises a specific set of keys and starts enumerating other services, you won't have visibility. But if you have coverage and you detect that those credentials are being used but no access is being allowed (because you've locked everything else down), now you can do that other part of threat hunting.

At minimum: understand what you intend to use, disable everything else, build detections around what you are using, continue to monitor everything else. And then the hardest part — continually adapting your detections as new attacks come up, because there are always new things and new ways to do the same things we were doing before.

James Berthoty (32:24) A lot of those research discoveries and big exploits are very complicated identity privilege escalation chains, where you're going between different services or using assume role to pivot up and take actions you shouldn't be allowed to take. But when it comes to what attackers are actually doing day to day — do you see the lazy stuff? Brute force scanning like an Nmap equivalent, then permission enumeration, things that are very easy to detect? Or do you think people are getting better at "let me check my own permissions first, slow roll this, patiently take a foothold over time, and figure out what I'm doing without setting off alarms along the way"?

Bleon (33:18) I would go back to the analogy of the plane with the bullets. We always talk about the new attacker we found and the new attack that happened. But most of the time, most of the new things we're finding are very basic and very much using unupdated techniques that are easy to detect. This is where we always say a company is either hacked, or they don't know they're hacked — and that might be way too much cynicism. But a technique is either known, or it's not known by anybody because nobody has spent enough time going deep on it.

To answer your question: we see a lot — and I mean a lot — of people just spamming tools. Most of the time it's "we found this, we spam tools, we get what we want and we learn." Most of the time I also suspect a lot of them are just automated scanners: "I have this credential and I will continue."

To go back to the Bedrock case I talked about: that's one example of playing in the area of the plane where there are no bullet holes. It's something that never came to mind, but the moment you see it you think, "that completely makes sense." These are the types of attacks we see less and less, mostly because we don't know the techniques — let's be fair. When we see techniques, we usually see a lot of unobfuscated stuff, people spamming tools, or a little bit of variation. Or we literally don't see anything, because we haven't reached the level of knowledge to understand how something can be weaponized.

Two examples come to mind. One is the usage of logs as an enumeration technique, and two is the usage of AWS Cloud Control as an attack tool. Both are relatively new techniques. Cloud Control in particular is interesting because it differs from standard API sessions enough that it looks completely normal — it's how AWS manages things internally, so you might not flag it at all.

And the Sys Projects in GCP are another case. They're not shown in the management console, only in the terminal console, and they're automatically generated each time somebody creates them. So it's completely normal to see 20 of them from a specific company with random IDs. Now it's your job to figure out which ones were actually created by the user and which ones were created by the attacker. If you obfuscate that well enough, you can actually persist very effectively using that technique.

Going back to the plane analogy: if you don't know about this technique, you'll brush it off as "that's just how GCP manages things, nothing unusual." But it's actually a technique an attacker is using. The Follina case was another example — there was a huge dump of random characters before the payload, and the payload was quite literally: turn your document into a zip, unzip the files, write the payload, turn it back into a document. And the moment you have it on your machine, you don't even have to open the file — it executes. These are the cases that you don't expect, and they're harder to catch precisely because we don't know the technical depth of them.

James Berthoty (38:05) There are so many edge cases in cloud that I think make it really challenging — things you just don't know. An example I often think about is the ability in AWS to refresh your session credential. Even if you've deleted a compromised user, you actually have to put an explicit block across all their actions before you delete them to make sure you've revoked their ability to refresh a session. There are just so many gotchas like that. What do you think is the best way for people to even deal with the volume of those things?

Bleon (38:43) Again, like I said before, understand what you want to use in AWS. Whenever you suspect a breach, the first thing you need to do is actually prove the breach is happening. Then go and monitor the event, monitor the identity. Now, depending on the identity's permissions, what I usually say is: if they have high permissions, limit those permissions to something less impactful, then wait to see if other semi-impactful things happen. You shouldn't care that an instance was created and spent you $20 for the five minutes it was up, because you now know the identity is breached. But you should care that the identity is able to go through all of your buckets and download or tamper with files, or go through all of your Google Drive and download all of your emails and sensitive information.

So what I would say is: if you think that is happening, limit them to something that you consider very high-impact, then see if something happens. If you have any other indicator that the identity is breached, then block them completely and go understand how the attack happened — because if they've reached that identity, there's a chance they may have already pivoted. You need to know the entire chain of attack.

Other than that, regarding temporary credentials, sessions, and all of that — I do have a research piece that I've literally written, I'm just waiting to publish it because I submitted a bug to one cloud provider and I'm waiting for a response. It deals specifically with the keys of abusing temporary credentials — not just privilege escalation, but also persistence, and how you can play around with their features to stay longer with the least amount of privileges.

James Berthoty (40:55) Well, that's a great segue as we wrap up. If people want to learn more about that and follow along with your research, what's the best way to stay in touch?

Bleon (41:07) I would say LinkedIn. I do have a Twitter but I mostly don't use it — I just use it as a threat intel feed. You can always reach me on LinkedIn anytime you have any questions. And also check out Exaforce's blog. You'll see my newest research there. That's how you can stay in touch with what I do.

James Berthoty (41:37) Sounds good. Well, thank you so much for your time today, and I appreciate everyone tuning in to SecOps Confidential.

Bleon (41:44) Thank you for having me.

James Berthoty (41:46) See ya.

Subscribe to Exaforce

The dream SOC team.
Working with you 24/7.

Detection, triage, investigation, and response covered by four Exabots running on a unified, real-time view of your environment. Operate the platform yourself, or have Exaforce run it for you.

Contact MDR Experts