Panic At The Ingest

Right-sizing your SIEM ingest is a delicate task. Too much, and you pay for capacity that you don’t need. Too little, and you constantly have to worry about going over (and possibly dropping things that you actually need).

Eventually, you settle on a level that works. Ish – inevitably, you will have occasions where things spike. What do you do when that happens?

Time to panic!

Not really – spikes in ingest volume should be a sign to investigate, not to panic. This is what we did when a spike occurred, causing us to hit our ingest limit unexpectedly.

Step 1: Don’t Panic

Really, don’t panic. We’ve all heard the horror stories about giant SIEM bills. Crazy unexpected giant SIEM bills – like the $65 million Datadog bill. Or the joke that Cisco bought Splunk because it was cheaper than paying the bill (there were several places this was going around, including Reddit and X/Twitter). But generally speaking, SIEM companies aren’t terrible, and your vendor shouldn’t hit you with a ridiculous fee for a one-time, reasonable overage. Shout out to Panther for being a great partner. Our rep followed up to make sure we were aware of and managing the spike!

Step 2: Dig

So you are aware of a spike – whether through alerting you have configured, an overage alert, or whatever else – now you have to do something with that information. Now it’s time to dig. Some questions to ask:

  • Are they within normal fluctuations and just happened to hit toward the end of your billing cycle?
  • Can the volume be attributed to a specific source?
  • What event(s) are causing the issue?
  • Can you narrow it down at all – to a specific account, location, etc.?
  • Does it appear malicious?

Fluctuations

Fluctuations happen. And some of us have our ingest levels tuned so finely that normal fluctuations hitting the billing cycle at just the right time could push you over the limit. If that’s what happened, maybe it’s time to bump up your ingest level. If that’s not an option, can you better filter/sample events to give yourself more wiggle room? Ask the hard questions: Are you getting value from the data? Do you really need to be actively monitoring/detecting it? Or could it go to cheaper archival storage and be brought in to your SIEM as needed? Yes, log everything, always. But there is a cost that has to be managed.

Source

Next up is determining if there is a specific source causing the issue. You need to determine if maybe a higher level of logging got turned on that resulted in more logs (looking at your “debugging” level!). Or did a log source have a change in how things are logged? Schemas change, how logs are pulled, and sizes of deployment change – all of these can bump up ingest. Figure out where the events are coming from and if there’s anything you need to do with the source to address it.

Narrow

Once you know the source causing the issue, can you narrow it down further? Is it related to a certain account/region/whatever? Is it a specific type of event? Is anything going on that makes it make sense – like deploying a new environment? Narrow things down as much as possible to make fixing easier.

Malicious

If you are on the detection and response side, this is probably where your mind went first. And that’s okay. If this isn’t where you went first, now’s the time to ensure. Are you seeing a ton of encryption events that could indicate ransomware? A spike in computing that might relate to crypto-mining? Strangeness in DNS or other traffic that might indicate command and control traffic? The possibilities are endless! (That’s what makes detection and response fun, right?)

Sure it is

Step 3: Fix It

Once you have identified the problem, it’s time to fix it. If it was something malicious, you kick in your IR processes and deal with that. You may not be able to shut anything down because you need the data and/or have bigger problems to deal with.

If it’s not malicious, what can you do to limit the damage? Ideally, you want to trim as narrowly as possible. What you can do will be very dependent on your tools. In our case, a temporary spike was occurring related to infrastructure deployment. We could eliminate just those logs and limit the damage for the 24 hours or so we needed before our ingest reset. Panther’s Normalized Event Filters made this fix take about 30 seconds to implement, and the log source visualizations let us keep an eye on the filtered events to see when things settle down easily. This was the easiest part, which was nice.

You’ll also need to determine if a more long-term fix is needed – and yes, that includes the possibility of increasing your ingest. That will be determined by what caused the spike. You may also need to converse with your vendor if the spike caused an overage. Ideally, you have a good relationship, and things like this rarely happen, so everything is fine. We have a good working relationship with Panther, so I wasn’t worried. That let me focus on the problem rather than panicking about a potential bill. Our rep followed up and everything is good. I am also sure that if I had needed help tuning the filters, a support engineer would have helped quickly.

Step 4: Rejoice

You’ve figured out and fixed the problem – now you can rejoice! Or at least take a breath. You may need to follow up to ensure it was a one-time thing. And we could probably all benefit from doing more event filtering and/or aggregation, but that’s for another time. You’ve solved the problem at hand.

Rejoice
Page Glave

Page is a Security Engineer at FloQast focusing on detection engineering and incident response. Outside of the infosec bubble, she enjoys music, creative pursuits, and the outdoors.



Back to Blog