Becoming a Data-Driven SOC

In our previous installation of the Detection-as-Code series, we introduced the problem of Alert Fatigue in the Security Operations Center (SOC) and how the Security Team here at FloQast was planning on fighting it off. As a reminder, Alert Fatigue essentially comes down to a Security Analyst getting tired and overwhelmed by the influx of alerts, to the point where they could potentially mistake an actual security incident as a false positive due to fatigue from the sheer number of alerts that come in every day. Our first steps to fight off Alert Fatigue, as mentioned in the previous article, were to create automation around our tools to make the process for closing out tickets much simpler. However, these automation were really just steps to improve our processes and didn’t involve making any changes to anything in our SIEM detection infrastructure.

A Shot in the Dark

When you’re trying to improve your detections, it can be a difficult task to figure out what to focus on. As you begin to ingest more logs, the list of LogTypes grows, and in turn, the list of detections grows. On top of that, it’s still important not to forget the already existing LogType Detections. So where do you even start?

We found ourselves asking the same thing when it came to deciding what to improve during monthly detection lifecycle improvements. We took a simple approach at first: query the counts by detection each month, sort in descending order, and focus on the top 10. Then there was the anecdotal approach: “I remember these tickets taking a long time; maybe we should add those to the list too?” Sure, this may work at first, but once the next month rolls around, you’ll find yourself back to square one when you realize the same detections are still firing with the same frequency as the month prior. Then you might find yourself asking the same question we did: Now what?

A Shot in the Light

Queue Detection Metrics. These are metrics that we implemented into our tickets so that we weren’t just guessing. We were taking a step towards being data-driven. Gone are the days of “I think we should improve this detection,” and now are the days of “The numbers say we should improve this detection.” But how did we accomplish this? We did it with only FIVE fields.

  1. Classify EventType/LogType: Easily classify your tickets by similar data types
    • Find a way to automatically extract this using Regex. In our workflow, we were able to achieve this with Jira Automations whenever a ticket gets opened.
  2. Classify Detection: Classify your tickets by the detection that triggered them
    • You can find a way to use Regex to extract this information also.
  3. Investigation Time: Figure out what tickets are taking the longest to triage. Just because a ticket is marked “In Progress” for an hour, doesn’t mean it actually took an hour (for example: the Analyst could just be waiting on a user response). This gives a more accurate picture of what is taking place by measuring the time the Analyst was actually taking action. We chose to have five options: a) < 5 minutes b) 5-15 mins c) 15-30 mins d) 30-60 mins e) > 1 hour
  4. Classification: Specify what the outcome of the alert is. This will help us determine how we should consider improving the detection. For example, are there a lot of False Positives? How can we implement logic to filter these out? We chose to have five different classifications: a) True Positive – An actual event that needed investigation. b) False Positive – There was no actual event that needed further escalation. c) Expected Activity – For routine accounts and activities that we still need to monitor for the off chance that something bad occurs. d) Confirmed Activity – A user has confirmed that they have performed the activity and given a business case for their actions. e) Security Testing – A member of the team was testing an action.
  5. Resources Used: A list of all the different tools or resources that were used to triage the ticket. This will help us determine ways we can automate the triage process, and build automation around tools that appear often for detection.

Using the Data

Now that we’ve been collecting data for a while, we had to figure out a way to use it. Since we use Jira for our ticketing, we started to query by detection and build simple visualizations for our data. But Jira’s dashboarding capabilities weren’t dynamic or robust enough for us to really take guided actions on our data because, let’s face it, this isn’t what Jira is for. So we had to look in a different direction: SuperBlocks. This tool changed the game for us. Now we could create complex/custom dashboards to use and manipulate our data in real-time. We achieved this by building comparable dashboards for each detection where we could ingest the data at just a glance. We made it easier by creating simple ways to filter by date range.

Selecting a Target

Now that we’ve collected our data, ingested it, and made it human-legible for our security engineers it’s time to put this data to use. In order to select the correct detection, we want to filter to the last month and first look for detections that have high counts of alerts that have investigation times that are not the “< 5 mins” type. By focusing on detections that are taking longer to triage on average, we could quickly give more time back to our Analysts by automating their processes better. We chose to take this route because we felt the “low-hanging fruit” was decreasing investigation time across the board first, meaning that we’re collecting all the necessary information to close tickets in our automation. Next, we look at the Classification. Are there any detections that have a high count of false positives? What about Expected Activity – could we create any exceptions around this? After this, if the list hasn’t been narrowed down enough, we look at the Resources Used. Are there some detections that require our Analysts to open up a lot of different tools to triage? Could we automate this?

Verifying Results

The last part of the equation is validating that what you’re doing is actually working. And just as the rest of this article suggests, we want the numbers to show that it’s working and not just say, “I think I’ve noticed a decrease in the amount of this detection the last month; it must have worked!” So, we built out a special dashboard called the Detection Comparer. This dashboard allows us to take one detection and compare the data of two different date ranges so that we can verify that our improvements are actually working. We show all of the metrics detailed above, along with the total count of detections fired and the First Response Average time. We made it easy to compare by not only having visualizations but tables that are side-by-side so all of our data lines up. Now we know that our changes are working.

How Does it Feel to be Data Driven?

After collecting data for just about six months now and having our dashboard built out for just about 2 months, we can now say with confidence that being data-driven has made our efforts much more focused and much more efficient. Now we can really dig into the detections that are taking up excess time for our team, overall decreasing the time spent on triaging tickets and increasing the time spent on improving our security posture. We’re also very excited to discuss our future plans for automation based on our new analytics. But I’ll save that for another day.

Ryan Cox

Ryan is a Security Engineer at FloQast with a focus in Detection and Incident Response. Outside of InfoSec, he's an avid golf/basketball/tennis player, weightlifter, guitarist, and loves to travel.

Back to Blog