Automating the SOC: Phase 1
Continuing our Detection-as-Code (DaC) series, we’ll be digging into a newly implemented technology and how our first Phase has significantly decreased ticket triage time. But first, let’s recap what we’ve previously covered. We started this series by Putting DaC into Practice and Integrating our SIEM with CI/CD. This laid the foundation for our Security Operations Center but also brought along new challenges integrating two very complex pieces together. Once we solved issues of scalability and trained the team on the new infrastructure, we started our ongoing battle with Alert Fatigue. We took the first step by automating a lot of our processes and took the second step by improving our detection lifecycle to become a data-driven Security Operations Center (SOC). Now it’s time to talk about step 3, and in my opinion, the fun part: SOAR Automation.
What is a SOAR?
A SOAR is an abbreviation for a Security Orchestration, Automation, and Response tool. In short, these tools allow you to create automated workflows for virtually any use case. However, the main use case for Phase 1 of our SOAR initiative was to create automated workflows for each of our SIEM Detections to decrease the time it takes our Security Analysts to triage a ticket.
As mentioned in our previous article, Becoming a Data-Drive SOC, we began collecting metrics on our detections in order to drive our detection improvement lifecycle. This would allow us to look for detections that were taking our Analysts excess time to triage and start to identify opportunities to save time. If you remember correctly, one of the metrics we are collecting is Resources Used – this is a Select All field on our tickets where an Analyst checks every tool they used to triage a ticket from start to finish. From a top-down perspective, we can see what tools are used most frequently for detections, serving as a starting point for automating our recon process.
First, let me detail the exact effect we’re aiming to accomplish. Our current workflow is for our SIEM to trigger a detection based on a Log, then this will send a ticket to Slack and to Jira. These tickets include a title, runbook, and alert context (pieces of important information). In Phase 1 of our efforts, we want our SOAR to run in the middle of our SIEM and Slack. So, our SIEM would send the Alert to our SOAR, which would kick off an automated workflow based on what detection fired, automatically collect intelligence needed to close the ticket, then populate the Slack ticket with the enhanced information. This means that in a perfect world, the Security Analyst triaging a ticket would have all the data needed to close a ticket all in one place in Slack. Easy right?
Building it Out
The first step to achieving our automated triage was to build a workflow that would accept all tickets and filter them to the correct sub-flow. Our SOAR functions with a single Webhook that accept all alerts from our SIEM extracts the LogType of the Alert, and filters it to a sub-flow for that specific LogType. Inside this sub-flow, it’s then sorted again based on the Detection. Once sorted, the Alert will traverse a branch of the sub-flow that automatically calls the APIs that were inspired by the metrics that we’ve been collecting. This is a long and iterative process that involves quite a bit of tweaking to find what information gives the best context around the detection and trimming of excess data. But once it’s finished, the results are incredible.
Let’s take a look at a sample use case that any SOC will see on practically a daily basis: A Brute Force Alert. First, let’s think about what we’re trying to figure out when one of these alerts pops up: did the user actually forget their password, or is someone trying to brute force into their account? Let’s be honest; 99% of the time, we see this alert, the user just forgot their password. But what about that 1%? We don’t want our analysts to skip over these alerts just because they see them often, but we don’t want them to have to take the tedious actions of having to query the user’s activity to figure out what’s really going on. So, we did it for them: 1. VirusTotal: The first step to any Alert that includes an IP is to run a lookup to make sure it’s not coming from a known malicious IP. We’ve saved the analyst time by extracting it from the AlertContext, automatically running it for them, and including the results in the ticket. 2. SIEM Indicator Search: Let’s see if this IP has occurred in our logs recently. Our workflow has kicked off the search for the last three days and included an easy link to view the results from the ticket. 3. Auto-Run Query – Top Successful Login Locations (Last 90 Days): We implemented this query to show the Analyst where the user has successfully authenticated over the last three months. Do any of these IPs match? If so, we can close this with confidence that the user just forgot their password. 4. Auto-Run Query – Last 5 Successful Logins: Was the last query not good enough? Well, here are the last 5 locations that the user successfully authenticated from and the timestamp it happened.
As you can see from the example, Phase 1 can really cut down the time it takes to triage a ticket. Our Analysts used to have to manually run queries, which can sometimes be a tedious process if you can’t find the runbook or can’t remember the name of the saved queries that are used for a specific alert. But now, it’s all done for us. So now we’re looking to the future on how we could improve this process even further. Let’s take the same example as above and assume a Brute Force attempt alert came through, and the IP address that it occurred from is their most successful login location over the last 90 days. This is clearly a case where the user either forgot their password or maybe fat-fingered it a few times. So, what if this alert never actually fired and automatically closed?
Queue Phase 2: The Auto-Close. In this Phase, we’ll take the intel we gathered from Phase 1 and build logic around it to auto-close our alerts. This could be as simple as IP matching but also comes with more complex use cases. This will surely be an interactive process, just as our detection lifecycle is. Stay tuned as we slowly take down Alert Fatigue once and for all.
Back to Blog