The Zen of Security Rules

The Zen of Security Rules#

The Zen of Security Rules (ZoSR) is a breakdown of 22 aphorisms intended to be succinct, guiding principles to sound rule development, inspired by the Zen of Python.

The advent of rule-based security products as well as major security operations built around these products continues to grow significantly. As adoption scales, so too does the size of the environments employing these products as well as the size of the rule sets themselves. Optimized rules are paramount to sustaining this growth. Besides alert fatigue, resource exhaustion is another major concern. There are not enough resources or focus on the need to write elegant, simple, performant rules.

The Zen of security rules is intended to be a catalyst for a more collaborative approach to detection engineering. At Elastic, we have been building SIEM and EDR rules in the open since their inception, and have a firm grasp on how to empower community collaboration.

The reality is that there are more things you shouldn’t do than you should, when it comes to designing the perfect rule. Simplicity is the key, and harnessing the lessons from the zen goes a long way.

This section will break down some of the more applicable principles, showing real-world examples from several of the most popular rule sets, including Elastic’s SIEM and EDR rules, as examples of what to do and what not to do. It will discuss the pitfalls of poor rule design, and the elegance possible, leading to sustainable detection engineering at scale, in even the largest environments.

We will dig into the detection logic building process to scrutinize every detail. In addition to looking at the rules, we will also dig into the resulting search results, alerts, and the development process within GitHub, to identify unnecessary pain points during the development process.

../_images/81-zosr.png — Fig. 81 The Zen of Security Rules#

Achieving Zen#

1. Almost all points from the Zen of Python are applicable to security rules - start there#

It’s true - you should check them out!

../_images/82-zop.png — Fig. 82 The Zen of Python#

Rules, and especially detection logic, are a lot like code, which is why this makes so much sense.

2. Favor inclusion-by-exception over exclusion-by-exception, or else endure perpetual whack-a-mole#

First, what is meant by inclusion and exclusion by exception? They are inherent patterns found in most rule logic, though it may not always be obvious. Distilling down to a simple abstraction, it is two approaches to balance assumptions and unknowns with the intent of the rule.

Let’s say you wanted to detect suspicious files created by shell applications that have spawned from a web server.

sequence with maxspan=5m
  [process where event.type =="start" and
     process.parent.name in ("apache2", "httpd") and 
     process.name in ("sh", "bash", "zsh")] by process.entity_id
  [file where event.type == "creation" and
     // inclusion by exception
     file.path : ("/tmp/*", "/root/*", "/etc/*")] by process.parent.entity_id

sequence with maxspan=5m
  [process where event.type =="start" and
     process.parent.name in ("apache2", "httpd") and 
     process.name in ("sh", "bash", "zsh")] by process.entity_id
  [file where event.type == "creation" and
     // exclusion by exception
     not file.path : ("/var/www/*", "/home/web/tmp/*")] by process.parent.entity_id

In this scenario, the assumption is about which files would be legitimately written by the web server, and so not malicious. In the first case, each file is explicitly included as an exceptional case, whereas in the second files are excluded.

If there is ambiguity, then the exclusion-based logic will likely end up requiring constant tuning to exclude new legitimate file paths, which are generating false positives. This process is colloquially referred to as whack-a-mole.

The alternative to consider is that the inclusion-based approach will miss any unknown malicious file paths. Ultimately it is about balance and whether the value of identifying the unknown threat justifies the additional maintenance of the rule and noise of the generated false positive alerts.

3. Have a propensity towards performance; expensive rules must justify the cost#

// only searches processes
from logs-endpoint.events.process*
| where event.type == "start" and process.name == "cmd.exe"

// searches ALL datasources for ALL event types
from logs-*
| where event.category in ("process", "file", "registry", "network") ...

7. Detect behaviors over IOCs, except when necessary#

Rather than writing logic to detect an IOC such as a binary hash, make it more resilient by capturing the behaviors instead. In this case, the latter detection logic looks for a change in UID from non root to root after the execution of a file from the temp directory.

process where 
  // known dirtycow exploit POC  
  process.hash.sha256 == "df34e9d762c2e604ca92f005965b39f3d5c491ae429c86602f59d50276e01130"

sequence with maxspan=5
  [process where user.id != "0"] by process.entity_id
  [process where user.id == "0" and 
     process.executable : "/tmp/*"] by process.parent.entity_id

10. Detection logic formatting should emphasize logical precedence and grouping#

It is a lot easier to understand the first query than the second. Precedence is obvious due to the formatting, reducing the chance of errors in understanding or future tuning.

network where host.os.type == "windows" and network.protocol == "dns" and
    process.name != null and user.id not in ("S-1-5-18", "S-1-5-19", "S-1-5-20") and
    /* Add new WebSvc domains here */
    dns.question.name :
       (
        "raw.githubusercontent.*",
        "pastebin.*"
        ) and
        
    /* Insert noisy false positives here */
    not (
      (
        process.executable : (
          "?:\\Program Files\\*.exe",
          "?:\\Windows\\System32\\WWAHost.exe"
        )
      ) or
    
      /* Discord App */
      (process.name : "Discord.exe" and (process.code_signature.subject_name : "Discord Inc." and
       process.code_signature.trusted == true) and dns.question.name : ("discord.com", "cdn.discordapp.com", "discordapp.com")
      ) or 

      /* MS Sharepoint */
      (process.name : "Microsoft.SharePoint.exe" and (process.code_signature.subject_name : "Microsoft Corporation" and
       process.code_signature.trusted == true) and dns.question.name : "onedrive.live.com"
      ) or 

      /* Firefox */
      (process.name : "firefox.exe" and (process.code_signature.subject_name : "Mozilla Corporation" and
       process.code_signature.trusted == true)
      )
    ) 

network where host.os.type == "windows" and network.protocol == "dns" and
process.name != null and user.id not in ("S-1-5-18", "S-1-5-19", "S-1-5-20") and /* Add new WebSvc domains here */ dns.question.name : (
"raw.githubusercontent.*", "pastebin.*") and 
/* Insert noisy false positives here */ not ((process.executable : (
"?:\\Program Files\\*.exe", "?:\\Windows\\System32\\WWAHost.exe")) or
/* Discord App */ (process.name : "Discord.exe" and 
(process.code_signature.subject_name : "Discord Inc." and
process.code_signature.trusted == true) and dns.question.name : ("discord.com", "cdn.discordapp.com", "discordapp.com")) or /* MS Sharepoint */
(process.name : "Microsoft.SharePoint.exe" and (process.code_signature.subject_name : "Microsoft Corporation" and
process.code_signature.trusted == true) and dns.question.name : "onedrive.live.com") or /* Firefox */ (process.name : "firefox.exe" and
(process.code_signature.subject_name : "Mozilla Corporation" and       process.code_signature.trusted == true))) 

11. Detection logic should be resilient, but when it can’t, use multiple rules#

In this generic example, you can see that complex logic can be simplified by breaking across several different rules. The benefit of this approach really shines during the tuning process, when the logic is continually adapting on unique edge cases.

So long as the logic of the separate rules doesn’t overlap, it will not create an alert problem, but the trade off is that more rules must execute, potentially sacrificing performance.

process where A and B and C and D and
  (E and not F) and
  (G and not H) and
  (I or not J) and
  (not K or not L)

// rule 1
process where A and B and C and D and
  (E and not F)

// rule 2
process where A and B and C and D and
  (G and not H)

// rule 3
process where A and B and C and D and
  (I or not J)

// rule 4
process where A and B and C and D and
  (not K or not L)

12. Consistency in rule structure leads to predictability and fewer errors#

Consistent rule structure could include the ordering and prominence of key fields, such as event fields; trailing with and, while leading with not; and indenting on logic over line breaks. This makes things stand out that may be out of the ordinary, allowing for easier identification and correction.

// patterns in: field ordering, formatting, precedence
from logs-endpoint-events.*
| where event.category == "process" and event.type == "start" and
    process.name in (...) and
    not process.args like "*....*"

21. Production environments vary wildly, so minimize assumptions#

Don’t just assume nobody runs powershell, do a preemptive search and verify these assumptions, which will save a lot of work triaging future false positive alerts.

In summary, this guide stepped through applied examples via the Elastic ecosystem, but the principles will apply to any platform or implementation. We specifically focused on controlling for efficacy, performance, and maintainability to develop and maintain high value rules.