1b. Understand the data

1b. Understand the data#

This is essentially an audit and inventory of available data sources based on the environment from the previous step. The data could be events, logs, errors, or any other relevant information. Having a central logging capability such as a security information and event management (SIEM) platform makes this process significantly easier, and will make later stages of the detection engineering process easier still.

Some of the ways to understand the data include:

  • Understanding the shape and relationships

    • Leverage existing schemas, such as ECS/OTel, to help guide this process and identify opportunities to normalize

    • Hierarchical relationships within specific eventing

    • Unique and overlapping data sources

    • Relationship across disparate data sources

    • Data shaping (visualizations)

    • Identifying prominent or important specific fields

    • Leverage features such as Kibana Discover, Dashboards, or Observability

    • Review netflows

    • Preservation of raw source vs normalized data

Understanding the verbosity and logging coverage

  • What is intentionally not logged

  • What is intended to be logged

  • What is logged by default vs requiring configuration or setup

  • What is the recommended verbosity for logging

  • What are the performance implications or limitations

Understanding logging frequency and cadence

  • Is it streaming or batched

  • Are batches pushed or pulled

  • Is it every 30 seconds or every 15 minutes

  • Are there buffers or limits and if so what happens to excess or timeouts

  • Does a prioritization concept exist for retrieval or transmitting

Understanding how the data is captured or generated

  • Is it reading event logs, consuming providers, or watching API calls

  • Is it occurring in user land or the kernel

  • Is it polling or listening for the data

  • Is data intermediately stored or processed

  • What is the susceptibility to tampering or fabrication

This is also a recurring step, which should have regular refreshing. It is also just as important that the correct individuals have an understanding of the data. This means that how the processing of the data is preserved will have an influence on its understandability, but also, it may require collaborative sessions or training on interpreting the outcomes of this step.