1b. Understand the data#
This is essentially an audit and inventory of available data sources based on the environment from the previous step. The data could be events, logs, errors, or any other relevant information. Having a central logging capability such as a security information and event management (SIEM) platform makes this process significantly easier, and will make later stages of the detection engineering process easier still.
Some of the ways to understand the data include:
Understanding the shape and relationships
Leverage existing schemas, such as ECS/OTel, to help guide this process and identify opportunities to normalize
Hierarchical relationships within specific eventing
Unique and overlapping data sources
Relationship across disparate data sources
Data shaping (visualizations)
Identifying prominent or important specific fields
Leverage features such as Kibana Discover, Dashboards, or Observability
Review netflows
Preservation of raw source vs normalized data
Understanding the verbosity and logging coverage
What is intentionally not logged
What is intended to be logged
What is logged by default vs requiring configuration or setup
What is the recommended verbosity for logging
What are the performance implications or limitations
Understanding logging frequency and cadence
Is it streaming or batched
Are batches pushed or pulled
Is it every 30 seconds or every 15 minutes
Are there buffers or limits and if so what happens to excess or timeouts
Does a prioritization concept exist for retrieval or transmitting
Understanding how the data is captured or generated
Is it reading event logs, consuming providers, or watching API calls
Is it occurring in user land or the kernel
Is it polling or listening for the data
Is data intermediately stored or processed
What is the susceptibility to tampering or fabrication
This is also a recurring step, which should have regular refreshing. It is also just as important that the correct individuals have an understanding of the data. This means that how the processing of the data is preserved will have an influence on its understandability, but also, it may require collaborative sessions or training on interpreting the outcomes of this step.