The detection engineering process: From hypothesis to high-fidelity detection

How to move from adversary hypothesis to deployed, high-fidelity rule and keep it accurate over time.

The detection engineering process is the structured workflow for moving from an adversary hypothesis to a deployed, high-fidelity detection that produces actionable alerts. It breaks detection creation into five discrete stages: threat modeling, logic development, testing and validation, deployment, and continuous tuning.

AI changes this process by helping teams execute each stage more consistently and at greater scale. It can help create better hypotheses, generate candidate detection logic, validate rules against expected behavior, triage bad or stale detections, recommend tuning changes, and learn from analyst feedback. The result is not just faster rule writing. It is a detection program that can identify its own weak spots and improve fidelity over time.

When each stage is followed, detections enter production with documented assumptions, known false positive characteristics, and a clear maintenance owner. When AI is applied thoughtfully, those assumptions can be checked continuously, false positive drivers can be surfaced faster, and rule maintenance can become proactive instead of reactive.

Teams that skip this structure accumulate detection debt, including aging rules nobody tuned, untested logic that fires on everything or nothing, overlapping detections that create redundant alerts, and coverage gaps that widen without anyone noticing. AI does not eliminate the need for detection engineering discipline, but it makes that discipline easier to apply across a larger and faster-moving detection library.

Strong detection engineering programs build this process explicitly rather than treating it as an implicit skill distributed unevenly across individual analysts. For context on how detection approaches have changed over time, how detection logic has evolved covers the shift from manual correlation to systematic engineering practice. This guide walks through each stage of the detection engineering process, where teams commonly get stuck, and how AI helps create, validate, triage, and improve detection rules throughout their lifecycle.

Stage 1: Threat modeling and hypothesis formation

Before writing any detection logic, a detection engineer needs a well-formed hypothesis, which is a specific, testable statement about what adversary behavior should look like in logs if it occurs. Hypothesis quality determines everything downstream. A weak hypothesis produces a detection that is too broad to be useful, too narrow to catch the technique it targets, or too vague for an engineer to test.

Useful inputs for hypothesis formation include MITRE ATT&CK technique descriptions and sub-techniques, threat intelligence reporting about techniques recently used against similar organizations, internal incident findings that documented a detection gap, and purple team or red team results that identified behaviors the current detection program missed.

The output of this stage is not a query yet. It is a precise description of the target behavior, the log source, the field-level evidence that should become visible when the behavior occurs, and what normal variation in that evidence looks like.

AI is valuable at this stage because many detection requests begin as broad, underspecified goals. An analyst may ask to “detect suspicious IAM abuse,” “find lateral movement,” or “catch credential misuse.” Those are not yet detection hypotheses. AI can help decompose them into testable behaviors, map those behaviors to ATT&CK techniques, identify likely telemetry sources, and call out where available logs may not support reliable detection.

For example, instead of leaving the goal as “detect suspicious IAM abuse,” an AI-assisted workflow might propose a scoped hypothesis, such as “detect AssumeRole activity from a previously unseen source ASN, and IAM policy enumeration or privilege-sensitive API calls.”

That hypothesis is more useful because it defines behavior, telemetry, sequence, and timing. It gives the detection engineer something concrete to validate.

A strong hypothesis has three properties. First, it is scoped to a specific technique, not “detect lateral movement” but “detect pass-the-hash authentication against Windows systems using NTLM where the source IP is not the account's registered workstation.” Second, it identifies the specific log source and fields that will carry evidence of the behavior. Third, it specifies what benign instances of the same evidence look like, so the detection can be designed to exclude them rather than tuning them out reactively after deployment.

AI can also assist with coverage prioritization. Teams maintaining a coverage heatmap can use AI to identify underrepresented ATT&CK technique categories, compare them against the organization’s threat model, and propose candidate hypotheses for the highest-risk gaps. Detection programs that skip this prioritization step tend to build depth in already-covered areas and leave persistent gaps in technique categories that represent higher breach consequences.

Stage 2: Writing detection logic

With a hypothesis in hand, the detection engineer translates it into logic that queries available data and produces an alert when the specified conditions are met. This stage involves four practical decisions that shape how the detection performs in production.

The first decision is choosing the detection type that best matches the hypothesis. Signature-based logic matches specific indicators and works well for known-bad coverage with continuous threat intelligence feeds. Threshold logic fires when an observable metric exceeds a defined boundary and suits brute force, enumeration, and data staging behaviors. Sequence logic looks for chains of events in defined order within a time window and encodes tradecraft patterns rather than individual indicators. Behavioral logic compares current activity against established baselines and covers techniques that blend into legitimate behavior. Correlation logic aggregates signals from multiple sources to build a higher-confidence compound finding.

Each type has different requirements for log source quality, baseline availability, and tuning effort. AI can help by turning a natural language version of a hypothesis into a repeatable query, recommending which detection type fits the hypothesis, and explaining the tradeoffs. For example, it may suggest that a static threshold is too brittle for an environment with uneven account activity and that a baseline-relative behavioral rule would be more reliable.

The second decision is field mapping, such as translating the hypothesis into the exact field names and data types used by the log source. A detection that references the wrong field name, or assumes a field contains normalized data when it contains raw strings, will silently fail or produce unexpected output. This step requires direct inspection of sample log data, not just schema documentation, because production logs frequently contain format variations that documentation does not capture.

The third decision is threshold and time window calibration. Threshold and sequence detections require boundaries, such as how many failed logins, how many API calls, how many seconds between related events, or how much data movement is anomalous. These values need calibration against actual production baseline data before deployment, not after. A cutoff pulled from convention rather than measured telemetry will either miss real behavior or fire constantly on benign activity.

AI can analyze historical telemetry and recommend starting thresholds based on observed distributions. It can show whether a proposed threshold would have generated acceptable alert volume over a representative period, whether certain users or service accounts would dominate the output, and whether a static threshold should be replaced with a per-entity baseline.

AI is also useful for rule drafting. Given a hypothesis, available telemetry, or relevant examples (Sigma, SPL, KQL, YARA-L), it can generate candidate SIEM-native logic. The engineer remains responsible for review and approval, but AI reduces the blank-page problem by proposing an initial query, expected fields, correlation windows, thresholds, and enrichment requirements.

Common logic patterns detection engineers work with include process creation chains, authentication anomalies, IAM permission escalation sequences, impossible travel, unusual administrative activity, data access volume spikes, and multi-source correlations that combine identity, endpoint, network, and cloud telemetry. AI can help adapt these patterns to the organization’s actual schema and rule language rather than forcing engineers to manually translate each idea from scratch.

Stage 3: Testing and validation

Testing is what separates detection engineering from detection guessing. The goal is to validate both sides of the rule, that it fires when the targeted behavior occurs, and that it produces an acceptable false positive rate against benign traffic.

Integration testing validates the rule against a sample of real production log data, typically a representative period of benign traffic, to measure baseline false positive rate. A rule that passes unit tests can still fire constantly in production if the benign exclusions defined in Stage 2 did not account for real-world patterns. Integration testing reveals those gaps before analysts encounter them in a live queue.

AI is valuable in integration testing because it can summarize why a rule fired across a large sample. Instead of forcing an engineer to manually inspect hundreds of matches, AI can group them by entity, application, source, command line, geography, parent process, or other relevant attributes. This makes it easier to see whether the rule is detecting meaningful behavior or simply matching a common administrative pattern.

For detection logic targeting high-consequence techniques, purple team validation adds another layer. The hypothesized behavior is executed in a controlled environment, and the team verifies the rule fires as expected and produces alert context sufficient for investigation. Purple team findings that the rule does not fire despite the behavior occurring reveal logic gaps that synthetic test cases may have missed.

AI can help convert purple team outcomes into rule improvements. If a test execution failed to alert, AI can compare the expected behavior with the observed telemetry and suggest likely causes, such as a missing log source, a field mismatch, an overly narrow condition, an incorrect time window, or a gap between the simulated technique and the original hypothesis.

Stage 4: Deployment and documentation

A detection that passes testing is ready for production deployment, but it must be deployed with metadata that makes the detection inventory navigable and maintainable. A rule deployed without documentation becomes a liability as soon as the engineer who wrote it is no longer available to explain what it does.

Each deployed detection should carry the ATT&CK technique IDs it maps to, the log source dependencies it requires, the severity level it carries, the analyst queue it routes to, a link to or copy of the investigation runbook appropriate for this detection type, and the date it was last validated against production data. In systems that support it, estimated false positive rate and alert volume expectations are also useful to include, so anomalous output in production is more quickly recognized as a potential issue.

AI can make documentation less manual. Given a final rule, its hypothesis, and its test results, AI can draft the rule description, summarize the logic in plain language, identify ATT&CK mappings, list telemetry dependencies, propose triage questions, and generate an initial investigation runbook. This makes documentation more likely to exist and more likely to remain consistent across the detection library.

Naming conventions matter more than they seem. A rule named “AWS Suspicious Activity” communicates nothing actionable. A rule named “AWS IAM: Privilege Escalation via Cross-Account AssumeRole from New IP (T1078.004)” communicates the service context, the technique, the distinguishing condition, and the ATT&CK mapping in the alert title itself. Analysts receiving this alert have meaningful context before opening the runbook.

Stage 5: Tuning and lifecycle management

No detection is final at deployment. Production environments change continuously, and detection rules that are not actively maintained degrade in accuracy over time.

Tuning addresses false positive rates that rise after deployment. The typical cause is new services, processes, or accounts added to the environment that match the detection conditions but represent legitimate behavior. These patterns may not have been present in the production sample used for integration testing. Tuning adds exclusions for new benign patterns, narrows conditions to distinguish malicious from legitimate instances more precisely, raises or lowers thresholds where the production baseline has shifted, or adds a second correlated signal to increase confidence.

AI can accelerate this tuning loop by analyzing alert outcomes and recommending specific improvements. For a noisy rule, AI might suggest excluding a known deployment service account, tightening a parent-child process relationship, narrowing activity to privileged accounts, changing a static threshold into a baseline-relative threshold, or requiring a second event within a defined time window. For a silent rule, AI might identify that a field has stopped populating, a log source has changed format, or the condition is too narrow to match the way the behavior appears in current telemetry.

Lifecycle management addresses the longer-term question of when a detection should be revised or retired. Signs that a detection needs major revision include a false positive rate that has climbed above a defined ceiling despite repeated tuning, coverage by a more precise detection that supersedes it, or dependency on a log source being decommissioned. Retired detections should be archived with a record of the reason for retirement so future engineers do not inadvertently recreate the same logic or question whether a coverage gap was intentional.

AI is especially useful for triaging bad detection rules already in production. It can identify rules that never fire, rules that fire too often, rules with stale field references, rules that duplicate stronger detections, rules whose alert outcomes show persistent false positives, and rules that no longer map cleanly to current telemetry. Instead of waiting for analysts to complain about noisy alerts, AI can continuously rank detections by likely maintenance need.

The feedback loop from analyst triage back to detection engineering completes the lifecycle. When an analyst closes an alert as a false positive, that outcome should route back to the engineering queue as tuning input. When an investigation reveals attacker behavior that no existing detection caught, it should generate a hypothesis for new detection development.

Where detection engineering processes typically break down

Certain failure modes recur across mature and immature programs alike. Understanding them in advance is more useful than diagnosing them after detection quality has declined.

Writing logic before the hypothesis is complete is a common first-stage failure. When an engineer jumps from a technique category to writing a query immediately, the result is typically an overly broad rule that fires on activity unrelated to the specific technique being targeted. The hypothesis should be specific enough that the engineer can predict before writing any code what the rule will and will not fire on.

Deploying without testing is a common second-stage failure. Rules validated only on synthetic data frequently fire on benign production activity that was not represented in the synthetic cases. This is especially common for threshold and behavioral detections, where cutoff values are only meaningful relative to actual production volumes.

Detection ownership without a maintenance schedule is a third common failure. Assigning a rule to a team member at deployment but not establishing a review schedule means the rule receives attention only when it breaks visibly. Many decay modes, like a silent log schema change that causes a field reference to stop resolving, do not break a rule in an obvious way. They simply reduce output or produce no output at all, and degraded coverage goes unnoticed until it matters.

Treating AI-generated rules as production-ready is an emerging failure mode. AI can generate useful candidate logic, but generated rules still need engineering review, schema validation, testing, and production calibration. A rule that looks plausible in a query editor can still be wrong, noisy, incomplete, or unsupported by available telemetry.

The right operating model is AI-assisted detection engineering, not AI-autonomous rule deployment. AI proposes, analyzes, and recommends. Detection engineers validate, approve, and own the final production behavior.

Frequently asked questions

How long should the detection engineering process take from hypothesis to deployment?

Simple signature and threshold detections with clean log source availability can move from hypothesis to tested, deployed rule in one to two days. Behavioral detections requiring baseline data collection, or correlation detections requiring multiple log source joins, typically take longer because the integration testing and false positive validation steps are more involved.

AI can reduce the time spent on hypothesis refinement, rule drafting, test case generation, documentation, and initial tuning analysis. It does not eliminate validation time. Mature programs still need representative production data, test infrastructure, and human approval before a rule becomes production detection logic.

How do you prioritize which detections to build next?

Prioritization should draw from ATT&CK coverage gaps weighted by the organization's specific threat model, recent threat intelligence identifying techniques actively used against similar targets, and internal red team or purple team results documenting behaviors the current program missed.

AI can support this prioritization by comparing coverage maps, incident history, available telemetry, and current detection health. For example, it can identify that a high-priority ATT&CK technique has no reliable detection, or that existing coverage depends on a noisy rule with low analyst trust. Techniques that are most frequently discussed in industry reporting are a weak prioritization signal because they reflect the broader population's threat model rather than yours.

What is a reasonable false positive rate for a deployed detection?

There is no universal threshold, but mature programs track false positive rates by detection type and maintain explicit ceilings. A threshold detection with a 10% false positive rate might be operationally acceptable if it catches high-consequence behavior and analyst time per false positive is low. A behavioral detection with a 30% false positive rate is typically too noisy to provide net value because it degrades analyst trust faster than it contributes coverage.

The important discipline is measuring the rate empirically rather than estimating it. AI can help by tracking false positive patterns, clustering recurring benign causes, and recommending whether the right fix is an exclusion, threshold adjustment, additional correlation requirement, or full rule rewrite.

How does the detection engineering process relate to SOAR playbooks?

Detection engineering produces the rule that fires the alert. SOAR playbooks define what happens after the alert fires, including what enrichment is gathered automatically, what response actions are triggered, and what information is assembled for the reviewing analyst. They operate at adjacent layers of the pipeline.

A high-fidelity detection paired with a well-designed playbook produces consistent, fast outcomes. A low-fidelity detection feeding a sophisticated playbook produces consistent, fast handling of noise. AI can improve both layers by helping write better detection logic and by recommending better enrichment, triage, and response steps based on the alert context.

What is the relationship between detection engineering process maturity and MTTD?

Mean time to detect is a function of detection coverage and detection fidelity together. More coverage reduces the probability that a technique goes undetected. Higher fidelity means the alert that fires is actionable rather than requiring extended investigation to determine whether it represents a real threat.

Process maturity improves both dimensions, including systematic hypothesis formation, tested logic, lifecycle governance, and feedback-driven tuning. AI strengthens that maturity by helping identify coverage gaps, generate candidate detections, validate rules, triage bad detections, and recommend tuning actions. Programs with high coverage and high fidelity produce low MTTD. Programs with high coverage and low fidelity produce high alert volume that delays effective detection despite nominal technical coverage.

Go to

Text Link

The dream SOC team.
Working with you 24/7.

Detection, triage, investigation, and response covered by four Exabots running on a unified, real-time view of your environment. Operate the platform yourself, or have Exaforce run it for you.

Request Demo