Research · Working Paper · February 2026

System Decision Mapping

A justification engine for AI progress in regulated systems.

AI capability evolves rapidly, while autonomy decisions in regulated systems are typically made early and changed cautiously. As a result, many organizations default to limiting AI use — not because models are incapable, but because expanding autonomy requires evidence that is difficult to produce and defend.

System Decision Mapping proposes an approach for organizations that want to revisit those boundaries over time. It focuses on generating decision-level evidence that can support incremental AI enablement — without changing regulatory accountability.

The approach combines two elements: a structured inventory of system decisions informed by governing compliance standards, and a design pattern that treats autonomy as a configurable property rather than a permanent architectural choice.

● Active research 📅 First published: February 2026 Read the framework ← Main profile

Target Audience: People designing, building, or governing AI-enabled enterprise systems in regulated environments.

The problem this solves: Enterprises can't justify expanding AI autonomy over time because no one has inventoried which decisions are constrained by regulation, which are eligible for AI ownership, and how those boundaries shift as model capabilities improve. This is that inventory, plus a design pattern for making the boundaries configurable.

What This Is

A staircase, not a guardrail

Most AI governance work asks: "Should AI do this?" System Decision Mapping asks: "Which decisions does this system actually make, which ones are structurally eligible for AI ownership today, and how does that change as models improve?"

That inversion matters. Governance without a decision inventory is policy theater. Governance with one becomes evidence — the kind regulators can evaluate, auditors can review, and architects can act on.

If you're familiar with governance frameworks like NIST AI RMF: this work doesn't replace them — it starts where they stop.

What it is not — A framework for preventing AI from acting. Not a risk checklist. Not a maturity model.
What it is — A structured method for identifying which decisions are AI-eligible today, which can earn autonomy over time, and how to make those boundaries explicit, reviewable, and upgradeable.
01
Inventory the Decision Surface

Before you can govern AI decisions, you have to know where they are. System Decision Mapping extracts them from real codebases — consistently, before reading a line of business logic.

02
Classify Autonomy Eligibility

Nine evaluation questions surface which decisions are safe to automate today, which require human approval, and which are regulatory hard floors. The governing standard predicts the structure.

03
Make Autonomy Configurable

The output is not a report — it's a design pattern. Decisions become configurable units. Autonomy is a property that can be upgraded with evidence, not a one-time architectural commitment.

Two Core Capabilities

System Decision Mapping delivers two things that are distinct but share the same foundation.

Capability-delta measurement on stable problem statements Decision surfaces in enterprise systems are structurally stable — they persist across refactors, vendor swaps, and model upgrades. AI capability is the fast-moving layer. This capability measures the delta: for each decision problem, how does the solution path change as models improve? Which steps collapse? Which human scaffolding is no longer needed? The answer tells you where to target AI enablement next — based on evidence, not intuition.
AI-first decision design pattern — autonomy as configuration, not commitment Each decision point is isolated, named, and tooled. Autonomy tier (Immediate / Earned / Supervised / Aspirational) is a configurable property, not a hardcoded assumption. Evidence gates upgrades. Rollback is structural — a config change, not an emergency. And every human or supervised decision trains the system as you work, so conservative configurations aren't dead ends — they're data collection.
The Core Insight

Decisions are stable. AI capability changes quickly. The missing piece is a measurement that lets organizations justify upgrading AI autonomy over time without renegotiating regulatory accountability. That's what this produces.

The capability is not the bottleneck. The justification is. Regulators don't block AI because they distrust the technology — they block it because there's no documented, auditable basis for the autonomy decision. This system produces that documentation.

Four Emergent Decision Types

These types weren't designed in advance — they emerged from analyzing real production codebases. Every decision point extracted from a codebase falls into one of these four categories.

T1
Mapping Decisions — Local concept mapped to external standard. High volume. AI-learnable. Fast feedback. Prime automation candidate.
Signal: High volume + consistent inputs
T2
Drift Correction Decisions — When local practice diverges from governing standard over time. AI-detectable. Human-correctable. Feedback velocity is the key variable.
Signal: Behavioral shift over large sample
T3
Retirement Decisions — When a governed standard code is deprecated. Cascade impact typically high. Human review warranted. AI can identify and stage, not decide.
Signal: Cascade scope + irreversibility
T4
Override Decisions — Where the standard doesn't fit the situation. Discretion required. Lowest AI confidence. Typically the most likely to require ongoing human sign-off.
Signal: "Override / Manual / Undo" in command name

The Nine Questions

Applied to every decision point extracted from a codebase. Questions are weighted — reversibility and regulatory floor carry the most signal.

Q1
Reversibility Test — What are the consequences and how easily can this be undone? Irreversibility is the single strongest argument for human review.
Q2
Money Movement Test — Does money move as a result of this decision? Financial irreversibility compounds all other risk signals.
Q3
Regulatory Test — Does law or regulation require a named human in this decision? This is the only permanent wall. Everything else is earnable.
Q4
Volume / Pattern Test — Is this repeated logic with consistent inputs? High volume + consistency = prime AI candidate.
Q5
Override Signal Test — Is there an "Override" in the command name? That's historical record of where humans didn't trust automation. Pay attention.
Q6
Discretion Test — Does this require unmeasured factors — empathy, relationship context, judgment that defies codification? Lowest AI confidence territory.
Q7
Anomaly Test — Is this detecting a pattern vs. responding to one? Detection is AI-suited. Response to edge-case anomaly requires more caution.
Q8
Cascade Test — What is the downstream impact scope? A decision that triggers a chain of other decisions needs proportionally higher confidence before autonomous execution.
Q9
Feedback Velocity Test — How fast does feedback become knowable? Fast feedback = faster confidence curve. Velocity determines the shape of the learning curve, not the destination.
Q10
Adversarial Risk Test In Development Is this decision's input channel susceptible to intentional or unintentional manipulation in ways that materially affect outcomes — regardless of whether the decision is made by AI or a human?

Weighting note: Questions are not equal. Reversibility (Q1) and regulatory requirement (Q3) function as hard floors — a single "yes" on either overrides all other signals. Money movement (Q2) compounds irreversibility. Volume/pattern (Q4) and feedback velocity (Q9) are the primary AI-positive signals. All others are confidence modifiers.

Four Autonomy Tiers

Each decision point in the inventory carries an autonomy tier — a starting position, not a permanent label. The tier is a configurable property. Upgrades require evidence. Rollback is structural.

✓ Immediate

AI owns from day one. Safe to automate without a track record. No regulatory flag. Fast feedback. Fully reversible.

~ Earned

AI owns after demonstrated performance. Starts supervised, graduates when confidence is established across a meaningful sample.

⚑ Supervised

AI recommends, human approves. Tier for decisions with real downstream consequence or precursor to irreversible action. Human is cosigner, not sole decision-maker.

⚠ Aspirational

Human owns, AI advises only. Regulatory floor, irreversible money movement, or deep discretion. AI cannot trigger execution — it prepares the summary and surfaces the impact. Human decides.

Continuous Training — Built In

Every human or Supervised decision is a labeled example. The system trains as you work — not as a separate project. Conservative tier assignments aren't holding patterns; they're structured data collection. The path from Supervised to Earned is paved by the decisions already being made.

For Supervised and Aspirational decisions, the AI classification is deliberately withheld until after the human has made their decision. This prevents decision complacency — the well-documented tendency for humans to anchor on and defer to an AI recommendation even when their independent judgment would differ. The result is a cleaner training signal and a more defensible audit trail: human decisions are genuinely independent, and divergence from the AI is measurable rather than suppressed.

Monitoring, Learning & Time Dynamics

Autonomy is not a one-time classification. It is earned, monitored, and can be revoked. Feedback velocity determines how fast that earning can happen — and how fast revocation must happen when things go wrong.

Feedback Velocity — How fast outcomes become observable. Loan repayment feedback arrives in days. Healthcare ICD-10 coding drift takes months to surface in audit cycles. Same framework architecture. Completely different confidence curve shapes. Patience is an architectural variable, not a policy preference.
Learning Loops — Confidence adjusts over time as outcomes accumulate. A decision point that starts in Earned can graduate to Immediate — or get demoted to Supervised — based on observed outcomes. The tier assignment is a starting point, not a permanent label.
Drift Detection — When AI behavior diverges from expected patterns over a large sample, that is a T2 Drift Correction event. The system flags it; a human corrects it. Detection is AI-suited. The correction decision is not.
Rollback & Containment — When confidence drops below threshold, autonomy is automatically constrained. Execution reverts to the next lower confidence tier until confidence is re-established. The tier model is the enforcement mechanism, not a separate control layer.
🔄
Model Upgrade Cadence — AI model capabilities are improving roughly every 3–4 months. Autonomy tier assignments made at implementation are not permanent — and that's the point. A structured re-evaluation against the decision inventory, run on each model upgrade cycle, is how organizations justify moving decisions up the tier ladder. This is planned research.

NIST AI RMF → System Decision Mapping (Operational Use)

This mapping shows how System Decision Mapping can be used as structured input — by humans or AI — to identify the decision landscape implicit in the NIST AI Risk Management Framework, providing a concrete place to start without defining policy or permissions.

Question NIST Helps You Answer Where NIST Stops NIST Function System Decision Mapping Feature How This Helps You Move Forward
What risks exist in this AI system? NIST MAP calls for identifying risk sources and system context MAP Decision-Point Extraction Surfaces where decisions actually occur in a live codebase, making risk locations explicit instead of inferred
How should AI risk be assessed? NIST MEASURE requires risk assessment but leaves methods open MEASURE
MANAGE
Nine Evaluation Questions Provides a repeatable way to examine each decision point for reversibility, discretion, regulation, and impact
Who is accountable for AI behavior? NIST GOVERN establishes accountability as a requirement GOVERN
MANAGE
Four-Tier Autonomy Model Defines when AI may act, when it must defer, and when humans retain ownership — with clear thresholds and upgrade criteria
What happens when confidence is unclear? NIST does not define a default posture GOVERN
MANAGE
Low-AI-Confidence Defaults Applies a simple rule: if confidence can't be established, the decision remains human-owned until evidence supports an upgrade
How do we monitor and respond to issues over time? NIST calls for monitoring and response MEASURE
MANAGE
Feedback Velocity & Learning Loops Limits autonomy based on how quickly decision outcomes can be observed and corrected
The Key Point

NIST tells you governance is required. This tells you where it lives in the code. That's the gap — and it's where AI adoption either stalls in policy documents or moves forward with an auditable basis.

Cross-Domain Validation

Predictions derived from standards documentation — SWIFT CBPR+, ISO 20022, BMC Medical Informatics (2020), CMS ICD-10 guidelines — then tested against real codebase analysis. The governing standard predicted the decision structure before the code was read.

682

Decision points classified across banking and healthcare

2

Domains validated (banking, healthcare)

<5%

Max tier prediction delta vs. source-derived predictions

70%

Agreement between heuristic and reasoning-based classification

Codebase Domain Architecture Pattern Governing Body Decision Points Status
Apache Fineract Core banking @CommandType Java annotations SWIFT, ISO 20022, NACHA 416 ✓ Validated
OpenMRS Electronic medical records @Authorized service methods ICD-10, SNOMED, LOINC 266 ✓ Validated
Apache OFBiz Supply chain / ERP XML service definitions GS1, UNSPSC, HS codes 3,493 ⟳ In research
Why Supply Chain Is In Research

Banking and healthcare standards (SWIFT, ICD-10) carry mandatory compliance enforcement — messages are rejected when they don't conform. GS1 and UNSPSC are voluntary adoption standards. The framework's predictive signal is not the existence of a governing standard but the enforcement mechanism behind it. Supply chain requires a separate evaluation methodology and is planned research.

🔬
OpenMRS vocabulary tell — void/purge (irreversible), retire (archive), unvoid/unretire (override tells). Domain language encodes the autonomy classification directly.
💡
Fineract edge case contrast — LoanGoodwillCredit (5% confidence): discretionary exception, lowest-confidence territory. UpdateLoanDelinquencyBucket (85%): pure pattern matching, prime AI candidate. Same system, opposite profiles.
Feedback velocity contrast — Loan repayment feedback: days. Healthcare ICD-10 coding drift: months to audit cycles. Same architecture. Different confidence curve shapes. Patience is an architectural variable.
Done
Framework defined — Nine questions, four decision types, four autonomy tiers, feedback velocity as architectural variable.
Done
Banking and healthcare validated — 682 decision points. Predictions sourced from SWIFT CBPR+ and ICD-10 documentation independently. All tier predictions within 5% of actuals.
Done
Reasoning baseline established — 416 Fineract decision points classified by AI reasoning (not heuristics) on February 21, 2026. 70.2% agreement with heuristic classifier. Baseline stored for next model version comparison.
In progress
Fineract AI-first reference implementation — Rebuilding the loan origination workflow with explicit tier boundaries as code constructs. Demonstrates what autonomy configuration looks like in practice and surfaces framework gaps in real time.
In research
Supply chain & IAM methodology — OFBiz extracted (3,493 business decisions). Voluntary-standard and policy-governed decision architectures require separate evaluation methodology. Planned extension.
Planned
Model upgrade assessment protocol — Structured methodology for re-evaluating autonomy tier classifications when AI model versions change. AI capabilities are improving every 3–4 months; tier classifications made at implementation are not permanent. This is the core justification mechanism.
Planned
Fourth domain validation — Testing framework against a codebase with mandatory external standard outside banking and healthcare.

The Tool: Decision-extractor

A static analysis tool that extracts and classifies AI decision points from enterprise codebases. The tool is evidence for the framework — not the contribution itself.

🔍
What it does — Scans Java codebases and XML service definitions for decision point patterns. Applies the nine questions as a scoring algorithm. Outputs structured JSON with confidence tier classifications.
🏗
Architecture patterns supported — Annotation-driven command patterns (Fineract), privilege-based service authorization (OpenMRS), XML servicedef files (OFBiz). Three different patterns, one extractor.
📤
Output — Decision point inventory with domain classification, autonomy tier, feedback velocity rating, and regulatory floor flags. Human-readable and machine-readable.
🧪
Test suite & capability-delta measurement — A curated inventory of structured decision problems, each with a model-stamped baseline classification. Re-run on each model upgrade to measure which decisions change tier and why. This is how autonomy upgrades get justified over time — not by trusting that the new model is better, but by measuring where it actually is.

Feedback Welcome

System Decision Mapping documents a recurring pattern observed across production systems in regulated environments. The most useful feedback is failure cases: where the enforcement mechanism distinction breaks, where the nine questions give ambiguous results, and where the tier model doesn't map cleanly to a real decision. If you've tried to apply this to your own systems, I'd like to compare notes.