# SPDX-License-Identifier: Apache-2.0
# nvidia/nemotron-3-super-120b-a12b:free — model prompting profile (STUB)
#
# Fourth per-model profile in the library proposed by issue #464.
# Sourced from OFFICIAL Nvidia Nemotron-3 documentation only — see
# `official_docs` below for the citation list. This profile ships as a
# STUB: metadata + prompting guidance are populated from official docs,
# but `validated_against`, `community_traces`, and `comparison_traces`
# are empty. A baseline empirical trace is tracked as a follow-up issue.

provider: openrouter
model: nvidia/nemotron-3-super-120b-a12b:free
family: nemotron-3
parameters: 120_000_000_000     # 120B total
active_parameters: 12_000_000_000 # 12B active per pass
tier: C                          # see docs/reference/models.md

context_window: 1_000_000
context_window_notes: |
  Native 1M-token window — the largest in this profile library by
  an order of magnitude. Architecture is hybrid Mamba-Transformer
  Mixture-of-Experts with Multi-Token Prediction (MTP). Pre-training
  cutoff: June 2025; post-training: February 2026. Supported
  languages: English, French, German, Italian, Japanese, Spanish,
  Chinese.

  The 1M window is real but operators should be skeptical of treating
  it as a free pass — Nvidia itself documents "Goal Drift" and
  "context explosion" as residual failure modes despite the window
  size. Long-horizon tasks benefit from explicit checkpointing
  regardless of available context.

official_docs:
  - https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16  # Hugging Face model card
  - https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/  # Nvidia announcement blog
  - https://docs.api.nvidia.com/nim/reference/nvidia-nemotron-3-super-120b-a12b  # Nvidia NIM API reference
  - https://docs.nvidia.com/nemotron/latest/usage-cookbook/Nemotron-3-Super/OpenScaffoldingResources/README.html  # Nemotron agentic-coding cookbook
  - https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b/modelcard  # Nvidia build catalog
  - https://openrouter.ai/nvidia/nemotron-3-super-120b-a12b:free          # OpenRouter routing + pricing

# --- prompting style ---------------------------------------------

prompting_style: chatml_structured
prompting_style_notes: |
  Nemotron-3 Super uses ChatML-style turn markers (`<|im_start|>` /
  `<|im_end|>`). The `<think>` (token id 12) / `</think>` (token id 13)
  tags wrap the reasoning channel. Apply via
  `tokenizer.apply_chat_template()` rather than hand-rolling markers.

  No system-prompt structure is prescribed by Nvidia — operators have
  latitude to shape the system turn for their use case.

reasoning_effort_control: true
reasoning_effort_levels: [low, on, off]
reasoning_effort_mechanism: |
  Toggled via `chat_template_kwargs` passed through the template:
  - `enable_thinking: True` (default) — full reasoning enabled
  - `enable_thinking: True, low_effort: True` — reduced reasoning tokens
  - `enable_thinking: False` — direct-answer mode, no reasoning channel

  Unlike gpt-oss's graded `low | medium | high`, Nemotron exposes
  a binary on/off plus a `low_effort` sub-mode within "on."
reasoning_effort_defaults:
  summarization: off
  extraction: off
  formatting: off
  product_requirements: low
  code_review: on
  tool_use: on
  architecture: on
  debugging: on
  multi_document_synthesis: on

source_priority_directive: optional
source_priority_directive_notes: |
  Not documented in primary Nvidia sources. Conservative default:
  include an explicit source-priority section in any skill that wires
  Nemotron-3 Super in for grounded generation.

harmony_format: false
harmony_format_notes: |
  Nemotron uses Nvidia's ChatML-style format, not OpenAI Harmony.

function_calling_format: |
  OpenAI-compatible chat-completions `tool_calls` deltas. Recommended
  tool-call parser across vLLM / SGLang / TRT-LLM is `qwen3_coder`
  (pass `--tool-call-parser qwen3_coder` to the inference engine).

  For coding-agent workflows, Nvidia recommends passing
  `chat_template_kwargs={"force_nonempty_content": True}` to suppress
  the documented "reasoning-only turn with empty visible content"
  failure mode (corroborates helmdeck ADR 053).

# --- skill-prose guidance ----------------------------------------

best_practices:
  - "Use `temperature=1.0` and `top_p=0.95` across all tasks and serving backends (Nvidia explicitly: 'reasoning, tool calling, and general chat alike')."
  - "Toggle thinking deliberately: `enable_thinking: True` for multi-step planning, `low_effort: True` for routine reasoning, `enable_thinking: False` for short deterministic tasks."
  - "For coding agents add `force_nonempty_content: True` to chat-template kwargs — prevents the empty-content reasoning-only-turn failure mode (ADR 053)."
  - "Lean into the 1M context window for long agentic sessions — Nvidia frames the goal as 'full working context without forced truncation across long sessions.'"
  - "Pair with Nemotron-3 Nano for individual-step execution; Nemotron-3 Super excels at 'complex, multi-step activities' but Nvidia's own recommendation is a Super-plus-Nano deployment for long chains."
  - "Use the `qwen3_coder` tool-call parser across vLLM / SGLang / TRT-LLM — Nvidia's recommended choice for parsing tool-call deltas."
  - "Leverage MTP (multi-token prediction): Nvidia claims '2-3x wall-clock speedup on structured generation like code and tool calls.'"
  - "EMPIRICAL 2026-06-10: per-use-case AGENTS.md hardening matters MORE than profile guidance alone. Three specific hardenings empirically closed both Nvidia-documented failure modes (PR #481→#484 A/B: 24 calls/0 deposit → 7 calls/deposit+verify): (a) explicit tool whitelist forbidding filesystem write/read packs, (b) async pattern bounds for content.ground (call ONCE, poll max 5x, no parallel jobs), (c) invalidation rule for tool calls generated as plain-text XML."
  - "EMPIRICAL 2026-06-10: bounded polling pattern for ASYNC packs. content.ground returns a job_id + state:working; without explicit 'max 5 polls then honest timeout' guidance, Nemotron spawns 6+ concurrent jobs per the 2026-06-10 baseline. Bound polling explicitly in AGENTS.md."
  - "EMPIRICAL 2026-06-10: workflow honest-failure recovery survives upstream pack failures. content.ground job ACTUALLY failed upstream mid-session in the hardened v2 trace; agent honored 'don't retry' rule, reported failure honestly, ended Turn 2 with literal handoff line; operator's deposit reply triggered Turn 3 which fired artifact.put + verify_manifest with all_present:true on the un-grounded draft. The audit-callback pattern is resilient, not just clean-path."

anti_patterns:
  - "Goal drift: Nvidia's agentic-coding cookbook explicitly names 'the agent loses alignment with the original task as context accumulates' as a residual failure mode. EMPIRICALLY REPRODUCED 2026-06-10 (PR #481): baseline AGENTS.md without explicit tool whitelist saw the agent drift to filesystem write/read packs (4 write + 1 read calls) NOT prescribed by the workflow. Mitigation: AGENTS.md whitelist forbidding non-helmdeck packs (PR #484 closed this empirically)."
  - "Tool-call failures: Nvidia documents 'malformed or hallucinated function calls that break the execution loop' as the second residual failure mode. EMPIRICALLY REPRODUCED 2026-06-10 (PR #481): final assistant turn started generating `<tool_call><function=helmdeck__pack-status>...` as PLAIN TEXT instead of the OpenAI toolCall format. Mitigation: AGENTS.md invalidation rule explicitly catching plain-text tool calls (PR #484 closed this empirically)."
  - "Reasoning-only-empty-content turns: addressed via `force_nonempty_content: True` — without this flag, Nemotron may emit thoughts and stop without a visible answer (corroborates ADR 053). EMPIRICALLY OBSERVED 2026-06-10 (PR #484 v2 trace): the first 'proceed' reply triggered this; operator had to retry."
  - "Context explosion in long agentic traces: 1M context window does not eliminate goal drift; treat the window as a tool, not a guarantee."
  - "Trusting Super alone for 5+ step chains — Nvidia's own deployment guidance recommends Super + Nano handoff for 'complex, multi-step activities.'"
  - "Conflicting sampling defaults — Nvidia says `temperature=1.0, top_p=0.95` for ALL tasks; community-sourced lower values (e.g., 0.6) conflict with Nvidia's universal guidance."
  - "EMPIRICAL 2026-06-10: deploying Nemotron with PROFILE GUIDANCE BUT NO HARDENED AGENTS.md. The docs-sourced profile (ChatML format, sampling, enable_thinking, force_nonempty_content) was insufficient to prevent the documented failures from firing — both anti-patterns above reproduced in the v1 baseline. Per-use-case AGENTS.md hardening is the load-bearing layer, not the profile alone."
  - "EMPIRICAL 2026-06-10: parallel async pack jobs. The v1 baseline spawned 6 simultaneous content.ground jobs; most hung at progress:10%; only ONE completed (on a tiny 46-byte test file). Bound async polling explicitly: 'call ONCE, poll max 5x, then pack-result OR honest timeout/failure. NEVER start a parallel job.'"

# --- chain-call reliability (the load-bearing helmdeck concern) --

chain_call_reliability:
  short_chains: high      # 1-2 pack calls per turn
  medium_chains: medium   # 3-4 pack calls per turn — Nvidia's design sweet spot
  long_chains: medium     # 5+ pack calls per turn — Nvidia recommends Super+Nano handoff
  notes: |
    Nvidia explicitly designed Nemotron-3 Super for "consistent,
    reliable behavior across many sequential steps" and post-trained
    it "across 15 environments in NeMo Gym covering multi-step tool
    use." The 1M context window plus MTP are marketed as mitigations
    for goal drift and tool-call failures.

    However: no quantitative chain-success benchmarks are surfaced in
    primary sources, and Nvidia itself recommends a Super + Nano
    deployment pattern for "complex, multi-step activities" — which
    implies Super alone is not the optimal long-chain executor.

    The estimate above is conservative: high confidence on 1-2 step
    chains (explicit training + Qwen-style parser support), medium on
    3-4 (Nvidia's sweet spot but no benchmarks), medium on 5+
    (Nvidia's own multi-model recommendation lowers single-model
    confidence).

    EMPIRICAL REFINEMENT 2026-06-10 (PR #481 → PR #484 A/B): chain-call
    reliability is WORKFLOW-SHAPE-DEPENDENT, not a pure model property.
    The same model on the same prompt produced 24 pack calls / 0 deposit
    with a docs-only AGENTS.md (v1 baseline) vs 7 pack calls / deposit
    + verify with all_present:true after AGENTS.md hardening (v2). The
    short/medium/long buckets above describe the model's CAPACITY; the
    actual call counts depend on whether the operator's AGENTS.md
    constrains the workflow with explicit tool whitelists, async pattern
    bounds, and plain-text tool call invalidation. See community_traces
    below for both ends of the A/B.

# --- skill prose template for nemotron-3-super --------------------

prompt_template: |
  <|im_start|>system
  ## Role
  {what the agent IS — voice, expertise, scope}

  ## Source priority
  1. {primary source}
  2. {secondary source}
  3. {fallback source}
  4. If not supported, say what is missing — do not guess.

  ## Constraints
  - {hard rule 1}
  - {hard rule 2}

  ## Output format
  - {required sections, in order, with explicit content shape}

  ## Success criteria
  - {criterion 1 — machine-checkable}
  - {criterion 2}

  ## Thinking
  enable_thinking: {True | False}
  low_effort: {True | False}   # only when enable_thinking is True

  ## For coding agents
  force_nonempty_content: True   # prevents reasoning-only-empty turns

  ## Sampling
  temperature=1.0, top_p=0.95   # Nvidia's universal recommendation
  <|im_end|>
  <|im_start|>user
  {user message}<|im_end|>
  <|im_start|>assistant
  <think>{reasoning trace if enable_thinking}</think>
  {final response}<|im_end|>

# --- empirical-validation pointer --------------------------------
#
# Awaiting baseline empirical trace — see follow-up issue (filed with
# this PR). Operators running custom Nemotron-3 Super agents are
# invited to submit traces here via PR; see docs/howto/add-free-models.md
# § 7 for the contribution paths.

validated_against:
  - skill: tech-blog-publisher  # shape-equivalent; operator workspace per memory rule
    workspace: <operator-personal — sanitized for public docs>
    agent: <Tier C agent on nvidia/nemotron-3-super-120b-a12b:free, three-turn iterative workflow>
    baseline: <same model + prompt, docs-only AGENTS.md (no per-use-case hardening)>
    metric: real pack calls + verify_manifest call + all_present result + workflow shape adherence
    trace_dates: [2026-06-10]
    finding: |
      Direct A/B on the same Tier C model (`nvidia/nemotron-3-super-120b-a12b:free`)
      with the SAME prompt (eBPF kernel rootkit detection deep-dive) but two
      different AGENTS.md variants:

      | Metric                          | Baseline AGENTS.md | Hardened AGENTS.md |
      |---------------------------------|--------------------|--------------------|
      | Total pack calls                | 24                 | 7                  |
      | helmdeck__content-ground calls  | 6 (parallel chaos) | 1                  |
      | helmdeck__pack-status polls     | 12                 | 4                  |
      | Filesystem write/read calls     | 5 (not authorized) | 0                  |
      | helmdeck__artifact-put          | ❌                 | ✅                 |
      | helmdeck__verify_manifest       | ❌                 | ✅                 |
      | all_present                     | n/a                | TRUE               |
      | Plain-text tool calls           | YES (Nvidia anti)  | 0                  |
      | Decision                        | profile-not-enough | profile-works      |

      Three hardenings in the v2 AGENTS.md empirically closed BOTH
      Nvidia-documented residual failure modes (Goal Drift + Tool-Call
      Failures): (1) explicit tool whitelist forbidding filesystem
      write/read packs, (2) async pattern bounds for content.ground
      (call ONCE, poll max 5x, no parallel jobs), (3) invalidation
      rule for tool calls generated as plain-text XML.

      Resilience: the content.ground job ACTUALLY failed upstream in
      v2 (state transitioned working → failed by poll #4). The agent
      honored the "don't retry" rule, reported the failure honestly,
      ended Turn 2 with the literal handoff line. Operator's "deposit"
      reply triggered Turn 3 which fired artifact.put +
      verify_manifest correctly with the un-grounded draft, returning
      all_present:true.

      Strategic lesson: the YAML profile (sampling, reasoning controls,
      Nvidia-documented best practices) is necessary but NOT sufficient
      for reliable agentic behavior on Tier C Nemotron. Per-use-case
      AGENTS.md hardening is the load-bearing layer. The empirical
      bar for "profile-works" on Tier C models is: the workflow shape
      constrains the documented failure modes; the profile guidance
      tunes what the model CAN do.

      See community_traces[] below for the per-trace metric_summary
      data and PRs #481 (baseline) + #484 (hardened) for the
      narrative-level analysis.

# Empirical 2026-06-10: first community trace captured via
# scripts/helmdeck-trace CLI. The Nvidia-documented "Goal Drift" and
# "Tool-Call Failures" residual failure modes both fired exactly as
# described in the agentic-coding cookbook. See entry below.
community_traces:
  - contributor: tosin2013
    use_case: blog-drafter-iterative
    session_date: 2026-06-10
    metric_summary:
      real_pack_calls: 24       # 6 content-ground + 12 pack-status + 1 pack-result + 4 filesystem-write + 1 filesystem-read
      verify_manifest_called: false
      all_present: null
      hallucination_count: 0    # no FALSE-claim hallucinations — model just never reached the deposit step
      simplification_observed: false   # workflow EXPANDED chaotically, not simplified
    decision: profile-not-enough
    notes: |
      First Press-Nemotron trace; 15-minute session that never reached
      the deposit step despite the model running freely (no 429 from
      Nvidia upstream). Reproduces BOTH Nvidia-documented Nemotron-3
      failure modes from the agentic-coding cookbook:

      1. **Goal Drift** — agent drifted from "blog draft + deposit"
         to "spam content.ground with multiple concurrent jobs and
         write random files." Used filesystem `write`/`read` packs
         (NOT prescribed by AGENTS.md) to save outline.md, draft.md,
         temp_draft.md, test.md to the workspace dir. Six simultaneous
         content.ground jobs started; most hung at "progress: 10%";
         only ONE completed and only on a tiny 46-byte test file.

      2. **Tool-Call Failures** — final assistant turn started
         generating `<tool_call><function=helmdeck__pack-status>
         <parameter=job_id>...` as PLAIN TEXT instead of using the
         OpenAI toolCall format. This is the literal "malformed
         function call" anti-pattern Nvidia documents.

      The per-model profile (ChatML format, sampling, enable_thinking,
      force_nonempty_content) was NOT sufficient to prevent these
      failures. Per-use-case AGENTS.md hardening — explicit "no
      filesystem packs," "content.ground is async; wait for
      state:completed before retry," "one tool call at a time" — is
      the apparent next step. Iterating the Press-Nemotron AGENTS.md
      (operator-local per memory rule) sets up a v2-vs-v1 A/B for
      the next community_traces[] entry.

      Useful empirical observations beyond the failure modes:
      - content.ground is ASYNC (returns job_id + state:"working").
        AGENTS.md says "Call content.ground ONCE" but doesn't mention
        the polling pattern. Operators iterating on the Nemotron
        recipe should add explicit "Call once, then poll pack-status
        until state:completed, then call pack-result" guidance.
      - The agent has access to filesystem packs (write/read) that
        AGENTS.md never authorized. These probably come from a
        separate Claude Code MCP integration in OpenClaw. AGENTS.md
        should explicitly enumerate allowed packs.
    pr_or_issue_url: https://github.com/tosin2013/helmdeck/issues/475

  # ----- v2-hardened entry (same agent, same prompt, hardened AGENTS.md) -----
  - contributor: tosin2013
    use_case: blog-drafter-hardened-v2
    session_date: 2026-06-10
    metric_summary:
      real_pack_calls: 7        # 1 content-ground + 4 pack-status + 1 artifact-put + 1 verify-manifest
      verify_manifest_called: true
      all_present: true
      hallucination_count: 0
      simplification_observed: false   # workflow tracked the prescribed shape closely
    decision: profile-works
    notes: |
      Second nemotron trace — direct A/B against the v1 baseline above
      using the SAME model, SAME prompt (eBPF kernel rootkit detection
      topic), and a hardened AGENTS.md (operator-local per memory rule).

      ## A/B summary
      | Metric                          | v1 baseline | v2 hardened |
      |---------------------------------|-------------|-------------|
      | Total pack calls                | 24          | 7           |
      | helmdeck__content-ground calls  | 6           | 1           |
      | helmdeck__pack-status polls     | 12          | 4           |
      | Filesystem write/read calls     | 5           | 0           |
      | helmdeck__artifact-put called   | NO          | YES         |
      | helmdeck__verify_manifest called| NO          | YES         |
      | all_present                     | n/a         | TRUE        |
      | Plain-text tool calls (Nvidia's | YES         | 0           |
      |   documented anti-pattern)      |             |             |

      ## Three hardenings that empirically closed the gap

      1. **Explicit tool whitelist** ("You MAY call ONLY these tools")
         forbidding filesystem write/read packs. Empirically: 0
         filesystem calls (vs 5 in v1). The model honored the
         negative constraint cleanly.

      2. **Async pattern bounds** for content.ground: "Call ONCE,
         poll pack-status max 5x, then pack-result OR honest timeout/
         failure. NEVER start a parallel job." Empirically: 1
         content.ground call (vs 6 in v1) + 4 pack-status polls
         (within the 5-budget).

      3. **Plain-text tool call invalidation** — explicit rule that
         tool calls generated as plain text (e.g., `<tool_call>
         <function=...>`) invalidate the response. Empirically: 0
         plain-text tool calls in v2 (vs the documented anti-pattern
         that fired in v1's final assistant turn).

      ## Resilience observation worth pinning

      The content.ground job ACTUALLY FAILED upstream in v2 (state
      transitioned working → failed by poll #4). The agent honored
      the "don't retry" rule from the hardened AGENTS.md and
      reported the failure honestly in the Turn 2 response,
      ending with the literal handoff line. Operator replied
      "deposit"; Turn 3 fired artifact.put + verify_manifest
      correctly with the un-grounded draft, returning
      all_present:true.

      This demonstrates the hardened workflow is **resilient to
      upstream pack failure**, not just clean-path. The audit-callback
      pattern survives a real upstream failure in the middle of the
      session.

      ## Decision rationale

      `profile-works`: with per-use-case AGENTS.md hardening on top of
      the docs-only profile, the audit-callback pattern fires reliably
      and the model honors complex negative constraints (no filesystem,
      no parallel jobs, no plain-text tool calls). The Nvidia-documented
      failure modes that fired uncontested in v1 are empirically
      closed in v2.

      Strategic lesson for future Nemotron operators: the YAML profile
      gives you the prompting shape, sampling, and reasoning controls
      Nvidia recommends. The AGENTS.md gives you the workflow constraints
      that turn those mechanics into reliable agentic behavior. You
      need both layers.
    pr_or_issue_url: https://github.com/tosin2013/helmdeck/issues/475

comparison_traces: []  # awaiting cross-tier comparison runs
