Punk Docs - What's New

What's New

User-facing release notes for Punk. Newest first. For engineering history, see CHANGELOG.md in the repository root.

June 2026

Runtime and gateways

Deterministic tool-plan cache hits now return fresh per-run tool-call ids, skip malformed plan arguments, and avoid counting tripwire-blocked plans as served cache hits.
Gateway validation now rejects malformed streaming flags and Anthropic side-channel controls before tracing, token counting, cache lookup, or provider pass-through, so client typos fail clearly instead of creating misleading run evidence.
Evidence packets now keep related artifact and pattern lifecycle audits even when many newer audit rows exist, savings summaries separate completed served routes from blocked rows, and the dashboard route mix uses distinct swatches for semantic cache, plan cache, and model substitution.
Builtin semantic-web sessions now handle common visible form inputs and label toggles more like browsers, so web actions submit url, tel, date/time, checkbox, and radio values more reliably.
Live gateway responses now keep serving through temporary optimization-evidence read failures, with route explanations or trace events showing when model substitution or shadow evaluation was skipped.
The dashboard now keeps cached data scoped to the active organization and reports failed operator actions with the specific operation and server message, reducing stale cross-org displays during transient refresh failures.
Dashboard chat and agents now reject malformed request bodies without creating records or kicking off runs, and tool-call-only chat replies fail visibly instead of appearing as blank assistant messages.
OpenAI-prefixed model ids such as openai:gpt-4o-mini now call the upstream OpenAI API with the base model name, use the right pricing, and share cache/learning evidence with the unprefixed model.
Tenant runtime settings are stricter and cleaner: webhook URLs are trimmed and cannot be blank, while retention and approval exception windows must be whole positive values so bad settings cannot create surprising sweeps, retries, or exception lifetimes.
Gateway and SDK tool tracing now classify subscription/payment, access-control, credential, deployment, and public-write tool names more conservatively, so unsafe low side-effect declarations cannot bypass approval or cache safeguards.
Replay and evidence packets now preserve repeated tool completions, so common multi-call traces remain replayable instead of losing proof because the same tool name appeared twice.
Learning now accounts for declared tool side-effect levels before tools run, keeping risky non-plan patterns conservative while still allowing risky tool_plan artifacts to require human promotion.
Artifact route explanations now show rollout state, including canary admissions and stable artifact serves, so optimized responses are easier to audit.
Optimized cache hits now ignore filtered or malformed final-answer rows and malformed tool-plan rows instead of replaying them as valid responses.
SDK tool tracing now treats malformed cache hits as misses, preserves cached null results, completes policy-held tool traces, and includes run/route evidence in gateway errors.
Ignored prompt patterns now stay on the live path across caches, artifacts, semantic cache, and model substitution, with route explanations showing the skipped negative-cache decision.
Gateway responses now preserve provider stop reasons through live responses, cache hits, and streams, while semantic caching skips incomplete or filtered answers instead of replaying partial text.
Semantic-web session opens and navigations now refresh the SOM cache used by fetches, while cached redirect aliases must pass URL guards and match final-URL snapshot evidence before serving.
Successful OpenAI-compatible upstream responses that omit an assistant choice/message now fail over with evidence instead of producing blank live answers.
Provider-prefixed model ids now choose the intended failover sibling model, so prefixed Claude/OpenAI traffic does not drift to a larger default backup.
Evidence packets for optimized artifact runs now include the related promotion, rollback, quarantine, and learning audit decisions, with structured artifact and pattern IDs for support handoffs.
Deterministic tool-plan optimizations now stay scoped to the declared tool schema and side-effect level, so cached plans and learned artifacts cannot cross between read-only and riskier tool contracts.
Gateway requests now normalize accidental model-id whitespace before routing/cache keys and reject duplicate or malformed tool declarations before they enter policy, cache, or learning evidence.
Live route explanations now show when semantic-cache serving or model substitution was skipped, and cache-health stats no longer count malformed negative-cache memos as hits.
Semantic-web sessions and workflow run-history APIs now report malformed requests clearly, and non-admin dashboards no longer show a false gateway-unreachable warning from the approvals badge poll.
Punk now runs as a wire-compatible gateway for OpenAI-style chat completions and Anthropic-style messages.
Every response carries a run id and selected route so teams can inspect cost, latency, policy, traces, and savings.
Provider routing supports live providers, deterministic local mock mode, tenant BYOK keys, configurable timeouts, and auditable failover.
Direct Gemini routing now appears consistently in readiness checks, BYOK provider-key management, and cost accounting for gemini:* model ids.
Tenant BYOK provider keys now reject blank saves and production readiness only counts usable tenant keys, so a bad stored key cannot silently replace a healthy platform key or make deployment checks look ready.
Provider responses that come back malformed now fail over with clearer evidence when a backup is available, and usage-less compatible providers still produce estimated token and cost accounting.
SDK chat calls now fail clearly on malformed successful gateway payloads instead of returning empty answers, and SDK tool-result caching now requires an explicit subject so read-through tools do not accidentally reuse tenant-global results.
Deterministic cache hits now stay isolated by stop sequences, tool-choice directives, and secondary sampling controls, so optimized responses respect the provider-visible request shape.
Malformed semantic-cache rows are now ignored instead of served, keeping optimized answers on valid cache evidence.
Shared SOM cache reads now ignore malformed imported snapshots, keeping semantic-web callers on valid page-runtime evidence.
OpenAI-style gateway requests now reject malformed models, empty message lists, and invalid roles before they enter traces or learning.
OpenAI-style gateway requests now also reject malformed response-format, sampling, multi-choice, and logprob controls before cache lookup, so invalid requests cannot be served from an unrelated exact-cache entry.
Anthropic-style gateway requests now reject invalid message roles and malformed max_tokens before they enter traces or learning.
Gateway validation now catches malformed numeric generation controls before tracing, and streamed Anthropic usage accounting stays accurate when upstream summaries are sparse.
Gateway validation now catches malformed message content, tool histories, tool schemas, tool choices, and Anthropic stop sequences before they enter traces, caches, learning, or provider pass-through.
Gateway validation now also rejects invalid tool-choice targets, malformed OpenAI tool-call history, missing Anthropic message content, and out-of-range Anthropic sampling controls before they enter routing or token counting.
OpenAI structured non-text content and object-shaped tool-call arguments now stay faithful in Punk's canonical traces, improving replay and routing evidence for multimodal and compatible-provider traffic.
Anthropic non-text content now carries stable identity in Punk's canonical request keys, so distinct images, documents, and tool-use blocks cannot reuse the same exact-cache entry.
Anthropic structured stream errors now fail visibly and avoid completed-run/cache evidence, so clients and operators no longer see failed streams as clean completions.
Anthropic-prefixed model ids now call the live Anthropic API with the correct Claude model name, keep cost accounting on the base model, and handle more provider/proxy SSE framing variants.
OpenAI-compatible streaming now handles common proxy/provider SSE framing quirks and upstream error frames more faithfully, avoiding empty successful-looking completions when a streamed provider actually failed.
Learned model-substitution responses no longer seed earlier cache tiers under the requested model, so turning substitution off returns traffic to live instead of replaying stale cheap-model cache hits.
Learned model-substitution settings now reject invalid mappings, substitution failover explanations clearly show which backup model actually served, and observe-mode keys report earned substitution savings without changing live traffic.
Streamed OpenAI-compatible tool-call requests now preserve tool calls even when a provider exposes only text deltas.
Optimized tool plans now stay bound to the request's declared tools: cached plans and tool_plan artifacts that reference undeclared tools fail open instead of returning an unexpected tool call.
OpenAI-compatible refusal responses no longer come back blank, and tool-using conversations keep native tool-call/tool-result structure when traffic crosses between OpenAI-style and Anthropic-style gateway wires.
Malformed JSON on OpenAI-compatible and Anthropic-compatible gateway requests now returns clear provider-shaped invalid-request errors without creating trace runs, and oversized JSON mutation/cache payloads are rejected before processing.
Exhausted failover chains now show as failed backup attempts instead of implying a backup served the response.
Generation caps, Anthropic tool TTL metadata, provider-prefixed model pricing, and malformed cache entries are now handled more consistently across gateway wires.
Semantic response-cache serving now stays on plain text-only traffic and counts hits only for live entries in the matching tenant.
Semantic-cache serving now respects request subject and sampling boundaries, and reports cache-store decisions more accurately in traces and observe-mode explanations.
Semantic-web fetches and builtin web sessions now reject upstream HTTP error pages, guard redirect targets with the caller's auth policy before following them, ignore malformed cached SOM snapshots by refetching, block private-network access outside true open-dev mode, honor disabled form controls, record redirected SOMs under the final observed URL, canonicalize fragment-only URL variants, report malformed form actions as failed in-session actions with audit evidence, keep snapshot history ordering stable, show requested-to-final redirects in the dashboard, and avoid duplicate unchanged SOM snapshots.
Semantic-web fetch caches and snapshots now stay scoped to the authenticated tenant, including workflow and Chorus research fetches.
Semantic-web fresh fetches now reject malformed bypassCache controls instead of quietly reusing cached SOM evidence.

Chorus

model: "punk/chorus" is the governed intelligence route for harder research, coding, analysis, policy, and operational work.
Chorus records claim, routing, verifier, evidence, cost, latency, and receipt traces without exposing private routing formulas or prompt internals.
Request controls choose latency, quality, research, budget, receipt, local/private, and evaluation posture from one model id.
The dashboard shows Chorus runs, receipts, claim graphs, route plans, verifier status, confidence, cost, and estimated savings.

Workflows, chat, and agents

Workflow control-plane requests now fail clearly when a manual API run tries to impersonate the scheduler, a template instantiation sends a malformed name, or an import contains no workflows or non-object entries.
The dashboard includes chat with route-visible economics on each assistant reply.
Workflow semantic-web nodes now keep the same private-network protection through fetch redirects, session-open redirects, and action-triggered navigation that they apply to the starting URL.
A useful chat can be saved as a scheduled single-task agent.
Dashboard chat threads are now private to their owning login user within an org, and repeated chat cache hits stay scoped to the resolved user or API key.
Manual workflow runs now reject malformed input payloads instead of quietly executing with empty input.
Workflow and credential mutation requests now reject malformed JSON, manual workflow runs reject typoed trigger values, and malformed queued workflow jobs fail visibly instead of running with defaulted evidence.
Manual agent runs now reject malformed input payloads instead of quietly executing with empty input.
The visual workflow builder supports LLM nodes, web fetches, web actions, choices, variables, map iteration, MCP tool calls, notifications, templates, import/export, and node timelines.
Workflow and agent failures now show failed nodes, child gateway-run links when available, and a rerun path using the stored input.
Workflow notification steps now honor continueOnError, scheduled invalid graphs are refused without retry churn while appearing as failed runs with audit evidence, and workflow run timelines stay bounded under oversized limits.
Workflow maps now still respect global runtime budgets when per-item errors are allowed, unexpected execution errors finalize as failed runs instead of lingering as running, and attribution latency ignores still-running workflow age.

Teams, billing, and onboarding

Billing usage attribution now ignores failed upstream attempts, uses the same run-quota limit the gateway enforces, and surfaces optimized or observe-mode savings ahead of lower-impact live spend.
Accounts support organizations, active-org switching, invitations, member removal, org deletion, public signup behind a flag, and email verification.
Usage metering tracks runs, tokens, spend, savings, routes, apps, agents, models, and workflows.
Savings and cache-health summaries now reflect settled run outcomes and include semantic-cache entries, making dashboard totals steadier and easier to reconcile.
Dashboard actions now preserve structured server error messages, so policy and operator failures show the actual reason instead of a generic object string.
Observe-mode cache probes and malformed cache rows no longer inflate served cache-hit counts, and failed promoted artifacts now fall back to the requested live model instead of another optimized route.
Failed provider attempts now stay visible as failed runs without consuming monthly run quota or inflating usage totals, so a healthy retry is not blocked by a previous upstream error.
Cache invalidation now clears semantic-response entries too, so an operator purge cannot leave stale semantic answers serving after the cache-health view says entries were removed.
Workflow template instantiation and imports now honor workflow plan limits, so quota enforcement is consistent across every creation path.
Plans, quotas, and optional Stripe checkout/portal/webhooks are available; self-hosted deployments can disable quotas.
New organizations get a getting-started panel and a one-click demo seed.

Governance and safety

API-key creation now rejects malformed mode, admin, app, and tenant controls instead of silently broadening key behavior, and pattern-ignore audits preserve the acting operator plus an allow verdict.
Negative feedback on optimized artifact runs now lowers the route evidence used for future serving decisions, and hybrid-artifact failures count against the same artifact safety gate.
MeshGuard-compatible policy YAML now rejects empty selectors, malformed action patterns, and invalid delegation bounds at load time, making governance misconfiguration visible instead of applying rules too broadly or missing actions silently.
Artifact feedback is now more conservative: correction text always lowers artifact confidence, later negative feedback can override an earlier positive rating for the same run, and oversized correction notes are rejected.
Tripwire blocking now covers tool-call arguments from live planning, cached plans, and promoted tool_plan artifacts, so planted decoys cannot leave through optimized tool-call routes.
Policies cover agent identity, trust tiers, side-effect levels, approvals, web actions, workflow tool calls, and audit.
Tripwire blocking now fails closed before optimized or streamed responses can reveal a planted decoy, while streamed alert-mode responses still record tripwire evidence.
Tenant-scoped admins can no longer mint API keys for other tenants, approval inboxes require admin access, and learning reports now stay scoped to the caller's tenant.
Multi-tenant operator evidence is now stricter: pattern/artifact views and feedback-driven learning ignore mismatched foreign records instead of showing or updating another tenant's optimization state.
Artifact feedback is now safer to operate: flags and corrections on optimized artifact runs lower artifact confidence once for that run, without duplicate submissions over-penalizing the artifact.
SDK-ingested trace events and user feedback are now validated more tightly so bad client input cannot pollute replay evidence or learning feedback.
Gateway and SDK identity fields now trim whitespace before tracing or caching, blank tool-cache subjects are rejected, positive artifact feedback strengthens live evidence once, and clearing a secret setting is visible in audit metadata without exposing the secret.
Risky tool plans and SDK-traced side effects now honor Punk's built-in approval gates before execution, and understated side-effect declarations can no longer downgrade obviously dangerous tools, including account disable/suspend/ban/cancel actions.
Undeclared tools now default to conservative side-effect handling across gateway tool schemas, SDK trace ingestion, and SDK traceTool; declare level 0/1 explicitly for read-only cacheable tools.
Persisted tenant settings are now revalidated before runtime use, so stale malformed rows cannot partially enable learned substitutions or weaken memory-quarantine gates.
Equal-specificity policy ties now resolve to the most restrictive verdict for the action, so peer safety policies cannot be bypassed by file load order.
Manual artifact promotion now re-checks promotion evidence and keeps unready artifacts out of live routing.
Production readiness checks highlight auth, encryption, provider, failover, worker, private-network, host-split, and billing risks before public exposure.
Evidence packets and trace integrity checks make runs easier to review and verify.
Replay-bundle errors, audit filters, and dashboard hit rates now line up more closely with the actual route evidence operators inspect.
SDK tool tracing now classifies destructive names conservatively before caching or execution, records failed wrapped-tool calls, and rejects malformed tool trace payloads before they enter evidence.
SDK tool tracing now shows cached calls and side-effect outcomes more completely, and tool-result caches are isolated by application so one app cannot reuse another app's cached tool output.
Tool-result cache TTL and subject-scope requests now reject malformed values instead of silently using a default or tenant-global scope.
Tool-result caches now preserve legitimate JSON null results as hits, so nullable read-through tools do not execute again just because their cached value is null.
Tripwires, honeytokens, streaming DLP, redaction, memory quarantine, encrypted credentials, OAuth credential refresh, and SSRF guards improve runtime safety.

Learning and optimization

Patterns whose first artifact fails replay now retry synthesis after enough new evidence instead of staying stuck behind the failed proof attempt.
Learning evidence and artifact/approval operator lists now stay stable when several records land in the same millisecond, and idle-confidence decay no longer overwrites fresh artifact evidence.
Learned-route proof counters are stricter: model-substitution serves no longer masquerade as fresh artifact or semantic-cache shadow evidence, so promotion and semantic-cache gates reflect real shadow comparisons.
Broken promoted artifacts that fail structurally during live execution are now degraded immediately after falling open to live, so the same unsafe optimized route is not retried for every matching request.
Semantic-cache and artifact routes now require stronger proof before serving: patternless semantic matches keep going live until shadow evidence exists, and malformed tool-plan artifacts fail open with visible evidence.
Canary rollout now advances only after failure-free live and shadow evidence for the current rung, and promotion responses reflect the persisted canary state operators will see.
Artifact detail now shows the same promotion-gate blockers that the server enforces, the dashboard/demo avoid suggesting promotion until evidence and confidence are ready, and demo links honor the configured gateway URL.
The learning view shows evidence notes, confidence trajectories, and promotion blockers.
Manual artifact evidence refresh now updates artifact confidence and counters for newly covered runs, dedupes repeated run ids, and reports skipped historical runs.
Failed promoted-artifact executions now fail open to live while still counting against artifact route evidence, so operators see the unsafe optimized path lose confidence.
Patterns with failed, degraded, or quarantined artifacts now reappear as actionable optimization work with clear blockers instead of staying hidden behind stale states.
Replay proof now refuses ambiguous duplicate tool-result evidence instead of silently choosing one recorded result.
Evidence checks now ignore malformed historical samples and hold artifacts when selected historical evidence cannot be reconstructed completely.
Malformed negative-cache memos are now ignored instead of suppressing future optimization attempts.
Invalid promoted artifact programs now fail open to the live provider with visible failure evidence instead of serving a bad optimized response.
Tenant-local preference signals can adjust routing and promotion within hard safety gates.
Tenant-local preference now treats corrections as negative feedback, dedupes repeated feedback on one run, and ignores mismatched tenant feedback rows, so human feedback tunes future optimization more accurately.
Opt-in aggregate learning can surface anonymized shape-level signals across tenants without sharing prompts, outputs, or identities.
Caches, learned artifacts, model substitution evidence, canary rollout, rollback, and quarantine help repeated work get cheaper over time.

Developer and operator surfaces

Artifact evidence pages now stay usable for read-only operators, show clear promotion-gate ready states, and display confidence from the persisted artifact posterior so proof views match routing evidence.
Rate-limit configuration now fails visibly on malformed API/chat/public-diagnostic RPM values, and the public workflow diagnostic uses its own throttle instead of being double-counted by the generic API limiter.
Tool-result cache operator calls now reject malformed app/schema cache dimensions and cap manual TTLs at 24 hours, keeping read-through tool results scoped and fresh.
Serverless cron ticks now reject malformed JSON bodies and invalid maxJobs/maxMs bounds before touching the job queue, so scheduler misconfiguration is visible instead of silently running a default drain.
SDK and Punk Runtime Engine errors now surface the gateway's human-readable explanation, including plan-limit and validation messages, and SDK network failures include the request path and configured gateway URL.
Manual artifact replay now rejects malformed explicit run-id lists, approval inbox queries stay bounded while preserving counts, and demo terminal output stays aligned under ANSI color.
Operator mutation APIs now reject malformed or non-object JSON bodies consistently, so a bad cache-invalidation request cannot be interpreted as an intentional whole-cache purge.
Workflow list, export, and update controls now reject typoed filters, bad export ids, and empty edits instead of widening results, omitting workflows, or creating no-op version churn.
Gateway and worker startup now reject malformed numeric runtime env values clearly, so bad ports, learning cadence, worker concurrency, or retention settings fail before they distort runtime behavior.
Account, organization, billing, tripwire, and memory-governance endpoints now reject malformed JSON and bad list-query limits visibly, while invalid login/signup/invite payloads no longer burn sensitive rate-limit attempts.
Operator list APIs now reject malformed filters and pagination values explicitly, so typoed run/job/approval queries fail visibly instead of showing misleading empty histories.
Cache invalidation and tool-result cache calls now reject typoed cache types and blank tool names instead of silently no-oping or creating ambiguous cache entries.
Dashboard and demo optimized-share displays now count optimized runs against served runs, keeping policy-blocked traffic visible without diluting cache/artifact rates.
SDK chat parsing now reports non-JSON gateway/proxy responses with request context and preserves compatible-provider tool-call arguments that arrive as objects.
SDK chat calls now expose OpenAI tool calls directly and accept the gateway's existing tool/max-token request controls, and evidence packets now explain optimized cache/artifact runs more clearly.
@punk/sdk includes chat, Chorus helpers, tool tracing, feedback, web fetches, web sessions, memory influence recording, receipts, evidence packets, MCP registry helpers, prompt ingest, cache helpers, and learning APIs.
Public docs are served at /docs; the hosted reference remains punktechnologies.com.

Punk is in active development. This page covers user-facing milestones; smaller fixes and internal changes land continuously.

//DOCS What's New