Benchmark Question

Prompt stacks in production agents often accumulate narrow rules. A user says “product promo”, so add a product-promo patch. A user says “first few scenes”, so add a scope patch. A user says “UI is the star”, so add an interface-as-prop patch. These rules can prevent real failures, but they also make the prompt harder to reason about.

GenFlick tested a different direction: add a small first-principles reasoning kernel that asks the agent to reason from user intent, current project state, domain ontology, render path, cost, and reversibility before reaching for narrow rules.

The benchmark question was whether that kernel could preserve behavior on representative workflows without relying on an ever-growing list of textual triggers.

Harness

The test used GenFlick’s no-media agent-flow harness. This is not a toy parser test. The harness imports the same planner, phase executor, action parser, dispatcher, prompt registry, and project-context builder used by the app. With the live GPT-5.5 provider, planner and specialist calls are real model calls.

The harness skips paid media generation. Queued media work is written to run artifacts, but image and video provider APIs are not called. This makes it suitable for prompt and action-flow assessment: it checks plans, parsed actions, project state, dispatch results, and quality rubrics without spending media-generation credits.

Variants

The baseline was the current production prompt flow before the first-principles kernel.

The first-principles variant added a small reasoning layer to the base chat system prompt, planner system prompt, and scenes-for-act addendum. Its goal was to improve action selection and scope by grounding the model in the product ontology and render path.

A targeted guard variant added a narrow product/app-promo grounding rule. This represented the patchwork approach: when a known failure mode appears, add a specific rule that quarantines it.

Cases

The representative suite included:

  • A first-shot movie smoke case.
  • A simple lighthouse color edit with a second-turn targeted edit.
  • A Canvas Studio brainstorm workflow.
  • A Canvas board-management workflow.

An additional ad-hoc run tested a grounded promo prompt for GenFlick: a normal creator turns an idea into a polished cinematic clip, premium and modern, but not sci-fi.

Results

The baseline first-shot smoke case passed with a 99.0 A+ score. The first-principles variant also passed with a 98.3 A+ score. It produced the expected phase shape and did not explode into an oversized full-movie plan.

On the simple lighthouse edit, both baseline and first-principles behavior were structurally correct. The first shot passed at 97.3 A+ for the first-principles variant. The follow-up color edit landed at 89.0 B+ watch in both baseline and first-principles runs, but the behavior was the important part: one targeted_edit phase and one update_scene, with no added scenes, acts, characters, sets, props, posters, or renders.

Canvas behavior remained scoped. The brainstorm case created only canvas tiles and a stack. The board-management case mutated pins, tags, and stacks without creating new media. This is important because prompt changes for movie generation should not accidentally broaden Canvas workflows.

The targeted product-promo guard produced grounded product-ad outputs: apartment studio, laptop workstation, review monitor angle, real devices, practical screens, and repeated negative constraints against holograms or sci-fi visuals. It still tripped a generic scene/clip rubric because it created a compact 15-second ad. That looked more like a rubric calibration issue than a product failure.

Interpretation

The first-principles kernel did not beat baseline on every score, but it preserved the representative behavior. That matters because the goal was not to add a flashy new capability. The goal was to test whether a smaller general reasoning layer could hold scope and action discipline.

The simple edit result is especially useful. A rubric score can mark a narrow edit as watch if it expects richer structure, but the correct product behavior for “make the lighthouse red” is exactly one targeted update. This highlights a recurring assessment challenge: rubrics must distinguish structural richness from task fitness.

The product-promo guard demonstrates the appropriate role of patch rules. When a production failure is active and costly, a narrow guard can be useful. But once the behavior is stable, it should either become a named regression case or be distilled into an ontology/render-path principle.

Assessment Lessons

First, prompt assessments should inspect action shape, not only quality score. A 95-point output that creates unwanted documents or posters can be worse than an 89-point output that preserves scope exactly.

Second, representative suites should include non-movie surfaces. A prompt edit to the shared chat agent can affect Canvas, board management, documents, and media queues.

Third, no-media runs are valuable. They let the team evaluate planning and action discipline cheaply before spending provider credits.

Fourth, score interpretation needs task context. For simple edits, minimal mutation is a success criterion.

Fifth, targeted guards need expiration pressure. A guard should either graduate into a general principle or stay backed by a regression case that proves it still earns its prompt budget.

Limitations

The benchmark used internal rubric scores and private run artifacts. It tested a representative slice, not the full product surface. It also used GPT-5.5, which may be more capable of first-principles inference than other supported models.

Conclusion

GenFlick’s first-principles prompt-kernel test showed that a compact reasoning layer can preserve behavior across first-shot, edit, and Canvas workflows. The result supports a prompt strategy where narrow patch rules are treated as temporary containment or regression-backed exceptions, while the durable prompt core emphasizes intent, state, ontology, render path, cost, and reversibility.