Role-Typed Visual References for Multimodal Generation Agents

Abstract

Multimodal generation agents need more than a flat list of image URLs. In GenFlick's movie and canvas workflows, the same uploaded image can mean identity, style, object, set, composition, mood, start frame, end frame, continuity source, edit source, motion reference, or audio reference. This article documents a role-typed reference contract: classify the reference, state its binding strength and scope, describe the visible traits in prompt text, and attach the actual media to provider inputs that honor references. The finding generalizes to AI image, video, and storyboarding systems because provider APIs differ sharply: some support labeled multi-image editing, some accept positional arrays, some use a single start frame, and some ignore references entirely.

Problem

Image and video generation products often expose a simple user promise: “upload a reference image and the agent will use it.” That promise hides several different intents.

A user may upload a portrait and ask for the same person in a new style. They may upload a warehouse photo and mean “this is the set.” They may upload a mood board and want loose lighting and palette. They may upload a finished still and want it animated as the exact starting frame. Treating all of these as reference_image_urls gives the generation model too little information and gives the dispatcher too little authority to choose the right provider mode.

GenFlick’s internal audit found that reliability depends on three layers working together:

Classify what the reference is for.
Describe the reference’s visible traits and intended use in prompt text.
Attach the actual image, video, or audio bytes to a provider path that supports that reference type.

If any layer is missing, the user experience becomes inconsistent. A strong prompt with no attached reference invents details. An attached image with vague text can bind the wrong role. A provider that ignores references can silently spend credits while discarding the user’s source media.

Reference Roles

The useful internal contract is not a flat URL list. It is a typed reference object.

GenFlick’s reference taxonomy uses these fields:

role: identity, style, object, set, composition, mood, start_frame, end_frame, continuity, edit_source, motion, or audio.
strength: canonical, strong, or loose.
scope: character, set, prop, clip, tile, project, or turn.
source: media id, tile ordinal, attachment index, clip id, or URL.
prompt_binding: a short description of what the reference contains and how the model should use it.

This model separates “use this exact face” from “borrow this palette,” “animate this exact frame” from “make a new angle based on this still,” and “reuse this prop permanently” from “use this image only for the current generation.”

Prompt Binding

Provider APIs vary in whether references can be labeled structurally. Some accept interleaved image and text parts. Some accept only a positional array. Some accept one image. Some support start and end frames separately. Because of that variation, every generated prompt that uses references should include a compact inventory.

Example:

Reference binding:
Ref 1 (identity, strong): use the uploaded portrait for face shape, hairstyle, glasses, and likeness; render the person as a 1990s anime protagonist.
Ref 2 (style, loose): borrow the warm watercolor palette and soft rim lighting; do not copy the composition.

Prompt:
...

This text is still useful when the provider receives labeled images, and it is essential when labels are downgraded into positional order. It also makes observability and debugging much easier because the prompt log shows how the agent understood the user’s reference intent.

Provider Implications

The same internal reference contract can be mapped differently per provider.

OpenAI image generation can use multiple image references through an image-editing path. The application can cap references, prioritize the strongest identity or edit source first, and inject a label map into the prompt where the API lacks label fields.

Gemini image generation can accept image parts and text parts together, which makes it a strong target for labeled multi-reference composition and style transfer.

Fal-hosted image models may accept a flat reference array, which means order and prompt labels matter more than structural metadata.

Text-only image providers should not be used when reference fidelity is central to the request. If a request contains strong references, routing should prefer a reference-capable provider or warn clearly that the selected provider cannot honor the media.

For video, the differences are larger. Some providers use a single start frame. Some support start and end frames. Some support reference images, videos, and audio clips. Some use character or object elements rather than style images. A role-typed contract lets the app choose provider-specific fields without exposing those fields to the agent prompt as the primary abstraction.

Agent Workflow

The agent should not merely say “use the image.” It should route reference intent to the right action.

For a recurring character, a portrait can become a canonical character reference or a generation input for a newly styled character reference. For a one-off image variation, the same portrait may be scoped only to the current tile. For a video continuation, a frame can be an exact start frame rather than a loose style image.

Same-turn attachments need special handling. If a user uploads several images and immediately asks for variations, relying on titles or newest-media heuristics is fragile. The agent should receive stable same-turn attachment indices or persisted media ids so actions can point to exact sources.

Failure Modes

The audit identified several common failures:

Style references treated as identity references.
Uploaded assets described to the LLM but not attached to the downstream image model.
Canvas image generation supporting reference text in prompts while sending an empty reference list to the provider.
Exact start-frame requests treated as loose inspiration.
Text-only providers selected for reference-dependent tasks.
Multiple uploaded images collapsed into “the image” without a stable source id.

These are product-contract failures more than model failures. The model may understand the user’s intent, but the runtime has to preserve that intent through action schema, dispatcher, provider adapter, and observability.

Validation

GenFlick validated the distinction between final canonical references and generation-input references with no-media agent-flow cases. In one character identity plus style scenario, the agent correctly attached uploaded images as reference_generation_media_ids for a character, set, and prop, without assigning the uploads as final references. Separate prop/object and set/layout scenarios passed the same test.

The validation did not call paid image or video providers. It verified that planner output, parsed actions, dispatch behavior, and project state carried the right reference ids to the right assets.

General Lesson

Reference reliability is a contract problem. The application needs a role-aware reference model, the agent needs action schema that can express that model, prompts need a reference inventory, and provider adapters need honest capability routing.

For multimodal generation systems, a reference is not just media. It is an instruction about identity, style, continuity, scope, and authority. Encoding those semantics explicitly is the difference between an agent that “saw the image” and a system that can actually use it.