Introduction

AI video continuation is often evaluated visually: does the character remain the same, does the camera continue, and does the shot avoid restarting? In production use, audio continuity can be equally disruptive. A generated clip may visually continue while the background music restarts, changes instrumentation, jumps loudness, or fades out at the join.

This benchmark tested reference strategies for keeping generated score continuous across a clip boundary. The question was practical: when extending a generated scene, should the next request include the whole prior video, only the last 10 seconds, only the last 5 seconds, the final frame plus recent audio, or a hybrid of tail video plus recent audio?

Method

Each experiment began with a 15-second source clip generated by SeeGen sd2 at 1280 by 720. From that source, the workflow extracted:

  • the full 15-second video;
  • the final 10 seconds of video;
  • the final 5 seconds of video;
  • the final frame as an image;
  • the final 7 seconds of audio as a WAV reference.

Five continuation conditions were then generated:

  • A: full 15-second video reference;
  • B: last 10-second video reference;
  • C: last 5-second video reference;
  • D: last frame plus last 7 seconds of audio;
  • E: last 5-second video plus last 7 seconds of audio.

The continuation prompt asked for the next 10 seconds of the same scene and emphasized that the music should continue rather than restart. Outputs were analyzed with spectral cosine similarity, chroma cosine similarity, source and output RMS dB, seam RMS delta dB, and spectral centroid.

Test Set

The performed-theme test used a teenage violinist on a rainy elevated train platform. The music was diegetically motivated by the character’s violin performance: wistful piano, cello, and clear violin melody in 3/4 time.

The mood-score test used an archivist in a station lost-and-found room. The background music was non-diegetic: felt piano, warm pad, faint cello harmonics, sparse bell tones, and a slow pulse.

The action-score test used a courier sprinting across rainy rooftops. The background music was non-diegetic action scoring: percussion, low strings, synth bass, tense brass, and faster tempo.

Results

For the performed violin-theme cue, the last-10-seconds video reference was strongest by the available metrics: spectral cosine 0.956, chroma cosine 0.972, source RMS -30.4 dB, output RMS -30.2 dB, and seam RMS delta -1.7 dB. The full-video reference also had good chroma similarity at 0.945 but was much quieter at the join, with output RMS -37.9 dB and seam RMS delta -6.8 dB. The hybrid tail-video-plus-audio condition had spectral cosine 0.915 and chroma cosine 0.961, useful but not best in this cue.

For the quiet mood-score cue, audio-reference strategies dominated. The last-frame-plus-audio condition reached spectral cosine 0.996 and chroma cosine 0.997 with seam RMS delta 1.1 dB. The tail-video-plus-audio hybrid reached spectral cosine 0.993 and chroma cosine 0.995 with seam RMS delta 0.7 dB. In contrast, the last-5-seconds video-only condition had spectral cosine 0.685 and a severe seam RMS delta of 26.8 dB, suggesting a loudness or cue-reset failure.

For the action-score cue, nearly every condition scored highly. Last-10-seconds video reference reached spectral cosine 0.996 and chroma cosine 0.997. Last-frame-plus-audio reached 0.991 and 0.996. Last-5-seconds video reached 0.984 and 0.993. Full-video reference reached 0.978 and 0.989. The hybrid was still high at 0.988 and 0.991 but had a larger RMS change than several alternatives.

Discussion

The strongest conclusion is that “best reference” depends on the musical role. When the source contains an on-screen performer, the recent video tail may carry useful visual and audio context together. When the music is a quiet non-diegetic score, a direct audio seed appears much more reliable than a short video tail alone. When the score is loud and rhythmic, the model may preserve enough musical structure from several reference types.

Full-video reference was not consistently best. It can preserve broad context, but it may encourage recap behavior, attenuation, or musical reset. Very short video tails can be either efficient or fragile: in the mood-score case, the last-5-seconds video-only condition produced the worst measured discontinuity.

The hybrid condition is conceptually attractive because it separates visual continuity from audio continuity. Its results were strong in the mood-score case and acceptable elsewhere, though it did not always beat the best single reference. This suggests a production default: provide recent visual context plus an explicit recent audio seed when score continuity matters, then choose tail length based on scene type.

Limitations

The benchmark used three source scenes and simple audio metrics. Spectral and chroma similarity do not fully capture subjective musical continuity, phrase logic, or emotional fit. The source clips were anime-style prompts, so results may differ for live-action, dialogue-heavy scenes, or clips with environmental sound instead of music.

The model and provider behavior may also change over time. These results should be treated as a method and a snapshot, not as a permanent ranking of SeeGen sd2 reference modes.

Conclusion

Score continuity should be treated as a separate design problem from visual continuity. For quiet background score, last-frame-plus-audio or tail-video-plus-audio references were strongest. For performed music, a longer recent video tail was best in this test. For loud action score, multiple reference strategies worked. Production systems should expose audio-reference selection explicitly rather than hiding it inside generic “continue video” prompting.

References