# Video Generation (/docs/agent-tools/video-generation)



The video generation tool lets the agent take a prompt and optional reference media (image / video / audio), render a video in the background, and drop it straight onto a path in the agent or space Drive.

How it works at a glance [#how-it-works-at-a-glance]

* **Asynchronous.** The call returns right away; you can keep chatting while the video renders.
* **Written to Drive.** On success the video lands on the path you chose, and the message bubble shows a preview card.
* **Cancellable.** Queued or running jobs can be stopped from the chat UI.

Enabling the tool [#enabling-the-tool]

Open **Agent Settings → Tools → Video** and toggle the group on. If the group isn't visible in your account, video generation isn't available in your current environment — reach out to your admin.

Input parameters [#input-parameters]

| Parameter         | Required | Notes                                                                                                                                              |
| ----------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| `prompt`          | yes      | Natural-language description of the video                                                                                                          |
| `file_path`       | yes      | Output path. Relative paths land on the agent Drive; `/space/...` writes to the space Drive                                                        |
| `attachments`     | no       | Array of reference media — each entry is a URL, asset ID, Drive path string, or explicit object `{ path? / url?, type?, role?, name?, mimeType? }` |
| `model`           | no       | Override the default model                                                                                                                         |
| `durationSeconds` | no       | 4–15 seconds                                                                                                                                       |
| `aspectRatio`     | no       | `adaptive / 21:9 / 16:9 / 4:3 / 1:1 / 3:4 / 9:16`                                                                                                  |
| `resolution`      | no       | `480p / 720p / 1080p` (not every model supports 1080p)                                                                                             |
| `generateAudio`   | no       | Defaults to on                                                                                                                                     |
| `watermark`       | no       | Defaults to off                                                                                                                                    |
| `returnLastFrame` | no       | Ask for the final frame so you can chain another clip                                                                                              |
| `webSearch`       | no       | Allow reference web search for pure text-to-video prompts                                                                                          |

`role` values inside `attachments` [#role-values-inside-attachments]

| role              | Media type | Typical use                                                       |
| ----------------- | ---------- | ----------------------------------------------------------------- |
| `reference_image` | image      | General reference (style, subject)                                |
| `first_frame`     | image      | First frame of the video                                          |
| `last_frame`      | image      | Last frame — pair with `first_frame` for first-to-last-frame mode |
| `reference_video` | video      | Reference clip for edit / extend flows                            |
| `reference_audio` | audio      | Reference audio for voiceover or background                       |

If you omit `role`, a sensible default is picked based on the media type.

Output path [#output-path]

* Relative path (e.g. `videos/demo.mp4`) — writes to the **agent Drive**
* Absolute path starting with `/space/` — writes to the **space Drive**
* Missing extension is normalized to `.mp4`
* The chat bubble's video card renders a preview directly from this path

Capability matrix [#capability-matrix]

Capability varies by model tier — pick one based on what you need:

| Tier          | Text→Video | First frame | First+last | Multimodal ref (image/video/audio) | Edit | Extend | Max resolution | Duration |
| ------------- | ---------- | ----------- | ---------- | ---------------------------------- | ---- | ------ | -------------- | -------- |
| Flagship      | ✓          | ✓           | ✓          | Full                               | ✓    | ✓      | 1080p          | 4–15s    |
| Flagship fast | ✓          | ✓           | ✓          | Full                               | ✓    | ✓      | 720p           | 4–15s    |
| Pro           | ✓          | ✓           | ✓          | Image only                         | ✗    | ✗      | 1080p          | 4–12s    |
| Lite i2v      | ✗          | ✓           | ✗          | Image only                         | ✗    | ✗      | 720p           | 2–12s    |
| Lite t2v      | ✓          | ✗           | ✗          | ✗                                  | ✗    | ✗      | 720p           | 2–12s    |

Flagship tiers can output audio when `generateAudio: true`.

Media specs [#media-specs]

**Images** — jpeg / png / webp / bmp / tiff / gif / heic / heif. ≤ 30 MB each, aspect ratio 0.4–2.5, side length 300–6000 px. Counts: 1 for first-frame, 2 for first+last, 1–9 for multimodal reference, 1–4 for lite reference.

**Videos** — mp4 / mov, H.264 or H.265 + AAC / MP3. 2–15 s each, up to 3 clips, ≤ 15 s total. 480p / 720p / 1080p, 24–60 fps.

**Audio** — wav / mp3, 2–15 s each, up to 3 clips, ≤ 15 s total, ≤ 15 MB each.

Prompt tips [#prompt-tips]

Formula: **subject + action, scene + action, camera + action**.

* Be concrete. Avoid stacking abstract adjectives.
* Front-load the important parts (subject, action, camera).
* Iterate on prompt first, then reference media; swap vague terms for specific descriptions.
* Text-to-video is high-variance — use it to prospect ideas; use image-to-video when you need a stable look.
* When using image-to-video, upload a high-quality first frame; frame quality strongly influences the result.

Aspect ratio and cropping [#aspect-ratio-and-cropping]

If `aspectRatio` differs from the input image, the backend **center-crops** along the shorter dimension so the crop region is fully inside the original. Keep `aspectRatio` close to the input image's ratio, or use `adaptive` to let the backend match it.

Cancelling a job [#cancelling-a-job]

Queued or running jobs can be stopped directly from the message bubble. The status flips to "cancelled" right away and no more work is done in the background.

Limits [#limits]

* Intermediate state and the temporary video URL are kept for 24 hours and then cleaned up; videos already written to Drive are unaffected.
* Reference media containing real human faces is rejected.
* Per-account RPM and concurrency limits apply; if you hit a rate limit, the error surfaces in the message bubble.
* Generation time depends on length, resolution, and model — usually 30 s to a few minutes. Closing the session pauses status updates; reopening the session resumes them.

See also [#see-also]

* [Agent Tools](/en/docs/agent-tools)
* [Drive and knowledge base](/en/docs/concepts/drive-and-knowledge-base)
