Video Generation Models

Model rankings updated July 2026 based on real usage data.

OpenRouter provides access to video generation models through a single, unified API gateway. Generate videos from text prompts and reference images via an asynchronous API — compare pricing, capabilities, and supported resolutions to find the best fit for your use case. Video generation is a new modality on OpenRouter, and available models are improving quickly. Learn more about video generation on OpenRouter.

Video Generation Models on OpenRouter

xAI: Grok Imagine Video 1.5

Grok Imagine Video 1.5 is an image-to-video generation model from xAI. It animates a starting image with an optional text prompt that can direct subject and camera motion, pacing, atmosphere, and physical behavior, while maintaining visual continuity across the clip. It can generate synchronized sound effects, ambience, and dialogue alongside the video.

by x-ai$0/M input tokens$0/M output tokens

Alibaba: HappyHorse 1.1

HappyHorse 1.1 is a video generation model from Alibaba. It generates short videos from a text prompt, a single starting image, or a set of reference images, with output up to 1080p and durations of 3 to 15 seconds. It is suited for creative content, social media clips, and image-driven animation, and improves on the prior version with stronger prompt adherence, smoother motion, and more consistent characters across frames.

by alibaba$0/M input tokens$0/M output tokens

Alibaba: HappyHorse 1.0

HappyHorse 1.0 is a video generation model from Alibaba. It generates short videos from a text prompt, a single starting image, or a set of reference images, with output up to 1080p and durations of 3 to 15 seconds. It is suited for creative content, social media clips, and image-driven animation across a range of aspect ratios.

by alibaba$0/M input tokens$0/M output tokens

xAI: Grok Imagine Video

Grok Imagine Video is xAI's fast, text-, image-, and reference-conditioned video generation model. It produces short videos (1–15 seconds, 24 fps) at 480p or 720p across seven aspect ratios - 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, and 2:3.

The model supports three generation modes: text-to-video from a prompt alone, image-to-video that animates a still input, and reference-to-video that grounds the output in up to seven reference images for consistent characters, styles, or settings.

by x-ai$0/M input tokens$0/M output tokens

Kling: Video v3.0 Pro

Kling v3.0 Pro is Kuaishou's premium video generation model, offering higher visual quality than the Standard tier. It supports text-to-video and image-to-video workflows, with first-frame and last-frame control for precise scene composition. Clips range from 3 to 15 seconds in 16:9, 9:16, or 1:1 aspect ratios. Native audio generation is available as an option.

by kwaivgi$0/M input tokens$0/M output tokens

Kling: Video v3.0 Standard

Kling v3.0 Standard is a video generation model from Kuaishou. It supports text-to-video and image-to-video workflows, with first-frame and last-frame control for guided scene composition. Clips range from 3 to 15 seconds in 16:9, 9:16, or 1:1 aspect ratios. Native audio generation is available as an option.

by kwaivgi$0/M input tokens$0/M output tokens

Google: Veo 3.1 Fast

Google's mid-tier video generation model balancing speed and quality. Veo 3.1 Fast generates high-quality video from text or image prompts with native synchronized audio, offering faster turnaround than Veo 3.1 at lower cost. Supports first-frame and last-frame conditioning, multiple resolutions and aspect ratios, and SynthID watermarking.

by google$0/M input tokens$0/M output tokens

Google: Veo 3.1 Lite

Google's most cost-effective video generation model, designed for high-volume applications and rapid iteration. Veo 3.1 Lite generates 720p and 1080p video from text or image prompts with native synchronized audio at less than 50% of the cost of Veo 3.1 Fast. Supports 4–8 second clips in landscape (16:9) and portrait (9:16) formats, with SynthID watermarking. Ideal for content platforms, short-form video creation, and automated media generation.

by google$0/M input tokens$0/M output tokens

Kling: Video O1

Kling Video O1 is a video generation model from Kuaishou. It supports text and image inputs with video output, enabling text-to-video and image-to-video workflows. It is suited for cinematic content production, with first-frame and last-frame control for precise scene composition. It generates 5 or 10 second clips in 16:9, 9:16, or 1:1 aspect ratios.

by kwaivgi$0/M input tokens$0/M output tokens

MiniMax: Hailuo 2.3

Hailuo 2.3 is a video generation model from MiniMax. It accepts text prompts and reference images as input and generates video output, supporting both text-to-video and image-to-video workflows. It is suited for creative content production, cinematic scene generation, and character animation, with a focus on realistic motion and expressive character rendering.

by minimax$0/M input tokens$0/M output tokens