Veo 3.1 is Google DeepMind's flagship AI video model. It turns text prompts or images into short, realistic video clips and generates matching audio - dialogue, sound effects, ambient noise, and music - in a single pass. In 3D AI Studio you can use Veo 3.1 Fast, Veo 3.1 Lite, and Veo 3 without a separate Google account or API setup.

How is Veo 3.1 different from Veo 3?

Veo 3.1 builds on Veo 3 (launched May 2025) with richer, more natural audio, stronger prompt adherence, and noticeably better image-to-video quality. Veo 3 is still available in Video Studio as a lower-cost option, but Veo 3.1 is the one to choose for the most realistic results.

Does Veo 3.1 generate sound automatically?

Yes - native audio is always on. Veo 3.1 creates synchronized sound effects, ambient noise, music, and spoken dialogue while it renders the video, so the audio already matches the on-screen action. You never have to add or sync a separate audio track.

How good is Veo 3.1's lip-sync?

Put a spoken line in quotation marks and Veo matches the words to the character's mouth movements with roughly 120 millisecond accuracy, which reads as natural in most clips. It can also handle multiple speakers taking turns in the same scene.

Why does spoken dialogue sometimes sound off?

Very short speech segments are still an area Google is actively improving. For the most reliable results, keep spoken lines short and clear, give the character a moment before they speak, and avoid cramming long sentences into a 4-second clip.

Can Veo 3.1 output 4K video?

Yes. Veo 3.1 Fast can render native 4K (only at the 8-second length). Veo 3.1 Lite tops out at 1080p. Both also support 720p, which is the fastest and most economical option for drafts and social content.

How long can a Veo 3.1 clip be?

Each generation is 4, 6, or 8 seconds (8 seconds is required for 1080p, 4K, or when using reference images). You can build longer sequences by using scene extension, which continues a clip from its final frames.

Should I use Text to Video or Image to Video?

Use Text to Video when you have no footage and want Veo to invent the whole scene. Use Image to Video when you need a specific character, product, or style to stay exactly the same - you upload the image and Veo only adds the motion and sound.

Can I turn a photo of a person into a talking video?

Yes. Choose Image to Video, upload the photo, and include the spoken line in quotation marks. Veo animates the person and lip-syncs the dialogue. Clear, well-lit, front-facing photos give the most natural talking results.

How many reference images can Veo 3.1 use?

Veo 3.1 accepts up to three reference images to guide a generation, which helps keep a subject or style consistent. You can also specify a first and last frame to control how the clip begins and ends.

What aspect ratios does Veo 3.1 support for social media?

Veo 3.1 supports 16:9 landscape (great for YouTube) and 9:16 vertical (ideal for TikTok, Instagram Reels, and YouTube Shorts), plus an auto option that matches your input image's shape.

How do I write a good Veo 3.1 prompt?

Name the subject, one camera move, the lighting, and the sound. For example, 'slow dolly-in on a chef plating food, warm kitchen light, sizzling pan' beats 'a chef cooking'. Put any dialogue in quotation marks, and change one element at a time when refining.

Do I need a Google account or the Gemini API to use Veo 3.1?

No. Veo 3.1 is built directly into 3D AI Studio's Video Studio. You just open the Studio, pick Veo 3.1, and generate - billing runs on 3D AI Studio credits, with no separate Google sign-up or API keys.

How many credits does a Veo 3.1 video cost?

It depends on the tier, resolution, and length. Veo 3.1 Lite is the most credit-efficient; Veo 3.1 Fast costs more, and 4K adds an extra charge. The exact credit cost is always shown in the Studio before you generate.

Is Veo 3.1 free to try?

Yes. New accounts start with free credits, so you can generate Veo 3.1 clips before deciding on a plan. Veo 3.1 Lite stretches those free credits the furthest.

Can I use Veo 3.1 videos commercially?

Yes. Videos you generate on a paid plan come with commercial rights, so you can use them in ads, social campaigns, client deliverables, product pages, and more.

How long does a Veo 3.1 generation take?

Most clips finish in under a minute to a couple of minutes; 720p without audio is quickest, while 1080p, 4K, and audio add time. You can move any generation to the background and keep working while it renders.

Google Veo 3.1

Veo 3.1 AI Video Generator

Make realistic videos with sound. Veo 3.1 is Google's flagship AI video model - it turns your text or images into cinematic clips with native audio, real lip-sync, lifelike motion, and resolution up to 4K. Use it free inside 3D AI Studio's Video Studio.

Google DeepMindNative audio + lip-syncUp to 4KText & image to video

Compare all AI video models

Prompt

“Hair blowing in the wind, cinematic volcanic landscape”

Veo

Max resolution

Native

Audio + lip-sync

~120ms

Lip-sync accuracy

Text + Image

Input modes

Ways to create

Two ways to create with Veo 3.1

Start from a written idea, or bring an existing image to life. Both generate matching sound automatically.

Text prompt

“Aerial drone shot over a misty river valley at dawn”

Text to Video

Veo 3.1 Lite

Describe a scene in plain English and Veo builds the whole clip from scratch - camera, lighting, motion, and sound. No footage or image needed.

Great for scenes you don't have footage of
Add dialogue in quotes for lip-synced speech
Cheapest on the Lite tier

Try Text to Video

Input image

Image to Video

Veo 3.1 Fast

Upload a photo or an AI-generated image and describe the motion. Veo keeps the look of your image and animates it with believable movement.

Keeps your exact character or product
Up to 4K on the Fast tier
Perfect after generating an image in Image Studio

Try Image to Video

What Veo 3.1 can do

Google's most advanced video model, built for realism and sound.

Native audio

Veo 3.1 generates sound with the video in one pass - dialogue, sound effects, ambient noise, and music, all matched to the action on screen.

Talking & lip-sync

Add dialogue in your prompt with quotation marks. Veo syncs spoken words to mouth movements with around 120ms accuracy, even with multiple speakers.

Up to 4K

Render crisp 720p and 1080p, or jump to native 4K (Fast tier) for cinema-quality detail and rich textures.

Real-world physics

Gravity, weight, reflections, and lighting behave naturally, so movement looks convincing instead of 'AI-floaty'.

Scene extension

Continue a clip beyond its first few seconds by extending from the final frames, so you can build longer, connected sequences.

Strong prompt following

Veo 3.1 understands long, detailed prompts - multiple subjects, specific camera moves, and mood - and sticks to them closely.

Image to Video

Turn any image into a video with sound

Upload a character, product, or scene. Veo adds lifelike motion and matching audio while keeping your image exactly as it is - these clips each started from one still image.

Input image

AI video

motion“He lifts his hand and a spell shimmers to life, robe shifting and candle flames reacting, with a low magical hum in the audio.”

Input image

AI video

motion“A confident hero beat: he plants his feet and glances toward the camera as leaves and dust drift past.”

Input image

AI video

motion“She breathes softly and blinks, ears twitching, as the forest glows and tiny chimes ring in the background.”

Text to Video

Generate video from a text prompt

No image, no footage - just words. Describe the shot and Veo creates it with synchronized sound.

Text to Video

“Aerial drone shot over a misty river valley at dawn”

Text to Video

“Man in a yellow suit dancing in an empty warehouse, dynamic”

Text to Video

“Wild horses running through a mountain meadow, golden hour”

Text to Video

“Woman faces a T-Rex in a ruined city, cinematic rain”

Text to Video

“Pixar-style penguin sliding down an icy slope, playful”

Text to Video

“Sci-fi drone navigating through a foggy neon city”

How to make a video with Veo 3.1, step by step

From a blank prompt to a finished, sound-on clip in five steps - no editing software needed.

Open Veo 3.1 in Video Studio

Head to Video Studio and select Veo 3.1. Choose Veo 3.1 Fast when you want maximum quality or 4K, or Veo 3.1 Lite for the most affordable runs. Then pick your input: Text to Video or Image to Video.

Describe the scene

Write what happens in one clear sentence or two: who or what is in frame, where they are, and the single camera move you want (for example 'slow dolly-in'). Veo follows specific, concrete prompts far better than vague ones.

Add dialogue and sound

Veo generates audio with the picture, so describe the ambient sound, and put any spoken line in quotation marks - like a guide saying "follow me" - to get lip-synced speech that lands on the right frames.

Set resolution, length and shape

Choose 720p, 1080p, or 4K (Fast tier), a length of 4-8 seconds, and an aspect ratio: 16:9 for YouTube and landscape, or 9:16 for TikTok, Reels, and Shorts.

Generate, refine and download

Hit generate and Veo renders with synchronized audio in about a minute on most settings. If something's off, change one thing at a time - the camera move, then the lighting - then download in HD or 4K.

Prompt guide

How to write great Veo 3.1 prompts

Veo rewards detail. The more clearly you describe the shot, the closer the result.

Name the camera move

Tell Veo how the camera should move. Specific terms work far better than 'the camera moves'.

“slow dolly-in on the subject, shallow depth of field”

Add dialogue in quotes

Put exact spoken lines in quotation marks and Veo will lip-sync them to the character.

“a barista smiles and says "your usual?"”

Set the mood and lighting

Lighting words change everything. Describe time of day, weather, and tone.

“golden hour, warm rim light, soft film grain”

Describe the sound

Veo generates audio too, so mention the sounds you want to hear in the scene.

“gentle rain, distant thunder, footsteps on wet pavement”

Veo 3.1 specs

Everything you can control when you generate.

ProviderGoogle DeepMind

Input modesText to Video · Image to Video

Resolution720p · 1080p · 4K (Fast tier)

Duration4s · 6s · 8s (extendable)

AudioNative, always on (dialogue, SFX, music)

Lip-syncYes, ~120ms accuracy

Aspect ratios16:9 · 9:16 · auto

Tiers in StudioVeo 3.1 Fast · Veo 3.1 Lite (+ Veo 3)

Veo 3.1 Fast vs Lite

Both generate native audio and lip-sync. Pick Fast for maximum quality and 4K, Lite for the lowest cost.

Feature	Veo 3.1 Fast	Veo 3.1 Lite
Best for	Hero shots, 4K, realism	Quick social clips, volume
Max resolution	4K	1080p
Native audio	Yes	Yes
Lip-sync	Yes	Yes
Speed	Fast	Fast
Relative cost	Higher	Lowest

What people make with Veo 3.1

Talking videos

Spokesperson clips, explainers, and avatars with synced dialogue.

Ads & product

Photoreal product shots and short ad creative with sound.

Social content

Vertical 9:16 clips for TikTok, Reels, and Shorts.

Cinematic scenes

4K establishing shots, trailers, and concept films.

Explainers

Narrated how-to and educational clips with on-screen action.

Music & mood

Atmospheric visuals timed to a feeling or soundtrack.

Real estate & travel

Sweeping aerial and walkthrough shots from a single image.

Character animation

Bring a portrait or mascot to life with natural movement.

One subscription

Veo works best with the rest of the studio

Generate a flawless starting image, animate it with Veo, then take it further into 3D - every tool shares one account and one credit balance.

Image Studio

Generate & edit images

Create the perfect input image with 100+ AI tools, then animate it here.

Image to 3D

Turn images into 3D

Convert any image or video frame into a production-ready 3D model.

Text to 3D

Generate 3D from text

Describe any object and get a textured 3D model in seconds.

What is Veo 3.1?

Veo 3.1 is Google DeepMind's flagship AI video generation model. It creates short, high-quality video clips from a simple text prompt or a single image - and unlike most earlier models, it generates the sound at the same time as the picture. That means dialogue, sound effects, ambient noise, and music all come out of a single generation, already matched to what's happening on screen.

Google first launched Veo 3 in May 2025 and followed with Veo 3.1 in October 2025, adding richer audio, better prompt understanding, and stronger image-to-video quality. A lower-cost Veo 3.1 Lite tier arrived in 2026 for teams that need to generate a lot of video affordably. Inside 3D AI Studio you can use Veo 3.1 Fast, Veo 3.1 Lite, and the original Veo 3 - all from the same Video Studio, with no separate Google account or API setup.

Text to Video vs Image to Video

Veo 3.1 works two ways. With Text to Video, you describe a scene and Veo invents everything - the subject, the setting, the camera, and the sound. It's the fastest way to create a shot you don't have any footage or images for, like an aerial over a misty valley or a cinematic creature scene.

With Image to Video, you start from a picture. Upload a photo, a product shot, or an AI-generated character, and Veo animates it while keeping the exact look of your image. This is the best choice when you need a specific person, product, or style to stay consistent. A popular workflow is to generate the perfect starting image in 3D AI Studio's Image Studio, then animate it with Veo here.

Why native audio matters

Most AI video tools produce silent clips, leaving you to find music and sound effects separately and line them up by hand. Veo 3.1 generates audio as part of the video, so footsteps land on the right frame, a slamming door sounds when it shuts, and a character's lips move in time with their words.

This is a big deal for talking-head videos, ads with narration, and any scene where sound sells the realism. You can specify dialogue directly in your prompt by wrapping it in quotation marks, and Veo will generate lip-synced speech with around 120 millisecond accuracy - close enough to look natural in almost any clip.

Veo 3.1 vs other AI video models

Veo 3.1's biggest strength is realistic audio and lip-sync. If your video needs a person talking, a product demo with narration, or believable sound effects, Veo is usually the best choice. It also leads on physical realism - water, hair, reflections, and lighting behave the way they do in real life.

For pure cinematic length and multi-shot storytelling, you might reach for Kling 3.0 (up to 15-second single shots) or ByteDance Seedance 2.0 (multi-reference, multi-shot). The good news: 3D AI Studio gives you all of them under one subscription, so you can pick the right model for each shot instead of being locked into one.

Tips for better Veo 3.1 videos

Be specific about the camera. Phrases like 'slow dolly-in', 'aerial drone shot', or 'handheld tracking shot' guide Veo far better than 'the camera moves'. Describe the lighting and mood too - 'golden hour', 'soft studio lighting', or 'moody rain' all change the result.

For dialogue, put the exact spoken line in quotation marks. For image-to-video, start from a clean, well-lit image - Veo keeps the look of your input and adds motion, so a sharp input gives a sharp video. If a generation isn't quite right, change one thing at a time (the camera move, then the lighting, then the action) rather than rewriting the whole prompt.

Explore other video models

Every plan includes access to all of them. Pick the right tool for each shot.

Start creating with Veo 3.1

Open Video Studio and generate your first clip in minutes. Free credits to start.

Veo 3.1 AI Video Generator

Two ways to create with Veo 3.1

Text to Video

Image to Video

What Veo 3.1 can do

Native audio

Talking & lip-sync

Up to 4K

Real-world physics

Scene extension

Strong prompt following

Turn any image into a video with sound

Generate video from a text prompt

More videos made with Veo

How to make a video with Veo 3.1, step by step

Open Veo 3.1 in Video Studio

Describe the scene

Add dialogue and sound

Set resolution, length and shape

Generate, refine and download

How to write great Veo 3.1 prompts

Name the camera move

Add dialogue in quotes

Set the mood and lighting

Describe the sound

Veo 3.1 specs

Veo 3.1 Fast vs Lite

What people make with Veo 3.1

Talking videos

Ads & product

Social content

Cinematic scenes

Explainers

Music & mood

Real estate & travel

Character animation

Veo works best with the rest of the studio

Generate & edit images

Turn images into 3D

Generate 3D from text

What is Veo 3.1?

Text to Video vs Image to Video

Why native audio matters

Veo 3.1 vs other AI video models

Tips for better Veo 3.1 videos

Explore other video models

Kling 3.0

Seedance 2.0

All video models

What is Veo 3.1?

How is Veo 3.1 different from Veo 3?

Does Veo 3.1 generate sound automatically?

How good is Veo 3.1's lip-sync?

Why does spoken dialogue sometimes sound off?

Can Veo 3.1 output 4K video?

How long can a Veo 3.1 clip be?

Should I use Text to Video or Image to Video?

Can I turn a photo of a person into a talking video?

How many reference images can Veo 3.1 use?

What aspect ratios does Veo 3.1 support for social media?

How do I write a good Veo 3.1 prompt?

Do I need a Google account or the Gemini API to use Veo 3.1?

How many credits does a Veo 3.1 video cost?

Is Veo 3.1 free to try?

Can I use Veo 3.1 videos commercially?

How long does a Veo 3.1 generation take?

Start creating with Veo 3.1