QUICK INFO
| Difficulty | Intermediate |
| Time Required | 15 minutes |
| Prerequisites | Basic familiarity with text-to-video AI tools; understanding of shot types (close-up, tracking, etc.) helps but isn't required |
| Tools Needed | Kling 3.0 access via fal.ai API or playground |
What You'll Learn:
- Structure multi-shot prompts that produce coherent scene sequences
- Write dialogue prompts with character consistency and audio control
- Direct camera motion and subject movement explicitly
- Use image-to-video mode without losing source detail
Kling 3.0 interprets prompts as scene directions. If you write "a dog in a park," you'll get something generic. If you write a shot describing how the camera finds the dog, what the dog does, and what sound fills the scene, the output is noticeably different. This guide covers the specific prompt patterns that produce the best results with the model's multi-shot, audio, and motion capabilities.
Anyone who's used earlier Kling versions or similar tools like Runway or Pika will pick this up quickly. If you're brand new to AI video, you'll still follow the steps, but expect to iterate more on your first few prompts.
Getting Started
You need access to Kling 3.0 through fal.ai (the API is exclusively available there as of early February 2026). Create a fal.ai account, grab an API key from your dashboard, or use their web playground if you just want to experiment before writing code.
The model supports text-to-video and image-to-video, outputs up to 15 seconds, and can generate multi-shot sequences of up to six shots in a single call. Native audio output is supported, including dialogue, ambient sound, and sound effects.
One thing the fal docs don't emphasize enough: the model responds to filmmaking vocabulary. Terms like "tracking shot," "POV," "shot-reverse-shot," and "macro close-up" aren't decorative. They change the output. If you know basic cinematography terms, use them. If you don't, stick with simple directional language ("camera follows," "camera holds still") and you'll be fine.
Writing Multi-Shot Prompts
This is where Kling 3.0 genuinely differs from earlier models. Instead of generating a single continuous clip, you can describe up to six distinct shots and the model will produce them as a coherent sequence with varied angles and compositions.
Step 1: Label Your Shots Explicitly
Number or label each shot. The model uses these labels to separate the sequence into distinct compositions. A master prompt sets the overall scene, then each shot prompt handles its own framing and action.
Here's the structure:
Master Prompt: [Overall scene description and tone]
Multi shot Prompt 1: [Specific framing, subject action, duration]
Multi shot Prompt 2: [Next shot, different angle or subject, duration]
Each shot prompt should specify what the camera sees, what the subject does, and roughly how long it lasts (in seconds). The model handles transitions between shots on its own; you don't need to describe cuts or fades.
Step 2: Vary Your Shots
If every shot is a medium wide, you'll get a flat sequence. Mix your framing: start wide to establish the scene, cut to a close-up for detail, pull to a tracking shot for movement. The model understands profile shots, macro close-ups, POV, and overhead angles.
A practical example from the fal.ai documentation describes a character dancing down stairs across two shots. The first shot (5 seconds) captures the start of the movement at the top with arms spreading wide. The second shot (5 seconds) follows the descent with spinning and coat movement, ending in a pose at the bottom. Different framing, continuous action, same character.
Step 3: Keep Character Descriptions Identical Across Shots
If shot 1 describes "a man in a red suit" and shot 3 says "the guy in crimson," you might lose consistency. Repeat the exact description. Kling 3.0 has strong subject consistency, but you have to meet it halfway by not introducing synonyms or variations mid-sequence.
Directing Motion
Vague motion descriptions produce vague results. "The camera moves around the subject" gives the model too many options. Specify the camera's behavior over time.
Instead of "tracking shot of a woman walking," try: "Camera follows at medium shot, staying slightly behind and to the right. When she stops at the door, the camera holds still. She turns, and the camera slowly pushes in to a close-up."
That level of direction works. The model can handle long takes where the camera's relationship to the subject changes, pauses when the subject pauses, and resumes when action picks up. I'm not sure exactly how far you can push the complexity of camera instructions before things break down, but sequences with two or three distinct camera behaviors within a single shot seem reliable in the examples I've seen.
Fast-paced movement and continuous shots both benefit from explicit timing. If you want a whip pan or a sudden freeze, say so. If the camera should drift slowly, say "slowly."
Writing Dialogue Prompts
This is the section that matters most if you're making anything with characters who talk. The audio system in Kling 3.0 is tightly coupled to the visual generation, so lip sync, facial expression, and voice timing are all influenced by how you write dialogue.
Step 1: Establish Characters with Labels
Every speaking character needs a unique label introduced at the start. The format is:
[Character A: Descriptive Name, voice/tone description]: "Dialogue line"
So a scene with two people might open with:
[Character A: Lead Detective, controlled serious voice]: "Let's stop pretending."
[Character B: Prime Suspect, sharp defensive voice]: "I already told you everything."
The labels (Lead Detective, Prime Suspect) must stay consistent throughout. Don't switch to pronouns or shortened names after introducing them. The model uses these labels to track who's speaking and to keep voice characteristics stable.
Step 2: Anchor Dialogue to Physical Action
This tripped me up at first. If you write the dialogue line without describing what the character physically does, the model doesn't always know who should be moving. Describe the action before the dialogue:
The detective slides a folder across the table.
Paper scraping sound.
[Lead Detective, calm but threatening tone]: "Then explain why your fingerprints are here."
The physical action (sliding the folder) and the sound effect (paper scraping) both help the model stage the moment correctly. Without them, you might get the right words but flat delivery.
Step 3: Control Speaker Transitions
Between dialogue lines, use temporal markers. "Immediately" is the most common one in the example prompts, and it signals a quick cut or reaction. "Pause" or "Silence" creates a beat. Without these markers, the model can merge two characters' speech together or lose the rhythm of the exchange.
[Character A]: "So... are you mad at me?"
Immediately, the passenger stares out the window.
[Character B, quiet cold tone]: "I don't know."
That "Immediately" does real work. It tells the model to cut to Character B's reaction right after A finishes speaking.
Tone and Emotion Keywords
Kling 3.0 supports granular voice descriptions. You can specify "raspy, deep voice," "voice cracking," "whispered," "shouting louder," "sleepy amused voice," and the model adjusts both audio output and facial performance. Bold formatting around tone descriptions in the prompt (like shouting loudly) seems to emphasize them, though I haven't tested whether it makes a measurable difference versus plain text.
The model also handles multiple languages, accents, and code-switching within a scene. I haven't tested multilingual dialogue personally, but the documentation claims it works.
Audio Beyond Dialogue
Native audio isn't limited to speech. You can direct ambient sound, sound effects, and music cues directly in the prompt.
The model recognizes a wide vocabulary of audio triggers. For environmental sound: "rain tapping softly on the roof," "birds chirping," "traffic noise." For incidental sound effects: "glass shattering," "footsteps in an empty hallway," "ceramic clinks sharply." For music: "low lo-fi music playing from the speakers," "a sad piano chord enters quietly," "music tightens with a rising pulse."
Place these cues where they should occur in the scene's timeline. A sound effect described right before a dialogue line will play before the character speaks. Music cues described as "entering quietly" tend to fade in rather than cut in abruptly. The placement and adjectives both matter.
Image-to-Video
When starting from a reference image, the model treats it as an anchor. It preserves identity, layout, and even text/signage from the source while introducing motion.
The key here is that your prompt should describe what changes, not what's already visible. The image handles the "what." Your prompt handles the "how it moves." Focus on camera motion, environmental shifts (wind, lighting changes), or subject actions that extend from the frozen moment the image captures. If you re-describe everything in the image, you're wasting prompt space and potentially introducing contradictions.
The docs claim the model maintains text and signage detail from source images, which would make it useful for product and advertising work. I haven't verified this with complex text layouts, so take that with appropriate skepticism.
Troubleshooting
Symptom: Characters swap voices or the wrong character speaks a line. Fix: Check that your character labels are exactly consistent throughout. If you introduced "[Character A: Mom]" and later wrote "[Mother]," that's probably the cause. Also add "Immediately" or a physical action between speaker transitions.
Symptom: Camera does something unexpected mid-shot (sudden angle change, jerky motion). Fix: Your motion description may be too vague. Replace "the camera moves" with specific direction: "camera tracks left," "camera pushes in slowly," "camera holds at medium shot." If you described multiple camera behaviors without timing cues, the model may try to execute them simultaneously.
Symptom: Audio output is silent or has mismatched ambient sound. Fix: Make sure native audio is enabled in your API call or playground settings. If ambient sound doesn't match your description, try using more specific trigger words from the audio keyword list (e.g., "fire crackling" rather than "fireplace sounds").
Symptom: Multi-shot output has inconsistent subjects between shots. Fix: Use identical character descriptions across all shot prompts. Don't paraphrase. Copy-paste the description string if you have to.
What's Next
If you're building with the API, the fal.ai Kling 3.0 API documentation covers endpoint parameters, duration settings, and response handling.
PRO TIPS
The model responds to pacing language beyond just camera direction. Words like "suddenly," "slowly," "gradually," and "sharply" influence the timing of both visual and audio elements. "A car screeches to a stop" produces a different result than "a car gradually slows down."
When writing multi-shot prompts, include duration for each shot (e.g., "Duration: 5 seconds"). The model supports up to 15 seconds total, and explicitly allocating time across shots gives you more predictable pacing than letting the model decide.
For dialogue-heavy scenes, describe the setting and ambient audio at the very top of the prompt before any character action. This establishes the sonic environment before the model has to process speech: "A quiet park bench in the late afternoon. Birds chirping. Wind through trees. Soft acoustic guitar music." Then introduce your characters.
Bold-formatting tone descriptions within dialogue brackets (like shouting loudly or voice trembling) appears in most of the official example prompts. Whether this is actually parsed differently from plain text is unclear, but the examples are consistent enough that it's probably worth doing.
PROMPT TEMPLATES
Two-Character Dialogue Scene
[Setting description with ambient sound cues.]
[Background music or atmospheric tone.]
[First character physical action.]
[Character A: Name/Role, voice tone description]: "Dialogue line."
Immediately, [second character reaction.]
[Character B: Name/Role, voice tone description]: "Dialogue line."
[Character A physical response.]
[Character A, shifted tone description]: "Dialogue line."
[Pause/silence/beat.]
[Character B, shifted tone description]: "Dialogue line."
Customize by: swapping the setting line and ambient audio to change genre entirely (interrogation room vs. park bench vs. parked car at night).
Example output from the fal.ai guide: A scene in a parked car at night with rain tapping on the roof, where two friends have a tense exchange about trust. Lip sync, facial expression, and voice tone all tracked the prompt's emotional arc across four dialogue lines.
Multi-Shot Action Sequence
Master Prompt: [Overall scene, character description, emotional tone.]
Multi shot Prompt 1: [Character action in specific framing, duration in seconds]
Multi shot Prompt 2: [Continuation or new angle, building on previous action, duration in seconds]
Customize by: adding a third or fourth shot prompt for longer sequences. Keep total duration under 15 seconds.
FAQ
Q: Does Kling 3.0 work with languages other than English? A: The documentation states it supports multiple languages, dialects, accents, and multilingual code-switching within a single scene. I haven't independently tested non-English dialogue, so I can't confirm quality parity.
Q: Can I control specific camera movements like dolly zooms or crane shots? A: The model understands cinematic terminology, so describing a "slow dolly zoom" or "overhead crane shot descending" should influence output. Standard terms like tracking, panning, POV, and shot-reverse-shot are explicitly called out as supported.
Q: What's the maximum duration for a single generation? A: 15 seconds. For multi-shot prompts, you allocate that time across shots (e.g., three 5-second shots or two 7-second and one 1-second).
Q: How is image-to-video different from text-to-video in terms of prompting? A: With image-to-video, the image sets the visual baseline. Your prompt should only describe what changes: motion, camera behavior, environmental shifts. Don't re-describe what's already in the image.
Q: Is the audio generated separately or part of the same model? A: Native audio is generated alongside the video. Lip movement, facial expression, and voice timing are coupled, which is why dialogue prompts need such precise structure to work well.
RESOURCES
- fal.ai Kling 3.0 Prompting Guide: Source material for this guide, includes video examples of each prompt pattern
- Kling 3.0 API Documentation: Endpoint reference, parameters, and code examples
- fal.ai Kling 3.0 Release Announcement: Technical overview of model capabilities and architecture changes




