All posts

VideoJune 11, 2026 10 min read

Talking AI Reels & Multi-Scene Skits from a Single Product Photo: How It Works

Inside the catais video pipeline: image-to-video from real product photos, lip-synced speech, scene-by-scene skit production, and the engineering that keeps packaging and logos accurate.

Talking AI Reels & Multi-Scene Skits from a Single Product Photo: How It Works

Send the catais bot one message — “make a reel of Blackseed Oil” — and a few minutes later a vertical video is live on Facebook, Instagram Reels and TikTok: your actual bottle in motion, a presenter speaking a hook written in your brand voice, lips synced, your exact logo in the corner, caption attached. No camera, no editor, no upload screen.

This post walks through how that pipeline works end to end — including the unglamorous engineering that makes the difference between a demo and a production system you'd trust with your brand: fidelity rules, logo compositing, scene stitching, background jobs and quota fallbacks.

Step 1 — Start from truth: the real product photo

Everything begins with image-to-video, never text-to-video. Pure text-to-video invents a product — beautiful, wrong, and a refund waiting to happen. catais resolves the product you named in your WooCommerce catalog and pulls its actual listing photo as the anchor. The generated motion is built *around* your real item, so shape, colours and packaging stay recognisably yours. (Attach a photo in chat instead, and that becomes the anchor — handy for items not yet listed.)

Step 2 — Compose the opening frame

If you've created a Cast presenter, the agent first composes a single opening frame: your locked character holding or presenting your product, styled by your brand kit. Anchoring the video to one composed frame is what keeps the *same* face and the *same* product consistent through the motion that follows — the single most effective consistency trick in AI video.

Step 3 — Write the words, then say them

A reel gets one spoken line — at most ~22 words: hook, product, soft call-to-action — written in your brand voice (the same voice profile that writes your captions). The video model renders it as actual speech: lip-synced dialogue when the presenter is on camera, clean voiceover for product-only clips. Modern video models in the class of Google's Veo generate synchronized audio natively, which is what makes one-step talking video possible at all.

Step 4 — The fidelity rules (the part that protects you)

These rules are the difference between AI video you post proudly and AI video you delete in embarrassment. They're enforced in the pipeline, not left to luck.

  • No generated text, anywhere. Video models hallucinate lettering — fake ingredients, mangled brand names. catais forbids on-screen text generation outright; packaging surfaces stay clean rather than confidently wrong.
  • No invented claims. Prices, health claims, specs — the script is constrained to what your store and brief actually say.
  • The logo is composited, never drawn. After generation, your *uploaded* logo file is overlaid frame-accurately onto the finished video. Pixel-perfect, because no model ever touched it.
  • The product stays the hero. Prompting keeps your item recognisable rather than letting the scene swallow it.

Step 5 — Skits: scenes, stitched

Single AI clips run about eight seconds — a hook, not a story. Skits chain them: ask for “a skit of Total Body Reset, 3 scenes” and the agent writes a beat-by-beat script (hook → value → call-to-action, each beat with its visual action and spoken line), generates every scene anchored to the same opening frame so character and product persist, then edits the scenes into one continuous video with proper concatenation and audio handling — finishing with the logo overlay.

Because that's minutes of compute, skits run as background jobs: the chat acknowledges instantly, a status message updates itself with live progress — *setting the scene… 10% → writing the script… 20% → filming scene 2 of 3… 45% → editing… 90%* — and you're pinged when it's published. Telegram or web, same experience.

Step 6 — Publish everywhere at once

The finished video posts to your Facebook Page, publishes as an Instagram Reel (container, processing wait, publish — handled), and cross-posts to TikTok via its official Content Posting API, with X receiving the campaign text. One production, four surfaces, zero uploads. (TikTok runs private-visibility while in the platform's pre-audit sandbox — tracked openly on our roadmap.)

The reliability engineering you don't see

  • Quota fallbacks: video model capacity is finite; when a primary model's daily quota is exhausted, the pipeline falls back to faster variants instead of failing your request.
  • Honest failure reporting: if a clip can't be produced, the chat tells you *why* — and falls back to a branded image post so the moment isn't lost.
  • Idempotent publishing: retries can't double-post; cross-post results are reported per-platform (“Instagram ✅, TikTok ✅”).

What to ask for (cheat sheet)

AI video is on the Business plan ($50/month) — generation is genuinely compute-expensive, and we'd rather price it transparently than throttle it secretly (pricing). If you've been meaning to “do video” since last year, the honest path is: connect your store, type one sentence, and watch your first reel publish itself.

  • Make a reel of [product]” — one talking clip, posted everywhere.
  • Make a skit of [product], 3 scenes” — a stitched story, produced in background.
  • Market this and post it” + an attached photo — your photo becomes the anchor.
  • Or none of the above: schedule video days, or let Autopilot turn price drops into reels automatically.

See it on your own store.

Free plan, one page, no card — live in an afternoon.

Get started

Keep reading