Setup Captions

Add AI-powered caption generation to your video editor using OpenAI Whisper via a server-side proxy

RVE supports AI-powered caption generation through a transcription adaptor. The built-in Whisper adaptor works through a proxy pattern — your server holds the OpenAI API key and calls the Whisper API, the editor calls your local route.

Why a proxy?

The editor runs in the browser. If you called the OpenAI API directly from the client, your API key would be exposed in network requests for anyone to steal. By routing through your own server, the key stays secret and never leaves your backend.

How it works

Editor UI  →  POST /api/ai/captions  →  Your server  →  api.openai.com (Whisper)

Unlike sounds and images, caption generation is not enabled by default. You need to:

Create a server-side proxy route that calls OpenAI Whisper
Pass the createWhisperTranscriptionAdaptor() to the editor

Step 1: Get an OpenAI API key

OPENAI_API_KEY=sk-your-api-key-here

Step 2: Create the caption proxy route

Create app/api/ai/captions/route.ts:

import { NextRequest, NextResponse } from 'next/server';

export async function POST(request: NextRequest) {
  const apiKey = process.env.OPENAI_API_KEY;
  if (!apiKey) {
    return NextResponse.json(
      { error: 'OpenAI API key not configured' },
      { status: 500 }
    );
  }

  // The adaptor sends FormData when the video is a local blob,
  // or JSON when it's a remote URL.
  let mediaBlob: Blob;
  let language: string | undefined;

  const contentType = request.headers.get('content-type') || '';

  if (contentType.includes('multipart/form-data')) {
    // Local file uploaded from browser
    const formData = await request.formData();
    const file = formData.get('file') as File | null;
    if (!file) {
      return NextResponse.json({ error: 'No file provided' }, { status: 400 });
    }
    mediaBlob = file;
    language = (formData.get('language') as string) || undefined;
  } else {
    // Remote URL — download server-side
    const body = await request.json();
    if (!body.videoSrc) {
      return NextResponse.json(
        { error: 'videoSrc is required' },
        { status: 400 }
      );
    }
    language = body.language;

    const mediaResponse = await fetch(body.videoSrc);
    if (!mediaResponse.ok) {
      return NextResponse.json(
        { error: `Failed to fetch media: ${mediaResponse.status}` },
        { status: 502 }
      );
    }
    mediaBlob = await mediaResponse.blob();
  }

  // Send to OpenAI Whisper
  const whisperForm = new FormData();
  whisperForm.append('file', mediaBlob, 'audio.mp4');
  whisperForm.append('model', 'whisper-1');
  whisperForm.append('response_format', 'verbose_json');
  whisperForm.append('timestamp_granularities[]', 'word');
  whisperForm.append('timestamp_granularities[]', 'segment');
  if (language) {
    whisperForm.append('language', language);
  }

  const whisperResponse = await fetch(
    'https://api.openai.com/v1/audio/transcriptions',
    {
      method: 'POST',
      headers: { Authorization: `Bearer ${apiKey}` },
      body: whisperForm,
    }
  );

  if (!whisperResponse.ok) {
    const error = await whisperResponse.text();
    return NextResponse.json(
      { error: `Whisper API error: ${whisperResponse.status} - ${error}` },
      { status: whisperResponse.status }
    );
  }

  const result = await whisperResponse.json();

  // Transform Whisper response into RVE formats
  const captions = transformWhisperResponse(result);

  // Build transcript for caching (generate once, reuse on subsequent clicks)
  const transcript = {
    language: language || result.language,
    createdAt: new Date().toISOString(),
    segments: (result.segments || []).map((seg: any) => ({
      text: seg.text.trim(),
      startMs: Math.round(seg.start * 1000),
      endMs: Math.round(seg.end * 1000),
      confidence: 1,
      words: (result.words || [])
        .filter((w: any) => w.start >= seg.start && w.end <= seg.end)
        .map((w: any) => ({
          word: w.word.trim(),
          startMs: Math.round(w.start * 1000),
          endMs: Math.round(w.end * 1000),
          confidence: 1,
        })),
    })),
  };

  return NextResponse.json({ captions, transcript });
}

function transformWhisperResponse(result: any) {
  const words = result.words || [];
  if (words.length === 0) return [];

  // Group words into sentences (split on punctuation or every ~8 words)
  const captions: any[] = [];
  let currentWords: any[] = [];

  for (const word of words) {
    currentWords.push({
      word: word.word.trim(),
      startMs: Math.round(word.start * 1000),
      endMs: Math.round(word.end * 1000),
      confidence: 1,
    });

    const text = word.word.trim();
    const isEndOfSentence = /[.!?]$/.test(text);
    const isTooLong = currentWords.length >= 8;

    if (isEndOfSentence || isTooLong) {
      captions.push({
        text: currentWords.map((w) => w.word).join(' '),
        startMs: currentWords[0].startMs,
        endMs: currentWords[currentWords.length - 1].endMs,
        timestampMs: null,
        confidence: 1,
        words: currentWords,
      });
      currentWords = [];
    }
  }

  // Push remaining words
  if (currentWords.length > 0) {
    captions.push({
      text: currentWords.map((w) => w.word).join(' '),
      startMs: currentWords[0].startMs,
      endMs: currentWords[currentWords.length - 1].endMs,
      timestampMs: null,
      confidence: 1,
      words: currentWords,
    });
  }

  return captions;
}

Step 3: Pass the adaptor to the editor

import { createWhisperTranscriptionAdaptor } from '@reactvideoeditor/react-video-editor/adaptors/whisper-transcription-adaptor';

<ReactVideoEditor
  adaptors={{
    transcription: createWhisperTranscriptionAdaptor(),
  }}
/>

That's it. Select a video on the timeline, open the AI panel, and click Generate Captions. The editor calls your proxy route, which calls Whisper, and the captions appear on the timeline with word-level timing.

Custom endpoint

If your proxy route lives at a different path:

<ReactVideoEditor
  adaptors={{
    transcription: createWhisperTranscriptionAdaptor({
      endpoint: '/my-api/transcribe',
    }),
  }}
/>

Custom transcription provider

You can use any transcription service (Deepgram, AssemblyAI, Rev, etc.) by implementing the TranscriptionAdaptor interface:

import type { TranscriptionAdaptor } from '@reactvideoeditor/react-video-editor/types';

const myAdaptor: TranscriptionAdaptor = {
  name: 'deepgram',
  displayName: 'Deepgram',
  async transcribe({ videoSrc, language, durationSeconds }) {
    const response = await fetch('/api/my-transcription', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ videoSrc, language, durationSeconds }),
    });
    const data = await response.json();
    return { captions: data.captions, transcript: data.transcript };
  },
};

<ReactVideoEditor
  adaptors={{
    transcription: myAdaptor,
  }}
/>

Your server route must return { captions: Caption[], transcript?: Transcript }. Including transcript enables caching — the editor stores it on the video so subsequent caption generation is instant (no re-transcription). Each caption has:

{
  text: string;        // The caption text
  startMs: number;     // Start time in milliseconds
  endMs: number;       // End time in milliseconds
  timestampMs: null;
  confidence: number;  // 0-1
  words: Array<{       // Word-level timing
    word: string;
    startMs: number;
    endMs: number;
    confidence: number;
  }>;
}