Setup Captions
Add AI-powered caption generation to your video editor using OpenAI Whisper via a server-side proxy
RVE supports AI-powered caption generation through a transcription adaptor. The built-in Whisper adaptor works through a proxy pattern — your server holds the OpenAI API key and calls the Whisper API, the editor calls your local route.
Why a proxy?
The editor runs in the browser. If you called the OpenAI API directly from the client, your API key would be exposed in network requests for anyone to steal. By routing through your own server, the key stays secret and never leaves your backend.
How it works
Editor UI → POST /api/ai/captions → Your server → api.openai.com (Whisper)Unlike sounds and images, caption generation is not enabled by default. You need to:
- Create a server-side proxy route that calls OpenAI Whisper
- Pass the
createWhisperTranscriptionAdaptor()to the editor
Step 1: Get an OpenAI API key
Sign up at platform.openai.com and create an API key. Add it to your environment:
OPENAI_API_KEY=sk-your-api-key-hereStep 2: Create the caption proxy route
Create app/api/ai/captions/route.ts:
import { NextRequest, NextResponse } from 'next/server';
export async function POST(request: NextRequest) {
const apiKey = process.env.OPENAI_API_KEY;
if (!apiKey) {
return NextResponse.json(
{ error: 'OpenAI API key not configured' },
{ status: 500 }
);
}
// The adaptor sends FormData when the video is a local blob,
// or JSON when it's a remote URL.
let mediaBlob: Blob;
let language: string | undefined;
const contentType = request.headers.get('content-type') || '';
if (contentType.includes('multipart/form-data')) {
// Local file uploaded from browser
const formData = await request.formData();
const file = formData.get('file') as File | null;
if (!file) {
return NextResponse.json({ error: 'No file provided' }, { status: 400 });
}
mediaBlob = file;
language = (formData.get('language') as string) || undefined;
} else {
// Remote URL — download server-side
const body = await request.json();
if (!body.videoSrc) {
return NextResponse.json(
{ error: 'videoSrc is required' },
{ status: 400 }
);
}
language = body.language;
const mediaResponse = await fetch(body.videoSrc);
if (!mediaResponse.ok) {
return NextResponse.json(
{ error: `Failed to fetch media: ${mediaResponse.status}` },
{ status: 502 }
);
}
mediaBlob = await mediaResponse.blob();
}
// Send to OpenAI Whisper
const whisperForm = new FormData();
whisperForm.append('file', mediaBlob, 'audio.mp4');
whisperForm.append('model', 'whisper-1');
whisperForm.append('response_format', 'verbose_json');
whisperForm.append('timestamp_granularities[]', 'word');
whisperForm.append('timestamp_granularities[]', 'segment');
if (language) {
whisperForm.append('language', language);
}
const whisperResponse = await fetch(
'https://api.openai.com/v1/audio/transcriptions',
{
method: 'POST',
headers: { Authorization: `Bearer ${apiKey}` },
body: whisperForm,
}
);
if (!whisperResponse.ok) {
const error = await whisperResponse.text();
return NextResponse.json(
{ error: `Whisper API error: ${whisperResponse.status} - ${error}` },
{ status: whisperResponse.status }
);
}
const result = await whisperResponse.json();
// Transform Whisper response into RVE formats
const captions = transformWhisperResponse(result);
// Build transcript for caching (generate once, reuse on subsequent clicks)
const transcript = {
language: language || result.language,
createdAt: new Date().toISOString(),
segments: (result.segments || []).map((seg: any) => ({
text: seg.text.trim(),
startMs: Math.round(seg.start * 1000),
endMs: Math.round(seg.end * 1000),
confidence: 1,
words: (result.words || [])
.filter((w: any) => w.start >= seg.start && w.end <= seg.end)
.map((w: any) => ({
word: w.word.trim(),
startMs: Math.round(w.start * 1000),
endMs: Math.round(w.end * 1000),
confidence: 1,
})),
})),
};
return NextResponse.json({ captions, transcript });
}
function transformWhisperResponse(result: any) {
const words = result.words || [];
if (words.length === 0) return [];
// Group words into sentences (split on punctuation or every ~8 words)
const captions: any[] = [];
let currentWords: any[] = [];
for (const word of words) {
currentWords.push({
word: word.word.trim(),
startMs: Math.round(word.start * 1000),
endMs: Math.round(word.end * 1000),
confidence: 1,
});
const text = word.word.trim();
const isEndOfSentence = /[.!?]$/.test(text);
const isTooLong = currentWords.length >= 8;
if (isEndOfSentence || isTooLong) {
captions.push({
text: currentWords.map((w) => w.word).join(' '),
startMs: currentWords[0].startMs,
endMs: currentWords[currentWords.length - 1].endMs,
timestampMs: null,
confidence: 1,
words: currentWords,
});
currentWords = [];
}
}
// Push remaining words
if (currentWords.length > 0) {
captions.push({
text: currentWords.map((w) => w.word).join(' '),
startMs: currentWords[0].startMs,
endMs: currentWords[currentWords.length - 1].endMs,
timestampMs: null,
confidence: 1,
words: currentWords,
});
}
return captions;
}Step 3: Pass the adaptor to the editor
import { createWhisperTranscriptionAdaptor } from '@reactvideoeditor/react-video-editor/adaptors/whisper-transcription-adaptor';
<ReactVideoEditor
adaptors={{
transcription: createWhisperTranscriptionAdaptor(),
}}
/>That's it. Select a video on the timeline, open the AI panel, and click Generate Captions. The editor calls your proxy route, which calls Whisper, and the captions appear on the timeline with word-level timing.
Custom endpoint
If your proxy route lives at a different path:
<ReactVideoEditor
adaptors={{
transcription: createWhisperTranscriptionAdaptor({
endpoint: '/my-api/transcribe',
}),
}}
/>Custom transcription provider
You can use any transcription service (Deepgram, AssemblyAI, Rev, etc.) by implementing the TranscriptionAdaptor interface:
import type { TranscriptionAdaptor } from '@reactvideoeditor/react-video-editor/types';
const myAdaptor: TranscriptionAdaptor = {
name: 'deepgram',
displayName: 'Deepgram',
async transcribe({ videoSrc, language, durationSeconds }) {
const response = await fetch('/api/my-transcription', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ videoSrc, language, durationSeconds }),
});
const data = await response.json();
return { captions: data.captions, transcript: data.transcript };
},
};
<ReactVideoEditor
adaptors={{
transcription: myAdaptor,
}}
/>Your server route must return { captions: Caption[], transcript?: Transcript }. Including transcript enables caching — the editor stores it on the video so subsequent caption generation is instant (no re-transcription). Each caption has:
{
text: string; // The caption text
startMs: number; // Start time in milliseconds
endMs: number; // End time in milliseconds
timestampMs: null;
confidence: number; // 0-1
words: Array<{ // Word-level timing
word: string;
startMs: number;
endMs: number;
confidence: number;
}>;
}