The Problem with Cloud-Based Captioning

When I started working on educational video content, I needed accurate captions in multiple languages. Existing solutions fell into two categories: expensive cloud services that sent video data to external servers, or inaccurate free tools that couldn't handle my Nepali accent.

Privacy was a major concern—when dealing with medical or professional content, uploading videos to third-party services wasn't an option. I needed a solution that worked entirely locally while still producing high-quality captions.

Pipeline Architecture

The core pipeline has four stages, each handling a specific part of the workflow:

1. Audio Extraction

FFmpeg handles audio extraction from video files. It supports virtually all video formats and can extract audio at various quality levels. I used:

ffmpeg -i input_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

Key parameters: -vn removes video, -acodec sets PCM format, -ar 16000 sets 16kHz sample rate (optimal for Whisper), and -ac 1 converts to mono.

2. Transcription with Whisper

OpenAI's Whisper model handles speech-to-text. I used the medium-sized model for better accuracy with accented speech. The base model is faster but struggles with less common accents.

import whisper

model = whisper.load_model("medium")
result = model.transcribe("audio.wav", language="en", task="transcribe")
print(result["text"])

The model outputs timestamps for each segment, which is crucial for syncing captions with video.

3. Translation with NLLB

Meta's No Language Left Behind (NLLB) handles translation to target languages. Unlike other translation APIs, NLLB supports 200+ languages, including Nepali, which was essential for my use case.

from transformers import pipeline

translator = pipeline("translation", model="facebook/nllb-200-distilled-600M")
segments = result["segments"]

for seg in segments:
    translated = translator(seg["text"], src_lang="eng_Latn", tgt_lang="nep_Latn")
    seg["translation"] = translated[0]["translation_text"]

4. Subtitle Generation

The final stage converts processed data into standard subtitle formats. I generate both SRT (most compatible) and VTT (supports more styling):

def create_srt(segments, translation=None):
    srt_content = ""
    for i, seg in enumerate(segments, 1):
        start = format_time(seg["start"])
        end = format_time(seg["end"])
        text = seg["text"]
        if translation:
            text += f"\n{seg['translation']}"
        srt_content += f"{i}\n{start} --> {end}\n{text}\n\n"
    return srt_content

Local-First Tradeoffs

Running everything locally has significant advantages but also requires careful consideration of hardware constraints:

Advantages

  • Privacy: No video data leaves your machine—critical for sensitive content
  • Cost: Zero per-minute API charges; just pay for hardware once
  • Offline Use: Works without internet connection
  • Customization: Fine-tune models on your own data for better accuracy

Hardware Considerations

Model sizes matter significantly. My experience:

  • Whisper Tiny: 39M parameters, fast but less accurate (8GB RAM OK)
  • Whisper Base: 74M parameters, good balance (12GB RAM recommended)
  • Whisper Medium: 769M parameters, excellent accuracy (16GB+ RAM needed)
  • CUDA Acceleration: If you have an NVIDIA GPU, Whisper runs 10x faster

User Experience Design

I built a Streamlit interface to make the tool accessible to non-technical users. The key UX decisions:

Clear Progress Feedback

Video processing can take 10-30 minutes depending on length. Users need to know exactly what's happening:

import streamlit as st

progress_bar = st.progress(0)
status_text = st.empty()

# Update as each stage completes
status_text.text("Extracting audio...")
progress_bar.progress(25)
status_text.text("Transcribing...")
progress_bar.progress(50)
status_text.text("Translating...")
progress_bar.progress(75)
status_text.text("Generating subtitles...")
progress_bar.progress(100)

Preview Before Download

Users can preview generated captions synced with video before downloading. This catches timing issues before they become problems.

Batch Processing

For creators with multiple videos, batch processing queues multiple files and processes them sequentially overnight.

Performance Optimizations

Several optimizations improved processing speed:

  • Caching: Store model weights locally; don't re-download on each run
  • Chunked Processing: Process long videos in segments to manage memory
  • Parallel Translation: Translate multiple segments concurrently
  • Optimized Formats: Use web-friendly video codecs for faster loading

Future Improvements

Planned enhancements include:

  • Speaker diarization to label different speakers in the video
  • Custom vocabulary support for domain-specific terminology
  • Integration with video editing software for direct export
  • Web interface hosted locally for easier team collaboration

Conclusion

Building a local caption pipeline requires more upfront effort than using a cloud service, but the benefits—in privacy, cost savings, and offline capability—make it worthwhile for serious content creators. The key is designing a modular pipeline that can be optimized and extended as needs evolve.

The complete source code and installation instructions are available on my GitHub. Feel free to adapt it for your own use case.