The Problem with Cloud-Based Captioning
When I started working on educational video content, I needed accurate captions in multiple languages. Existing solutions fell into two categories: expensive cloud services that sent video data to external servers, or inaccurate free tools that couldn't handle my Nepali accent.
Privacy was a major concern—when dealing with medical or professional content, uploading videos to third-party services wasn't an option. I needed a solution that worked entirely locally while still producing high-quality captions.
Pipeline Architecture
The core pipeline has four stages, each handling a specific part of the workflow:
1. Audio Extraction
FFmpeg handles audio extraction from video files. It supports virtually all video formats and can extract audio at various quality levels. I used:
ffmpeg -i input_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
Key parameters: -vn removes video, -acodec sets PCM format, -ar 16000 sets 16kHz sample rate (optimal for Whisper), and -ac 1 converts to mono.
2. Transcription with Whisper
OpenAI's Whisper model handles speech-to-text. I used the medium-sized model for better accuracy with accented speech. The base model is faster but struggles with less common accents.
import whisper
model = whisper.load_model("medium")
result = model.transcribe("audio.wav", language="en", task="transcribe")
print(result["text"])
The model outputs timestamps for each segment, which is crucial for syncing captions with video.
3. Translation with NLLB
Meta's No Language Left Behind (NLLB) handles translation to target languages. Unlike other translation APIs, NLLB supports 200+ languages, including Nepali, which was essential for my use case.
from transformers import pipeline
translator = pipeline("translation", model="facebook/nllb-200-distilled-600M")
segments = result["segments"]
for seg in segments:
translated = translator(seg["text"], src_lang="eng_Latn", tgt_lang="nep_Latn")
seg["translation"] = translated[0]["translation_text"]
4. Subtitle Generation
The final stage converts processed data into standard subtitle formats. I generate both SRT (most compatible) and VTT (supports more styling):
def create_srt(segments, translation=None):
srt_content = ""
for i, seg in enumerate(segments, 1):
start = format_time(seg["start"])
end = format_time(seg["end"])
text = seg["text"]
if translation:
text += f"\n{seg['translation']}"
srt_content += f"{i}\n{start} --> {end}\n{text}\n\n"
return srt_content
Local-First Tradeoffs
Running everything locally has significant advantages but also requires careful consideration of hardware constraints:
Advantages
- Privacy: No video data leaves your machine—critical for sensitive content
- Cost: Zero per-minute API charges; just pay for hardware once
- Offline Use: Works without internet connection
- Customization: Fine-tune models on your own data for better accuracy
Hardware Considerations
Model sizes matter significantly. My experience:
- Whisper Tiny: 39M parameters, fast but less accurate (8GB RAM OK)
- Whisper Base: 74M parameters, good balance (12GB RAM recommended)
- Whisper Medium: 769M parameters, excellent accuracy (16GB+ RAM needed)
- CUDA Acceleration: If you have an NVIDIA GPU, Whisper runs 10x faster
User Experience Design
I built a Streamlit interface to make the tool accessible to non-technical users. The key UX decisions:
Clear Progress Feedback
Video processing can take 10-30 minutes depending on length. Users need to know exactly what's happening:
import streamlit as st
progress_bar = st.progress(0)
status_text = st.empty()
# Update as each stage completes
status_text.text("Extracting audio...")
progress_bar.progress(25)
status_text.text("Transcribing...")
progress_bar.progress(50)
status_text.text("Translating...")
progress_bar.progress(75)
status_text.text("Generating subtitles...")
progress_bar.progress(100)
Preview Before Download
Users can preview generated captions synced with video before downloading. This catches timing issues before they become problems.
Batch Processing
For creators with multiple videos, batch processing queues multiple files and processes them sequentially overnight.
Performance Optimizations
Several optimizations improved processing speed:
- Caching: Store model weights locally; don't re-download on each run
- Chunked Processing: Process long videos in segments to manage memory
- Parallel Translation: Translate multiple segments concurrently
- Optimized Formats: Use web-friendly video codecs for faster loading
Future Improvements
Planned enhancements include:
- Speaker diarization to label different speakers in the video
- Custom vocabulary support for domain-specific terminology
- Integration with video editing software for direct export
- Web interface hosted locally for easier team collaboration
Conclusion
Building a local caption pipeline requires more upfront effort than using a cloud service, but the benefits—in privacy, cost savings, and offline capability—make it worthwhile for serious content creators. The key is designing a modular pipeline that can be optimized and extended as needs evolve.
The complete source code and installation instructions are available on my GitHub. Feel free to adapt it for your own use case.