Overview
The Transcribe tab (Muninn) converts your video's spoken content into text using OpenAI's Whisper AI. The transcription powers all downstream features: metadata generation, captions, AutoCut, and more.
Key Features
- GPU-accelerated processing (CUDA support)
- Multiple Whisper model sizes
- Up to 4 audio tracks per video
- Built-in translation to 50+ languages
- Automatic filler word removal
- Word-level timestamps for captions
- Voice Activity Detection (VAD)
Getting Started
Step 1: Download a Whisper Model
- Go to Application Settings > Transcription
- Click Download Models
- Select a model size (see model comparison below)
- Wait for the download to complete
Step 2: Transcribe Your First Video
- Go to the Transcribe tab (Muninn)
- Select one or more videos from the list
- Configure your settings (language, device, etc.)
- Click Start Transcription
Whisper Model Comparison
Choose the right model for your needs:
| Model |
Size |
Speed |
Quality |
VRAM |
| tiny |
~75 MB |
Very Fast |
Basic |
~1 GB |
| base |
~150 MB |
Fast |
Good |
~1 GB |
| small |
~500 MB |
Medium |
Better |
~2 GB |
| medium |
~1.5 GB |
Slower |
Great |
~5 GB |
| large-v3 |
~3 GB |
Slow |
Excellent |
~10 GB |
| large-v3-turbo |
~3 GB |
Fast |
Excellent |
~6 GB |
Recommendation: Use large-v3-turbo for the best balance of speed and quality. It's nearly as accurate as large-v3 but significantly faster.
Transcription Settings
Basic Settings
- Source Language: The language spoken in your video (English, Japanese, etc.)
- Device: Where to run transcription
auto - Automatically detect (recommended)
cuda - Force GPU (NVIDIA only)
cpu - Force CPU (slower but always works)
- Translate To: Optional - translate to another language after transcription
- Remove Filler Words: Strip "um", "ah", "uh" from transcripts
Audio Track Configuration
Configure which of your video's audio tracks to transcribe:
- Enable/Disable: Toggle each track on or off
- Track Name: Assign a label (System Audio, Mic 1, Mic 2, Mic 3)
Multi-Track Tip: If your video has separate voice and game audio tracks, transcribe only the voice track for cleaner results.
Advanced Settings
Fine-tune transcription in Application Settings > Transcription:
Compute Settings
- Compute Type: Precision mode
float16 - Best for most GPUs (default)
int8_float16 - Lower VRAM usage
int8 - Lowest VRAM
float32 - CPU or compatibility mode
- Beam Size: Higher = more accurate but slower (1-10, default: 10)
- Patience: How long to consider alternatives (1.0-3.0, default: 2.0)
Voice Activity Detection (VAD)
VAD helps filter out non-speech audio:
- Auto: Automatically choose best method
- Off: Process all audio
- Energy: Simple volume-based detection
- Silero: Neural network-based (most accurate)
- TEN: Alternative neural method
VAD Threshold: Sensitivity (0.0-1.0, default: 0.35). Lower = more sensitive.
Output Options
- Generate Subtitles (.srt): Create subtitle files automatically
- Word-level Timestamps: Enable precise timing for each word (required for karaoke captions)
Performance Tuning
- Batch Size: Segments processed at once (default: 24)
- Intra Threads: Threads per operation (0 = auto)
- Inter Threads: Parallel operations (default: 1)
Batch Transcription
Process multiple videos efficiently:
- Click Select Ready to auto-select all pending videos
- Or manually select videos with checkboxes
- Click Start Transcription
- Monitor progress in the log panel
The log shows:
- Current video being processed
- Progress percentage for translations
- Speed and processing time
- Batch summary when complete
Translation
Loki Studio can translate transcripts to 50+ languages:
- Set your Source Language (what's spoken in the video)
- Set Translate To (target language)
- Run transcription as normal
- Both original and translated versions are saved
Supported languages include: English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Korean, Chinese, Arabic, Hindi, and many more.
Managing Transcriptions
View Transcription
Click the View button next to any transcribed video to preview the text.
Delete Transcription
Click Del to remove a transcription file, or use Delete Selected for batch removal.
Status Indicators
- ✓ Transcribed - Complete
- Pending - Needs transcription
Troubleshooting
CUDA out of memory error
Your GPU doesn't have enough VRAM for the selected model. Try:
- Use a smaller model (small or medium instead of large)
- Switch compute type to int8_float16 or int8
- Close other GPU-intensive applications
- Use CPU mode as a fallback
Transcription is very slow
Check these settings:
- Ensure Device is set to "auto" or "cuda" (not CPU)
- Update your NVIDIA drivers
- Reduce beam size (5 instead of 10)
- Use large-v3-turbo instead of large-v3
Poor transcription quality
Improve accuracy with:
- Use a larger model (large-v3-turbo or large-v3)
- Increase beam size to 10
- Enable VAD (Silero mode)
- Set the correct source language
- Use Initial Prompt to provide context (e.g., game names, character names)
Wrong language detected
Whisper auto-detects language, but you can override it:
- Explicitly set Source Language instead of using "auto"
- Use the Initial Prompt field to hint at the language