Transcription Guide

Overview

The Transcribe tab (Muninn) converts your video's spoken content into text using OpenAI's Whisper AI. The transcription powers all downstream features: metadata generation, captions, AutoCut, and more.

Key Features

GPU-accelerated processing (CUDA support)
Multiple Whisper model sizes
Up to 4 audio tracks per video
Built-in translation to 50+ languages
Automatic filler word removal
Word-level timestamps for captions
Voice Activity Detection (VAD)

Getting Started

Step 1: Download a Whisper Model

Go to Application Settings > Transcription
Click Download Models
Select a model size (see model comparison below)
Wait for the download to complete

Step 2: Transcribe Your First Video

Go to the Transcribe tab (Muninn)
Select one or more videos from the list
Configure your settings (language, device, etc.)
Click Start Transcription

Whisper Model Comparison

Choose the right model for your needs:

Model	Size	Speed	Quality	VRAM
tiny	~75 MB	Very Fast	Basic	~1 GB
base	~150 MB	Fast	Good	~1 GB
small	~500 MB	Medium	Better	~2 GB
medium	~1.5 GB	Slower	Great	~5 GB
large-v3	~3 GB	Slow	Excellent	~10 GB
large-v3-turbo	~3 GB	Fast	Excellent	~6 GB

Recommendation: Use large-v3-turbo for the best balance of speed and quality. It's nearly as accurate as large-v3 but significantly faster.

Transcription Settings

Basic Settings

Source Language: The language spoken in your video (English, Japanese, etc.)
Device: Where to run transcription
- auto - Automatically detect (recommended)
- cuda - Force GPU (NVIDIA only)
- cpu - Force CPU (slower but always works)
Translate To: Optional - translate to another language after transcription
Remove Filler Words: Strip "um", "ah", "uh" from transcripts

Audio Track Configuration

Configure which of your video's audio tracks to transcribe:

Enable/Disable: Toggle each track on or off
Track Name: Assign a label (System Audio, Mic 1, Mic 2, Mic 3)

Multi-Track Tip: If your video has separate voice and game audio tracks, transcribe only the voice track for cleaner results.

Advanced Settings

Fine-tune transcription in Application Settings > Transcription:

Compute Settings

Compute Type: Precision mode
- float16 - Best for most GPUs (default)
- int8_float16 - Lower VRAM usage
- int8 - Lowest VRAM
- float32 - CPU or compatibility mode
Beam Size: Higher = more accurate but slower (1-10, default: 10)
Patience: How long to consider alternatives (1.0-3.0, default: 2.0)

Voice Activity Detection (VAD)

VAD helps filter out non-speech audio:

Auto: Automatically choose best method
Off: Process all audio
Energy: Simple volume-based detection
Silero: Neural network-based (most accurate)
TEN: Alternative neural method

VAD Threshold: Sensitivity (0.0-1.0, default: 0.35). Lower = more sensitive.

Output Options

Generate Subtitles (.srt): Create subtitle files automatically
Word-level Timestamps: Enable precise timing for each word (required for karaoke captions)

Performance Tuning

Batch Size: Segments processed at once (default: 24)
Intra Threads: Threads per operation (0 = auto)
Inter Threads: Parallel operations (default: 1)

Batch Transcription

Process multiple videos efficiently:

Click Select Ready to auto-select all pending videos
Or manually select videos with checkboxes
Click Start Transcription
Monitor progress in the log panel

The log shows:

Current video being processed
Progress percentage for translations
Speed and processing time
Batch summary when complete

Translation

Loki Studio can translate transcripts to 50+ languages:

Set your Source Language (what's spoken in the video)
Set Translate To (target language)
Run transcription as normal
Both original and translated versions are saved

Supported languages include: English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Korean, Chinese, Arabic, Hindi, and many more.

Managing Transcriptions

View Transcription

Click the View button next to any transcribed video to preview the text.

Delete Transcription

Click Del to remove a transcription file, or use Delete Selected for batch removal.

Status Indicators

✓ Transcribed - Complete
Pending - Needs transcription

Troubleshooting

CUDA out of memory error

Your GPU doesn't have enough VRAM for the selected model. Try:

Use a smaller model (small or medium instead of large)
Switch compute type to int8_float16 or int8
Close other GPU-intensive applications
Use CPU mode as a fallback

Transcription is very slow

Check these settings:

Ensure Device is set to "auto" or "cuda" (not CPU)
Update your NVIDIA drivers
Reduce beam size (5 instead of 10)
Use large-v3-turbo instead of large-v3

Poor transcription quality

Improve accuracy with:

Use a larger model (large-v3-turbo or large-v3)
Increase beam size to 10
Enable VAD (Silero mode)
Set the correct source language
Use Initial Prompt to provide context (e.g., game names, character names)

Wrong language detected

Whisper auto-detects language, but you can override it:

Explicitly set Source Language instead of using "auto"
Use the Initial Prompt field to hint at the language