Transcription Guide

GPU-accelerated speech-to-text with Whisper AI.

Overview

The Transcribe tab (Muninn) converts your video's spoken content into text using OpenAI's Whisper AI. The transcription powers all downstream features: metadata generation, captions, AutoCut, and more.

Key Features

  • GPU-accelerated processing (CUDA support)
  • Multiple Whisper model sizes
  • Up to 4 audio tracks per video
  • Built-in translation to 50+ languages
  • Automatic filler word removal
  • Word-level timestamps for captions
  • Voice Activity Detection (VAD)

Getting Started

Step 1: Download a Whisper Model

  1. Go to Application Settings > Transcription
  2. Click Download Models
  3. Select a model size (see model comparison below)
  4. Wait for the download to complete

Step 2: Transcribe Your First Video

  1. Go to the Transcribe tab (Muninn)
  2. Select one or more videos from the list
  3. Configure your settings (language, device, etc.)
  4. Click Start Transcription

Whisper Model Comparison

Choose the right model for your needs:

Model Size Speed Quality VRAM
tiny ~75 MB Very Fast Basic ~1 GB
base ~150 MB Fast Good ~1 GB
small ~500 MB Medium Better ~2 GB
medium ~1.5 GB Slower Great ~5 GB
large-v3 ~3 GB Slow Excellent ~10 GB
large-v3-turbo ~3 GB Fast Excellent ~6 GB
Recommendation: Use large-v3-turbo for the best balance of speed and quality. It's nearly as accurate as large-v3 but significantly faster.

Transcription Settings

Basic Settings

  • Source Language: The language spoken in your video (English, Japanese, etc.)
  • Device: Where to run transcription
    • auto - Automatically detect (recommended)
    • cuda - Force GPU (NVIDIA only)
    • cpu - Force CPU (slower but always works)
  • Translate To: Optional - translate to another language after transcription
  • Remove Filler Words: Strip "um", "ah", "uh" from transcripts

Audio Track Configuration

Configure which of your video's audio tracks to transcribe:

  • Enable/Disable: Toggle each track on or off
  • Track Name: Assign a label (System Audio, Mic 1, Mic 2, Mic 3)
Multi-Track Tip: If your video has separate voice and game audio tracks, transcribe only the voice track for cleaner results.

Advanced Settings

Fine-tune transcription in Application Settings > Transcription:

Compute Settings

  • Compute Type: Precision mode
    • float16 - Best for most GPUs (default)
    • int8_float16 - Lower VRAM usage
    • int8 - Lowest VRAM
    • float32 - CPU or compatibility mode
  • Beam Size: Higher = more accurate but slower (1-10, default: 10)
  • Patience: How long to consider alternatives (1.0-3.0, default: 2.0)

Voice Activity Detection (VAD)

VAD helps filter out non-speech audio:

  • Auto: Automatically choose best method
  • Off: Process all audio
  • Energy: Simple volume-based detection
  • Silero: Neural network-based (most accurate)
  • TEN: Alternative neural method

VAD Threshold: Sensitivity (0.0-1.0, default: 0.35). Lower = more sensitive.

Output Options

  • Generate Subtitles (.srt): Create subtitle files automatically
  • Word-level Timestamps: Enable precise timing for each word (required for karaoke captions)

Performance Tuning

  • Batch Size: Segments processed at once (default: 24)
  • Intra Threads: Threads per operation (0 = auto)
  • Inter Threads: Parallel operations (default: 1)

Batch Transcription

Process multiple videos efficiently:

  1. Click Select Ready to auto-select all pending videos
  2. Or manually select videos with checkboxes
  3. Click Start Transcription
  4. Monitor progress in the log panel

The log shows:

  • Current video being processed
  • Progress percentage for translations
  • Speed and processing time
  • Batch summary when complete

Translation

Loki Studio can translate transcripts to 50+ languages:

  1. Set your Source Language (what's spoken in the video)
  2. Set Translate To (target language)
  3. Run transcription as normal
  4. Both original and translated versions are saved

Supported languages include: English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Korean, Chinese, Arabic, Hindi, and many more.

Managing Transcriptions

View Transcription

Click the View button next to any transcribed video to preview the text.

Delete Transcription

Click Del to remove a transcription file, or use Delete Selected for batch removal.

Status Indicators

  • ✓ Transcribed - Complete
  • Pending - Needs transcription

Troubleshooting

CUDA out of memory error

Your GPU doesn't have enough VRAM for the selected model. Try:

  • Use a smaller model (small or medium instead of large)
  • Switch compute type to int8_float16 or int8
  • Close other GPU-intensive applications
  • Use CPU mode as a fallback
Transcription is very slow

Check these settings:

  • Ensure Device is set to "auto" or "cuda" (not CPU)
  • Update your NVIDIA drivers
  • Reduce beam size (5 instead of 10)
  • Use large-v3-turbo instead of large-v3
Poor transcription quality

Improve accuracy with:

  • Use a larger model (large-v3-turbo or large-v3)
  • Increase beam size to 10
  • Enable VAD (Silero mode)
  • Set the correct source language
  • Use Initial Prompt to provide context (e.g., game names, character names)
Wrong language detected

Whisper auto-detects language, but you can override it:

  • Explicitly set Source Language instead of using "auto"
  • Use the Initial Prompt field to hint at the language

Related Topics

Need More Help?

Can't find what you're looking for? Join our Discord community for help. He personally responds to every question.

Join Discord
Buy me a coffee