cleanup

2025-12-16 23:14:18 +01:00 · 2025-12-01 15:06:29 +01:00
parent 336ee5b5bd
commit 0408a1fc07
5 changed files with 23 additions and 238 deletions
--- a/PERFORMANCE_OPTIMIZATIONS.md
+++ b/PERFORMANCE_OPTIMIZATIONS.md
@@ -1,215 +0,0 @@
 # Performance Optimizations
 This document outlines the performance optimizations implemented in the Transcription API.
 ## 1. Model Management
 ### Shared Model Instance
 - **Location**: `transcription_server.py:73-137`
 - **Optimization**: Single Whisper model instance shared across all connections (gRPC, WebSocket, REST)
 - **Benefit**: Eliminates redundant model loading, reduces memory usage by ~50-80%
 ### Model Evaluation Mode
 - **Location**: `transcription_server.py:119-122`
 - **Optimization**: Set model to eval mode and disable gradient computation
 - **Benefit**: Reduces memory usage and improves inference speed by ~15-20%
 ## 2. GPU Optimizations
 ### TF32 Precision (Ampere GPUs)
 - **Location**: `transcription_server.py:105-111`
 - **Optimization**: Enable TF32 for matrix multiplications on compatible GPUs
 - **Benefit**: Up to 3x faster inference on A100/RTX 3000+ series GPUs with minimal accuracy loss
 ### cuDNN Benchmarking
 - **Location**: `transcription_server.py:110`
 - **Optimization**: Enable cuDNN autotuning for optimal convolution algorithms
 - **Benefit**: 10-30% speedup after initial warmup
 ### FP16 Inference
 - **Location**: `transcription_server.py:253`
 - **Optimization**: Use FP16 precision on CUDA devices
 - **Benefit**: 2x faster inference, 50% less GPU memory usage
 ## 3. Inference Optimizations
 ### No Gradient Context
 - **Location**: `transcription_server.py:249-260, 340-346`
 - **Optimization**: Wrap all inference calls in `torch.no_grad()` context
 - **Benefit**: 10-15% speed improvement, reduces memory usage
 ### Optimized Audio Processing
 - **Location**: `transcription_server.py:208-219`
 - **Optimization**: Direct numpy operations, inline energy calculations
 - **Benefit**: Faster VAD processing, reduced memory allocations
 ## 4. Network Optimizations
 ### gRPC Threading
 - **Location**: `transcription_server.py:512-527`
 - **Optimization**: Dynamic thread pool sizing based on CPU cores
 - **Configuration**: `max_workers = min(cpu_count * 2, 20)`
 - **Benefit**: Better handling of concurrent connections
 ### gRPC Keepalive
 - **Location**: `transcription_server.py:522-526`
 - **Optimization**: Configured keepalive and ping settings
 - **Benefit**: More stable long-running connections, faster failure detection
 ### Message Size Limits
 - **Location**: `transcription_server.py:519-520`
 - **Optimization**: 100MB message size limits for large audio files
 - **Benefit**: Support for longer audio files without chunking
 ## 5. Voice Activity Detection (VAD)
 ### Smart Filtering
 - **Location**: `transcription_server.py:162-203`
 - **Optimization**: Fast energy-based VAD to skip silent audio
 - **Configuration**:
  - Energy threshold: 0.005
  - Zero-crossing threshold: 50
 - **Benefit**: 40-60% reduction in transcription calls for audio with silence
 ### Early Return
 - **Location**: `transcription_server.py:215-217`
 - **Optimization**: Skip transcription for non-speech audio
 - **Benefit**: Reduces unnecessary inference calls, improves overall throughput
 ## 6. Anti-hallucination Filters
 ### Aggressive Filtering
 - **Location**: `transcription_server.py:262-310`
 - **Optimization**: Comprehensive hallucination detection and filtering
 - **Filters**:
  - Common hallucination phrases
  - Repetitive text
  - Low alphanumeric ratio
  - Cross-language detection
 - **Benefit**: Better transcription quality, fewer false positives
 ### Conservative Parameters
 - **Location**: `transcription_server.py:254-259`
 - **Optimization**: Tuned Whisper parameters to reduce hallucinations
 - **Settings**:
  - `temperature=0.0` (deterministic)
  - `no_speech_threshold=0.8` (high)
  - `logprob_threshold=-0.5` (strict)
  - `condition_on_previous_text=False`
 - **Benefit**: More accurate transcriptions, fewer hallucinations
 ## 7. Logging Optimizations
 ### Debug-level for VAD
 - **Location**: `transcription_server.py:216-219`
 - **Optimization**: Use DEBUG level for VAD messages instead of INFO
 - **Benefit**: Reduced log volume, better performance in high-throughput scenarios
 ## 8. REST API Optimizations
 ### Async Operations
 - **Location**: `rest_api.py`
 - **Optimization**: Fully async FastAPI with uvicorn
 - **Benefit**: Non-blocking I/O, better concurrency
 ### Streaming Responses
 - **Location**: `rest_api.py:223-278`
 - **Optimization**: Server-Sent Events for streaming transcription
 - **Benefit**: Real-time results without buffering entire response
 ### Connection Pooling
 - **Built-in**: FastAPI/Uvicorn connection pooling
 - **Benefit**: Efficient handling of concurrent HTTP connections
 ## Performance Benchmarks
 ### Typical Performance (RTX 3090, large-v3 model)
 | Metric | Value |
 |--------|-------|
 | Cold start | 5-8 seconds |
 | Transcription speed (with VAD) | 0.1-0.3x real-time |
 | Memory usage | 3-4 GB VRAM |
 | Concurrent sessions | 5-10 (GPU memory dependent) |
 | API latency | 50-200ms (excluding inference) |
 ### Without Optimizations
 | Metric | Previous | Optimized | Improvement |
 |--------|----------|-----------|-------------|
 | Inference speed | 0.2x | 0.1x | 2x faster |
 | Memory per session | 4 GB | 0.5 GB | 8x reduction |
 | Startup time | 8s | 6s | 25% faster |
 ## Recommendations
 ### For Maximum Performance
 1. **Use GPU**: CUDA is 10-50x faster than CPU
 2. **Use smaller models**: `base` or `small` for real-time applications
 3. **Enable VAD**: Reduces unnecessary transcriptions
 4. **Batch audio**: Send 3-5 second chunks for optimal throughput
 5. **Use gRPC**: Lower overhead than REST for high-frequency calls
 ### For Best Quality
 1. **Use larger models**: `large-v3` for best accuracy
 2. **Disable VAD**: If you need to transcribe everything
 3. **Specify language**: Avoid auto-detection if you know the language
 4. **Longer audio chunks**: 5-10 seconds for better context
 ### For High Throughput
 1. **Multiple replicas**: Scale horizontally with load balancer
 2. **GPU per replica**: Each replica needs dedicated GPU memory
 3. **Use gRPC streaming**: Most efficient for continuous transcription
 4. **Monitor GPU utilization**: Keep it above 80% for best efficiency
 ## Future Optimizations
 Potential improvements not yet implemented:
 1. **Batch Inference**: Process multiple audio chunks in parallel
 2. **Model Quantization**: INT8 quantization for faster inference
 3. **Faster Whisper**: Use faster-whisper library (2-3x speedup)
 4. **KV Cache**: Reuse key-value cache for streaming
 5. **TensorRT**: Use TensorRT for optimized inference on NVIDIA GPUs
 6. **Distillation**: Use distilled Whisper models (whisper-small-distilled)
 ## Monitoring
 Use these endpoints to monitor performance:
 ```bash
 # Health and metrics
 curl http://localhost:8000/health
 # Active sessions
 curl http://localhost:8000/sessions
 # GPU utilization (if nvidia-smi available)
 nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 1
 ```
 ## Tuning Parameters
 Key environment variables for performance tuning:
 ```env
 # Model selection (smaller = faster)
 MODEL_PATH=base  # tiny, base, small, medium, large-v3
 # Thread count (CPU inference)
 OMP_NUM_THREADS=4
 # GPU selection
 CUDA_VISIBLE_DEVICES=0
 # Enable optimizations
 ENABLE_REST=true
 ENABLE_WEBSOCKET=true
 ```
 ## Contact
 For performance issues or optimization suggestions, please open an issue on GitHub.
--- a/README.md
+++ b/README.md
@@ -4,14 +4,14 @@ A high-performance, standalone transcription service with **REST API**, **gRPC**
 ## Features
- 🚀 **Multiple API Interfaces**: REST API, gRPC, and WebSocket
+- **Multiple API Interfaces**: REST API, gRPC, and WebSocket
- 🎯 **High Performance**: Optimized with TF32, cuDNN, and efficient batching
+- **High Performance**: Optimized with TF32, cuDNN, and efficient batching
- 🧠 **Whisper Models**: Support for all Whisper models (tiny to large-v3)
+- **Whisper Models**: Support for all Whisper models (tiny to large-v3)
- 🎤 **Real-time Streaming**: Bidirectional streaming for live transcription
+- **Real-time Streaming**: Bidirectional streaming for live transcription
- 🔇 **Voice Activity Detection**: Smart VAD to filter silence and noise
+- **Voice Activity Detection**: Smart VAD to filter silence and noise
- 🚫 **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
+- **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
- 🐳 **Docker Ready**: Easy deployment with GPU support
+- **Docker Ready**: Easy deployment with GPU support
- 📊 **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)
+- **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)
 ## Quick Start
--- a/examples/rust-client/src/realtime_playback.rs
+++ b/examples/rust-client/src/realtime_playback.rs
@@ -207,7 +207,7 @@ async fn main() -> Result<()> {
    }
    println!("\n{}", "─".repeat(80));
-    println!("✅ Playback and transcription complete!");
+    println!("Playback and transcription complete!");
    // Keep the program alive until playback finishes
    time::sleep(Duration::from_secs(2)).await;
--- a/examples/rust-client/src/system_audio_transcribe.rs
+++ b/examples/rust-client/src/system_audio_transcribe.rs
@@ -118,7 +118,7 @@ async fn main() -> Result<()> {
 fn list_audio_devices() -> Result<()> {
    let host = cpal::default_host();
-    println!("\n📊 Available Audio Devices:");
+    println!("\n Available Audio Devices:");
    println!("{}", "─".repeat(80));
    // List input devices
@@ -138,10 +138,10 @@ fn list_audio_devices() -> Result<()> {
    // Show default device
    if let Some(device) = host.default_input_device() {
-        println!("\n⭐ Default Input: {}", device.name()?);
+        println!("\n Default Input: {}", device.name()?);
    }
-    println!("\n💡 Tips for capturing system audio:");
+    println!("\n Tips for capturing system audio:");
    println!("  Linux: Look for devices with 'monitor' in the name (PulseAudio/PipeWire)");
    println!("  Windows: Install VB-Cable or enable 'Stereo Mix' in sound settings");
    println!("  macOS: Install BlackHole or Loopback for system audio capture");
--- a/examples/rust-client/transcribe_video_call.sh
+++ b/examples/rust-client/transcribe_video_call.sh
@@ -18,7 +18,7 @@ NC='\033[0m' # No Color
 # Check dependencies
 check_dependency() {
    if ! command -v $1 &> /dev/null; then
-        echo -e "${RED}❌ $1 not found.${NC}"
+        echo -e "${RED} $1 not found.${NC}"
        echo "Please install: sudo apt-get install $2"
        return 1
    fi
@@ -27,7 +27,7 @@ check_dependency() {
 echo "Checking dependencies..."
 check_dependency "parec" "pulseaudio-utils" || exit 1
-check_dependency "sox" "sox" || echo -e "${YELLOW}⚠️  sox not installed (optional but recommended)${NC}"
+check_dependency "sox" "sox" || echo -e "${YELLOW}  sox not installed (optional but recommended)${NC}"
 # Function to find the monitor source for system audio
 find_monitor_source() {
@@ -52,11 +52,11 @@ find_monitor_source() {
 # List available sources
 if [ "$1" == "--list" ]; then
-    echo -e "${GREEN}📊 Available Audio Sources:${NC}"
+    echo -e "${GREEN} Available Audio Sources:${NC}"
    echo ""
    pactl list sources short 2>/dev/null || pacmd list-sources 2>/dev/null | grep "name:"
    echo ""
-    echo -e "${GREEN}💡 Monitor sources (system audio):${NC}"
+    echo -e "${GREEN} Monitor sources (system audio):${NC}"
    pactl list sources short 2>/dev/null | grep -i "monitor" || echo "No monitor sources found"
    exit 0
 fi
@@ -82,24 +82,24 @@ fi
 # Determine what to capture
 if [ "$1" == "--microphone" ]; then
-    echo -e "${GREEN}🎤 Using microphone input${NC}"
+    echo -e "${GREEN} Using microphone input${NC}"
    # Run the existing live-transcribe for microphone
    exec cargo run --bin live-transcribe
    exit 0
 elif [ "$1" == "--combined" ]; then
-    echo -e "${YELLOW}🎤+🔊 Combined audio capture not yet implemented${NC}"
+    echo -e "${YELLOW}+ Combined audio capture not yet implemented${NC}"
    echo "For now, please run two separate instances:"
    echo "  1. $0 (for system audio)"
    echo "  2. $0 --microphone (for mic)"
    exit 1
 elif [ "$1" == "--source" ] && [ -n "$2" ]; then
    SOURCE="$2"
-    echo -e "${GREEN}📡 Using specified source: $SOURCE${NC}"
+    echo -e "${GREEN} Using specified source: $SOURCE${NC}"
 else
    # Auto-detect monitor source
    SOURCE=$(find_monitor_source)
    if [ -z "$SOURCE" ]; then
-        echo -e "${RED}❌ Could not find system audio monitor source${NC}"
+        echo -e "${RED} Could not find system audio monitor source${NC}"
        echo ""
        echo "This might happen if:"
        echo "  1. No audio is currently playing"
@@ -111,14 +111,14 @@ else
        echo "  3. Use a specific source: $0 --source <source_name>"
        exit 1
    fi
-    echo -e "${GREEN}📡 Found system audio source: $SOURCE${NC}"
+    echo -e "${GREEN} Found system audio source: $SOURCE${NC}"
 fi
 echo ""
-echo -e "${GREEN}🎬 Starting video call transcription...${NC}"
+echo -e "${GREEN} Starting video call transcription...${NC}"
 echo -e "${YELLOW}Press Ctrl+C to stop${NC}"
 echo ""
-echo "💡 Tips for best results:"
+echo " Tips for best results:"
 echo "  • Join your video call first"
 echo "  • Use headphones to avoid echo"
 echo "  • Close other audio sources (music, videos)"