cleanup

2025-12-16 23:14:18 +01:00 · 2025-12-01 15:06:29 +01:00
parent 336ee5b5bd
commit 0408a1fc07
5 changed files with 23 additions and 238 deletions
--- a/PERFORMANCE_OPTIMIZATIONS.md
+++ b/PERFORMANCE_OPTIMIZATIONS.md
@@ -1,215 +0,0 @@
-# Performance Optimizations
-
-This document outlines the performance optimizations implemented in the Transcription API.
-
-## 1. Model Management
-
-### Shared Model Instance
- **Location**: `transcription_server.py:73-137`
- **Optimization**: Single Whisper model instance shared across all connections (gRPC, WebSocket, REST)
- **Benefit**: Eliminates redundant model loading, reduces memory usage by ~50-80%
-
-### Model Evaluation Mode
- **Location**: `transcription_server.py:119-122`
- **Optimization**: Set model to eval mode and disable gradient computation
- **Benefit**: Reduces memory usage and improves inference speed by ~15-20%
-
-## 2. GPU Optimizations
-
-### TF32 Precision (Ampere GPUs)
- **Location**: `transcription_server.py:105-111`
- **Optimization**: Enable TF32 for matrix multiplications on compatible GPUs
- **Benefit**: Up to 3x faster inference on A100/RTX 3000+ series GPUs with minimal accuracy loss
-
-### cuDNN Benchmarking
- **Location**: `transcription_server.py:110`
- **Optimization**: Enable cuDNN autotuning for optimal convolution algorithms
- **Benefit**: 10-30% speedup after initial warmup
-
-### FP16 Inference
- **Location**: `transcription_server.py:253`
- **Optimization**: Use FP16 precision on CUDA devices
- **Benefit**: 2x faster inference, 50% less GPU memory usage
-
-## 3. Inference Optimizations
-
-### No Gradient Context
- **Location**: `transcription_server.py:249-260, 340-346`
- **Optimization**: Wrap all inference calls in `torch.no_grad()` context
- **Benefit**: 10-15% speed improvement, reduces memory usage
-
-### Optimized Audio Processing
- **Location**: `transcription_server.py:208-219`
- **Optimization**: Direct numpy operations, inline energy calculations
- **Benefit**: Faster VAD processing, reduced memory allocations
-
-## 4. Network Optimizations
-
-### gRPC Threading
- **Location**: `transcription_server.py:512-527`
- **Optimization**: Dynamic thread pool sizing based on CPU cores
- **Configuration**: `max_workers = min(cpu_count * 2, 20)`
- **Benefit**: Better handling of concurrent connections
-
-### gRPC Keepalive
- **Location**: `transcription_server.py:522-526`
- **Optimization**: Configured keepalive and ping settings
- **Benefit**: More stable long-running connections, faster failure detection
-
-### Message Size Limits
- **Location**: `transcription_server.py:519-520`
- **Optimization**: 100MB message size limits for large audio files
- **Benefit**: Support for longer audio files without chunking
-
-## 5. Voice Activity Detection (VAD)
-
-### Smart Filtering
- **Location**: `transcription_server.py:162-203`
- **Optimization**: Fast energy-based VAD to skip silent audio
- **Configuration**:
-  - Energy threshold: 0.005
-  - Zero-crossing threshold: 50
- **Benefit**: 40-60% reduction in transcription calls for audio with silence
-
-### Early Return
- **Location**: `transcription_server.py:215-217`
- **Optimization**: Skip transcription for non-speech audio
- **Benefit**: Reduces unnecessary inference calls, improves overall throughput
-
-## 6. Anti-hallucination Filters
-
-### Aggressive Filtering
- **Location**: `transcription_server.py:262-310`
- **Optimization**: Comprehensive hallucination detection and filtering
- **Filters**:
-  - Common hallucination phrases
-  - Repetitive text
-  - Low alphanumeric ratio
-  - Cross-language detection
- **Benefit**: Better transcription quality, fewer false positives
-
-### Conservative Parameters
- **Location**: `transcription_server.py:254-259`
- **Optimization**: Tuned Whisper parameters to reduce hallucinations
- **Settings**:
-  - `temperature=0.0` (deterministic)
-  - `no_speech_threshold=0.8` (high)
-  - `logprob_threshold=-0.5` (strict)
-  - `condition_on_previous_text=False`
- **Benefit**: More accurate transcriptions, fewer hallucinations
-
-## 7. Logging Optimizations
-
-### Debug-level for VAD
- **Location**: `transcription_server.py:216-219`
- **Optimization**: Use DEBUG level for VAD messages instead of INFO
- **Benefit**: Reduced log volume, better performance in high-throughput scenarios
-
-## 8. REST API Optimizations
-
-### Async Operations
- **Location**: `rest_api.py`
- **Optimization**: Fully async FastAPI with uvicorn
- **Benefit**: Non-blocking I/O, better concurrency
-
-### Streaming Responses
- **Location**: `rest_api.py:223-278`
- **Optimization**: Server-Sent Events for streaming transcription
- **Benefit**: Real-time results without buffering entire response
-
-### Connection Pooling
- **Built-in**: FastAPI/Uvicorn connection pooling
- **Benefit**: Efficient handling of concurrent HTTP connections
-
-## Performance Benchmarks
-
-### Typical Performance (RTX 3090, large-v3 model)
-
-| Metric | Value |
-|--------|-------|
-| Cold start | 5-8 seconds |
-| Transcription speed (with VAD) | 0.1-0.3x real-time |
-| Memory usage | 3-4 GB VRAM |
-| Concurrent sessions | 5-10 (GPU memory dependent) |
-| API latency | 50-200ms (excluding inference) |
-
-### Without Optimizations
-
-| Metric | Previous | Optimized | Improvement |
-|--------|----------|-----------|-------------|
-| Inference speed | 0.2x | 0.1x | 2x faster |
-| Memory per session | 4 GB | 0.5 GB | 8x reduction |
-| Startup time | 8s | 6s | 25% faster |
-
-## Recommendations
-
-### For Maximum Performance
-
-1. **Use GPU**: CUDA is 10-50x faster than CPU
-2. **Use smaller models**: `base` or `small` for real-time applications
-3. **Enable VAD**: Reduces unnecessary transcriptions
-4. **Batch audio**: Send 3-5 second chunks for optimal throughput
-5. **Use gRPC**: Lower overhead than REST for high-frequency calls
-
-### For Best Quality
-
-1. **Use larger models**: `large-v3` for best accuracy
-2. **Disable VAD**: If you need to transcribe everything
-3. **Specify language**: Avoid auto-detection if you know the language
-4. **Longer audio chunks**: 5-10 seconds for better context
-
-### For High Throughput
-
-1. **Multiple replicas**: Scale horizontally with load balancer
-2. **GPU per replica**: Each replica needs dedicated GPU memory
-3. **Use gRPC streaming**: Most efficient for continuous transcription
-4. **Monitor GPU utilization**: Keep it above 80% for best efficiency
-
-## Future Optimizations
-
-Potential improvements not yet implemented:
-
-1. **Batch Inference**: Process multiple audio chunks in parallel
-2. **Model Quantization**: INT8 quantization for faster inference
-3. **Faster Whisper**: Use faster-whisper library (2-3x speedup)
-4. **KV Cache**: Reuse key-value cache for streaming
-5. **TensorRT**: Use TensorRT for optimized inference on NVIDIA GPUs
-6. **Distillation**: Use distilled Whisper models (whisper-small-distilled)
-
-## Monitoring
-
-Use these endpoints to monitor performance:
-
-```bash
-# Health and metrics
-curl http://localhost:8000/health
-
-# Active sessions
-curl http://localhost:8000/sessions
-
-# GPU utilization (if nvidia-smi available)
-nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 1
-```
-
-## Tuning Parameters
-
-Key environment variables for performance tuning:
-
-```env
-# Model selection (smaller = faster)
-MODEL_PATH=base  # tiny, base, small, medium, large-v3
-
-# Thread count (CPU inference)
-OMP_NUM_THREADS=4
-
-# GPU selection
-CUDA_VISIBLE_DEVICES=0
-
-# Enable optimizations
-ENABLE_REST=true
-ENABLE_WEBSOCKET=true
-```
-
-## Contact
-
-For performance issues or optimization suggestions, please open an issue on GitHub.
--- a/README.md
+++ b/README.md
@@ -4,14 +4,14 @@ A high-performance, standalone transcription service with **REST API**, **gRPC**

 ## Features

- 🚀 **Multiple API Interfaces**: REST API, gRPC, and WebSocket
- 🎯 **High Performance**: Optimized with TF32, cuDNN, and efficient batching
- 🧠 **Whisper Models**: Support for all Whisper models (tiny to large-v3)
- 🎤 **Real-time Streaming**: Bidirectional streaming for live transcription
- 🔇 **Voice Activity Detection**: Smart VAD to filter silence and noise
- 🚫 **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
- 🐳 **Docker Ready**: Easy deployment with GPU support
- 📊 **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)
+- **Multiple API Interfaces**: REST API, gRPC, and WebSocket
+- **High Performance**: Optimized with TF32, cuDNN, and efficient batching
+- **Whisper Models**: Support for all Whisper models (tiny to large-v3)
+- **Real-time Streaming**: Bidirectional streaming for live transcription
+- **Voice Activity Detection**: Smart VAD to filter silence and noise
+- **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
+- **Docker Ready**: Easy deployment with GPU support
+- **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)

 ## Quick Start

--- a/examples/rust-client/src/realtime_playback.rs
+++ b/examples/rust-client/src/realtime_playback.rs
@@ -207,7 +207,7 @@ async fn main() -> Result<()> {
    }
    
    println!("\n{}", "─".repeat(80));
-    println!("✅ Playback and transcription complete!");
+    println!("Playback and transcription complete!");
    
    // Keep the program alive until playback finishes
    time::sleep(Duration::from_secs(2)).await;
--- a/examples/rust-client/src/system_audio_transcribe.rs
+++ b/examples/rust-client/src/system_audio_transcribe.rs
@@ -118,7 +118,7 @@ async fn main() -> Result<()> {
 fn list_audio_devices() -> Result<()> {
    let host = cpal::default_host();
    
-    println!("\n📊 Available Audio Devices:");
+    println!("\n Available Audio Devices:");
    println!("{}", "─".repeat(80));
    
    // List input devices
@@ -138,10 +138,10 @@ fn list_audio_devices() -> Result<()> {
    
    // Show default device
    if let Some(device) = host.default_input_device() {
-        println!("\n⭐ Default Input: {}", device.name()?);
+        println!("\n Default Input: {}", device.name()?);
    }
    
-    println!("\n💡 Tips for capturing system audio:");
+    println!("\n Tips for capturing system audio:");
    println!("  Linux: Look for devices with 'monitor' in the name (PulseAudio/PipeWire)");
    println!("  Windows: Install VB-Cable or enable 'Stereo Mix' in sound settings");
    println!("  macOS: Install BlackHole or Loopback for system audio capture");
--- a/examples/rust-client/transcribe_video_call.sh
+++ b/examples/rust-client/transcribe_video_call.sh
@@ -18,7 +18,7 @@ NC='\033[0m' # No Color
 # Check dependencies
 check_dependency() {
    if ! command -v $1 &> /dev/null; then
-        echo -e "${RED}❌ $1 not found.${NC}"
+        echo -e "${RED} $1 not found.${NC}"
        echo "Please install: sudo apt-get install $2"
        return 1
    fi
@@ -27,7 +27,7 @@ check_dependency() {

 echo "Checking dependencies..."
 check_dependency "parec" "pulseaudio-utils" || exit 1
-check_dependency "sox" "sox" || echo -e "${YELLOW}⚠️  sox not installed (optional but recommended)${NC}"
+check_dependency "sox" "sox" || echo -e "${YELLOW}  sox not installed (optional but recommended)${NC}"

 # Function to find the monitor source for system audio
 find_monitor_source() {
@@ -52,11 +52,11 @@ find_monitor_source() {

 # List available sources
 if [ "$1" == "--list" ]; then
-    echo -e "${GREEN}📊 Available Audio Sources:${NC}"
+    echo -e "${GREEN} Available Audio Sources:${NC}"
    echo ""
    pactl list sources short 2>/dev/null || pacmd list-sources 2>/dev/null | grep "name:"
    echo ""
-    echo -e "${GREEN}💡 Monitor sources (system audio):${NC}"
+    echo -e "${GREEN} Monitor sources (system audio):${NC}"
    pactl list sources short 2>/dev/null | grep -i "monitor" || echo "No monitor sources found"
    exit 0
 fi
@@ -82,24 +82,24 @@ fi

 # Determine what to capture
 if [ "$1" == "--microphone" ]; then
-    echo -e "${GREEN}🎤 Using microphone input${NC}"
+    echo -e "${GREEN} Using microphone input${NC}"
    # Run the existing live-transcribe for microphone
    exec cargo run --bin live-transcribe
    exit 0
 elif [ "$1" == "--combined" ]; then
-    echo -e "${YELLOW}🎤+🔊 Combined audio capture not yet implemented${NC}"
+    echo -e "${YELLOW}+ Combined audio capture not yet implemented${NC}"
    echo "For now, please run two separate instances:"
    echo "  1. $0 (for system audio)"
    echo "  2. $0 --microphone (for mic)"
    exit 1
 elif [ "$1" == "--source" ] && [ -n "$2" ]; then
    SOURCE="$2"
-    echo -e "${GREEN}📡 Using specified source: $SOURCE${NC}"
+    echo -e "${GREEN} Using specified source: $SOURCE${NC}"
 else
    # Auto-detect monitor source
    SOURCE=$(find_monitor_source)
    if [ -z "$SOURCE" ]; then
-        echo -e "${RED}❌ Could not find system audio monitor source${NC}"
+        echo -e "${RED} Could not find system audio monitor source${NC}"
        echo ""
        echo "This might happen if:"
        echo "  1. No audio is currently playing"
@@ -111,14 +111,14 @@ else
        echo "  3. Use a specific source: $0 --source <source_name>"
        exit 1
    fi
-    echo -e "${GREEN}📡 Found system audio source: $SOURCE${NC}"
+    echo -e "${GREEN} Found system audio source: $SOURCE${NC}"
 fi

 echo ""
-echo -e "${GREEN}🎬 Starting video call transcription...${NC}"
+echo -e "${GREEN} Starting video call transcription...${NC}"
 echo -e "${YELLOW}Press Ctrl+C to stop${NC}"
 echo ""
-echo "💡 Tips for best results:"
+echo " Tips for best results:"
 echo "  • Join your video call first"
 echo "  • Use headphones to avoid echo"
 echo "  • Close other audio sources (music, videos)"