This commit is contained in:
2025-12-01 15:06:29 +01:00
parent 336ee5b5bd
commit 0408a1fc07
5 changed files with 23 additions and 238 deletions

View File

@@ -1,215 +0,0 @@
# Performance Optimizations
This document outlines the performance optimizations implemented in the Transcription API.
## 1. Model Management
### Shared Model Instance
- **Location**: `transcription_server.py:73-137`
- **Optimization**: Single Whisper model instance shared across all connections (gRPC, WebSocket, REST)
- **Benefit**: Eliminates redundant model loading, reduces memory usage by ~50-80%
### Model Evaluation Mode
- **Location**: `transcription_server.py:119-122`
- **Optimization**: Set model to eval mode and disable gradient computation
- **Benefit**: Reduces memory usage and improves inference speed by ~15-20%
## 2. GPU Optimizations
### TF32 Precision (Ampere GPUs)
- **Location**: `transcription_server.py:105-111`
- **Optimization**: Enable TF32 for matrix multiplications on compatible GPUs
- **Benefit**: Up to 3x faster inference on A100/RTX 3000+ series GPUs with minimal accuracy loss
### cuDNN Benchmarking
- **Location**: `transcription_server.py:110`
- **Optimization**: Enable cuDNN autotuning for optimal convolution algorithms
- **Benefit**: 10-30% speedup after initial warmup
### FP16 Inference
- **Location**: `transcription_server.py:253`
- **Optimization**: Use FP16 precision on CUDA devices
- **Benefit**: 2x faster inference, 50% less GPU memory usage
## 3. Inference Optimizations
### No Gradient Context
- **Location**: `transcription_server.py:249-260, 340-346`
- **Optimization**: Wrap all inference calls in `torch.no_grad()` context
- **Benefit**: 10-15% speed improvement, reduces memory usage
### Optimized Audio Processing
- **Location**: `transcription_server.py:208-219`
- **Optimization**: Direct numpy operations, inline energy calculations
- **Benefit**: Faster VAD processing, reduced memory allocations
## 4. Network Optimizations
### gRPC Threading
- **Location**: `transcription_server.py:512-527`
- **Optimization**: Dynamic thread pool sizing based on CPU cores
- **Configuration**: `max_workers = min(cpu_count * 2, 20)`
- **Benefit**: Better handling of concurrent connections
### gRPC Keepalive
- **Location**: `transcription_server.py:522-526`
- **Optimization**: Configured keepalive and ping settings
- **Benefit**: More stable long-running connections, faster failure detection
### Message Size Limits
- **Location**: `transcription_server.py:519-520`
- **Optimization**: 100MB message size limits for large audio files
- **Benefit**: Support for longer audio files without chunking
## 5. Voice Activity Detection (VAD)
### Smart Filtering
- **Location**: `transcription_server.py:162-203`
- **Optimization**: Fast energy-based VAD to skip silent audio
- **Configuration**:
- Energy threshold: 0.005
- Zero-crossing threshold: 50
- **Benefit**: 40-60% reduction in transcription calls for audio with silence
### Early Return
- **Location**: `transcription_server.py:215-217`
- **Optimization**: Skip transcription for non-speech audio
- **Benefit**: Reduces unnecessary inference calls, improves overall throughput
## 6. Anti-hallucination Filters
### Aggressive Filtering
- **Location**: `transcription_server.py:262-310`
- **Optimization**: Comprehensive hallucination detection and filtering
- **Filters**:
- Common hallucination phrases
- Repetitive text
- Low alphanumeric ratio
- Cross-language detection
- **Benefit**: Better transcription quality, fewer false positives
### Conservative Parameters
- **Location**: `transcription_server.py:254-259`
- **Optimization**: Tuned Whisper parameters to reduce hallucinations
- **Settings**:
- `temperature=0.0` (deterministic)
- `no_speech_threshold=0.8` (high)
- `logprob_threshold=-0.5` (strict)
- `condition_on_previous_text=False`
- **Benefit**: More accurate transcriptions, fewer hallucinations
## 7. Logging Optimizations
### Debug-level for VAD
- **Location**: `transcription_server.py:216-219`
- **Optimization**: Use DEBUG level for VAD messages instead of INFO
- **Benefit**: Reduced log volume, better performance in high-throughput scenarios
## 8. REST API Optimizations
### Async Operations
- **Location**: `rest_api.py`
- **Optimization**: Fully async FastAPI with uvicorn
- **Benefit**: Non-blocking I/O, better concurrency
### Streaming Responses
- **Location**: `rest_api.py:223-278`
- **Optimization**: Server-Sent Events for streaming transcription
- **Benefit**: Real-time results without buffering entire response
### Connection Pooling
- **Built-in**: FastAPI/Uvicorn connection pooling
- **Benefit**: Efficient handling of concurrent HTTP connections
## Performance Benchmarks
### Typical Performance (RTX 3090, large-v3 model)
| Metric | Value |
|--------|-------|
| Cold start | 5-8 seconds |
| Transcription speed (with VAD) | 0.1-0.3x real-time |
| Memory usage | 3-4 GB VRAM |
| Concurrent sessions | 5-10 (GPU memory dependent) |
| API latency | 50-200ms (excluding inference) |
### Without Optimizations
| Metric | Previous | Optimized | Improvement |
|--------|----------|-----------|-------------|
| Inference speed | 0.2x | 0.1x | 2x faster |
| Memory per session | 4 GB | 0.5 GB | 8x reduction |
| Startup time | 8s | 6s | 25% faster |
## Recommendations
### For Maximum Performance
1. **Use GPU**: CUDA is 10-50x faster than CPU
2. **Use smaller models**: `base` or `small` for real-time applications
3. **Enable VAD**: Reduces unnecessary transcriptions
4. **Batch audio**: Send 3-5 second chunks for optimal throughput
5. **Use gRPC**: Lower overhead than REST for high-frequency calls
### For Best Quality
1. **Use larger models**: `large-v3` for best accuracy
2. **Disable VAD**: If you need to transcribe everything
3. **Specify language**: Avoid auto-detection if you know the language
4. **Longer audio chunks**: 5-10 seconds for better context
### For High Throughput
1. **Multiple replicas**: Scale horizontally with load balancer
2. **GPU per replica**: Each replica needs dedicated GPU memory
3. **Use gRPC streaming**: Most efficient for continuous transcription
4. **Monitor GPU utilization**: Keep it above 80% for best efficiency
## Future Optimizations
Potential improvements not yet implemented:
1. **Batch Inference**: Process multiple audio chunks in parallel
2. **Model Quantization**: INT8 quantization for faster inference
3. **Faster Whisper**: Use faster-whisper library (2-3x speedup)
4. **KV Cache**: Reuse key-value cache for streaming
5. **TensorRT**: Use TensorRT for optimized inference on NVIDIA GPUs
6. **Distillation**: Use distilled Whisper models (whisper-small-distilled)
## Monitoring
Use these endpoints to monitor performance:
```bash
# Health and metrics
curl http://localhost:8000/health
# Active sessions
curl http://localhost:8000/sessions
# GPU utilization (if nvidia-smi available)
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 1
```
## Tuning Parameters
Key environment variables for performance tuning:
```env
# Model selection (smaller = faster)
MODEL_PATH=base # tiny, base, small, medium, large-v3
# Thread count (CPU inference)
OMP_NUM_THREADS=4
# GPU selection
CUDA_VISIBLE_DEVICES=0
# Enable optimizations
ENABLE_REST=true
ENABLE_WEBSOCKET=true
```
## Contact
For performance issues or optimization suggestions, please open an issue on GitHub.

View File

@@ -4,14 +4,14 @@ A high-performance, standalone transcription service with **REST API**, **gRPC**
## Features
- 🚀 **Multiple API Interfaces**: REST API, gRPC, and WebSocket
- 🎯 **High Performance**: Optimized with TF32, cuDNN, and efficient batching
- 🧠 **Whisper Models**: Support for all Whisper models (tiny to large-v3)
- 🎤 **Real-time Streaming**: Bidirectional streaming for live transcription
- 🔇 **Voice Activity Detection**: Smart VAD to filter silence and noise
- 🚫 **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
- 🐳 **Docker Ready**: Easy deployment with GPU support
- 📊 **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)
- **Multiple API Interfaces**: REST API, gRPC, and WebSocket
- **High Performance**: Optimized with TF32, cuDNN, and efficient batching
- **Whisper Models**: Support for all Whisper models (tiny to large-v3)
- **Real-time Streaming**: Bidirectional streaming for live transcription
- **Voice Activity Detection**: Smart VAD to filter silence and noise
- **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
- **Docker Ready**: Easy deployment with GPU support
- **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)
## Quick Start

View File

@@ -207,7 +207,7 @@ async fn main() -> Result<()> {
}
println!("\n{}", "".repeat(80));
println!("Playback and transcription complete!");
println!("Playback and transcription complete!");
// Keep the program alive until playback finishes
time::sleep(Duration::from_secs(2)).await;

View File

@@ -118,7 +118,7 @@ async fn main() -> Result<()> {
fn list_audio_devices() -> Result<()> {
let host = cpal::default_host();
println!("\n📊 Available Audio Devices:");
println!("\n Available Audio Devices:");
println!("{}", "".repeat(80));
// List input devices
@@ -138,10 +138,10 @@ fn list_audio_devices() -> Result<()> {
// Show default device
if let Some(device) = host.default_input_device() {
println!("\n Default Input: {}", device.name()?);
println!("\n Default Input: {}", device.name()?);
}
println!("\n💡 Tips for capturing system audio:");
println!("\n Tips for capturing system audio:");
println!(" Linux: Look for devices with 'monitor' in the name (PulseAudio/PipeWire)");
println!(" Windows: Install VB-Cable or enable 'Stereo Mix' in sound settings");
println!(" macOS: Install BlackHole or Loopback for system audio capture");

View File

@@ -18,7 +18,7 @@ NC='\033[0m' # No Color
# Check dependencies
check_dependency() {
if ! command -v $1 &> /dev/null; then
echo -e "${RED} $1 not found.${NC}"
echo -e "${RED} $1 not found.${NC}"
echo "Please install: sudo apt-get install $2"
return 1
fi
@@ -27,7 +27,7 @@ check_dependency() {
echo "Checking dependencies..."
check_dependency "parec" "pulseaudio-utils" || exit 1
check_dependency "sox" "sox" || echo -e "${YELLOW}⚠️ sox not installed (optional but recommended)${NC}"
check_dependency "sox" "sox" || echo -e "${YELLOW} sox not installed (optional but recommended)${NC}"
# Function to find the monitor source for system audio
find_monitor_source() {
@@ -52,11 +52,11 @@ find_monitor_source() {
# List available sources
if [ "$1" == "--list" ]; then
echo -e "${GREEN}📊 Available Audio Sources:${NC}"
echo -e "${GREEN} Available Audio Sources:${NC}"
echo ""
pactl list sources short 2>/dev/null || pacmd list-sources 2>/dev/null | grep "name:"
echo ""
echo -e "${GREEN}💡 Monitor sources (system audio):${NC}"
echo -e "${GREEN} Monitor sources (system audio):${NC}"
pactl list sources short 2>/dev/null | grep -i "monitor" || echo "No monitor sources found"
exit 0
fi
@@ -82,24 +82,24 @@ fi
# Determine what to capture
if [ "$1" == "--microphone" ]; then
echo -e "${GREEN}🎤 Using microphone input${NC}"
echo -e "${GREEN} Using microphone input${NC}"
# Run the existing live-transcribe for microphone
exec cargo run --bin live-transcribe
exit 0
elif [ "$1" == "--combined" ]; then
echo -e "${YELLOW}🎤+🔊 Combined audio capture not yet implemented${NC}"
echo -e "${YELLOW}+ Combined audio capture not yet implemented${NC}"
echo "For now, please run two separate instances:"
echo " 1. $0 (for system audio)"
echo " 2. $0 --microphone (for mic)"
exit 1
elif [ "$1" == "--source" ] && [ -n "$2" ]; then
SOURCE="$2"
echo -e "${GREEN}📡 Using specified source: $SOURCE${NC}"
echo -e "${GREEN} Using specified source: $SOURCE${NC}"
else
# Auto-detect monitor source
SOURCE=$(find_monitor_source)
if [ -z "$SOURCE" ]; then
echo -e "${RED} Could not find system audio monitor source${NC}"
echo -e "${RED} Could not find system audio monitor source${NC}"
echo ""
echo "This might happen if:"
echo " 1. No audio is currently playing"
@@ -111,14 +111,14 @@ else
echo " 3. Use a specific source: $0 --source <source_name>"
exit 1
fi
echo -e "${GREEN}📡 Found system audio source: $SOURCE${NC}"
echo -e "${GREEN} Found system audio source: $SOURCE${NC}"
fi
echo ""
echo -e "${GREEN}🎬 Starting video call transcription...${NC}"
echo -e "${GREEN} Starting video call transcription...${NC}"
echo -e "${YELLOW}Press Ctrl+C to stop${NC}"
echo ""
echo "💡 Tips for best results:"
echo " Tips for best results:"
echo " • Join your video call first"
echo " • Use headphones to avoid echo"
echo " • Close other audio sources (music, videos)"