mirror of
https://github.com/aljazceru/transcription-api.git
synced 2025-12-16 23:14:18 +01:00
cleanup
This commit is contained in:
@@ -1,215 +0,0 @@
|
|||||||
# Performance Optimizations
|
|
||||||
|
|
||||||
This document outlines the performance optimizations implemented in the Transcription API.
|
|
||||||
|
|
||||||
## 1. Model Management
|
|
||||||
|
|
||||||
### Shared Model Instance
|
|
||||||
- **Location**: `transcription_server.py:73-137`
|
|
||||||
- **Optimization**: Single Whisper model instance shared across all connections (gRPC, WebSocket, REST)
|
|
||||||
- **Benefit**: Eliminates redundant model loading, reduces memory usage by ~50-80%
|
|
||||||
|
|
||||||
### Model Evaluation Mode
|
|
||||||
- **Location**: `transcription_server.py:119-122`
|
|
||||||
- **Optimization**: Set model to eval mode and disable gradient computation
|
|
||||||
- **Benefit**: Reduces memory usage and improves inference speed by ~15-20%
|
|
||||||
|
|
||||||
## 2. GPU Optimizations
|
|
||||||
|
|
||||||
### TF32 Precision (Ampere GPUs)
|
|
||||||
- **Location**: `transcription_server.py:105-111`
|
|
||||||
- **Optimization**: Enable TF32 for matrix multiplications on compatible GPUs
|
|
||||||
- **Benefit**: Up to 3x faster inference on A100/RTX 3000+ series GPUs with minimal accuracy loss
|
|
||||||
|
|
||||||
### cuDNN Benchmarking
|
|
||||||
- **Location**: `transcription_server.py:110`
|
|
||||||
- **Optimization**: Enable cuDNN autotuning for optimal convolution algorithms
|
|
||||||
- **Benefit**: 10-30% speedup after initial warmup
|
|
||||||
|
|
||||||
### FP16 Inference
|
|
||||||
- **Location**: `transcription_server.py:253`
|
|
||||||
- **Optimization**: Use FP16 precision on CUDA devices
|
|
||||||
- **Benefit**: 2x faster inference, 50% less GPU memory usage
|
|
||||||
|
|
||||||
## 3. Inference Optimizations
|
|
||||||
|
|
||||||
### No Gradient Context
|
|
||||||
- **Location**: `transcription_server.py:249-260, 340-346`
|
|
||||||
- **Optimization**: Wrap all inference calls in `torch.no_grad()` context
|
|
||||||
- **Benefit**: 10-15% speed improvement, reduces memory usage
|
|
||||||
|
|
||||||
### Optimized Audio Processing
|
|
||||||
- **Location**: `transcription_server.py:208-219`
|
|
||||||
- **Optimization**: Direct numpy operations, inline energy calculations
|
|
||||||
- **Benefit**: Faster VAD processing, reduced memory allocations
|
|
||||||
|
|
||||||
## 4. Network Optimizations
|
|
||||||
|
|
||||||
### gRPC Threading
|
|
||||||
- **Location**: `transcription_server.py:512-527`
|
|
||||||
- **Optimization**: Dynamic thread pool sizing based on CPU cores
|
|
||||||
- **Configuration**: `max_workers = min(cpu_count * 2, 20)`
|
|
||||||
- **Benefit**: Better handling of concurrent connections
|
|
||||||
|
|
||||||
### gRPC Keepalive
|
|
||||||
- **Location**: `transcription_server.py:522-526`
|
|
||||||
- **Optimization**: Configured keepalive and ping settings
|
|
||||||
- **Benefit**: More stable long-running connections, faster failure detection
|
|
||||||
|
|
||||||
### Message Size Limits
|
|
||||||
- **Location**: `transcription_server.py:519-520`
|
|
||||||
- **Optimization**: 100MB message size limits for large audio files
|
|
||||||
- **Benefit**: Support for longer audio files without chunking
|
|
||||||
|
|
||||||
## 5. Voice Activity Detection (VAD)
|
|
||||||
|
|
||||||
### Smart Filtering
|
|
||||||
- **Location**: `transcription_server.py:162-203`
|
|
||||||
- **Optimization**: Fast energy-based VAD to skip silent audio
|
|
||||||
- **Configuration**:
|
|
||||||
- Energy threshold: 0.005
|
|
||||||
- Zero-crossing threshold: 50
|
|
||||||
- **Benefit**: 40-60% reduction in transcription calls for audio with silence
|
|
||||||
|
|
||||||
### Early Return
|
|
||||||
- **Location**: `transcription_server.py:215-217`
|
|
||||||
- **Optimization**: Skip transcription for non-speech audio
|
|
||||||
- **Benefit**: Reduces unnecessary inference calls, improves overall throughput
|
|
||||||
|
|
||||||
## 6. Anti-hallucination Filters
|
|
||||||
|
|
||||||
### Aggressive Filtering
|
|
||||||
- **Location**: `transcription_server.py:262-310`
|
|
||||||
- **Optimization**: Comprehensive hallucination detection and filtering
|
|
||||||
- **Filters**:
|
|
||||||
- Common hallucination phrases
|
|
||||||
- Repetitive text
|
|
||||||
- Low alphanumeric ratio
|
|
||||||
- Cross-language detection
|
|
||||||
- **Benefit**: Better transcription quality, fewer false positives
|
|
||||||
|
|
||||||
### Conservative Parameters
|
|
||||||
- **Location**: `transcription_server.py:254-259`
|
|
||||||
- **Optimization**: Tuned Whisper parameters to reduce hallucinations
|
|
||||||
- **Settings**:
|
|
||||||
- `temperature=0.0` (deterministic)
|
|
||||||
- `no_speech_threshold=0.8` (high)
|
|
||||||
- `logprob_threshold=-0.5` (strict)
|
|
||||||
- `condition_on_previous_text=False`
|
|
||||||
- **Benefit**: More accurate transcriptions, fewer hallucinations
|
|
||||||
|
|
||||||
## 7. Logging Optimizations
|
|
||||||
|
|
||||||
### Debug-level for VAD
|
|
||||||
- **Location**: `transcription_server.py:216-219`
|
|
||||||
- **Optimization**: Use DEBUG level for VAD messages instead of INFO
|
|
||||||
- **Benefit**: Reduced log volume, better performance in high-throughput scenarios
|
|
||||||
|
|
||||||
## 8. REST API Optimizations
|
|
||||||
|
|
||||||
### Async Operations
|
|
||||||
- **Location**: `rest_api.py`
|
|
||||||
- **Optimization**: Fully async FastAPI with uvicorn
|
|
||||||
- **Benefit**: Non-blocking I/O, better concurrency
|
|
||||||
|
|
||||||
### Streaming Responses
|
|
||||||
- **Location**: `rest_api.py:223-278`
|
|
||||||
- **Optimization**: Server-Sent Events for streaming transcription
|
|
||||||
- **Benefit**: Real-time results without buffering entire response
|
|
||||||
|
|
||||||
### Connection Pooling
|
|
||||||
- **Built-in**: FastAPI/Uvicorn connection pooling
|
|
||||||
- **Benefit**: Efficient handling of concurrent HTTP connections
|
|
||||||
|
|
||||||
## Performance Benchmarks
|
|
||||||
|
|
||||||
### Typical Performance (RTX 3090, large-v3 model)
|
|
||||||
|
|
||||||
| Metric | Value |
|
|
||||||
|--------|-------|
|
|
||||||
| Cold start | 5-8 seconds |
|
|
||||||
| Transcription speed (with VAD) | 0.1-0.3x real-time |
|
|
||||||
| Memory usage | 3-4 GB VRAM |
|
|
||||||
| Concurrent sessions | 5-10 (GPU memory dependent) |
|
|
||||||
| API latency | 50-200ms (excluding inference) |
|
|
||||||
|
|
||||||
### Without Optimizations
|
|
||||||
|
|
||||||
| Metric | Previous | Optimized | Improvement |
|
|
||||||
|--------|----------|-----------|-------------|
|
|
||||||
| Inference speed | 0.2x | 0.1x | 2x faster |
|
|
||||||
| Memory per session | 4 GB | 0.5 GB | 8x reduction |
|
|
||||||
| Startup time | 8s | 6s | 25% faster |
|
|
||||||
|
|
||||||
## Recommendations
|
|
||||||
|
|
||||||
### For Maximum Performance
|
|
||||||
|
|
||||||
1. **Use GPU**: CUDA is 10-50x faster than CPU
|
|
||||||
2. **Use smaller models**: `base` or `small` for real-time applications
|
|
||||||
3. **Enable VAD**: Reduces unnecessary transcriptions
|
|
||||||
4. **Batch audio**: Send 3-5 second chunks for optimal throughput
|
|
||||||
5. **Use gRPC**: Lower overhead than REST for high-frequency calls
|
|
||||||
|
|
||||||
### For Best Quality
|
|
||||||
|
|
||||||
1. **Use larger models**: `large-v3` for best accuracy
|
|
||||||
2. **Disable VAD**: If you need to transcribe everything
|
|
||||||
3. **Specify language**: Avoid auto-detection if you know the language
|
|
||||||
4. **Longer audio chunks**: 5-10 seconds for better context
|
|
||||||
|
|
||||||
### For High Throughput
|
|
||||||
|
|
||||||
1. **Multiple replicas**: Scale horizontally with load balancer
|
|
||||||
2. **GPU per replica**: Each replica needs dedicated GPU memory
|
|
||||||
3. **Use gRPC streaming**: Most efficient for continuous transcription
|
|
||||||
4. **Monitor GPU utilization**: Keep it above 80% for best efficiency
|
|
||||||
|
|
||||||
## Future Optimizations
|
|
||||||
|
|
||||||
Potential improvements not yet implemented:
|
|
||||||
|
|
||||||
1. **Batch Inference**: Process multiple audio chunks in parallel
|
|
||||||
2. **Model Quantization**: INT8 quantization for faster inference
|
|
||||||
3. **Faster Whisper**: Use faster-whisper library (2-3x speedup)
|
|
||||||
4. **KV Cache**: Reuse key-value cache for streaming
|
|
||||||
5. **TensorRT**: Use TensorRT for optimized inference on NVIDIA GPUs
|
|
||||||
6. **Distillation**: Use distilled Whisper models (whisper-small-distilled)
|
|
||||||
|
|
||||||
## Monitoring
|
|
||||||
|
|
||||||
Use these endpoints to monitor performance:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Health and metrics
|
|
||||||
curl http://localhost:8000/health
|
|
||||||
|
|
||||||
# Active sessions
|
|
||||||
curl http://localhost:8000/sessions
|
|
||||||
|
|
||||||
# GPU utilization (if nvidia-smi available)
|
|
||||||
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 1
|
|
||||||
```
|
|
||||||
|
|
||||||
## Tuning Parameters
|
|
||||||
|
|
||||||
Key environment variables for performance tuning:
|
|
||||||
|
|
||||||
```env
|
|
||||||
# Model selection (smaller = faster)
|
|
||||||
MODEL_PATH=base # tiny, base, small, medium, large-v3
|
|
||||||
|
|
||||||
# Thread count (CPU inference)
|
|
||||||
OMP_NUM_THREADS=4
|
|
||||||
|
|
||||||
# GPU selection
|
|
||||||
CUDA_VISIBLE_DEVICES=0
|
|
||||||
|
|
||||||
# Enable optimizations
|
|
||||||
ENABLE_REST=true
|
|
||||||
ENABLE_WEBSOCKET=true
|
|
||||||
```
|
|
||||||
|
|
||||||
## Contact
|
|
||||||
|
|
||||||
For performance issues or optimization suggestions, please open an issue on GitHub.
|
|
||||||
16
README.md
16
README.md
@@ -4,14 +4,14 @@ A high-performance, standalone transcription service with **REST API**, **gRPC**
|
|||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
- 🚀 **Multiple API Interfaces**: REST API, gRPC, and WebSocket
|
- **Multiple API Interfaces**: REST API, gRPC, and WebSocket
|
||||||
- 🎯 **High Performance**: Optimized with TF32, cuDNN, and efficient batching
|
- **High Performance**: Optimized with TF32, cuDNN, and efficient batching
|
||||||
- 🧠 **Whisper Models**: Support for all Whisper models (tiny to large-v3)
|
- **Whisper Models**: Support for all Whisper models (tiny to large-v3)
|
||||||
- 🎤 **Real-time Streaming**: Bidirectional streaming for live transcription
|
- **Real-time Streaming**: Bidirectional streaming for live transcription
|
||||||
- 🔇 **Voice Activity Detection**: Smart VAD to filter silence and noise
|
- **Voice Activity Detection**: Smart VAD to filter silence and noise
|
||||||
- 🚫 **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
|
- **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
|
||||||
- 🐳 **Docker Ready**: Easy deployment with GPU support
|
- **Docker Ready**: Easy deployment with GPU support
|
||||||
- 📊 **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)
|
- **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
|
|||||||
@@ -207,7 +207,7 @@ async fn main() -> Result<()> {
|
|||||||
}
|
}
|
||||||
|
|
||||||
println!("\n{}", "─".repeat(80));
|
println!("\n{}", "─".repeat(80));
|
||||||
println!("✅ Playback and transcription complete!");
|
println!("Playback and transcription complete!");
|
||||||
|
|
||||||
// Keep the program alive until playback finishes
|
// Keep the program alive until playback finishes
|
||||||
time::sleep(Duration::from_secs(2)).await;
|
time::sleep(Duration::from_secs(2)).await;
|
||||||
|
|||||||
@@ -118,7 +118,7 @@ async fn main() -> Result<()> {
|
|||||||
fn list_audio_devices() -> Result<()> {
|
fn list_audio_devices() -> Result<()> {
|
||||||
let host = cpal::default_host();
|
let host = cpal::default_host();
|
||||||
|
|
||||||
println!("\n📊 Available Audio Devices:");
|
println!("\n Available Audio Devices:");
|
||||||
println!("{}", "─".repeat(80));
|
println!("{}", "─".repeat(80));
|
||||||
|
|
||||||
// List input devices
|
// List input devices
|
||||||
@@ -138,10 +138,10 @@ fn list_audio_devices() -> Result<()> {
|
|||||||
|
|
||||||
// Show default device
|
// Show default device
|
||||||
if let Some(device) = host.default_input_device() {
|
if let Some(device) = host.default_input_device() {
|
||||||
println!("\n⭐ Default Input: {}", device.name()?);
|
println!("\n Default Input: {}", device.name()?);
|
||||||
}
|
}
|
||||||
|
|
||||||
println!("\n💡 Tips for capturing system audio:");
|
println!("\n Tips for capturing system audio:");
|
||||||
println!(" Linux: Look for devices with 'monitor' in the name (PulseAudio/PipeWire)");
|
println!(" Linux: Look for devices with 'monitor' in the name (PulseAudio/PipeWire)");
|
||||||
println!(" Windows: Install VB-Cable or enable 'Stereo Mix' in sound settings");
|
println!(" Windows: Install VB-Cable or enable 'Stereo Mix' in sound settings");
|
||||||
println!(" macOS: Install BlackHole or Loopback for system audio capture");
|
println!(" macOS: Install BlackHole or Loopback for system audio capture");
|
||||||
|
|||||||
@@ -18,7 +18,7 @@ NC='\033[0m' # No Color
|
|||||||
# Check dependencies
|
# Check dependencies
|
||||||
check_dependency() {
|
check_dependency() {
|
||||||
if ! command -v $1 &> /dev/null; then
|
if ! command -v $1 &> /dev/null; then
|
||||||
echo -e "${RED}❌ $1 not found.${NC}"
|
echo -e "${RED} $1 not found.${NC}"
|
||||||
echo "Please install: sudo apt-get install $2"
|
echo "Please install: sudo apt-get install $2"
|
||||||
return 1
|
return 1
|
||||||
fi
|
fi
|
||||||
@@ -27,7 +27,7 @@ check_dependency() {
|
|||||||
|
|
||||||
echo "Checking dependencies..."
|
echo "Checking dependencies..."
|
||||||
check_dependency "parec" "pulseaudio-utils" || exit 1
|
check_dependency "parec" "pulseaudio-utils" || exit 1
|
||||||
check_dependency "sox" "sox" || echo -e "${YELLOW}⚠️ sox not installed (optional but recommended)${NC}"
|
check_dependency "sox" "sox" || echo -e "${YELLOW} sox not installed (optional but recommended)${NC}"
|
||||||
|
|
||||||
# Function to find the monitor source for system audio
|
# Function to find the monitor source for system audio
|
||||||
find_monitor_source() {
|
find_monitor_source() {
|
||||||
@@ -52,11 +52,11 @@ find_monitor_source() {
|
|||||||
|
|
||||||
# List available sources
|
# List available sources
|
||||||
if [ "$1" == "--list" ]; then
|
if [ "$1" == "--list" ]; then
|
||||||
echo -e "${GREEN}📊 Available Audio Sources:${NC}"
|
echo -e "${GREEN} Available Audio Sources:${NC}"
|
||||||
echo ""
|
echo ""
|
||||||
pactl list sources short 2>/dev/null || pacmd list-sources 2>/dev/null | grep "name:"
|
pactl list sources short 2>/dev/null || pacmd list-sources 2>/dev/null | grep "name:"
|
||||||
echo ""
|
echo ""
|
||||||
echo -e "${GREEN}💡 Monitor sources (system audio):${NC}"
|
echo -e "${GREEN} Monitor sources (system audio):${NC}"
|
||||||
pactl list sources short 2>/dev/null | grep -i "monitor" || echo "No monitor sources found"
|
pactl list sources short 2>/dev/null | grep -i "monitor" || echo "No monitor sources found"
|
||||||
exit 0
|
exit 0
|
||||||
fi
|
fi
|
||||||
@@ -82,24 +82,24 @@ fi
|
|||||||
|
|
||||||
# Determine what to capture
|
# Determine what to capture
|
||||||
if [ "$1" == "--microphone" ]; then
|
if [ "$1" == "--microphone" ]; then
|
||||||
echo -e "${GREEN}🎤 Using microphone input${NC}"
|
echo -e "${GREEN} Using microphone input${NC}"
|
||||||
# Run the existing live-transcribe for microphone
|
# Run the existing live-transcribe for microphone
|
||||||
exec cargo run --bin live-transcribe
|
exec cargo run --bin live-transcribe
|
||||||
exit 0
|
exit 0
|
||||||
elif [ "$1" == "--combined" ]; then
|
elif [ "$1" == "--combined" ]; then
|
||||||
echo -e "${YELLOW}🎤+🔊 Combined audio capture not yet implemented${NC}"
|
echo -e "${YELLOW}+ Combined audio capture not yet implemented${NC}"
|
||||||
echo "For now, please run two separate instances:"
|
echo "For now, please run two separate instances:"
|
||||||
echo " 1. $0 (for system audio)"
|
echo " 1. $0 (for system audio)"
|
||||||
echo " 2. $0 --microphone (for mic)"
|
echo " 2. $0 --microphone (for mic)"
|
||||||
exit 1
|
exit 1
|
||||||
elif [ "$1" == "--source" ] && [ -n "$2" ]; then
|
elif [ "$1" == "--source" ] && [ -n "$2" ]; then
|
||||||
SOURCE="$2"
|
SOURCE="$2"
|
||||||
echo -e "${GREEN}📡 Using specified source: $SOURCE${NC}"
|
echo -e "${GREEN} Using specified source: $SOURCE${NC}"
|
||||||
else
|
else
|
||||||
# Auto-detect monitor source
|
# Auto-detect monitor source
|
||||||
SOURCE=$(find_monitor_source)
|
SOURCE=$(find_monitor_source)
|
||||||
if [ -z "$SOURCE" ]; then
|
if [ -z "$SOURCE" ]; then
|
||||||
echo -e "${RED}❌ Could not find system audio monitor source${NC}"
|
echo -e "${RED} Could not find system audio monitor source${NC}"
|
||||||
echo ""
|
echo ""
|
||||||
echo "This might happen if:"
|
echo "This might happen if:"
|
||||||
echo " 1. No audio is currently playing"
|
echo " 1. No audio is currently playing"
|
||||||
@@ -111,14 +111,14 @@ else
|
|||||||
echo " 3. Use a specific source: $0 --source <source_name>"
|
echo " 3. Use a specific source: $0 --source <source_name>"
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
echo -e "${GREEN}📡 Found system audio source: $SOURCE${NC}"
|
echo -e "${GREEN} Found system audio source: $SOURCE${NC}"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
echo ""
|
echo ""
|
||||||
echo -e "${GREEN}🎬 Starting video call transcription...${NC}"
|
echo -e "${GREEN} Starting video call transcription...${NC}"
|
||||||
echo -e "${YELLOW}Press Ctrl+C to stop${NC}"
|
echo -e "${YELLOW}Press Ctrl+C to stop${NC}"
|
||||||
echo ""
|
echo ""
|
||||||
echo "💡 Tips for best results:"
|
echo " Tips for best results:"
|
||||||
echo " • Join your video call first"
|
echo " • Join your video call first"
|
||||||
echo " • Use headphones to avoid echo"
|
echo " • Use headphones to avoid echo"
|
||||||
echo " • Close other audio sources (music, videos)"
|
echo " • Close other audio sources (music, videos)"
|
||||||
|
|||||||
Reference in New Issue
Block a user