# Transcription API Service A high-performance, standalone transcription service with gRPC and WebSocket support, optimized for real-time speech-to-text applications. Perfect for desktop applications, web services, and IoT devices. ## Features - **Dual Protocol Support**: Both gRPC (recommended) and WebSocket - **Real-Time Streaming**: Bidirectional audio streaming with immediate transcription - **Multiple Models**: Support for all Whisper models (tiny to large-v3) - **Language Support**: 50+ languages with automatic detection - **Docker Ready**: Simple deployment with Docker Compose - **Production Ready**: Health checks, monitoring, and graceful shutdown - **Rust Client Examples**: Ready-to-use Rust client for desktop applications ## Quick Start ### Using Docker Compose (Recommended) ```bash # Clone the repository cd transcription-api # Start the service (uses 'base' model by default) docker compose up -d # Check logs docker compose logs -f # Stop the service docker compose down ``` ### Configuration Edit `.env` or `docker-compose.yml` to configure: ```env MODEL_PATH=base # tiny, base, small, medium, large, large-v3 GRPC_PORT=50051 # gRPC service port WEBSOCKET_PORT=8765 # WebSocket service port ENABLE_WEBSOCKET=true # Enable WebSocket support CUDA_VISIBLE_DEVICES=0 # GPU device ID (if available) ``` ## API Protocols ### gRPC (Recommended for Desktop Apps) **Why gRPC?** - Strongly typed with Protocol Buffers - Excellent performance with HTTP/2 - Built-in streaming support - Auto-generated client code - Better error handling **Proto Definition**: See `proto/transcription.proto` **Service Methods**: - `StreamTranscribe`: Bidirectional streaming for real-time transcription - `TranscribeFile`: Single file transcription - `GetCapabilities`: Query available models and languages - `HealthCheck`: Service health status ### WebSocket (Alternative) **Protocol**: ```javascript // Connect ws://localhost:8765 // Send audio { "type": "audio", "data": "base64_encoded_pcm16_audio" } // Receive transcription { "type": "transcription", "text": "Hello world", "start_time": 0.0, "end_time": 1.5, "is_final": true, "timestamp": 1234567890 } // Stop { "type": "stop" } ``` ## Rust Client Usage ### Installation ```toml # Add to your Cargo.toml [dependencies] tonic = "0.10" tokio = { version = "1.35", features = ["full"] } # ... see examples/rust-client/Cargo.toml for full list ``` ### Live Microphone Transcription ```rust use transcription_client::TranscriptionClient; #[tokio::main] async fn main() -> Result<()> { // Connect to service let mut client = TranscriptionClient::connect("http://localhost:50051").await?; // Start streaming from microphone let stream = client.stream_from_microphone( "auto", // language "transcribe", // task "base" // model ).await?; // Process transcriptions while let Some(transcription) = stream.next().await { println!("{}", transcription.text); } Ok(()) } ``` ### Build and Run Examples ```bash cd examples/rust-client # Build cargo build --release # Run live transcription from microphone cargo run --bin live-transcribe # Transcribe a file cargo run --bin file-transcribe -- audio.wav # Stream a WAV file cargo run --bin stream-transcribe -- audio.wav --realtime ``` ## Audio Requirements - **Format**: PCM16 (16-bit signed integer) - **Sample Rate**: 16kHz - **Channels**: Mono - **Chunk Size**: Minimum ~500 bytes (flexible for real-time) ## Performance Optimization ### For Real-Time Applications 1. **Use gRPC**: Lower latency than WebSocket 2. **Small Chunks**: Send audio in 0.5-1 second chunks 3. **Model Selection**: - `tiny`: Fastest, lowest accuracy (real-time on CPU) - `base`: Good balance (near real-time on CPU) - `small`: Better accuracy (may lag on CPU) - `large-v3`: Best accuracy (requires GPU for real-time) ### GPU Acceleration ```yaml # docker-compose.yml environment: - CUDA_VISIBLE_DEVICES=0 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ``` ## Architecture ``` ┌─────────────┐ │ Rust App │ │ (Desktop) │ └──────┬──────┘ │ gRPC/HTTP2 ▼ ┌─────────────┐ │ Transcription│ │ Service │ │ ┌────────┐ │ │ │Whisper │ │ │ │ Model │ │ │ └────────┘ │ └─────────────┘ ``` ### Components 1. **gRPC Server**: Handles streaming audio and returns transcriptions 2. **WebSocket Server**: Alternative protocol for web clients 3. **Transcription Engine**: Whisper/SimulStreaming for speech-to-text 4. **Session Manager**: Handles multiple concurrent streams 5. **Model Cache**: Prevents re-downloading models ## Advanced Configuration ### Using SimulStreaming For even lower latency, mount SimulStreaming: ```yaml volumes: - ./SimulStreaming:/app/SimulStreaming environment: - SIMULSTREAMING_PATH=/app/SimulStreaming ``` ### Custom Models Mount your own Whisper models: ```yaml volumes: - ./models:/app/models environment: - MODEL_PATH=/app/models/custom-model.pt ``` ### Monitoring The service exposes metrics on `/metrics` (when enabled): ```bash curl http://localhost:9090/metrics ``` ## API Reference ### gRPC Methods #### StreamTranscribe ```protobuf rpc StreamTranscribe(stream AudioChunk) returns (stream TranscriptionResult); ``` Bidirectional streaming for real-time transcription. Send audio chunks, receive transcriptions. #### TranscribeFile ```protobuf rpc TranscribeFile(AudioFile) returns (TranscriptionResponse); ``` Transcribe a complete audio file in one request. #### GetCapabilities ```protobuf rpc GetCapabilities(Empty) returns (Capabilities); ``` Query available models, languages, and features. #### HealthCheck ```protobuf rpc HealthCheck(Empty) returns (HealthStatus); ``` Check service health and status. ## Language Support Supports 50+ languages including: - English (en) - Spanish (es) - French (fr) - German (de) - Italian (it) - Portuguese (pt) - Russian (ru) - Chinese (zh) - Japanese (ja) - Korean (ko) - And many more... Use `"auto"` for automatic language detection. ## Troubleshooting ### Service won't start - Check if ports 50051 and 8765 are available - Ensure Docker has enough memory (minimum 4GB) - Check logs: `docker compose logs transcription-api` ### Slow transcription - Use a smaller model (tiny or base) - Enable GPU if available - Reduce audio quality to 16kHz mono - Send smaller chunks more frequently ### Connection refused - Check firewall settings - Ensure service is running: `docker compose ps` - Verify correct ports in client configuration ### High memory usage - Models are cached in memory for performance - Use smaller models for limited memory systems - Set memory limits in docker-compose.yml ## Development ### Building from Source ```bash # Install dependencies pip install -r requirements.txt # Generate gRPC code python -m grpc_tools.protoc \ -I./proto \ --python_out=./src \ --grpc_python_out=./src \ ./proto/transcription.proto # Run the service python src/transcription_server.py ``` ### Running Tests ```bash # Test gRPC connection grpcurl -plaintext localhost:50051 list # Test health check grpcurl -plaintext localhost:50051 transcription.TranscriptionService/HealthCheck # Test with example audio python test_client.py ``` ## Production Deployment ### Docker Swarm ```bash docker stack deploy -c docker-compose.yml transcription ``` ### Kubernetes ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: transcription-api spec: replicas: 3 selector: matchLabels: app: transcription-api template: metadata: labels: app: transcription-api spec: containers: - name: transcription-api image: transcription-api:latest ports: - containerPort: 50051 name: grpc - containerPort: 8765 name: websocket env: - name: MODEL_PATH value: "base" resources: requests: memory: "4Gi" cpu: "2" limits: memory: "8Gi" cpu: "4" ``` ### Security For production: 1. Enable TLS for gRPC 2. Use WSS for WebSocket 3. Add authentication 4. Rate limiting 5. Input validation ## License MIT License - See LICENSE file for details ## Contributing Contributions welcome! Please read CONTRIBUTING.md for guidelines. ## Support - GitHub Issues: [Report bugs or request features] - Documentation: [Full API documentation] - Examples: See `examples/` directory