initial commit

2025-12-17 07:14:24 +01:00 · 2025-09-11 09:59:16 +02:00
commit ab17a8ac21
19 changed files with 2587 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,412 @@
+# Transcription API Service
+
+A high-performance, standalone transcription service with gRPC and WebSocket support, optimized for real-time speech-to-text applications. Perfect for desktop applications, web services, and IoT devices.
+
+## Features
+
+- **Dual Protocol Support**: Both gRPC (recommended) and WebSocket
+- **Real-Time Streaming**: Bidirectional audio streaming with immediate transcription
+- **Multiple Models**: Support for all Whisper models (tiny to large-v3)
+- **Language Support**: 50+ languages with automatic detection
+- **Docker Ready**: Simple deployment with Docker Compose
+- **Production Ready**: Health checks, monitoring, and graceful shutdown
+- **Rust Client Examples**: Ready-to-use Rust client for desktop applications
+
+## Quick Start
+
+### Using Docker Compose (Recommended)
+
+```bash
+# Clone the repository
+cd transcription-api
+
+# Start the service (uses 'base' model by default)
+docker compose up -d
+
+# Check logs
+docker compose logs -f
+
+# Stop the service
+docker compose down
+```
+
+### Configuration
+
+Edit `.env` or `docker-compose.yml` to configure:
+
+```env
+MODEL_PATH=base          # tiny, base, small, medium, large, large-v3
+GRPC_PORT=50051         # gRPC service port
+WEBSOCKET_PORT=8765     # WebSocket service port
+ENABLE_WEBSOCKET=true   # Enable WebSocket support
+CUDA_VISIBLE_DEVICES=0  # GPU device ID (if available)
+```
+
+## API Protocols
+
+### gRPC (Recommended for Desktop Apps)
+
+**Why gRPC?**
+- Strongly typed with Protocol Buffers
+- Excellent performance with HTTP/2
+- Built-in streaming support
+- Auto-generated client code
+- Better error handling
+
+**Proto Definition**: See `proto/transcription.proto`
+
+**Service Methods**:
+- `StreamTranscribe`: Bidirectional streaming for real-time transcription
+- `TranscribeFile`: Single file transcription
+- `GetCapabilities`: Query available models and languages
+- `HealthCheck`: Service health status
+
+### WebSocket (Alternative)
+
+**Protocol**:
+```javascript
+// Connect
+ws://localhost:8765
+
+// Send audio
+{
+  "type": "audio",
+  "data": "base64_encoded_pcm16_audio"
+}
+
+// Receive transcription
+{
+  "type": "transcription",
+  "text": "Hello world",
+  "start_time": 0.0,
+  "end_time": 1.5,
+  "is_final": true,
+  "timestamp": 1234567890
+}
+
+// Stop
+{
+  "type": "stop"
+}
+```
+
+## Rust Client Usage
+
+### Installation
+
+```toml
+# Add to your Cargo.toml
+[dependencies]
+tonic = "0.10"
+tokio = { version = "1.35", features = ["full"] }
+# ... see examples/rust-client/Cargo.toml for full list
+```
+
+### Live Microphone Transcription
+
+```rust
+use transcription_client::TranscriptionClient;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    // Connect to service
+    let mut client = TranscriptionClient::connect("http://localhost:50051").await?;
+    
+    // Start streaming from microphone
+    let stream = client.stream_from_microphone(
+        "auto",       // language
+        "transcribe", // task
+        "base"        // model
+    ).await?;
+    
+    // Process transcriptions
+    while let Some(transcription) = stream.next().await {
+        println!("{}", transcription.text);
+    }
+    
+    Ok(())
+}
+```
+
+### Build and Run Examples
+
+```bash
+cd examples/rust-client
+
+# Build
+cargo build --release
+
+# Run live transcription from microphone
+cargo run --bin live-transcribe
+
+# Transcribe a file
+cargo run --bin file-transcribe -- audio.wav
+
+# Stream a WAV file
+cargo run --bin stream-transcribe -- audio.wav --realtime
+```
+
+## Audio Requirements
+
+- **Format**: PCM16 (16-bit signed integer)
+- **Sample Rate**: 16kHz
+- **Channels**: Mono
+- **Chunk Size**: Minimum ~500 bytes (flexible for real-time)
+
+## Performance Optimization
+
+### For Real-Time Applications
+
+1. **Use gRPC**: Lower latency than WebSocket
+2. **Small Chunks**: Send audio in 0.5-1 second chunks
+3. **Model Selection**:
+   - `tiny`: Fastest, lowest accuracy (real-time on CPU)
+   - `base`: Good balance (near real-time on CPU)
+   - `small`: Better accuracy (may lag on CPU)
+   - `large-v3`: Best accuracy (requires GPU for real-time)
+
+### GPU Acceleration
+
+```yaml
+# docker-compose.yml
+environment:
+  - CUDA_VISIBLE_DEVICES=0
+deploy:
+  resources:
+    reservations:
+      devices:
+        - driver: nvidia
+          count: 1
+          capabilities: [gpu]
+```
+
+## Architecture
+
+```
+┌─────────────┐
+│ Rust App    │
+│ (Desktop)   │
+└──────┬──────┘
+       │ gRPC/HTTP2
+       ▼
+┌─────────────┐
+│ Transcription│
+│   Service    │
+│  ┌────────┐ │
+│  │Whisper │ │
+│  │ Model  │ │
+│  └────────┘ │
+└─────────────┘
+```
+
+### Components
+
+1. **gRPC Server**: Handles streaming audio and returns transcriptions
+2. **WebSocket Server**: Alternative protocol for web clients
+3. **Transcription Engine**: Whisper/SimulStreaming for speech-to-text
+4. **Session Manager**: Handles multiple concurrent streams
+5. **Model Cache**: Prevents re-downloading models
+
+## Advanced Configuration
+
+### Using SimulStreaming
+
+For even lower latency, mount SimulStreaming:
+
+```yaml
+volumes:
+  - ./SimulStreaming:/app/SimulStreaming
+environment:
+  - SIMULSTREAMING_PATH=/app/SimulStreaming
+```
+
+### Custom Models
+
+Mount your own Whisper models:
+
+```yaml
+volumes:
+  - ./models:/app/models
+environment:
+  - MODEL_PATH=/app/models/custom-model.pt
+```
+
+### Monitoring
+
+The service exposes metrics on `/metrics` (when enabled):
+
+```bash
+curl http://localhost:9090/metrics
+```
+
+## API Reference
+
+### gRPC Methods
+
+#### StreamTranscribe
+```protobuf
+rpc StreamTranscribe(stream AudioChunk) returns (stream TranscriptionResult);
+```
+
+Bidirectional streaming for real-time transcription. Send audio chunks, receive transcriptions.
+
+#### TranscribeFile
+```protobuf
+rpc TranscribeFile(AudioFile) returns (TranscriptionResponse);
+```
+
+Transcribe a complete audio file in one request.
+
+#### GetCapabilities
+```protobuf
+rpc GetCapabilities(Empty) returns (Capabilities);
+```
+
+Query available models, languages, and features.
+
+#### HealthCheck
+```protobuf
+rpc HealthCheck(Empty) returns (HealthStatus);
+```
+
+Check service health and status.
+
+## Language Support
+
+Supports 50+ languages including:
+- English (en)
+- Spanish (es)
+- French (fr)
+- German (de)
+- Italian (it)
+- Portuguese (pt)
+- Russian (ru)
+- Chinese (zh)
+- Japanese (ja)
+- Korean (ko)
+- And many more...
+
+Use `"auto"` for automatic language detection.
+
+## Troubleshooting
+
+### Service won't start
+- Check if ports 50051 and 8765 are available
+- Ensure Docker has enough memory (minimum 4GB)
+- Check logs: `docker compose logs transcription-api`
+
+### Slow transcription
+- Use a smaller model (tiny or base)
+- Enable GPU if available
+- Reduce audio quality to 16kHz mono
+- Send smaller chunks more frequently
+
+### Connection refused
+- Check firewall settings
+- Ensure service is running: `docker compose ps`
+- Verify correct ports in client configuration
+
+### High memory usage
+- Models are cached in memory for performance
+- Use smaller models for limited memory systems
+- Set memory limits in docker-compose.yml
+
+## Development
+
+### Building from Source
+
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Generate gRPC code
+python -m grpc_tools.protoc \
+    -I./proto \
+    --python_out=./src \
+    --grpc_python_out=./src \
+    ./proto/transcription.proto
+
+# Run the service
+python src/transcription_server.py
+```
+
+### Running Tests
+
+```bash
+# Test gRPC connection
+grpcurl -plaintext localhost:50051 list
+
+# Test health check
+grpcurl -plaintext localhost:50051 transcription.TranscriptionService/HealthCheck
+
+# Test with example audio
+python test_client.py
+```
+
+## Production Deployment
+
+### Docker Swarm
+
+```bash
+docker stack deploy -c docker-compose.yml transcription
+```
+
+### Kubernetes
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: transcription-api
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: transcription-api
+  template:
+    metadata:
+      labels:
+        app: transcription-api
+    spec:
+      containers:
+      - name: transcription-api
+        image: transcription-api:latest
+        ports:
+        - containerPort: 50051
+          name: grpc
+        - containerPort: 8765
+          name: websocket
+        env:
+        - name: MODEL_PATH
+          value: "base"
+        resources:
+          requests:
+            memory: "4Gi"
+            cpu: "2"
+          limits:
+            memory: "8Gi"
+            cpu: "4"
+```
+
+### Security
+
+For production:
+1. Enable TLS for gRPC
+2. Use WSS for WebSocket
+3. Add authentication
+4. Rate limiting
+5. Input validation
+
+## License
+
+MIT License - See LICENSE file for details
+
+## Contributing
+
+Contributions welcome! Please read CONTRIBUTING.md for guidelines.
+
+## Support
+
+- GitHub Issues: [Report bugs or request features]
+- Documentation: [Full API documentation]
+- Examples: See `examples/` directory