mirror of
https://github.com/aljazceru/transcription-api.git
synced 2025-12-17 07:14:24 +01:00
412 lines
8.6 KiB
Markdown
412 lines
8.6 KiB
Markdown
# Transcription API Service
|
|
|
|
A high-performance, standalone transcription service with gRPC and WebSocket support, optimized for real-time speech-to-text applications. Perfect for desktop applications, web services, and IoT devices.
|
|
|
|
## Features
|
|
|
|
- **Dual Protocol Support**: Both gRPC (recommended) and WebSocket
|
|
- **Real-Time Streaming**: Bidirectional audio streaming with immediate transcription
|
|
- **Multiple Models**: Support for all Whisper models (tiny to large-v3)
|
|
- **Language Support**: 50+ languages with automatic detection
|
|
- **Docker Ready**: Simple deployment with Docker Compose
|
|
- **Production Ready**: Health checks, monitoring, and graceful shutdown
|
|
- **Rust Client Examples**: Ready-to-use Rust client for desktop applications
|
|
|
|
## Quick Start
|
|
|
|
### Using Docker Compose (Recommended)
|
|
|
|
```bash
|
|
# Clone the repository
|
|
cd transcription-api
|
|
|
|
# Start the service (uses 'base' model by default)
|
|
docker compose up -d
|
|
|
|
# Check logs
|
|
docker compose logs -f
|
|
|
|
# Stop the service
|
|
docker compose down
|
|
```
|
|
|
|
### Configuration
|
|
|
|
Edit `.env` or `docker-compose.yml` to configure:
|
|
|
|
```env
|
|
MODEL_PATH=base # tiny, base, small, medium, large, large-v3
|
|
GRPC_PORT=50051 # gRPC service port
|
|
WEBSOCKET_PORT=8765 # WebSocket service port
|
|
ENABLE_WEBSOCKET=true # Enable WebSocket support
|
|
CUDA_VISIBLE_DEVICES=0 # GPU device ID (if available)
|
|
```
|
|
|
|
## API Protocols
|
|
|
|
### gRPC (Recommended for Desktop Apps)
|
|
|
|
**Why gRPC?**
|
|
- Strongly typed with Protocol Buffers
|
|
- Excellent performance with HTTP/2
|
|
- Built-in streaming support
|
|
- Auto-generated client code
|
|
- Better error handling
|
|
|
|
**Proto Definition**: See `proto/transcription.proto`
|
|
|
|
**Service Methods**:
|
|
- `StreamTranscribe`: Bidirectional streaming for real-time transcription
|
|
- `TranscribeFile`: Single file transcription
|
|
- `GetCapabilities`: Query available models and languages
|
|
- `HealthCheck`: Service health status
|
|
|
|
### WebSocket (Alternative)
|
|
|
|
**Protocol**:
|
|
```javascript
|
|
// Connect
|
|
ws://localhost:8765
|
|
|
|
// Send audio
|
|
{
|
|
"type": "audio",
|
|
"data": "base64_encoded_pcm16_audio"
|
|
}
|
|
|
|
// Receive transcription
|
|
{
|
|
"type": "transcription",
|
|
"text": "Hello world",
|
|
"start_time": 0.0,
|
|
"end_time": 1.5,
|
|
"is_final": true,
|
|
"timestamp": 1234567890
|
|
}
|
|
|
|
// Stop
|
|
{
|
|
"type": "stop"
|
|
}
|
|
```
|
|
|
|
## Rust Client Usage
|
|
|
|
### Installation
|
|
|
|
```toml
|
|
# Add to your Cargo.toml
|
|
[dependencies]
|
|
tonic = "0.10"
|
|
tokio = { version = "1.35", features = ["full"] }
|
|
# ... see examples/rust-client/Cargo.toml for full list
|
|
```
|
|
|
|
### Live Microphone Transcription
|
|
|
|
```rust
|
|
use transcription_client::TranscriptionClient;
|
|
|
|
#[tokio::main]
|
|
async fn main() -> Result<()> {
|
|
// Connect to service
|
|
let mut client = TranscriptionClient::connect("http://localhost:50051").await?;
|
|
|
|
// Start streaming from microphone
|
|
let stream = client.stream_from_microphone(
|
|
"auto", // language
|
|
"transcribe", // task
|
|
"base" // model
|
|
).await?;
|
|
|
|
// Process transcriptions
|
|
while let Some(transcription) = stream.next().await {
|
|
println!("{}", transcription.text);
|
|
}
|
|
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
### Build and Run Examples
|
|
|
|
```bash
|
|
cd examples/rust-client
|
|
|
|
# Build
|
|
cargo build --release
|
|
|
|
# Run live transcription from microphone
|
|
cargo run --bin live-transcribe
|
|
|
|
# Transcribe a file
|
|
cargo run --bin file-transcribe -- audio.wav
|
|
|
|
# Stream a WAV file
|
|
cargo run --bin stream-transcribe -- audio.wav --realtime
|
|
```
|
|
|
|
## Audio Requirements
|
|
|
|
- **Format**: PCM16 (16-bit signed integer)
|
|
- **Sample Rate**: 16kHz
|
|
- **Channels**: Mono
|
|
- **Chunk Size**: Minimum ~500 bytes (flexible for real-time)
|
|
|
|
## Performance Optimization
|
|
|
|
### For Real-Time Applications
|
|
|
|
1. **Use gRPC**: Lower latency than WebSocket
|
|
2. **Small Chunks**: Send audio in 0.5-1 second chunks
|
|
3. **Model Selection**:
|
|
- `tiny`: Fastest, lowest accuracy (real-time on CPU)
|
|
- `base`: Good balance (near real-time on CPU)
|
|
- `small`: Better accuracy (may lag on CPU)
|
|
- `large-v3`: Best accuracy (requires GPU for real-time)
|
|
|
|
### GPU Acceleration
|
|
|
|
```yaml
|
|
# docker-compose.yml
|
|
environment:
|
|
- CUDA_VISIBLE_DEVICES=0
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: nvidia
|
|
count: 1
|
|
capabilities: [gpu]
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ Rust App │
|
|
│ (Desktop) │
|
|
└──────┬──────┘
|
|
│ gRPC/HTTP2
|
|
▼
|
|
┌─────────────┐
|
|
│ Transcription│
|
|
│ Service │
|
|
│ ┌────────┐ │
|
|
│ │Whisper │ │
|
|
│ │ Model │ │
|
|
│ └────────┘ │
|
|
└─────────────┘
|
|
```
|
|
|
|
### Components
|
|
|
|
1. **gRPC Server**: Handles streaming audio and returns transcriptions
|
|
2. **WebSocket Server**: Alternative protocol for web clients
|
|
3. **Transcription Engine**: Whisper/SimulStreaming for speech-to-text
|
|
4. **Session Manager**: Handles multiple concurrent streams
|
|
5. **Model Cache**: Prevents re-downloading models
|
|
|
|
## Advanced Configuration
|
|
|
|
### Using SimulStreaming
|
|
|
|
For even lower latency, mount SimulStreaming:
|
|
|
|
```yaml
|
|
volumes:
|
|
- ./SimulStreaming:/app/SimulStreaming
|
|
environment:
|
|
- SIMULSTREAMING_PATH=/app/SimulStreaming
|
|
```
|
|
|
|
### Custom Models
|
|
|
|
Mount your own Whisper models:
|
|
|
|
```yaml
|
|
volumes:
|
|
- ./models:/app/models
|
|
environment:
|
|
- MODEL_PATH=/app/models/custom-model.pt
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
The service exposes metrics on `/metrics` (when enabled):
|
|
|
|
```bash
|
|
curl http://localhost:9090/metrics
|
|
```
|
|
|
|
## API Reference
|
|
|
|
### gRPC Methods
|
|
|
|
#### StreamTranscribe
|
|
```protobuf
|
|
rpc StreamTranscribe(stream AudioChunk) returns (stream TranscriptionResult);
|
|
```
|
|
|
|
Bidirectional streaming for real-time transcription. Send audio chunks, receive transcriptions.
|
|
|
|
#### TranscribeFile
|
|
```protobuf
|
|
rpc TranscribeFile(AudioFile) returns (TranscriptionResponse);
|
|
```
|
|
|
|
Transcribe a complete audio file in one request.
|
|
|
|
#### GetCapabilities
|
|
```protobuf
|
|
rpc GetCapabilities(Empty) returns (Capabilities);
|
|
```
|
|
|
|
Query available models, languages, and features.
|
|
|
|
#### HealthCheck
|
|
```protobuf
|
|
rpc HealthCheck(Empty) returns (HealthStatus);
|
|
```
|
|
|
|
Check service health and status.
|
|
|
|
## Language Support
|
|
|
|
Supports 50+ languages including:
|
|
- English (en)
|
|
- Spanish (es)
|
|
- French (fr)
|
|
- German (de)
|
|
- Italian (it)
|
|
- Portuguese (pt)
|
|
- Russian (ru)
|
|
- Chinese (zh)
|
|
- Japanese (ja)
|
|
- Korean (ko)
|
|
- And many more...
|
|
|
|
Use `"auto"` for automatic language detection.
|
|
|
|
## Troubleshooting
|
|
|
|
### Service won't start
|
|
- Check if ports 50051 and 8765 are available
|
|
- Ensure Docker has enough memory (minimum 4GB)
|
|
- Check logs: `docker compose logs transcription-api`
|
|
|
|
### Slow transcription
|
|
- Use a smaller model (tiny or base)
|
|
- Enable GPU if available
|
|
- Reduce audio quality to 16kHz mono
|
|
- Send smaller chunks more frequently
|
|
|
|
### Connection refused
|
|
- Check firewall settings
|
|
- Ensure service is running: `docker compose ps`
|
|
- Verify correct ports in client configuration
|
|
|
|
### High memory usage
|
|
- Models are cached in memory for performance
|
|
- Use smaller models for limited memory systems
|
|
- Set memory limits in docker-compose.yml
|
|
|
|
## Development
|
|
|
|
### Building from Source
|
|
|
|
```bash
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Generate gRPC code
|
|
python -m grpc_tools.protoc \
|
|
-I./proto \
|
|
--python_out=./src \
|
|
--grpc_python_out=./src \
|
|
./proto/transcription.proto
|
|
|
|
# Run the service
|
|
python src/transcription_server.py
|
|
```
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Test gRPC connection
|
|
grpcurl -plaintext localhost:50051 list
|
|
|
|
# Test health check
|
|
grpcurl -plaintext localhost:50051 transcription.TranscriptionService/HealthCheck
|
|
|
|
# Test with example audio
|
|
python test_client.py
|
|
```
|
|
|
|
## Production Deployment
|
|
|
|
### Docker Swarm
|
|
|
|
```bash
|
|
docker stack deploy -c docker-compose.yml transcription
|
|
```
|
|
|
|
### Kubernetes
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: transcription-api
|
|
spec:
|
|
replicas: 3
|
|
selector:
|
|
matchLabels:
|
|
app: transcription-api
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: transcription-api
|
|
spec:
|
|
containers:
|
|
- name: transcription-api
|
|
image: transcription-api:latest
|
|
ports:
|
|
- containerPort: 50051
|
|
name: grpc
|
|
- containerPort: 8765
|
|
name: websocket
|
|
env:
|
|
- name: MODEL_PATH
|
|
value: "base"
|
|
resources:
|
|
requests:
|
|
memory: "4Gi"
|
|
cpu: "2"
|
|
limits:
|
|
memory: "8Gi"
|
|
cpu: "4"
|
|
```
|
|
|
|
### Security
|
|
|
|
For production:
|
|
1. Enable TLS for gRPC
|
|
2. Use WSS for WebSocket
|
|
3. Add authentication
|
|
4. Rate limiting
|
|
5. Input validation
|
|
|
|
## License
|
|
|
|
MIT License - See LICENSE file for details
|
|
|
|
## Contributing
|
|
|
|
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
|
|
|
|
## Support
|
|
|
|
- GitHub Issues: [Report bugs or request features]
|
|
- Documentation: [Full API documentation]
|
|
- Examples: See `examples/` directory |