mirror of
https://github.com/aljazceru/transcription-api.git
synced 2025-12-17 07:14:24 +01:00
initial commit
This commit is contained in:
412
README.md
Normal file
412
README.md
Normal file
@@ -0,0 +1,412 @@
|
||||
# Transcription API Service
|
||||
|
||||
A high-performance, standalone transcription service with gRPC and WebSocket support, optimized for real-time speech-to-text applications. Perfect for desktop applications, web services, and IoT devices.
|
||||
|
||||
## Features
|
||||
|
||||
- **Dual Protocol Support**: Both gRPC (recommended) and WebSocket
|
||||
- **Real-Time Streaming**: Bidirectional audio streaming with immediate transcription
|
||||
- **Multiple Models**: Support for all Whisper models (tiny to large-v3)
|
||||
- **Language Support**: 50+ languages with automatic detection
|
||||
- **Docker Ready**: Simple deployment with Docker Compose
|
||||
- **Production Ready**: Health checks, monitoring, and graceful shutdown
|
||||
- **Rust Client Examples**: Ready-to-use Rust client for desktop applications
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Using Docker Compose (Recommended)
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
cd transcription-api
|
||||
|
||||
# Start the service (uses 'base' model by default)
|
||||
docker compose up -d
|
||||
|
||||
# Check logs
|
||||
docker compose logs -f
|
||||
|
||||
# Stop the service
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Edit `.env` or `docker-compose.yml` to configure:
|
||||
|
||||
```env
|
||||
MODEL_PATH=base # tiny, base, small, medium, large, large-v3
|
||||
GRPC_PORT=50051 # gRPC service port
|
||||
WEBSOCKET_PORT=8765 # WebSocket service port
|
||||
ENABLE_WEBSOCKET=true # Enable WebSocket support
|
||||
CUDA_VISIBLE_DEVICES=0 # GPU device ID (if available)
|
||||
```
|
||||
|
||||
## API Protocols
|
||||
|
||||
### gRPC (Recommended for Desktop Apps)
|
||||
|
||||
**Why gRPC?**
|
||||
- Strongly typed with Protocol Buffers
|
||||
- Excellent performance with HTTP/2
|
||||
- Built-in streaming support
|
||||
- Auto-generated client code
|
||||
- Better error handling
|
||||
|
||||
**Proto Definition**: See `proto/transcription.proto`
|
||||
|
||||
**Service Methods**:
|
||||
- `StreamTranscribe`: Bidirectional streaming for real-time transcription
|
||||
- `TranscribeFile`: Single file transcription
|
||||
- `GetCapabilities`: Query available models and languages
|
||||
- `HealthCheck`: Service health status
|
||||
|
||||
### WebSocket (Alternative)
|
||||
|
||||
**Protocol**:
|
||||
```javascript
|
||||
// Connect
|
||||
ws://localhost:8765
|
||||
|
||||
// Send audio
|
||||
{
|
||||
"type": "audio",
|
||||
"data": "base64_encoded_pcm16_audio"
|
||||
}
|
||||
|
||||
// Receive transcription
|
||||
{
|
||||
"type": "transcription",
|
||||
"text": "Hello world",
|
||||
"start_time": 0.0,
|
||||
"end_time": 1.5,
|
||||
"is_final": true,
|
||||
"timestamp": 1234567890
|
||||
}
|
||||
|
||||
// Stop
|
||||
{
|
||||
"type": "stop"
|
||||
}
|
||||
```
|
||||
|
||||
## Rust Client Usage
|
||||
|
||||
### Installation
|
||||
|
||||
```toml
|
||||
# Add to your Cargo.toml
|
||||
[dependencies]
|
||||
tonic = "0.10"
|
||||
tokio = { version = "1.35", features = ["full"] }
|
||||
# ... see examples/rust-client/Cargo.toml for full list
|
||||
```
|
||||
|
||||
### Live Microphone Transcription
|
||||
|
||||
```rust
|
||||
use transcription_client::TranscriptionClient;
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> Result<()> {
|
||||
// Connect to service
|
||||
let mut client = TranscriptionClient::connect("http://localhost:50051").await?;
|
||||
|
||||
// Start streaming from microphone
|
||||
let stream = client.stream_from_microphone(
|
||||
"auto", // language
|
||||
"transcribe", // task
|
||||
"base" // model
|
||||
).await?;
|
||||
|
||||
// Process transcriptions
|
||||
while let Some(transcription) = stream.next().await {
|
||||
println!("{}", transcription.text);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### Build and Run Examples
|
||||
|
||||
```bash
|
||||
cd examples/rust-client
|
||||
|
||||
# Build
|
||||
cargo build --release
|
||||
|
||||
# Run live transcription from microphone
|
||||
cargo run --bin live-transcribe
|
||||
|
||||
# Transcribe a file
|
||||
cargo run --bin file-transcribe -- audio.wav
|
||||
|
||||
# Stream a WAV file
|
||||
cargo run --bin stream-transcribe -- audio.wav --realtime
|
||||
```
|
||||
|
||||
## Audio Requirements
|
||||
|
||||
- **Format**: PCM16 (16-bit signed integer)
|
||||
- **Sample Rate**: 16kHz
|
||||
- **Channels**: Mono
|
||||
- **Chunk Size**: Minimum ~500 bytes (flexible for real-time)
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### For Real-Time Applications
|
||||
|
||||
1. **Use gRPC**: Lower latency than WebSocket
|
||||
2. **Small Chunks**: Send audio in 0.5-1 second chunks
|
||||
3. **Model Selection**:
|
||||
- `tiny`: Fastest, lowest accuracy (real-time on CPU)
|
||||
- `base`: Good balance (near real-time on CPU)
|
||||
- `small`: Better accuracy (may lag on CPU)
|
||||
- `large-v3`: Best accuracy (requires GPU for real-time)
|
||||
|
||||
### GPU Acceleration
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
environment:
|
||||
- CUDA_VISIBLE_DEVICES=0
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: 1
|
||||
capabilities: [gpu]
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ Rust App │
|
||||
│ (Desktop) │
|
||||
└──────┬──────┘
|
||||
│ gRPC/HTTP2
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ Transcription│
|
||||
│ Service │
|
||||
│ ┌────────┐ │
|
||||
│ │Whisper │ │
|
||||
│ │ Model │ │
|
||||
│ └────────┘ │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
1. **gRPC Server**: Handles streaming audio and returns transcriptions
|
||||
2. **WebSocket Server**: Alternative protocol for web clients
|
||||
3. **Transcription Engine**: Whisper/SimulStreaming for speech-to-text
|
||||
4. **Session Manager**: Handles multiple concurrent streams
|
||||
5. **Model Cache**: Prevents re-downloading models
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Using SimulStreaming
|
||||
|
||||
For even lower latency, mount SimulStreaming:
|
||||
|
||||
```yaml
|
||||
volumes:
|
||||
- ./SimulStreaming:/app/SimulStreaming
|
||||
environment:
|
||||
- SIMULSTREAMING_PATH=/app/SimulStreaming
|
||||
```
|
||||
|
||||
### Custom Models
|
||||
|
||||
Mount your own Whisper models:
|
||||
|
||||
```yaml
|
||||
volumes:
|
||||
- ./models:/app/models
|
||||
environment:
|
||||
- MODEL_PATH=/app/models/custom-model.pt
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
|
||||
The service exposes metrics on `/metrics` (when enabled):
|
||||
|
||||
```bash
|
||||
curl http://localhost:9090/metrics
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
### gRPC Methods
|
||||
|
||||
#### StreamTranscribe
|
||||
```protobuf
|
||||
rpc StreamTranscribe(stream AudioChunk) returns (stream TranscriptionResult);
|
||||
```
|
||||
|
||||
Bidirectional streaming for real-time transcription. Send audio chunks, receive transcriptions.
|
||||
|
||||
#### TranscribeFile
|
||||
```protobuf
|
||||
rpc TranscribeFile(AudioFile) returns (TranscriptionResponse);
|
||||
```
|
||||
|
||||
Transcribe a complete audio file in one request.
|
||||
|
||||
#### GetCapabilities
|
||||
```protobuf
|
||||
rpc GetCapabilities(Empty) returns (Capabilities);
|
||||
```
|
||||
|
||||
Query available models, languages, and features.
|
||||
|
||||
#### HealthCheck
|
||||
```protobuf
|
||||
rpc HealthCheck(Empty) returns (HealthStatus);
|
||||
```
|
||||
|
||||
Check service health and status.
|
||||
|
||||
## Language Support
|
||||
|
||||
Supports 50+ languages including:
|
||||
- English (en)
|
||||
- Spanish (es)
|
||||
- French (fr)
|
||||
- German (de)
|
||||
- Italian (it)
|
||||
- Portuguese (pt)
|
||||
- Russian (ru)
|
||||
- Chinese (zh)
|
||||
- Japanese (ja)
|
||||
- Korean (ko)
|
||||
- And many more...
|
||||
|
||||
Use `"auto"` for automatic language detection.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service won't start
|
||||
- Check if ports 50051 and 8765 are available
|
||||
- Ensure Docker has enough memory (minimum 4GB)
|
||||
- Check logs: `docker compose logs transcription-api`
|
||||
|
||||
### Slow transcription
|
||||
- Use a smaller model (tiny or base)
|
||||
- Enable GPU if available
|
||||
- Reduce audio quality to 16kHz mono
|
||||
- Send smaller chunks more frequently
|
||||
|
||||
### Connection refused
|
||||
- Check firewall settings
|
||||
- Ensure service is running: `docker compose ps`
|
||||
- Verify correct ports in client configuration
|
||||
|
||||
### High memory usage
|
||||
- Models are cached in memory for performance
|
||||
- Use smaller models for limited memory systems
|
||||
- Set memory limits in docker-compose.yml
|
||||
|
||||
## Development
|
||||
|
||||
### Building from Source
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Generate gRPC code
|
||||
python -m grpc_tools.protoc \
|
||||
-I./proto \
|
||||
--python_out=./src \
|
||||
--grpc_python_out=./src \
|
||||
./proto/transcription.proto
|
||||
|
||||
# Run the service
|
||||
python src/transcription_server.py
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Test gRPC connection
|
||||
grpcurl -plaintext localhost:50051 list
|
||||
|
||||
# Test health check
|
||||
grpcurl -plaintext localhost:50051 transcription.TranscriptionService/HealthCheck
|
||||
|
||||
# Test with example audio
|
||||
python test_client.py
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Docker Swarm
|
||||
|
||||
```bash
|
||||
docker stack deploy -c docker-compose.yml transcription
|
||||
```
|
||||
|
||||
### Kubernetes
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: transcription-api
|
||||
spec:
|
||||
replicas: 3
|
||||
selector:
|
||||
matchLabels:
|
||||
app: transcription-api
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: transcription-api
|
||||
spec:
|
||||
containers:
|
||||
- name: transcription-api
|
||||
image: transcription-api:latest
|
||||
ports:
|
||||
- containerPort: 50051
|
||||
name: grpc
|
||||
- containerPort: 8765
|
||||
name: websocket
|
||||
env:
|
||||
- name: MODEL_PATH
|
||||
value: "base"
|
||||
resources:
|
||||
requests:
|
||||
memory: "4Gi"
|
||||
cpu: "2"
|
||||
limits:
|
||||
memory: "8Gi"
|
||||
cpu: "4"
|
||||
```
|
||||
|
||||
### Security
|
||||
|
||||
For production:
|
||||
1. Enable TLS for gRPC
|
||||
2. Use WSS for WebSocket
|
||||
3. Add authentication
|
||||
4. Rate limiting
|
||||
5. Input validation
|
||||
|
||||
## License
|
||||
|
||||
MIT License - See LICENSE file for details
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
|
||||
|
||||
## Support
|
||||
|
||||
- GitHub Issues: [Report bugs or request features]
|
||||
- Documentation: [Full API documentation]
|
||||
- Examples: See `examples/` directory
|
||||
Reference in New Issue
Block a user