readme update

This commit is contained in:
2025-09-12 07:01:22 +02:00
committed by GitHub
parent ffdda3d730
commit fd1709e386

315
README.md
View File

@@ -2,17 +2,6 @@
A high-performance, standalone transcription service with gRPC and WebSocket support, optimized for real-time speech-to-text applications. Perfect for desktop applications, web services, and IoT devices.
## Features
- **Dual Protocol Support**: Both gRPC (recommended) and WebSocket
- **Real-Time Streaming**: Bidirectional audio streaming with immediate transcription
- **Multiple Models**: Support for all Whisper models (tiny to large-v3)
- **Language Support**: 50+ languages with automatic detection
- **Docker Ready**: Simple deployment with Docker Compose
- **Production Ready**: Health checks, monitoring, and graceful shutdown
- **Rust Client Examples**: Ready-to-use Rust client for desktop applications
## Quick Start
### Using Docker Compose (Recommended)
@@ -42,92 +31,9 @@ ENABLE_WEBSOCKET=true # Enable WebSocket support
CUDA_VISIBLE_DEVICES=0 # GPU device ID (if available)
```
## API Protocols
### gRPC (Recommended for Desktop Apps)
**Why gRPC?**
- Strongly typed with Protocol Buffers
- Excellent performance with HTTP/2
- Built-in streaming support
- Auto-generated client code
- Better error handling
**Proto Definition**: See `proto/transcription.proto`
**Service Methods**:
- `StreamTranscribe`: Bidirectional streaming for real-time transcription
- `TranscribeFile`: Single file transcription
- `GetCapabilities`: Query available models and languages
- `HealthCheck`: Service health status
### WebSocket (Alternative)
**Protocol**:
```javascript
// Connect
ws://localhost:8765
// Send audio
{
"type": "audio",
"data": "base64_encoded_pcm16_audio"
}
// Receive transcription
{
"type": "transcription",
"text": "Hello world",
"start_time": 0.0,
"end_time": 1.5,
"is_final": true,
"timestamp": 1234567890
}
// Stop
{
"type": "stop"
}
```
## Rust Client Usage
### Installation
```toml
# Add to your Cargo.toml
[dependencies]
tonic = "0.10"
tokio = { version = "1.35", features = ["full"] }
# ... see examples/rust-client/Cargo.toml for full list
```
### Live Microphone Transcription
```rust
use transcription_client::TranscriptionClient;
#[tokio::main]
async fn main() -> Result<()> {
// Connect to service
let mut client = TranscriptionClient::connect("http://localhost:50051").await?;
// Start streaming from microphone
let stream = client.stream_from_microphone(
"auto", // language
"transcribe", // task
"base" // model
).await?;
// Process transcriptions
while let Some(transcription) = stream.next().await {
println!("{}", transcription.text);
}
Ok(())
}
```
### Build and Run Examples
```bash
@@ -145,224 +51,3 @@ cargo run --bin file-transcribe -- audio.wav
# Stream a WAV file
cargo run --bin stream-transcribe -- audio.wav --realtime
```
## Audio Requirements
- **Format**: PCM16 (16-bit signed integer)
- **Sample Rate**: 16kHz
- **Channels**: Mono
- **Chunk Size**: Minimum ~500 bytes (flexible for real-time)
## Performance Optimization
### For Real-Time Applications
1. **Use gRPC**: Lower latency than WebSocket
2. **Small Chunks**: Send audio in 0.5-1 second chunks
3. **Model Selection**:
- `tiny`: Fastest, lowest accuracy (real-time on CPU)
- `base`: Good balance (near real-time on CPU)
- `small`: Better accuracy (may lag on CPU)
- `large-v3`: Best accuracy (requires GPU for real-time)
### GPU Acceleration
```yaml
# docker-compose.yml
environment:
- CUDA_VISIBLE_DEVICES=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
```
## Architecture
```
┌─────────────┐
│ Rust App │
│ (Desktop) │
└──────┬──────┘
│ gRPC/HTTP2
┌─────────────┐
│ Transcription│
│ Service │
│ ┌────────┐ │
│ │Whisper │ │
│ │ Model │ │
│ └────────┘ │
└─────────────┘
```
### Components
1. **gRPC Server**: Handles streaming audio and returns transcriptions
2. **WebSocket Server**: Alternative protocol for web clients
3. **Transcription Engine**: Whisper/SimulStreaming for speech-to-text
4. **Session Manager**: Handles multiple concurrent streams
5. **Model Cache**: Prevents re-downloading models
## Advanced Configuration
### Using SimulStreaming
For even lower latency, mount SimulStreaming:
```yaml
volumes:
- ./SimulStreaming:/app/SimulStreaming
environment:
- SIMULSTREAMING_PATH=/app/SimulStreaming
```
### Custom Models
Mount your own Whisper models:
```yaml
volumes:
- ./models:/app/models
environment:
- MODEL_PATH=/app/models/custom-model.pt
```
### Monitoring
The service exposes metrics on `/metrics` (when enabled):
```bash
curl http://localhost:9090/metrics
```
## API Reference
### gRPC Methods
#### StreamTranscribe
```protobuf
rpc StreamTranscribe(stream AudioChunk) returns (stream TranscriptionResult);
```
Bidirectional streaming for real-time transcription. Send audio chunks, receive transcriptions.
#### TranscribeFile
```protobuf
rpc TranscribeFile(AudioFile) returns (TranscriptionResponse);
```
Transcribe a complete audio file in one request.
#### GetCapabilities
```protobuf
rpc GetCapabilities(Empty) returns (Capabilities);
```
Query available models, languages, and features.
#### HealthCheck
```protobuf
rpc HealthCheck(Empty) returns (HealthStatus);
```
Check service health and status.
## Language Support
Supports 50+ languages including:
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Russian (ru)
- Chinese (zh)
- Japanese (ja)
- Korean (ko)
- And many more...
Use `"auto"` for automatic language detection.
## Troubleshooting
### Service won't start
- Check if ports 50051 and 8765 are available
- Ensure Docker has enough memory (minimum 4GB)
- Check logs: `docker compose logs transcription-api`
### Slow transcription
- Use a smaller model (tiny or base)
- Enable GPU if available
- Reduce audio quality to 16kHz mono
- Send smaller chunks more frequently
### Connection refused
- Check firewall settings
- Ensure service is running: `docker compose ps`
- Verify correct ports in client configuration
### High memory usage
- Models are cached in memory for performance
- Use smaller models for limited memory systems
- Set memory limits in docker-compose.yml
## Development
### Building from Source
```bash
# Install dependencies
pip install -r requirements.txt
# Generate gRPC code
python -m grpc_tools.protoc \
-I./proto \
--python_out=./src \
--grpc_python_out=./src \
./proto/transcription.proto
# Run the service
python src/transcription_server.py
```
### Running Tests
```bash
# Test gRPC connection
grpcurl -plaintext localhost:50051 list
# Test health check
grpcurl -plaintext localhost:50051 transcription.TranscriptionService/HealthCheck
# Test with example audio
python test_client.py
```
## R&D Project Notice
This is a research and development project for exploring real-time transcription capabilities. It is not production-ready and should be used for experimentation and development purposes only.
### Known Limitations
- Memory usage scales with model size (1.5-6GB for large models)
- Single model instance shared across connections
- No authentication or rate limiting
- Not optimized for high-concurrency production use
## License
MIT License - See LICENSE file for details
## Contributing
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
## Support
- GitHub Issues: [Report bugs or request features]
- Documentation: [Full API documentation]
- Examples: See `examples/` directory