mirror of
https://github.com/aljazceru/transcription-api.git
synced 2025-12-16 23:14:18 +01:00
231 lines
5.4 KiB
Markdown
231 lines
5.4 KiB
Markdown
# Transcription API Service
|
|
|
|
A high-performance, standalone transcription service with **REST API**, **gRPC**, and **WebSocket** support, optimized for real-time speech-to-text applications. Perfect for desktop applications, web services, and IoT devices.
|
|
|
|
## Features
|
|
|
|
- **Multiple API Interfaces**: REST API, gRPC, and WebSocket
|
|
- **High Performance**: Optimized with TF32, cuDNN, and efficient batching
|
|
- **Whisper Models**: Support for all Whisper models (tiny to large-v3)
|
|
- **Real-time Streaming**: Bidirectional streaming for live transcription
|
|
- **Voice Activity Detection**: Smart VAD to filter silence and noise
|
|
- **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
|
|
- **Docker Ready**: Easy deployment with GPU support
|
|
- **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)
|
|
|
|
## Quick Start
|
|
|
|
### Using Docker Compose (Recommended)
|
|
|
|
```bash
|
|
# Clone the repository
|
|
cd transcription-api
|
|
|
|
# Start the service (uses 'base' model by default)
|
|
docker compose up -d
|
|
|
|
# Check logs
|
|
docker compose logs -f
|
|
|
|
# Stop the service
|
|
docker compose down
|
|
```
|
|
|
|
### Configuration
|
|
|
|
Edit `.env` or `docker-compose.yml` to configure:
|
|
|
|
```env
|
|
# Model Configuration
|
|
MODEL_PATH=base # tiny, base, small, medium, large, large-v3
|
|
|
|
# Service Ports
|
|
GRPC_PORT=50051 # gRPC service port
|
|
WEBSOCKET_PORT=8765 # WebSocket service port
|
|
REST_PORT=8000 # REST API port
|
|
|
|
# Feature Flags
|
|
ENABLE_WEBSOCKET=true # Enable WebSocket support
|
|
ENABLE_REST=true # Enable REST API
|
|
|
|
# GPU Configuration
|
|
CUDA_VISIBLE_DEVICES=0 # GPU device ID (if available)
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
The service provides three ways to access transcription:
|
|
|
|
### 1. REST API (Port 8000)
|
|
|
|
The REST API is perfect for simple HTTP-based integrations.
|
|
|
|
#### Base URLs
|
|
- **API Docs**: http://localhost:8000/docs
|
|
- **ReDoc**: http://localhost:8000/redoc
|
|
- **Health**: http://localhost:8000/health
|
|
|
|
#### Key Endpoints
|
|
|
|
**Transcribe File**
|
|
```bash
|
|
curl -X POST "http://localhost:8000/transcribe" \
|
|
-F "file=@audio.wav" \
|
|
-F "language=en" \
|
|
-F "task=transcribe" \
|
|
-F "vad_enabled=true"
|
|
```
|
|
|
|
**Health Check**
|
|
```bash
|
|
curl http://localhost:8000/health
|
|
```
|
|
|
|
**Get Capabilities**
|
|
```bash
|
|
curl http://localhost:8000/capabilities
|
|
```
|
|
|
|
**WebSocket Streaming** (via REST API)
|
|
```bash
|
|
# Connect to WebSocket
|
|
ws://localhost:8000/ws/transcribe
|
|
```
|
|
|
|
For detailed API documentation, visit http://localhost:8000/docs after starting the service.
|
|
|
|
### 2. gRPC (Port 50051)
|
|
|
|
For high-performance, low-latency applications. See protobuf definitions in `proto/transcription.proto`.
|
|
|
|
### 3. WebSocket (Port 8765)
|
|
|
|
Legacy WebSocket endpoint for backward compatibility.
|
|
|
|
|
|
## Usage Examples
|
|
|
|
### REST API (Python)
|
|
|
|
```python
|
|
import requests
|
|
|
|
# Transcribe a file
|
|
with open('audio.wav', 'rb') as f:
|
|
response = requests.post(
|
|
'http://localhost:8000/transcribe',
|
|
files={'file': f},
|
|
data={
|
|
'language': 'en',
|
|
'task': 'transcribe',
|
|
'vad_enabled': True
|
|
}
|
|
)
|
|
result = response.json()
|
|
print(result['full_text'])
|
|
```
|
|
|
|
### REST API (cURL)
|
|
|
|
```bash
|
|
# Transcribe an audio file
|
|
curl -X POST "http://localhost:8000/transcribe" \
|
|
-F "file=@audio.wav" \
|
|
-F "language=en"
|
|
|
|
# Health check
|
|
curl http://localhost:8000/health
|
|
|
|
# Get service capabilities
|
|
curl http://localhost:8000/capabilities
|
|
```
|
|
|
|
### WebSocket (JavaScript)
|
|
|
|
```javascript
|
|
const ws = new WebSocket('ws://localhost:8000/ws/transcribe');
|
|
|
|
ws.onopen = () => {
|
|
console.log('Connected');
|
|
|
|
// Send audio data (base64-encoded PCM16)
|
|
ws.send(JSON.stringify({
|
|
type: 'audio',
|
|
data: base64AudioData,
|
|
language: 'en',
|
|
vad_enabled: true
|
|
}));
|
|
};
|
|
|
|
ws.onmessage = (event) => {
|
|
const data = JSON.parse(event.data);
|
|
if (data.type === 'transcription') {
|
|
console.log('Transcription:', data.text);
|
|
}
|
|
};
|
|
|
|
// Stop transcription
|
|
ws.send(JSON.stringify({ type: 'stop' }));
|
|
```
|
|
|
|
## Rust Client Usage
|
|
|
|
### Build and Run Examples
|
|
|
|
```bash
|
|
cd examples/rust-client
|
|
|
|
# Build
|
|
cargo build --release
|
|
|
|
# Run live transcription from microphone
|
|
cargo run --bin live-transcribe
|
|
|
|
# Transcribe a file
|
|
cargo run --bin file-transcribe -- audio.wav
|
|
|
|
# Stream a WAV file
|
|
cargo run --bin stream-transcribe -- audio.wav --realtime
|
|
```
|
|
|
|
## Performance Optimizations
|
|
|
|
This service includes several performance optimizations:
|
|
|
|
1. **Shared Model Instance**: Single model loaded in memory, shared across all connections
|
|
2. **TF32 & cuDNN**: Enabled for Ampere GPUs for faster inference
|
|
3. **No Gradient Computation**: `torch.no_grad()` context for inference
|
|
4. **Optimized Threading**: Dynamic thread pool sizing based on CPU cores
|
|
5. **Efficient VAD**: Fast voice activity detection to skip silent audio
|
|
6. **Batch Processing**: Processes audio in optimal chunk sizes
|
|
7. **gRPC Optimizations**: Keepalive and HTTP/2 settings tuned for performance
|
|
|
|
## Supported Formats
|
|
|
|
- **Audio**: WAV, MP3, WebM, OGG, FLAC, M4A, raw PCM16
|
|
- **Sample Rate**: 16kHz (automatically resampled)
|
|
- **Languages**: Auto-detect or specify (en, es, fr, de, it, pt, ru, zh, ja, ko, etc.)
|
|
- **Tasks**: Transcribe or Translate to English
|
|
|
|
## API Documentation
|
|
|
|
Full interactive API documentation is available at:
|
|
- **Swagger UI**: http://localhost:8000/docs
|
|
- **ReDoc**: http://localhost:8000/redoc
|
|
|
|
## Health Monitoring
|
|
|
|
```bash
|
|
# Check service health
|
|
curl http://localhost:8000/health
|
|
|
|
# Response:
|
|
{
|
|
"healthy": true,
|
|
"status": "running",
|
|
"model_loaded": "large-v3",
|
|
"uptime_seconds": 3600,
|
|
"active_sessions": 2
|
|
}
|
|
```
|