transcription-api/README.md

# Transcription API Service

A high-performance, standalone transcription service with **REST API**, **gRPC**, and **WebSocket** support, optimized for real-time speech-to-text applications. Perfect for desktop applications, web services, and IoT devices.

## Features

- 🚀 **Multiple API Interfaces**: REST API, gRPC, and WebSocket
- 🎯 **High Performance**: Optimized with TF32, cuDNN, and efficient batching
- 🧠 **Whisper Models**: Support for all Whisper models (tiny to large-v3)
- 🎤 **Real-time Streaming**: Bidirectional streaming for live transcription
- 🔇 **Voice Activity Detection**: Smart VAD to filter silence and noise
- 🚫 **Anti-hallucination**: Advanced filtering to reduce Whisper hallucinations
- 🐳 **Docker Ready**: Easy deployment with GPU support
- 📊 **Interactive Docs**: Auto-generated API documentation (Swagger/OpenAPI)

## Quick Start

### Using Docker Compose (Recommended)

```bash
# Clone the repository
cd transcription-api

# Start the service (uses 'base' model by default)
docker compose up -d

# Check logs
docker compose logs -f

# Stop the service
docker compose down
```

### Configuration

Edit `.env` or `docker-compose.yml` to configure:

```env
# Model Configuration
MODEL_PATH=base          # tiny, base, small, medium, large, large-v3

# Service Ports
GRPC_PORT=50051         # gRPC service port
WEBSOCKET_PORT=8765     # WebSocket service port
REST_PORT=8000          # REST API port

# Feature Flags
ENABLE_WEBSOCKET=true   # Enable WebSocket support
ENABLE_REST=true        # Enable REST API

# GPU Configuration
CUDA_VISIBLE_DEVICES=0  # GPU device ID (if available)
```

## API Endpoints

The service provides three ways to access transcription:

### 1. REST API (Port 8000)

The REST API is perfect for simple HTTP-based integrations.

#### Base URLs
- **API Docs**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
- **Health**: http://localhost:8000/health

#### Key Endpoints

**Transcribe File**
```bash
curl -X POST "http://localhost:8000/transcribe" \
  -F "file=@audio.wav" \
  -F "language=en" \
  -F "task=transcribe" \
  -F "vad_enabled=true"
```

**Health Check**
```bash
curl http://localhost:8000/health
```

**Get Capabilities**
```bash
curl http://localhost:8000/capabilities
```

**WebSocket Streaming** (via REST API)
```bash
# Connect to WebSocket
ws://localhost:8000/ws/transcribe
```

For detailed API documentation, visit http://localhost:8000/docs after starting the service.

### 2. gRPC (Port 50051)

For high-performance, low-latency applications. See protobuf definitions in `proto/transcription.proto`.

### 3. WebSocket (Port 8765)

Legacy WebSocket endpoint for backward compatibility.


## Usage Examples

### REST API (Python)

```python
import requests

# Transcribe a file
with open('audio.wav', 'rb') as f:
    response = requests.post(
        'http://localhost:8000/transcribe',
        files={'file': f},
        data={
            'language': 'en',
            'task': 'transcribe',
            'vad_enabled': True
        }
    )
    result = response.json()
    print(result['full_text'])
```

### REST API (cURL)

```bash
# Transcribe an audio file
curl -X POST "http://localhost:8000/transcribe" \
  -F "file=@audio.wav" \
  -F "language=en"

# Health check
curl http://localhost:8000/health

# Get service capabilities
curl http://localhost:8000/capabilities
```

### WebSocket (JavaScript)

```javascript
const ws = new WebSocket('ws://localhost:8000/ws/transcribe');

ws.onopen = () => {
  console.log('Connected');

  // Send audio data (base64-encoded PCM16)
  ws.send(JSON.stringify({
    type: 'audio',
    data: base64AudioData,
    language: 'en',
    vad_enabled: true
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === 'transcription') {
    console.log('Transcription:', data.text);
  }
};

// Stop transcription
ws.send(JSON.stringify({ type: 'stop' }));
```

## Rust Client Usage

### Build and Run Examples

```bash
cd examples/rust-client

# Build
cargo build --release

# Run live transcription from microphone
cargo run --bin live-transcribe

# Transcribe a file
cargo run --bin file-transcribe -- audio.wav

# Stream a WAV file
cargo run --bin stream-transcribe -- audio.wav --realtime
```

## Performance Optimizations

This service includes several performance optimizations:

1. **Shared Model Instance**: Single model loaded in memory, shared across all connections
2. **TF32 & cuDNN**: Enabled for Ampere GPUs for faster inference
3. **No Gradient Computation**: `torch.no_grad()` context for inference
4. **Optimized Threading**: Dynamic thread pool sizing based on CPU cores
5. **Efficient VAD**: Fast voice activity detection to skip silent audio
6. **Batch Processing**: Processes audio in optimal chunk sizes
7. **gRPC Optimizations**: Keepalive and HTTP/2 settings tuned for performance

## Supported Formats

- **Audio**: WAV, MP3, WebM, OGG, FLAC, M4A, raw PCM16
- **Sample Rate**: 16kHz (automatically resampled)
- **Languages**: Auto-detect or specify (en, es, fr, de, it, pt, ru, zh, ja, ko, etc.)
- **Tasks**: Transcribe or Translate to English

## API Documentation

Full interactive API documentation is available at:
- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc

## Health Monitoring

```bash
# Check service health
curl http://localhost:8000/health

# Response:
{
  "healthy": true,
  "status": "running",
  "model_loaded": "large-v3",
  "uptime_seconds": 3600,
  "active_sessions": 2
}
```