Skip to main content
DeploymentsContainer

Performance and cost

Get an overview of the performance and cost of Speechmatics container deployments

Speech to text containers

This is a comparison of the performance and estimated running costs of transcription executing on standard Azure VMs. The comparison highlights the maximum number of concurrent real-time sessions (session density) and the maximum throughput for batch jobs on a single instance.

Batch transcription

Operating PointCPU StandardCPU EnhancedGPU StandardGPU Enhanced
Lowest Processing Cost (US ¢ per hour)1.73.80.461.88
Cost vs CPU Standard (%)-224%27%110%
Cost vs CPU Enhanced (%)45%-12%49%
Maximum Throughput153.223.726040
Representative Real-Time Factor (RTF)20.0850.20.0290.127
Transcriber Count2020136

The benchmark uses the following configuration:

Benchmark details
Version14.8.0
CPUD16ds_v5
GPU StandardStandard_NC16as_T4_v3
GPU EnhancedStandard_NC8as_T4_v3
Price BasisAzure PAYG East US, Linux, Standard

For GPU Operating Points, transcribers and inference servers were all run on a single VM node.

Realtime transcription

Operating PointCPU StandardCPU EnhancedGPU StandardGPU Enhanced
Lowest Processing Cost (US ¢ per hour)1.972.950.862.51
Cost vs. CPU Standard (%)-150%44%127%
Cost vs. CPU Enhanced (%)67%-29%85%
Session Density40241403303

This benchmark uses the following configuration4:

Benchmark detailsValue
Version13.4.0
CPUD16ds_v5
GPU StandardStandard_NC16as_T4_v3
GPU EnhancedStandard_NC8as_T4_v3
Price BasisAzure PAYG East US, Linux, Standard

For GPU Operating Points, the transcribers and inference servers were run on a single VM node.

Each first session, transcriber requires 0.25 cores for both OPs, with 1.2 GB memory (Standard OP) or 3 GB memory (Enhanced OP). Every additional session consumes 0.1 cores and 100 MB of memory.

Translation (GPU)

Translation running on a 4-core T4 has an RTF of roughly 0.008. It can handle up to 125 hours of batch audio per hour, or 125 Realtime Transcription streams. However, each translation target language is counted as a stream, meaning that a single Realtime Transcription stream which requests 5 target languages adds the same load on the Translation Inference Server as 5 transcription streams each requesting a single target language.

Footnotes

  1. Throughput is measured as hours of audio per hour of runtime. A throughput of 50 would mean that in one hour, the system as a whole can transcribe 50 hours of audio.

  2. An RTF of 1 would mean that a one hour file would take one hour to transcribe. An RTF of 0.1 would mean that a one hour file would take six minutes to transcribe. Benchmark RTFs are representative for processing audio files over 20 minutes in duration using parallel=4.

  3. Multiple sessions are handled by a single worker configured with the required concurrency. 2

  4. Benchmark results reflect performance on a fully loaded inference server operating at the session density recommended for the respective GPU platform.