DeploymentsContainer

Performance and cost

Get an overview of the performance and cost of Speechmatics container deployments

Speech to text containers

This is a comparison of the performance and estimated running costs of transcription executing on standard Azure VMs. The comparison highlights the maximum number of concurrent real-time sessions (session density) and the maximum throughput for batch jobs on a single instance.

Batch transcription

Operating Point	CPU Standard	CPU Enhanced	GPU Standard	GPU Enhanced
Lowest Processing Cost (US ¢ per hour)	1.7	3.8	0.46	1.88
Cost vs CPU Standard (%)	-	224%	27%	110%
Cost vs CPU Enhanced (%)	45%	-	12%	49%
Maximum Throughput¹	53.2	23.7	260	40
Representative Real-Time Factor (RTF)²	0.085	0.2	0.029	0.127
Transcriber Count	20	20	13	6

The benchmark uses the following configuration:

Benchmark details
Version	14.8.0
CPU	D16ds_v5
GPU Standard	Standard_NC16as_T4_v3
GPU Enhanced	Standard_NC8as_T4_v3
Price Basis	Azure PAYG East US, Linux, Standard

For GPU Operating Points, transcribers and inference servers were all run on a single VM node.

Realtime transcription

Operating Point	CPU Standard	CPU Enhanced	GPU Standard	GPU Enhanced
Lowest Processing Cost (US ¢ per hour)	1.97	2.95	0.86	2.51
Cost vs. CPU Standard (%)	-	150%	44%	127%
Cost vs. CPU Enhanced (%)	67%	-	29%	85%
Session Density	40	24	140³	30³

This benchmark uses the following configuration⁴:

Benchmark details	Value
Version	13.4.0
CPU	D16ds_v5
GPU Standard	Standard_NC16as_T4_v3
GPU Enhanced	Standard_NC8as_T4_v3
Price Basis	Azure PAYG East US, Linux, Standard

For GPU Operating Points, the transcribers and inference servers were run on a single VM node.

Each first session, transcriber requires 0.25 cores for both OPs, with 1.2 GB memory (Standard OP) or 3 GB memory (Enhanced OP). Every additional session consumes 0.1 cores and 100 MB of memory.

Translation (GPU)

Translation running on a 4-core T4 has an RTF of roughly 0.008. It can handle up to 125 hours of batch audio per hour, or 125 Realtime Transcription streams. However, each translation target language is counted as a stream, meaning that a single Realtime Transcription stream which requests 5 target languages adds the same load on the Translation Inference Server as 5 transcription streams each requesting a single target language.

Throughput is measured as hours of audio per hour of runtime. A throughput of 50 would mean that in one hour, the system as a whole can transcribe 50 hours of audio. ↩
An RTF of 1 would mean that a one hour file would take one hour to transcribe. An RTF of 0.1 would mean that a one hour file would take six minutes to transcribe. Benchmark RTFs are representative for processing audio files over 20 minutes in duration using parallel=4. ↩
Multiple sessions are handled by a single worker configured with the required concurrency. ↩ ↩²
Benchmark results reflect performance on a fully loaded inference server operating at the session density recommended for the respective GPU platform. ↩

Speech to text containers​

Batch transcription​

Realtime transcription​

Translation (GPU)​

Footnotes​

Speech to text containers

Batch transcription

Realtime transcription

Translation (GPU)

Footnotes