Performance and cost
Speech to text containers
This is a comparison of the performance and estimated running costs of transcription executing on standard Azure VMs. The comparison highlights the maximum number of concurrent real-time sessions (session density) and the maximum throughput for batch jobs on a single instance.
Batch transcription
The benchmark uses the following configuration:
For GPU Operating Points, transcribers and inference servers were all run on a single VM node.
Realtime transcription
This benchmark uses the following configuration4:
For GPU Operating Points, the transcribers and inference servers were run on a single VM node.
Each first session, transcriber requires 0.25 cores for both OPs, with 1.2 GB memory (Standard OP) or 3 GB memory (Enhanced OP). Every additional session consumes 0.1 cores and 100 MB of memory.
Translation (GPU)
Translation running on a 4-core T4 has an RTF of roughly 0.008. It can handle up to 125 hours of batch audio per hour, or 125 Realtime Transcription streams. However, each translation target language is counted as a stream, meaning that a single Realtime Transcription stream which requests 5 target languages adds the same load on the Translation Inference Server as 5 transcription streams each requesting a single target language.