Improve the streaming transcription experience with Amazon Transcribe partial results stabilization

Whether you’re watching a live broadcast of your favorite soccer team, having a video chat with a vendor, or calling your bank about a loan payment, streaming speech content is everywhere. You can apply a streaming transcription service to generate subtitles for content understanding and accessibility, to create metadata to enable search, or to extract insights for call analytics. These transcription services process streaming audio content and generate partial transcription results until it provides a final transcription for a segment of continuous speech. However, some words or phrases in these partial results might change, as the service further understands the context of the audio.

We’re happy to announce that Amazon Transcribe now allows you to enable and configure partial results stabilization for streaming audio transcriptions. Amazon Transcribe is an automatic speech recognition (ASR) service that enables developers to add real-time speech-to-text capabilities into their applications for on-demand and streaming content. Instead of waiting for an entire sentence to be transcribed, you can now control the stabilization level of partial results. Transcribe offers 3 settings: High, Medium and Low. Setting the stabilization “High” allows a greater portion of the partial results to be fixed with only the last few words changing during the transcription process. This feature helps you have more flexibility in your streaming transcription workflows based on the user experience you want to create.

In this post, we walk through the benefits of this feature and how to enable it via the Amazon Transcribe console or the API.

How partial results stabilization works

Let’s dive deeper into this with an example.

During your daily conversations, you may think you hear a certain word or phrase, but later realize that it was incorrect based on additional context. Let’s say you were talking to someone about food, and you heard them say “Tonight, I will eat a pear…” However, when the speaker finishes, you realize they actually said “Tonight I will eat a pair of pancakes.” Just as humans may change our understanding based on the information at hand, Amazon Transcribe uses machine learning (ML) to self-correct the transcription of streaming audio based on the context it receives. To enable this, Amazon Transcribe uses partial results.

During the streaming transcription process, Amazon Transcribe outputs chunks of the results with an isPartial flag. Results with this flag marked as true are the ones that Amazon Transcribe may change in the future depending on the additional context received. After Amazon Transcribe classifies that it has sufficient context to be over a certain confidence threshold, the results are stabilized and the isPartial flag for that specific partial result is marked false. The window size of these partial results could range from a few words to multiple sentences depending on the stream context.

The following image displays how the partial results are generated (and edited) in Amazon Transcribe for streaming transcription.

Results stabilization enables more control over the latency and accuracy of transcription results. Depending on the use case, you may prioritize one over the other. For example, when providing live subtitles, high stabilization of results may be preferred because speed is more important than accuracy. On the other hand for use cases like content moderation, lower stabilization is preferred because accuracy may be more important than latency.

A high stability level enables quicker stabilization of transcription results by limiting the window of context for stabilizing results, but can lead to lower overall accuracy. On the other hand, a low stability level leads to more accurate transcription results, but the partial transcription results are more likely to change.

With the streaming transcription API, you can now control the stability of the partial results in your transcription stream.

Now let’s look at how to use the feature.

Access partial results stabilization via the Amazon Transcribe console

To start using partial results stabilization on the Amazon Transcribe console, complete the following steps:

  1. On the Amazon Transcribe console, make sure you’re in a Region that supports Amazon Transcribe Streaming.

For this post, we use us-east-1.

  1. In the navigation pane, choose Real-time transcription.
  2. Under Additional settings, enable Partial results stabilization.

  1. Select your stability level.

You can choose between three levels:

  • High – Provides the most stable partial transcription results with lower accuracy compared to Medium and Low settings. Results are less likely to change as additional context is gathered.
  • Medium – Provides partial transcription results that have a balance between stability and accuracy
  • Low – Provides relatively less stable partial transcription results with higher accuracy compared to High and Medium settings. Results get updated as additional context is gathered and utilized.

  1. Choose Start streaming to play a stream and check the results.

Access partial results stabilization via the API

In this section, we demonstrate streaming with HTTP/2. You can enable your preferred level of partial results stabilization in an API request.

You enable this feature via the enable-partial-results-stabilization flag and the partial-results-stability level input parameters:

POST /stream-transcription HTTP/2 
x-amzn-transcribe-language-code: LanguageCode 
x-amzn-transcribe-sample-rate: MediaSampleRateHertz 
x-amzn-transcribe-media-encoding: MediaEncoding 
x-amzn-transcribe-session-id: SessionId 
x-amzn-transcribe-enable-partial-results-stabilization= true
x-amzn-transcribe-partial-results-stability = low | medium | high

Enabling partial results stabilization introduces the additional parameter flag Stable in the API response at the item level in the transcription results. If a partial results item in the streaming transcription result has the Stable flag marked as true, the corresponding item transcription in the partial results doesn’t change irrespective of any subsequent context identified by Amazon Transcribe. If the Stable flag is marked as false, there is still a chance that the corresponding item may change in the future, until the IsPartial flag is marked as false.

The following code shows our API response:

{
    "Alternatives": [
        {
            "Items": [
                {
                    "Confidence": 0,
                    "Content": "Amazon",
                    "EndTime": 1.22,
                    "Stable": true,
                    "StartTime": 0.78,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },
                {
                    "Confidence": 0,
                    "Content": "is",
                    "EndTime": 1.63,
                    "Stable": true,
                    "StartTime": 1.46,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },
                {
                    "Confidence": 0,
                    "Content": "the",
                    "EndTime": 1.76,
                    "Stable": true,
                    "StartTime": 1.64,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },
                {
                    "Confidence": 0,
                    "Content": "largest",
                    "EndTime": 2.31,
                    "Stable": true,
                    "StartTime": 1.77,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },
                {
                    "Confidence": 1,
                    "Content": "rainforest",
                    "EndTime": 3.34,
                    "Stable": true,
                    "StartTime": 2.4,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },      
            ],
            "Transcript": "Amazon is the largest rainforest "
        }
    ],
    "EndTime": 4.33,
    "IsPartial": false,
    "ResultId": "f4b5d4dd-b685-4736-b883-795dc3f7f636",
    "StartTime": 0.78
}

Conclusion

This post introduces the recently launched partial results stabilization feature in Amazon Transcribe. For more information, see the Amazon Transcribe Partial results stabilization documentation.

To learn more about the Amazon Transcribe Streaming Transcription API, check out Using Amazon Transcribe streaming With HTTP/2 and Using Amazon Transcribe streaming with WebSockets.


About the Author

Alex Chirayath is an SDE in the Amazon Machine Learning Solutions Lab. He helps customers adopt AWS AI services by building solutions to address common business problems.

Read More