Skip to content

Feedback: speech-to-text-universal-streaming-message-sequence

Original URL: https://www.assemblyai.com/docs/speech-to-text/universal-streaming/message-sequence
Category: speech-to-text
Generated: 05/08/2025, 4:22:55 pm


Generated: 05/08/2025, 4:22:54 pm

Technical Documentation Analysis & Feedback

Section titled “Technical Documentation Analysis & Feedback”

This documentation provides a good visual demonstration of the streaming API message sequence, but lacks critical context and guidance that users need to effectively implement and understand the system.

Issue: Users are thrown directly into JSON examples without understanding the fundamental concepts.

Solution: Add an introductory section:

## Overview
The Universal Streaming API processes speech in real-time, sending multiple messages as it refines its understanding of what was spoken. This page shows how a single utterance ("Hi, my name is Sonny") evolves from initial partial transcripts to the final formatted result.
### Key Concepts
- **Partial Transcripts**: Intermediate results as speech is being processed
- **Final Transcripts**: Completed transcription with two versions (unformatted and formatted)
- **Turn**: A complete speech segment from one speaker
- **Word Finalization**: Individual words become "final" before the entire turn is complete

Issue: Timestamp values (1440, 1520, etc.) have no explanation of units.

Solution: Add a note before the first example:

<Note>
Timestamps in the `start` and `end` fields are in milliseconds from the beginning of the audio stream.
</Note>

Issue: No explanation of when/why messages are sent.

Solution: Add explanatory text between examples:

### Message 1: First Word Detected
The API detects the first word "hi" but keeps `word_is_final: false` as it's still processing.
### Message 2: Word Becomes Final
Notice that "hi" now has `word_is_final: true`, indicating the API is confident in this transcription. A new word "name" appears but isn't final yet.

Instead of raw JSON blocks, annotate key changes:

{
"turn_order": 0,
"turn_is_formatted": false,
"end_of_turn": false, // ← Still processing this turn
"transcript": "hi my name is",
"end_of_turn_confidence": 0.017141787335276604, // ← Low confidence = more speech expected
"words": [
// ... previous words now final
{
"start": 2320,
"end": 2400,
"text": "son", // ← Partial word, will likely change
"confidence": 0.471368670463562, // ← Low confidence
"word_is_final": false // ← Still being refined
}
]
}

Add a visual representation:

## Timeline Visualization

0ms 1440ms 1600ms 1680ms 2320ms 3040ms | | | | | | | “hi” “my” “name” “is” “sonny” | ↓ ↓ ↓ ↓ ↓ | final final final final final

### 3. **Summary Table**
Add a comparison table:
| Message # | Transcript State | Key Changes | End of Turn Confidence |
|-----------|------------------|-------------|------------------------|
| 1 | "hi" | First word detected | 0.68 |
| 2 | "hi my" | "hi" becomes final, "my" added | 0.004 |
| ... | ... | ... | ... |
## User Experience Pain Points
### 1. **No Error Handling Examples**
**Issue**: Only shows successful transcription.
**Solution**: Add a section:
```markdown
## Handling Corrections and Revisions
Sometimes the API will revise earlier words. Here's how a correction looks:
[Show example where earlier "final" words change in subsequent messages]

Issue: Developers don’t know how to handle these messages in code.

Solution: Add practical examples:

## Implementation Tips
### Handling Partial Transcripts
```javascript
function handlePartialTranscript(message) {
// Only display words that are final for stable UI
const finalWords = message.words.filter(word => word.word_is_final);
displayText(finalWords.map(w => w.text).join(' '));
}

Issue: No information about message frequency or volume.

Solution: Add:

## Message Frequency
- Partial transcripts: Sent every 100-200ms during active speech
- Final transcripts: Sent when turn completion is detected
- Formatted transcripts: Sent immediately after unformatted final transcript

Issue: Some start/end times are identical between different words (1600ms for both “my” and “name”).

Solution: Either fix the data or explain why this occurs:

<Note>
Words may share identical timestamps when they're detected simultaneously or when timestamp precision is insufficient for very fast speech.
</Note>

Issue: No guidance on interpreting confidence values.

Solution: Add:

## Understanding Confidence Scores
- **0.9+**: High confidence, rarely changes
- **0.7-0.9**: Good confidence, may occasionally be revised
- **Below 0.7**: Lower confidence, more likely to change in subsequent messages
  1. Add a “What’s Next” section linking to implementation guides
  2. Include common gotchas (e.g., “Don’t assume word_is_final means it won’t change”)
  3. Add filtering examples showing how to extract only completed text
  4. Include troubleshooting for common integration issues
# Streaming API: Message Sequence Breakdown
## Overview
[Conceptual explanation]
## Understanding the Timeline
[Visual timeline and timestamp explanation]
## Partial Transcripts: Step by Step
[Current examples with annotations]
## Final Transcripts
[Current examples with explanations]
## Implementation Guide
[Code examples and best practices]
## Common Scenarios
[Edge cases, corrections, error handling]
## Troubleshooting
[Common issues and solutions]

This restructured approach would transform the documentation from a simple example dump into a comprehensive guide that helps users both understand the concept and implement it successfully.