Skip to content

Feedback: guides-custom-vocab-lemur

Original URL: https://www.assemblyai.com/docs/guides/custom-vocab-lemur
Category: guides
Generated: 05/08/2025, 4:41:55 pm


Generated: 05/08/2025, 4:41:54 pm

Technical Documentation Analysis: LeMUR Custom Vocab Guide

Section titled “Technical Documentation Analysis: LeMUR Custom Vocab Guide”

This documentation provides a functional cookbook-style guide but has several clarity, structure, and completeness issues that impact user experience. The code works, but the explanation and organization need significant improvement.

Problem: The title suggests this is about “boosting transcription accuracy” but it’s actually about post-processing corrections.

Fix:

  • Change title to “Post-Process Transcriptions with Custom Vocabulary Using LeMUR”
  • Clarify in the introduction that this corrects transcripts after transcription, not during

Missing comma in word_list:

# Current (broken):
word_list = [
'Azj-Kahet',
'Neferess',
"Ny'alotha",
"Xal'atath" # Missing comma here
"Ansurek"
]
# Fixed:
word_list = [
'Azj-Kahet',
'Neferess',
"Ny'alotha",
"Xal'atath", # Added comma
"Ansurek"
]

Broken print statement:

# Current (broken):
print(colored("Confidence is less than 0.25", correction["original_word"], correction["corrected_word"], correction["confidence"], "red"))
# Fixed:
print(colored(f"Low confidence ({correction['confidence']:.2f}): {correction['original_word']} -> {correction['corrected_word']}", "red"))

Problem: No error handling for JSON parsing, API failures, or malformed responses.

Fix: Add comprehensive error handling:

def correct_sentence(sentence, word_list = []):
try:
response = assemblyai.Lemur().task(
prompt=prompt,
input_text="Sentence: {}\nWord List: {}".format(sentence, ", ".join(word_list)),
final_model=assemblyai.LemurModel.claude3_5_sonnet
)
return response.response
except Exception as e:
print(f"Error processing sentence: {e}")
return "[]" # Return empty array on error
def correct_transcript(transcript, word_list = []):
# ... existing code ...
try:
corrections_json = loads(corrections)
except json.JSONDecodeError as e:
print(f"Error parsing JSON response: {e}")
corrections_json = []
# ... rest of function

Problem: Code appears before explanation, making it hard to follow.

Recommended Structure:

# Post-Process Transcriptions with Custom Vocabulary Using LeMUR
## Overview
Brief explanation of what this does and when to use it
## Prerequisites
- AssemblyAI account and API key
- Python environment setup
## How It Works
Step-by-step explanation of the process
## Implementation
### Step 1: Setup
### Step 2: Basic Transcription
### Step 3: Custom Vocabulary Correction
### Step 4: Running the Complete Example
## Advanced Usage
## Cost Considerations
## Troubleshooting

Add:

## Prerequisites
### Required Dependencies
```bash
pip install -U assemblyai termcolor
  1. Sign up for an AssemblyAI account
  2. Get your API key from your dashboard
  3. Set your API key in the code or as an environment variable
  • Python 3.7+
  • Internet connection for API calls
## Missing Information
### 6. **Cost and Performance Details**
**Add**:
```markdown
## Important Considerations
### Cost Impact
- Each sentence requires a separate LeMUR API call
- For a 5-minute transcript with ~50 sentences, expect ~50 API calls
- Use `claude3_haiku` for cost optimization (up to 60% savings)
### Performance Expectations
- Processing time: ~2-3 seconds per sentence
- Rate limits: Contact support@assemblyai.com if you hit RPM limits
- Best for: Transcripts with <100 sentences for reasonable processing time

Add:

## Configuration Options
### Confidence Threshold
```python
# Adjust this value based on your accuracy needs
confidence_threshold = 0.25 # Lower = more corrections, higher = fewer corrections
# For speed and cost optimization:
final_model=assemblyai.LemurModel.claude3_haiku
# For maximum accuracy:
final_model=assemblyai.LemurModel.claude3_5_sonnet
### 8. **Troubleshooting Section**
**Add**:
```markdown
## Troubleshooting
### Common Issues
**Rate Limit Errors**
- Reduce request frequency or contact support for limit increase
- Consider batching sentences for processing
**JSON Parsing Errors**
- LeMUR may return malformed JSON occasionally
- The code includes error handling for this scenario
**Poor Correction Quality**
- Adjust confidence threshold
- Refine your custom vocabulary list
- Consider switching to claude3_5_sonnet for better accuracy

Add before the current example:

## Use Cases
This solution works best for:
- **Company/Brand Names**: "Sprinklr" → "Sprinkler"
- **Technical Terms**: "Kubernetes" → "communities"
- **Proper Nouns**: "Xal'atath" → "Zalatath"
- **Industry Jargon**: Medical, legal, or technical terminology
## Basic Example
Here's a simple example before we dive into the complete implementation:
```python
# Simple correction example
word_list = ["AssemblyAI", "LeMUR", "Kubernetes"]
transcript_text = "The assembly eye platform uses lemur and communities"
# After processing: "The AssemblyAI platform uses LeMUR and Kubernetes"

Add to functions:

def correct_transcript(transcript, word_list = []):
if not word_list:
print("Warning: Empty word list provided. No corrections will be made.")
return transcript.text
if not transcript or not transcript.text:
raise ValueError("Invalid transcript provided")
# ... rest of function
  1. Restructure the entire document following the suggested organization
  2. Fix all code errors before publication
  3. Add comprehensive error handling to all functions
  4. Include cost calculator or reference to help users estimate expenses
  5. Add performance benchmarks for different transcript lengths
  6. Create a simple “Quick Start” example before the complex World of Warcraft example
  7. Add environment variable setup for API keys instead of hardcoding
  8. Include validation for word list format and content
  9. Add logging options for debugging and monitoring corrections
  10. Consider adding a batch processing option to reduce API calls

This documentation has good foundational content but needs significant structural and technical improvements to provide a smooth user experience.