Skip to main content

What is Speaker Diarization?

Speaker diarization automatically labels who is speaking at each moment in an audio file. Example Output:
[00:00-00:05] Speaker 1 (Customer): "Hi, I'd like to place an order"
[00:05-00:12] Speaker 2 (Agent): "Hello! I'd be happy to help you"
[00:12-00:20] Speaker 1: "Great, I need..."
[00:20-00:35] Speaker 2: "Perfect, let me look that up for you"

Available Providers

Best for: Speaker identification with high accuracy
Speakers Detected: Up to 10+ simultaneously
Accuracy: High
Cost: ~$0.0075/min (standard)
Latency: Real-time processing
Languages: All languages
Special Features: Speaker labels automatically assigned

Deepgram (Alternative)

Best for: Speed and real-time performance
Speakers Detected: Identifies multiple speakers
Accuracy: Good
Cost: Included in standard pricing
Latency: Real-time
Languages: All supported languages
Special Features: Built into standard STT

Pyannote Audio (Advanced)

Best for: Open-source, self-hosted
Speakers Detected: Unlimited
Accuracy: Very high
Cost: Free (self-hosted)
Latency: Depends on infrastructure
Languages: Language-independent
Special Features: Customizable, offline capable

Use Cases

Customer Service

Scenario: Monitoring agent performance and call quality
Benefits:
✓ Verify agent followed procedures
✓ Identify who said what in disputes
✓ Measure talk time (agent vs customer)
✓ Training and coaching improvements

Multi-Department Calls

Scenario: Call involves multiple agents/departments
Benefits:
✓ Track transfer between agents
✓ Measure hold time
✓ Identify each speaker's role
✓ Billing and routing information

Compliance & Recording

Scenario: Regulatory requirements for call recording
Benefits:
✓ Clear transcript with speaker labels
✓ Legal evidence of who said what
✓ Audit trail for compliance
✓ Risk mitigation in disputes

Call Center Analytics

Scenario: Performance tracking and improvement
Benefits:
✓ Agent talk time vs listening time
✓ Customer engagement metrics
✓ Silence/hold time detection
✓ Conversation flow analysis

Setup

Step 1: Configure STT AssemblyAI diarization requires using AssemblyAI as your STT provider.
  1. Go to Settings → Developer Settings
  2. Add AssemblyAI API key (if not already done)
  3. Save configuration
Step 2: Enable Diarization in Agent
  1. Create or Edit Agent
  2. Go to “Speech-to-Text” settings
  3. Select AssemblyAI as provider
  4. Enable “Speaker Diarization” option
  5. Set speaker count:
    • Auto-detect (default)
    • Or specify 2, 3, 4+ speakers
  6. Save
Step 3: Test
  1. Make test call with multiple participants
  2. Check transcript for speaker labels
  3. Verify accuracy
  4. Adjust if needed

Deepgram Setup

Deepgram includes diarization in standard pricing: Step 1: Ensure Deepgram is Configured
  1. Go to Settings → Developer Settings
  2. Add Deepgram API key (if not done)
  3. Save
Step 2: Enable in Agent
  1. Create or Edit Agent
  2. Select Deepgram as STT provider
  3. Enable “Diarization” option
  4. Configure speaker count
  5. Save
Step 3: Test
  1. Make test call
  2. Review transcript for speaker identification
  3. Verify labels are accurate

Pyannote Audio Setup (Advanced)

For self-hosted or advanced use: Step 1: Install Pyannote
pip install pyannote.audio
pip install torch
Step 2: Download Model
from pyannote.audio import Model
model = Model.from_pretrained("pyannote/speaker-diarization-3.0")
Step 3: Integrate with CallIntel Requires custom implementation. Contact support for guidance.

Configuration

Speaker Count

Auto-Detect:
Best For: Unknown number of speakers
Accuracy: High
Performance: Slightly slower
Recommended: Most cases
Specify Count:
If you know there are always 2 speakers (agent + customer):
- Set to 2 speakers
- Faster processing
- More accurate identification

Common Scenarios:
- 2: Standard agent + customer call
- 3: Agent + supervisor + customer
- 4+: Conference calls or multi-agent calls

Minimum Speaker Duration

Some systems allow configuring minimum speech duration:
Default: 0.5 seconds
Increase to: 1-2 seconds (reduces fragmentation)
Use: When speakers change rapidly

Clustering Method

Algorithm used to identify speakers:
Automatic (Default): System chooses best method
Spectral Clustering: Better for many speakers
Agglomerative: More stable

Output Format

Transcript with Speaker Labels

Standard Format:
[00:00-00:05] SPEAKER_00: "Hello, this is customer service"
[00:05-00:10] SPEAKER_01: "Hi! I have a question about my order"
[00:10-00:15] SPEAKER_00: "I'd be happy to help!"
[00:15-00:25] SPEAKER_01: "Great, I ordered on Tuesday and..."

Labeled JSON Output

{
  "transcript": [
    {
      "speaker": "SPEAKER_00",
      "start": 0.0,
      "end": 5.0,
      "text": "Hello, this is customer service"
    },
    {
      "speaker": "SPEAKER_01",
      "start": 5.0,
      "end": 10.0,
      "text": "Hi! I have a question about my order"
    }
  ]
}

Custom Speaker Names

Option 1: Generic Labels Use SPEAKER_00, SPEAKER_01, etc. (default) Option 2: Role-Based Configure custom names:
SPEAKER_00: "Agent"
SPEAKER_01: "Customer"
Option 2: Actual Names If known, specify:
SPEAKER_00: "John (Agent)"
SPEAKER_01: "Sarah (Customer)"

Accuracy Considerations

Factors Affecting Accuracy

Positive Factors:
  • Clear audio quality
  • Distinct speakers (different voices)
  • Normal speaking volume
  • Minimal background noise
  • Correct speaker count specified
Negative Factors:
  • Poor audio quality
  • Overlapping speakers
  • Similar voices
  • Background noise
  • Incorrect speaker count

Improving Accuracy

1. Use High-Quality Audio
Recommended:
- 16kHz or higher sample rate
- Mono or stereo
- Minimal compression
- Clear voice capture
2. Specify Correct Speaker Count
If you know: "Always 2 speakers"
→ Set to 2 (faster and more accurate)

If you're unsure:
→ Use auto-detect
3. Minimize Overlapping Speech
System handles overlaps but:
- Overlapping = harder to distinguish
- Encourage natural turn-taking
- Can affect accuracy
4. Use Silence Gaps
Speakers with clear separation = better accuracy
Continuous talking = harder to distinguish
Multiple speakers = higher chance of errors

Cost Analysis

Pricing Comparison

AssemblyAI Diarization:
Standard: ~$0.0075/minute
With Diarization: Additional $0.005/minute
Total: ~$0.0125/minute

Monthly Cost (1000 calls, 2 min avg):
1000 × 2 × $0.0125 = $25/month
Deepgram (Diarization Included):
Standard: ~$0.0043/minute
Diarization: Included
Total: ~$0.0043/minute

Monthly Cost (1000 calls, 2 min avg):
1000 × 2 × $0.0043 = $8.60/month
Recommendation: Use Deepgram if diarization needed frequently (included cost).

Use Cases & Examples

Sales Call Recording

Setup:
1. Agent + Customer call
2. Speaker count: 2
3. Labels:
   - SPEAKER_00: "Agent (Sales)"
   - SPEAKER_01: "Customer"
Output:
[00:00] AGENT: "Hello! Thanks for calling ABC Company"
[00:05] CUSTOMER: "Hi, I saw your ad and wanted to know more"
[00:10] AGENT: "Great! Let me tell you about our products..."
Benefits:
  • Quality assurance
  • Sales coaching
  • Compliance
  • Training

Support Escalation

Setup:
3-way call:
- Initial agent
- Supervisor
- Customer

Speaker count: 3
Labels:
- SPEAKER_00: Agent 1
- SPEAKER_01: Agent 2 (Supervisor)
- SPEAKER_02: Customer
Benefits:
  • Track escalation handling
  • Training for supervisors
  • Document decisions
  • Quality control

Conference Call Transcription

Setup:
Multiple participants: 4-6
Auto-detect speakers
Let system identify each speaker

Labels:
- SPEAKER_00: Participant 1
- SPEAKER_01: Participant 2
- SPEAKER_02: Participant 3
- etc.
Benefits:
  • Complete transcript
  • Know who said what
  • Meeting notes
  • Action item tracking

Troubleshooting

Check audio quality, specify correct speaker count, verify that speakers are clearly distinct.
Increase minimum speaker duration setting, or verify audio quality isn’t causing segmentation.
Diarization has limitations with simultaneous speech. Encourage natural turn-taking in your process.
Switch to Deepgram (includes diarization), or only enable for calls where needed.
Specify exact speaker count (don’t use auto-detect) for faster processing.

Best Practices

1. Use Auto-Detect When Unknown

Default to auto-detect
System handles most cases well
Switch to specified count only when needed

2. Enable Selectively

Not every call needs diarization
Only enable for:
- Quality assurance calls
- Escalations
- Training calls
- Compliance-critical calls

Saves on cost for standard calls

3. Monitor Accuracy

Weekly Review:
- Sample 5-10 calls with diarization
- Verify speaker identification
- Check for false positives
- Adjust settings if needed

4. Clear Audio Quality

Ensure:
- Good microphone quality
- Minimal background noise
- Proper volume levels
- Good internet connection

5. Document Speaker Roles

When processing transcripts:
- Label speakers by role (Agent, Supervisor, Customer)
- Add timestamps for easy reference
- Create searchable indexes
- Archive for compliance

Advanced Features

Custom Models (Enterprise)

Some providers offer custom diarization models:
Benefits:
- Trained on your specific voices
- Better accuracy for known speakers
- Speaker identification by name
- Customized thresholds

Cost: Enterprise pricing
Setup: Contact provider

Continuous Learning

System learns over time:
- More accurate as it processes your calls
- Adapts to your speakers
- Improves baseline accuracy
- Requires ongoing feedback

Performance Metrics

Key Metrics to Track

1. Diarization Error Rate (DER)
   - Lower is better (< 5% is excellent)
   
2. Speaker Count Accuracy
   - Correct identification of number of speakers
   
3. Speaker Labeling Accuracy
   - Correct assignment of labels to speakers
   
4. Processing Latency
   - Time to generate diarization

See Also

Speech-to-Text

Configure STT providers

Call History

View and analyze call transcripts

Quality Assurance

Use diarization for call analysis and coaching

Support

Contact Support