Speaker Identification & Diarization

What is Speaker Diarization?

Speaker diarization automatically labels who is speaking at each moment in an audio file. Example Output:

[00:00-00:05] Speaker 1 (Customer): "Hi, I'd like to place an order"
[00:05-00:12] Speaker 2 (Agent): "Hello! I'd be happy to help you"
[00:12-00:20] Speaker 1: "Great, I need..."
[00:20-00:35] Speaker 2: "Perfect, let me look that up for you"

Available Providers

AssemblyAI (Recommended)

Best for: Speaker identification with high accuracy

Speakers Detected: Up to 10+ simultaneously
Accuracy: High
Cost: ~$0.0075/min (standard)
Latency: Real-time processing
Languages: All languages
Special Features: Speaker labels automatically assigned

Deepgram (Alternative)

Best for: Speed and real-time performance

Speakers Detected: Identifies multiple speakers
Accuracy: Good
Cost: Included in standard pricing
Latency: Real-time
Languages: All supported languages
Special Features: Built into standard STT

Pyannote Audio (Advanced)

Best for: Open-source, self-hosted

Speakers Detected: Unlimited
Accuracy: Very high
Cost: Free (self-hosted)
Latency: Depends on infrastructure
Languages: Language-independent
Special Features: Customizable, offline capable

Use Cases

Customer Service

Scenario: Monitoring agent performance and call quality

Benefits:
✓ Verify agent followed procedures
✓ Identify who said what in disputes
✓ Measure talk time (agent vs customer)
✓ Training and coaching improvements

Multi-Department Calls

Scenario: Call involves multiple agents/departments

Benefits:
✓ Track transfer between agents
✓ Measure hold time
✓ Identify each speaker's role
✓ Billing and routing information

Compliance & Recording

Scenario: Regulatory requirements for call recording

Benefits:
✓ Clear transcript with speaker labels
✓ Legal evidence of who said what
✓ Audit trail for compliance
✓ Risk mitigation in disputes

Call Center Analytics

Scenario: Performance tracking and improvement

Benefits:
✓ Agent talk time vs listening time
✓ Customer engagement metrics
✓ Silence/hold time detection
✓ Conversation flow analysis

Setup

AssemblyAI Setup (Recommended)

Step 1: Configure STT AssemblyAI diarization requires using AssemblyAI as your STT provider.

Go to Settings → Developer Settings
Add AssemblyAI API key (if not already done)
Save configuration

Step 2: Enable Diarization in Agent

Create or Edit Agent
Go to “Speech-to-Text” settings
Select AssemblyAI as provider
Enable “Speaker Diarization” option
Set speaker count:
- Auto-detect (default)
- Or specify 2, 3, 4+ speakers
Save

Step 3: Test

Make test call with multiple participants
Check transcript for speaker labels
Verify accuracy
Adjust if needed

Deepgram Setup

Deepgram includes diarization in standard pricing: Step 1: Ensure Deepgram is Configured

Go to Settings → Developer Settings
Add Deepgram API key (if not done)
Save

Step 2: Enable in Agent

Create or Edit Agent
Select Deepgram as STT provider
Enable “Diarization” option
Configure speaker count
Save

Step 3: Test

Make test call
Review transcript for speaker identification
Verify labels are accurate

Pyannote Audio Setup (Advanced)

For self-hosted or advanced use: Step 1: Install Pyannote

pip install pyannote.audio
pip install torch

Step 2: Download Model

from pyannote.audio import Model
model = Model.from_pretrained("pyannote/speaker-diarization-3.0")

Step 3: Integrate with CallIntel Requires custom implementation. Contact support for guidance.

Configuration

Speaker Count

Auto-Detect:

Best For: Unknown number of speakers
Accuracy: High
Performance: Slightly slower
Recommended: Most cases

Specify Count:

If you know there are always 2 speakers (agent + customer):
- Set to 2 speakers
- Faster processing
- More accurate identification

Common Scenarios:
- 2: Standard agent + customer call
- 3: Agent + supervisor + customer
- 4+: Conference calls or multi-agent calls

Minimum Speaker Duration

Some systems allow configuring minimum speech duration:

Default: 0.5 seconds
Increase to: 1-2 seconds (reduces fragmentation)
Use: When speakers change rapidly

Clustering Method

Algorithm used to identify speakers:

Automatic (Default): System chooses best method
Spectral Clustering: Better for many speakers
Agglomerative: More stable

Output Format

Transcript with Speaker Labels

Standard Format:

[00:00-00:05] SPEAKER_00: "Hello, this is customer service"
[00:05-00:10] SPEAKER_01: "Hi! I have a question about my order"
[00:10-00:15] SPEAKER_00: "I'd be happy to help!"
[00:15-00:25] SPEAKER_01: "Great, I ordered on Tuesday and..."

Labeled JSON Output

{
  "transcript": [
    {
      "speaker": "SPEAKER_00",
      "start": 0.0,
      "end": 5.0,
      "text": "Hello, this is customer service"
    },
    {
      "speaker": "SPEAKER_01",
      "start": 5.0,
      "end": 10.0,
      "text": "Hi! I have a question about my order"
    }
  ]
}

Custom Speaker Names

Option 1: Generic Labels Use SPEAKER_00, SPEAKER_01, etc. (default) Option 2: Role-Based Configure custom names:

SPEAKER_00: "Agent"
SPEAKER_01: "Customer"

Option 2: Actual Names If known, specify:

SPEAKER_00: "John (Agent)"
SPEAKER_01: "Sarah (Customer)"

Accuracy Considerations

Factors Affecting Accuracy

Positive Factors:

Clear audio quality
Distinct speakers (different voices)
Normal speaking volume
Minimal background noise
Correct speaker count specified

Negative Factors:

Poor audio quality
Overlapping speakers
Similar voices
Background noise
Incorrect speaker count

Improving Accuracy

1. Use High-Quality Audio

Recommended:
- 16kHz or higher sample rate
- Mono or stereo
- Minimal compression
- Clear voice capture

2. Specify Correct Speaker Count

If you know: "Always 2 speakers"
→ Set to 2 (faster and more accurate)

If you're unsure:
→ Use auto-detect

3. Minimize Overlapping Speech

System handles overlaps but:
- Overlapping = harder to distinguish
- Encourage natural turn-taking
- Can affect accuracy

4. Use Silence Gaps

Speakers with clear separation = better accuracy
Continuous talking = harder to distinguish
Multiple speakers = higher chance of errors

Cost Analysis

Pricing Comparison

AssemblyAI Diarization:

Standard: ~$0.0075/minute
With Diarization: Additional $0.005/minute
Total: ~$0.0125/minute

Monthly Cost (1000 calls, 2 min avg):
1000 × 2 × $0.0125 = $25/month

Deepgram (Diarization Included):

Standard: ~$0.0043/minute
Diarization: Included
Total: ~$0.0043/minute

Monthly Cost (1000 calls, 2 min avg):
1000 × 2 × $0.0043 = $8.60/month

Recommendation: Use Deepgram if diarization needed frequently (included cost).

Use Cases & Examples

Sales Call Recording

Setup:

1. Agent + Customer call
2. Speaker count: 2
3. Labels:
   - SPEAKER_00: "Agent (Sales)"
   - SPEAKER_01: "Customer"

Output:

[00:00] AGENT: "Hello! Thanks for calling ABC Company"
[00:05] CUSTOMER: "Hi, I saw your ad and wanted to know more"
[00:10] AGENT: "Great! Let me tell you about our products..."

Benefits:

Quality assurance
Sales coaching
Compliance
Training

Support Escalation

Setup:

3-way call:
- Initial agent
- Supervisor
- Customer

Speaker count: 3
Labels:
- SPEAKER_00: Agent 1
- SPEAKER_01: Agent 2 (Supervisor)
- SPEAKER_02: Customer

Benefits:

Track escalation handling
Training for supervisors
Document decisions
Quality control

Conference Call Transcription

Setup:

Multiple participants: 4-6
Auto-detect speakers
Let system identify each speaker

Labels:
- SPEAKER_00: Participant 1
- SPEAKER_01: Participant 2
- SPEAKER_02: Participant 3
- etc.

Benefits:

Complete transcript
Know who said what
Meeting notes
Action item tracking

Troubleshooting

Speaker labels are incorrect

Check audio quality, specify correct speaker count, verify that speakers are clearly distinct.

Too many false speaker changes

Increase minimum speaker duration setting, or verify audio quality isn’t causing segmentation.

Overlapping speech not handled properly

Diarization has limitations with simultaneous speech. Encourage natural turn-taking in your process.

Cost too high

Switch to Deepgram (includes diarization), or only enable for calls where needed.

Processing is too slow

Specify exact speaker count (don’t use auto-detect) for faster processing.

Best Practices

1. Use Auto-Detect When Unknown

Default to auto-detect
System handles most cases well
Switch to specified count only when needed

2. Enable Selectively

Not every call needs diarization
Only enable for:
- Quality assurance calls
- Escalations
- Training calls
- Compliance-critical calls

Saves on cost for standard calls

3. Monitor Accuracy

Weekly Review:
- Sample 5-10 calls with diarization
- Verify speaker identification
- Check for false positives
- Adjust settings if needed

4. Clear Audio Quality

Ensure:
- Good microphone quality
- Minimal background noise
- Proper volume levels
- Good internet connection

5. Document Speaker Roles

When processing transcripts:
- Label speakers by role (Agent, Supervisor, Customer)
- Add timestamps for easy reference
- Create searchable indexes
- Archive for compliance

Advanced Features

Custom Models (Enterprise)

Some providers offer custom diarization models:

Benefits:
- Trained on your specific voices
- Better accuracy for known speakers
- Speaker identification by name
- Customized thresholds

Cost: Enterprise pricing
Setup: Contact provider

Continuous Learning

System learns over time:
- More accurate as it processes your calls
- Adapts to your speakers
- Improves baseline accuracy
- Requires ongoing feedback

Performance Metrics

Key Metrics to Track

1. Diarization Error Rate (DER)
   - Lower is better (< 5% is excellent)
   
2. Speaker Count Accuracy
   - Correct identification of number of speakers
   
3. Speaker Labeling Accuracy
   - Correct assignment of labels to speakers
   
4. Processing Latency
   - Time to generate diarization

Speech-to-Text

Configure STT providers

Call History

View and analyze call transcripts

Quality Assurance

Use diarization for call analysis and coaching

Support

AssemblyAI Diarization

AssemblyAI Speaker Diarization Docs

Contact Support

Email: callintel01@gmail.com

​What is Speaker Diarization?

​Available Providers

​AssemblyAI (Recommended)

​Deepgram (Alternative)

​Pyannote Audio (Advanced)

​Use Cases

​Customer Service

​Multi-Department Calls

​Compliance & Recording

​Call Center Analytics

​Setup

​AssemblyAI Setup (Recommended)

​Deepgram Setup

​Pyannote Audio Setup (Advanced)

​Configuration

​Speaker Count

​Minimum Speaker Duration

​Clustering Method

​Output Format

​Transcript with Speaker Labels

​Labeled JSON Output

​Custom Speaker Names

​Accuracy Considerations

​Factors Affecting Accuracy

​Improving Accuracy

​Cost Analysis

​Pricing Comparison

​Use Cases & Examples

​Sales Call Recording

​Support Escalation

​Conference Call Transcription

​Troubleshooting

​Best Practices

​1. Use Auto-Detect When Unknown

​2. Enable Selectively

​3. Monitor Accuracy

​4. Clear Audio Quality

​5. Document Speaker Roles

​Advanced Features

​Custom Models (Enterprise)

​Continuous Learning

​Performance Metrics

​Key Metrics to Track

​See Also

Speech-to-Text

Call History

Quality Assurance

​Support

AssemblyAI Diarization

Contact Support

What is Speaker Diarization?

Available Providers

AssemblyAI (Recommended)

Deepgram (Alternative)

Pyannote Audio (Advanced)

Use Cases

Customer Service

Multi-Department Calls

Compliance & Recording

Call Center Analytics

Setup

AssemblyAI Setup (Recommended)

Deepgram Setup

Pyannote Audio Setup (Advanced)

Configuration

Speaker Count

Minimum Speaker Duration

Clustering Method

Output Format

Transcript with Speaker Labels

Labeled JSON Output

Custom Speaker Names

Accuracy Considerations

Factors Affecting Accuracy

Improving Accuracy

Cost Analysis

Pricing Comparison

Use Cases & Examples

Sales Call Recording

Support Escalation

Conference Call Transcription

Troubleshooting

Best Practices

1. Use Auto-Detect When Unknown

2. Enable Selectively

3. Monitor Accuracy

4. Clear Audio Quality

5. Document Speaker Roles

Advanced Features

Custom Models (Enterprise)

Continuous Learning

Performance Metrics

Key Metrics to Track

See Also

Support