We Shipped Groq Orpheus TTS to Production - Here's What Broke

A deep-dive into the WebSocket lifecycle and audio streaming challenges we encountered deploying Groq's Orpheus TTS, and the architectural patterns that finally made it production-ready.

TL;DR

Groq's Orpheus TTS is fast. Really fast. Sub-200ms time-to-first-byte, ~100 characters/second, with voices that sound genuinely human. But when we shipped it to production, users started reporting that the second message was silent. The problem wasn't Groq. The problem wasn't Orpheus. The problem was us - specifically, our WebSocket lifecycle management and audio streaming architecture.

What We Were Building

We were adding real-time voice capabilities to our AI agent platform. Users could chat with AI agents via text, but we wanted to add voice output - the agent speaks its responses aloud.

For the Arabic-speaking market, we specifically needed Saudi dialect support. That led us to Groq's Orpheus Arabic-Saudi model (canopylabs/orpheus-arabic-saudi), which offers:

  • Four authentic Saudi dialect voices: Fahad, Sultan, Lulwa, Noura
  • Natural pronunciation with regional nuances
  • Sub-200ms time-to-first-byte latency via Groq's LPU inference
  • ~100 characters/second throughput
  • Simple, OpenAI-compatible API

The API integration was straightforward:

Python
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

response = client.audio.speech.create(
    model="canopylabs/orpheus-arabic-saudi",
    voice="noura",
    input="مرحبا بكم في تطبيقنا",
    response_format="wav"
)

# Save or stream the audio
with open("output.wav", "wb") as f:
    f.write(response.content)

Easy, right? It worked perfectly in development.


The Setup: Our Architecture

Our architecture looked like this:

Browser (Frontend) WebSocket Audio Chunks Backend Server HTTPS TTS Request Groq API (Orpheus)

Flow:

  1. User sends a message
  2. AI agent generates a text response
  3. Backend sends text to Groq's speech endpoint
  4. Groq returns audio (WAV format)
  5. Backend streams audio chunks to frontend via WebSocket
  6. Frontend plays audio using Web Audio API

What Broke in Production

After deploying, support tickets started rolling in:

Issue Frequency User Report
Second message silent ~30% of sessions "First response plays fine, but when I send another message, no audio"
Audio never completes Intermittent "The agent just stops mid-sentence"
UI stuck on 'speaking' ~15% of sessions "It says the agent is speaking but there's no sound"
STOP doesn't work Frequent "I clicked stop but it kept playing"
Memory usage climbing Over time (Internal monitoring alert)

The frustrating part: none of these reproduced locally. Our dev environment worked flawlessly.


Root Cause: Unclear State Ownership

After days of debugging, we identified the root cause. It wasn't one bug - it was an architectural flaw: we had no clear ownership of stateful resources.

Backend Problems

JavaScript DON'T DO THIS
socket.on('tts-request', async (text) => {
  // Problem 1: Creating a new WebSocket every time
  // Old connections were never explicitly closed
  const groqWs = new WebSocket(GROQ_STREAMING_URL);
  
  groqWs.on('message', (audioChunk) => {
    socket.emit('audio-chunk', audioChunk);
  });
  
  // Problem 2: Adding listeners without removing old ones
  socket.on('stop', () => {
    groqWs.close(); // Best effort, not guaranteed
  });
  
  // Problem 3: Stream end was implicit
  groqWs.on('close', () => {
    socket.emit('stream-end');
  });
});

What went wrong

Multiple WebSocket connections: Each TTS request created a new connection without closing the previous one.
Listener accumulation: Every request added new listeners. After 5 messages, STOP would trigger 5 close attempts.
Implicit stream completion: We assumed groqWs.on('close') would always fire. It didn't.


The Fix: Deterministic Lifecycle Management

We rewrote both backend and frontend around three principles:

The Three Laws of Real-Time State Management

Law 1 - Ownership: Every resource must have exactly one owner at any time.
Law 2 - Cleanup: All cleanup functions must be idempotent.
Law 3 - Completion: Stream completion must be explicit, never inferred.

Single WebSocket Per Session

JavaScript FIXED
// Store the external connection on the socket itself
function cleanupExternalConnection(socket) {
  // Idempotent - safe to call multiple times
  if (socket.data.groqWs) {
    socket.data.groqWs.removeAllListeners();
    if (socket.data.groqWs.readyState === WebSocket.OPEN) {
      socket.data.groqWs.close();
    }
    socket.data.groqWs = null;
  }
  
  // Only send end signal once
  if (!socket.data.endSignalSent) {
    socket.data.endSignalSent = true;
    socket.emit('stream-end');
  }
}

socket.on('tts-request', async (text) => {
  // ALWAYS clean up previous session first
  cleanupExternalConnection(socket);
  
  // Reset state for new session
  socket.data.endSignalSent = false;
  socket.removeAllListeners('stop');
  
  // Create new connection with explicit ownership
  const groqWs = new WebSocket(GROQ_STREAMING_URL);
  socket.data.groqWs = groqWs;
  
  // ... rest of implementation
});

Timeline-Driven Audio Playback

JavaScript FIXED
class AudioStreamPlayer {
  constructor() {
    this.audioContext = new AudioContext();
    this.nextStartTime = 0;
    this.activeSources = new Set();
    this.streamEnded = false;
  }
  
  reset() {
    // Stop all active sources
    this.activeSources.forEach(source => {
      try { source.stop(); } catch (e) { }
    });
    this.activeSources.clear();
    this.nextStartTime = this.audioContext.currentTime;
    this.streamEnded = false;
  }
  
  async playChunk(pcmData) {
    // Schedule on timeline (not "now")
    const startTime = Math.max(
      this.nextStartTime, 
      this.audioContext.currentTime
    );
    source.start(startTime);
    this.nextStartTime = startTime + audioBuffer.duration;
  }
}

Testing for Production Reliability

Standard unit tests won't catch these issues. Here's what we added:

1. Multi-Message Stress Test

Send 100 consecutive messages and verify no resource leaks.

2. Rapid Interrupt Test

Issue STOP commands at random points during streaming.

3. Second Message Test

Specifically test that message 2 plays correctly after message 1.

4. Chaos Test

Introduce random latency and packet loss to verify graceful degradation.


Key Takeaways

5 Lessons Learned

  1. Groq + Orpheus is production-ready. Your integration might not be. The Groq API did exactly what it was supposed to do. Our bugs were in our own code.
  2. Demos work because conditions are perfect. In production, users interrupt, retry, and behave unpredictably.
  3. State ownership is everything. Every resource needs exactly one owner. When you can't answer "who closes this WebSocket?", you have a bug.
  4. Make completion explicit. Never infer that a stream has ended. Send explicit signals and verify them.
  5. Test for production, not demos. Multi-message stress tests, interrupt tests, chaos tests.

Conclusion

We started this project excited about Groq's speed and Orpheus's voice quality. We ended up learning a hard lesson about real-time systems architecture.

The good news: once we fixed our lifecycle management, everything worked beautifully. Groq Orpheus now powers voice output for thousands of users, in both English and Saudi Arabic, with reliable playback and clean state management.

The lesson: if you can't clearly explain when something starts and when it ends, it will break in production.

Share this article