We Moved to Groq and Our Transcription Got 10x Faster

Last week we flipped the switch on our new transcription backend. If you've processed a video recently, you probably noticed — it's fast now. Like, noticeably fast.

Here's what happened.

The Before

Since launch, we've been running Whisper Large V3 on GPU instances. The setup worked fine. A 5-minute video took about 20-30 seconds to transcribe, depending on server load. Not bad, not great.

The problem was scaling. GPU instances are expensive, and our queue would back up during peak hours. Users would upload a video and wait 45 seconds to a minute just for transcription. For a tool whose whole value prop is "fast captions," that wait was starting to undermine the experience.

We explored a few options: more GPU instances (expensive), smaller models (worse accuracy), batching optimizations (marginal gains). None of them felt right.

The Groq Option

Then we tested Groq's LPU (Language Processing Unit) inference for Whisper. The first benchmark made me refresh the page because I thought the timer was broken.

A 10-minute video. Transcribed in 3.7 seconds. That's 164x real-time speed.

I ran it again. Same result. Ran it on a 30-minute podcast episode. 11 seconds.

The accuracy was identical — it's the same Whisper Large V3 model, just running on different hardware. Same 8.4% word error rate, same language support, same word-level timestamps. The only difference is speed.

What This Means for Users

Faster processing. Transcription used to be the bottleneck. Now rendering is the bottleneck (and rendering was already fast). End-to-end processing time for a typical 3-minute TikTok went from ~45 seconds to ~15 seconds.

No more queue waits. Because each transcription is so fast, the queue basically never backs up. Peak hour performance is now the same as off-peak.

Better word timestamps. This one surprised us. Groq's implementation returns slightly more precise word-level timestamps than our previous setup. We're talking millisecond-level improvements, but it makes caption animations noticeably smoother — words appear exactly when they're spoken, not 50ms early or late.

The Migration

Switching was relatively painless. Our transcription worker already abstracted the Whisper API behind an interface, so swapping the backend was mostly a config change. The tricky part was handling the response format differences and making sure our timestamp normalization worked correctly with Groq's output.

We ran both backends in parallel for a week, comparing outputs side by side. Accuracy was within margin of error (sometimes Groq was slightly better, sometimes our old setup was, never a meaningful difference). Speed was consistently 8-12x faster.

One thing we did change: because transcription is now so fast, we removed the progress polling for the transcription step. It used to show "Transcribing... 40%... 60%..." — but now it goes from "Transcribing" to "Done" so quickly that the progress bar was just flickering. We simplified it to a single "Processing" state that covers both transcription and rendering.

The Cost Question

Groq is actually cheaper per minute of audio than running our own GPU instances. I won't share exact numbers, but the cost reduction was significant enough that we're reinvesting the savings into rendering capacity. We're now running more concurrent rendering workers, which cuts wait times even further.

99 Languages, Auto-Detected

One more thing we enabled with this migration: automatic language detection across all 99 languages that Whisper supports. Previously, we had a language selector that defaulted to English and required manual switching. Now the model detects the language automatically.

This matters more than you'd think. A lot of our users create content in multiple languages, or have videos with mixed-language audio. Removing the manual language selection step eliminates one more point of friction.

The new transcription backend is live for all users. If you process a video today, you're already on Groq. Let us know if you notice the speed difference — we sure did.