Case Study

Meet-In

Real-time speech-to-speech translation. Under 1.8 seconds end to end.

StatusIn active development

Domainmeet-in.app

StackFly.io, Supabase, Gemini, Google TTS

Target latency< 1.8s end-to-end

The problem with real-time translation

The difference between a translated call that feels like a conversation and one that feels like a walkie-talkie is roughly two seconds of latency. Below that threshold, the human brain accepts it as a real exchange. Above it, you start waiting, and the conversational rhythm collapses.

Meet-In's entire engineering thesis is built around staying below that threshold. You speak in your language. The other person hears it in theirs. The whole pipeline (speech recognition, translation, speech synthesis) has to fit in under 1.8 seconds.

The latency budget

If end-to-end has to be under 1.8 seconds, every component needs a slice it can't overrun. Going over budget in any one stage ruins the whole call, so the design started with the budget itself.

Audio capture & chunking~150ms

Network to inference region~100ms

Speech recognition (Gemini)~400ms

Translation (Gemini)~350ms

Speech synthesis (Google TTS)~450ms

Network return + buffer~250ms

Total end-to-end< 1.8s

Every stage runs to its slice. Nothing gets to overrun and ruin the call.

Why Gemini for both ASR and translation

Splitting speech recognition and translation across two models adds latency at the handoff (serialisation, network, model load). Gemini handles both in a single call, which collapses a chunk of overhead the budget couldn't otherwise absorb.

Multi-region from day one

Latency is mostly physics. A call between London and Tokyo has a hard floor that comes from the speed of light, no matter how fast your code is. The only way to beat that is to put your compute close to your users.

Meet-In runs on Fly.io's edge deployment, with inference nodes in multiple regions. When a call connects, traffic routes to the nearest viable node. Network RTT becomes a function of geography, not provider choice.

Voice-first design

The product is built around natural conversation, not text boxes. There's no chat interface, no "press enter to translate". You speak, you hear the reply. Google Cloud TTS handles output that sounds like a person, not a phone tree.

This sounds obvious in retrospect, but most "translation apps" still default to text input because text is easier to engineer around. The decision to commit to voice-only as the primary mode forced every other design decision downstream.

The security audit

Before opening to users, Meet-In went through a full security audit. The audit returned 26 findings. All 26 were addressed before launch.

The findings ranged from minor (header configuration, CSP tightening) to substantive (auth flow edge cases, audio data retention policies). The lesson wasn't that the audit found issues, every audit does. The lesson was that auditing before opening to users is a different posture than auditing after, and the difference matters when the product handles voice data from real conversations.

A translated call either feels like a conversation or it feels like waiting. There's no in-between, and the threshold is around 1.8 seconds.

What I learned

Latency budgets are a forcing function for architectural decisions. Once you commit to a hard end-to-end target, every other choice (model selection, region placement, protocol design) becomes a question of "does this fit under the budget?" rather than "is this the latest tool?"

Also: in a voice-first product, the moments where the latency budget fails are visceral. Users don't think "the network was slow", they think "this product is broken". That's a much higher bar than text products have to meet.

Current Status

Active development. Security audit complete with 26 findings addressed. Phase 0 security and Phase 1 steps 1-4 complete. Currently working on auth migration to Supabase.

SASS-E

ChimaeraOS (Labs)