Skip to content

Voice AI in Hospitality: Architecture of a Hotel Callbot

A
abemon
| | 12 min read
Share

Why a hotel callbot

A 150-room hotel receives between 80 and 200 calls daily. Of those, 60-70% are repetitive questions: check-in/check-out times, availability, directions, parking prices, pool hours, pet policies. Each call averages 3.5 minutes. At reception, answering the phone competes with attending the guest standing in front of you. The guest in front of you always wins.

The result: missed calls. A typical hotel misses 15-35% of incoming calls. Each missed call from a potential booking has an estimated average value of EUR 180-320 (average nightly rate times average length of stay). The math is straightforward.

A well-implemented callbot does not replace reception. It acts as a first filter that resolves 60-70% of calls without human intervention and transfers the rest to a human agent with the context already captured. This is what we built and what is running in production.

The complete architecture

The system has six layers, each with design decisions that directly affect user experience.

Layer 1: telephony and audio ingestion

The call enters through a SIP trunk (we use Twilio as the primary provider, with Vonage as backup). Audio is captured in PCM 16-bit, 16kHz, mono format. This format is the sweet spot between recognition quality and bandwidth.

A critical detail: voice activity detection (VAD). The system needs to know when the user is speaking and when they have stopped. This seems trivial but it is not. Silabs WebRTC VAD works for most cases, but hotel environments have specific background noise (lobby ambience, background music, nearby conversations) that required an adjusted energy threshold.

The silence timeout (how long the system waits before considering the user has finished speaking) is set at 700ms. Less causes constant interruptions. More creates uncomfortable pauses. We arrived at 700ms after testing with 200 real calls.

Layer 2: Speech-to-Text (STT)

STT converts audio to text. We use Deepgram as the primary engine. We chose it over Whisper (OpenAI) and Google Speech-to-Text after a comparative evaluation with 500 audio samples from real hotel calls in four languages.

EngineWER SpanishWER EnglishWER FrenchWER GermanLatency p50
Deepgram Nova-26.2%4.1%7.8%8.1%280ms
Whisper Large V35.8%3.9%6.9%7.4%1.2s
Google V27.1%4.5%8.3%8.9%350ms

Whisper has the lowest WER (Word Error Rate), but its latency is unacceptable for a real-time callbot. 1.2 seconds of STT plus processing time plus TTS time means the user waits 3-4 seconds before hearing a response. That does not work in a phone conversation.

Deepgram offers the best balance between accuracy and latency. It also supports streaming (transcribes while the user speaks, rather than waiting for them to finish), which reduces perceived latency.

A hotel-specific problem: proper nouns. “I have a reservation under Pfeiffer” or “I want to book the Alhambra suite.” Generic STT engines fail with proper nouns outside their vocabulary. Deepgram allows keyword boosting: a list of terms (room names, frequent guest names) that the model prioritizes during recognition. This improved proper noun accuracy by 23%.

Layer 3: language understanding (NLU)

The transcribed text reaches the understanding engine. This is where the system decides what the user wants.

We use a hybrid approach. An intent classifier based on a fine-tuned model (DistilBERT for fast classification) detects the primary intent among 24 defined intents: book, check_availability, check_in, check_out, directions, services, prices, complaint, speak_to_human, and 15 more.

For simple intents (schedules, directions, hotel services), the response is generated directly from a structured knowledge base. No LLM needed. “What time is checkout?” is answered with the hotel’s data, formatted into a natural pre-templated response.

For complex intents (bookings, modifications, queries requiring reasoning), the system invokes an LLM (Claude 3.5 Haiku for its cost-speed balance) with the hotel’s context as the system prompt. The LLM generates the conversational response, but factual data (prices, availability) always comes from the PMS API, never from the LLM.

This separation is fundamental. The LLM is a conversation engine, not a source of truth. Data comes from the hotel management system. The LLM presents it naturally.

Layer 4: PMS integration

The PMS (Property Management System) is the hotel’s nervous system. Our callbot integrates with Opera (Oracle), Mews, and Cloudbeds via their respective APIs. The integration covers:

  • Availability queries: room types, dates, real-time pricing
  • Booking creation: with confirmation number communicated to the guest by voice and SMS
  • Existing booking lookup: the guest provides their surname or booking number and the system retrieves the details
  • Modifications: date changes, room upgrades (with price confirmation)
  • Guest information: name, preferences, history (for personalized service)

PMS API latency is the most frequent bottleneck. Opera Cloud has response times of 800ms-2s. Mews is faster (200-500ms). Cloudbeds sits in between. To mitigate latency, we use a cache with a 5-minute TTL for availability and pricing (which do not change every second) and direct calls for write operations.

When the PMS does not respond within 3 seconds, the system says “One moment, please, let me check availability” and retries. If it fails after two retries, it transfers to a human agent with the captured context.

Layer 5: Text-to-Speech (TTS)

TTS converts the response to audio. We use ElevenLabs for primary voices. The reason: naturalness. ElevenLabs’ synthetic voices are the most natural we have evaluated, particularly in Spanish and English.

Each hotel has a configured voice: gender, tone, speed. A boutique hotel in Ibiza has a different voice than a business hotel in Madrid. This seems like an aesthetic detail but it impacts guest perception. In our post-call surveys, voice naturalness is the factor most correlated with overall callbot satisfaction.

TTS latency is critical. ElevenLabs offers audio streaming: it begins playing the response before the TTS completes the entire sentence. This reduces perceived latency from 1.5 seconds to under 500ms for the first syllable.

Layer 6: dialog management and context

The dialog manager maintains the conversation state: what the user has said, what the system has responded, where in the flow we are, and what information still needs to be collected.

For bookings, the flow is: dates -> room type -> number of guests -> contact details -> confirmation. The system collects this data throughout the conversation naturally, not as a form. If the user says “I want a double for March 15 to 17, two people,” the system extracts all data from a single utterance. If they say only “I want to book,” the system asks for missing data one piece at a time.

Context persists throughout the call. If the user asks about availability, then about parking prices, and then says “alright, I’ll book,” the system knows they mean the room they inquired about earlier, not the parking.

Multilingual: the real challenge

A hotel on the Spanish coast receives calls in Spanish, English, German, French, and occasionally Italian, Dutch, or Russian. The system detects the language automatically in the first 3-5 seconds of speech (Deepgram does this natively) and switches the entire pipeline: STT, NLU, responses, TTS.

The challenge is not technological (the engines support multiple languages). The challenge is content: every response must sound natural in each language. “Le confirmo su reserva” in Spanish does not translate literally into German and sound right. Each language has its own hospitality expressions. Keeping the knowledge base updated in 4-5 languages is ongoing work.

Production metrics

After 8 months in production across 12 hotels:

MetricValue
Resolution rate without human64%
Caller satisfaction (survey)4.1/5
Average response latency1.8 seconds
Missed calls (reduction)-78%
Bookings completed by callbot23% of total
Cost per callEUR 0.12

The 64% resolution rate means 6 out of 10 calls resolve completely without transferring to a human. The remaining 36% are transferred, but with context (the human agent knows why the guest is calling, what language they speak, and what data has already been collected).

The EUR 0.12 cost per call compares favorably with the cost of a receptionist handling the same call (estimated at EUR 1.50-2.50 including labor cost and opportunity cost of not attending the in-person guest).

Voice AI in hospitality is not a replacement for people. It is a tool that allows people to focus on what truly matters: in-person attention, complex problem resolution, and the experience of the guest standing in front of them. The phone is handled by the machine. The experience is created by the human team.

To understand how AI-powered revenue management complements the hotel’s technology strategy, see our article on hotel revenue management with AI. And for a broader look at how AI transforms customer service, our article on AI in customer service, beyond the chatbot covers automated service patterns.

About the author

A

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.