Talk to Your AI: How Voice-First Interfaces Are Transforming Clinical Workflows

Reading Time: 11 minutes

A surgeon mid-procedure needs to check a drug interaction. An ER physician working a trauma case needs to dictate a note. A nurse administering medications needs to verify a dosage. In each scenario, the clinician’s hands are occupied, their eyes are on the patient, and the information they need is locked behind a keyboard and a login screen.

Healthcare is one of the few industries where the people who need information most urgently are the least able to interact with a screen. Yet the vast majority of clinical AI tools — including many of our own — are built around text input fields and graphical dashboards. We are asking clinicians to stop what they are doing, wash their hands, find a workstation, log in, type a query, read a response, and then return to the patient.

This is why IntelMedica built Talk to LLM — a voice-first interface that lets clinicians interact with AI assistants through natural speech. Not a command-and-control voice menu. Not a dictation transcription tool. A conversational AI interface where clinicians speak naturally and receive spoken and visual responses grounded in clinical knowledge.

Why Voice Is the Natural Clinical Interface

The keyboard-and-mouse paradigm was designed for office workers sitting at desks. It was never designed for clinicians moving between exam rooms, standing at bedsides, or working in sterile environments. The mismatch between clinical workflows and text-based interfaces creates measurable friction.

The Documentation Time Tax

A 2024 study in the Annals of Internal Medicine found that physicians spend an average of 16 minutes per patient encounter on EHR documentation — roughly equal to the time spent with the patient. For a physician seeing 20 patients per day, that is over 5 hours of typing, clicking, and navigating EHR interfaces.

Speech is fundamentally faster than typing for most people. The average typing speed for a professional is 40-60 words per minute. The average speaking speed is 130-150 words per minute. Even accounting for speech recognition errors and correction time, voice input is 2-3 times faster than keyboard input for narrative text like clinical notes.

Hands-Free Is Not Optional — It Is Clinical

In many clinical scenarios, touching a keyboard is not just inconvenient — it is a clinical hazard or an infection control violation:

Surgical and procedural environments — Sterile fields cannot be broken to interact with a device
Emergency resuscitation — Every hand is performing a clinical task
Wound care and dressing changes — Gloved hands are contaminated
Patient examination — Breaking eye contact to type disrupts the clinical relationship
Isolation rooms — Bringing devices in and out of isolation adds contamination risk

Voice interfaces are not a convenience feature for these scenarios. They are the only viable interaction modality.

Cognitive Load Reduction

Switching between clinical thinking and computer interaction imposes a cognitive switching cost. A physician mentally composing a clinical note while simultaneously navigating an EHR menu structure is performing two unrelated cognitive tasks. Research on dual-task interference in clinical settings shows that this switching degrades both tasks — the clinical note is less thorough, and the EHR interaction takes longer.

Voice interaction reduces this switching cost by allowing clinicians to express clinical thoughts in the same natural language they use for clinical reasoning. There is no translation step between thinking “the patient’s creatinine is trending upward, suggesting early renal impairment” and speaking it aloud.

Talk to LLM: Architecture Overview

Talk to LLM is built as a modular pipeline with three core stages: speech-to-text, LLM processing, and text-to-speech. Each stage runs independently, allowing us to swap components, optimize individual stages, and maintain the system without downtime across the entire pipeline.

Stage 1: Speech-to-Text (STT)

The speech recognition layer converts spoken clinical language into text. This is harder than general speech recognition for several reasons:

Medical vocabulary. Clinical speech is dense with specialized terminology — drug names, anatomical terms, eponymous conditions, procedure names, abbreviations. General-purpose speech recognition models trained on podcasts, phone calls, and YouTube videos struggle with words like “esophagogastroduodenoscopy,” “adalimumab,” or “PERRLA.”

Ambient noise. Clinical environments are noisy — monitor alarms, ventilator sounds, overhead pages, conversations between other staff members. The STT model must isolate the target speaker from environmental noise.

Accented and fatigued speech. Healthcare professionals come from diverse linguistic backgrounds, and clinical speech often occurs under stress, fatigue, or time pressure. The model must be robust to accent variation and speech patterns that deviate from training data norms.

Our STT pipeline uses a fine-tuned Whisper variant optimized for medical vocabulary. The fine-tuning process involved:

Base model selection — Whisper Large V3 as the foundation, providing strong multilingual and noise-robust performance
Medical vocabulary augmentation — Training on medical dictation corpora, clinical conference recordings, and synthetic speech generated from medical text
Noise conditioning — Augmenting training data with hospital-environment noise profiles (ICU ambient, OR ambient, ED ambient) to improve recognition accuracy in realistic acoustic conditions
Speaker diarization — Integrated speaker identification to distinguish the target clinician from other voices in the environment

Post-STT, the recognized text passes through a medical spell-check and normalization layer that corrects common recognition errors (e.g., “Tylenol” misrecognized as “tileanol”), expands abbreviations where appropriate, and normalizes medical terminology to standard forms.

Stage 2: LLM Processing

The recognized and normalized text is processed by the LLM layer, which serves different functions depending on the clinical context:

Documentation mode. The LLM transforms spoken clinical observations into structured clinical notes following established documentation formats (SOAP notes, H&P templates, procedure notes). The clinician speaks naturally — “Patient is a 67-year-old male presenting with three days of progressive shortness of breath, worse with exertion, associated with bilateral lower extremity edema” — and the LLM generates a properly formatted, structured clinical document.

Query mode. The clinician asks a clinical question — “What is the recommended loading dose of amiodarone for new-onset atrial fibrillation with rapid ventricular response?” — and the LLM retrieves relevant guidelines via the RAG pipeline and returns a concise, evidence-grounded answer.

Decision support mode. The clinician describes a clinical scenario, and the LLM provides relevant clinical decision support — drug interaction checks, dosage calculations, differential diagnosis considerations, or applicable clinical practice guidelines.

The LLM layer uses the same infrastructure as Doc Assist AI: Qwen 3.5 4B running on vLLM for generation, with RAG retrieval from Qdrant for guideline grounding. The anti-hallucination validation layer (using MedGemma 4B) operates on all LLM outputs before they are presented to the clinician.

Stage 3: Text-to-Speech (TTS)

For spoken responses, the LLM output is converted to speech using a neural TTS model. The TTS component is optimized for:

Clarity over naturalness. In a noisy clinical environment, a slightly more enunciated and slower speaking rate is preferable to a natural but hard-to-hear conversational tone. Our TTS model is tuned for intelligibility at moderate volume levels.

Medical pronunciation. The TTS model correctly pronounces medical terminology, drug names, and clinical abbreviations. Hearing “metoprolol” mispronounced as “meh-TOP-ro-lol” instead of “meh-TOE-pro-lol” breaks clinician trust immediately.

Appropriate pacing. Clinical information must be delivered at a pace that allows comprehension and action. Dosage information, for example, is spoken with deliberate pauses between the drug name, dosage, route, and frequency to reduce the risk of mishearing.

Not all responses use TTS. Documentation mode outputs are displayed on a screen for physician review rather than read aloud — hearing your own clinical note spoken back to you is not useful. Query and decision support responses use a hybrid approach: a concise spoken summary with detailed information displayed on an available screen.

HIPAA Considerations for Voice Data

Voice data introduces HIPAA challenges beyond those of text-based systems, and we designed Talk to LLM to address each one.

Voice Is PHI

Under HIPAA, a patient’s voice is a biometric identifier and constitutes PHI. But the voice data processed by Talk to LLM is the clinician’s voice, not the patient’s. This distinction matters for privacy analysis but does not eliminate HIPAA concerns — the clinician’s speech contains patient information (diagnoses, medications, treatments), and the audio recordings themselves may capture patient voices in the background.

On-Premise Processing Is Non-Negotiable

For the same reasons we run our LLMs on-premise (detailed in our article on small models), the entire Talk to LLM pipeline runs locally. The Whisper STT model, the LLM inference, and the TTS generation all execute on hardware within the healthcare organization’s facility. No audio data is transmitted to external servers.

This eliminates the primary HIPAA risk of cloud-based voice processing: the transmission and storage of audio containing PHI on third-party infrastructure.

Audio Retention Policy

Raw audio from the STT pipeline is processed and discarded in real time. We do not store audio recordings by default. The pipeline is designed as a streaming processor — audio chunks are transcribed as they arrive, and the audio buffer is cleared after transcription.

The transcribed text, however, is retained according to the organization’s documentation retention policy, since it becomes part of the clinical record. This creates a clean separation: the transient biometric data (voice audio) is ephemeral, while the clinical content (transcribed text) follows standard medical record retention rules.

Ambient Voice Capture Risks

In clinical environments, the microphone may capture conversations not intended for the AI system — discussions between clinicians, patient-clinician conversations, or background talk between staff. Our pipeline addresses this with:

Wake word activation. The system activates only when the clinician speaks a configurable activation phrase (e.g., “Hey Doc,” “Note this,” or a custom phrase). The microphone is in a low-power listening mode that processes only enough audio to detect the wake word — it does not transcribe background conversation.

Speaker verification. After wake word detection, the system verifies the speaker’s voice against enrolled clinician voiceprints. This prevents unauthorized users from interacting with the system and prevents accidental activation by non-clinician voices.

Session boundaries. Each voice interaction has explicit start and end boundaries. The system does not passively record and transcribe ambient conversation. The clinician initiates a session, speaks their input, and the session ends either by explicit command (“end note”) or by silence timeout.

Real-Time vs. Batch Processing

Clinical voice AI faces a fundamental architectural choice: process audio in real time as the clinician speaks, or collect audio and process it in batch after the interaction.

Real-Time Streaming

In real-time mode, the STT model transcribes audio as it arrives, with words appearing on screen (or in the processing pipeline) within 200-500 milliseconds of being spoken. This enables:

Live feedback — The clinician can see their words being transcribed and correct errors immediately
Interrupt capability — The clinician can change direction mid-sentence (“Actually, scratch that — the patient reports the pain started three days ago, not two”)
Conversational interaction — For query and decision support modes, the system can begin processing the request before the clinician finishes speaking, reducing perceived latency

The downside of real-time processing is increased compute overhead. The STT model must process audio chunks continuously, the LLM must handle partial inputs and potentially revise its understanding as more context arrives, and the entire pipeline must maintain state across the streaming session.

Batch Processing

In batch mode, the clinician speaks their complete input, and the system processes the entire audio recording at once. This is simpler architecturally and allows for better transcription accuracy (because the model has the full context when processing each segment), but it introduces latency. A 60-second clinical note takes 60 seconds to speak plus additional processing time before the clinician sees the output.

Our Hybrid Approach

Talk to LLM uses a hybrid streaming architecture:

STT runs in real-time streaming mode — The clinician sees their words transcribed live, providing immediate feedback and error detection
LLM processing runs in batch mode — The LLM waits for a complete input segment (detected by sentence boundaries and pause detection) before generating its response. This allows the LLM to consider the full context of the input rather than processing partial sentences
TTS runs in streaming mode — The spoken response begins playing as soon as the first sentence is generated, without waiting for the complete LLM output

This hybrid approach delivers perceived responsiveness (the clinician hears the first words of the response within 1-2 seconds of finishing their input) while maintaining the accuracy benefits of batch LLM processing.

EHR Integration

A voice AI system that operates in isolation from the electronic health record is a toy. Clinical voice AI must read from and write to the EHR to be useful in real workflows.

FHIR-Based Integration

Talk to LLM integrates with EHR systems through FHIR R4 APIs (Fast Healthcare Interoperability Resources), the industry standard for clinical data exchange. The integration supports:

Reading patient context. When a clinician begins a voice session, the system pulls the current patient’s context from the EHR — active problem list, current medications, recent labs, allergies, and relevant history. This context is provided to the LLM, so when the clinician says “Add amoxicillin 500 three times daily for 10 days,” the system can check against the patient’s documented penicillin allergy and flag the interaction before the order is placed.

Writing clinical documentation. Generated clinical notes, orders, and documentation updates are written back to the EHR via FHIR DocumentReference and other appropriate resource types. The clinician reviews and approves the content through the voice interface or a companion screen before it is committed to the patient record.

Order entry. For supported EHR systems, Talk to LLM can generate structured orders (medications, labs, imaging) from voice commands. “Order a CBC, CMP, and TSH” generates the appropriate lab orders with correct codes, which are queued for clinician electronic signature.

Existing EHR Voice Capabilities

Major EHR vendors (Epic with Dragon integration, Cerner with their voice modules, and MEDITECH with their speech solutions) have existing voice capabilities. Talk to LLM is designed to complement rather than replace these vendor solutions:

It handles AI-augmented interactions (clinical queries, decision support, intelligent documentation) that go beyond what dictation-to-text provides
It serves as an EHR-agnostic layer that works across different EHR systems, allowing health systems with multiple EHR platforms to provide a consistent voice AI experience
It integrates with IntelMedica’s clinical AI pipeline (Doc Assist AI, specialized validation models), providing capabilities that are outside the scope of EHR-embedded voice tools

Clinical Use Cases

Voice-Activated Clinical Notes

The most immediate and high-impact use case. A physician speaks naturally during or after a patient encounter, and Talk to LLM generates a structured clinical note:

“This is a follow-up visit for Mrs. Patterson. She’s a 72-year-old woman with type 2 diabetes, hypertension, and stage 3 chronic kidney disease. Her A1C from last week came back at 7.2, down from 7.8 three months ago. She reports good compliance with metformin 1000 twice daily. No hypoglycemic episodes. Blood pressure today is 138 over 82. I’m going to continue current medications, add a spot urine for albumin-to-creatinine ratio, and see her back in three months.”

From this natural speech, the system generates a properly formatted SOAP note with structured problem list entries, medication reconciliation, and appropriate follow-up orders — all queued for physician review and signature.

Hands-Free Decision Support

During a procedure or while examining a patient:

“What is the maximum recommended dose of lidocaine with epinephrine for local anesthesia in a 70-kilogram adult?”

The system responds with a concise, evidence-sourced answer: “The maximum recommended dose of lidocaine with epinephrine is 7 milligrams per kilogram, which for a 70-kilogram patient is 490 milligrams. This corresponds to approximately 49 milliliters of 1% solution or 24.5 milliliters of 2% solution. Source: American Society of Regional Anesthesia guidelines.”

Patient Handoff Summaries

At shift change, the outgoing clinician speaks a handoff summary for each patient:

“Room 4, Mr. Chen. 58-year-old male admitted last night with acute cholecystitis. Surgery consulted, planning lap chole tomorrow morning. He’s NPO after midnight. Pain controlled on IV morphine 2 mg every 4 hours PRN. Last dose was at 3 PM. Labs from this morning showed white count trending down to 12.3 from 15.1 on admission. Lipase normal. Needs a type and screen sent tonight for tomorrow’s case.”

The system generates a structured handoff document following SBAR (Situation, Background, Assessment, Recommendation) or I-PASS format, with critical action items highlighted for the incoming clinician.

Voice-Driven Prior Authorization

Integrating Talk to LLM with Doc Assist AI, a physician can initiate a prior authorization by voice:

“I need a prior auth for Mr. Rodriguez in room 12. He needs an MRI of the lumbar spine with and without contrast. He’s had six weeks of conservative therapy including physical therapy and NSAIDs without improvement. Diagnosis is lumbar radiculopathy with progressive neurological deficit.”

The system pulls the patient’s record, feeds the voice input plus EHR context into Doc Assist AI, and generates a complete prior authorization document with CMS-compliant medical necessity language — all initiated by a 20-second voice command instead of a 30-minute manual documentation process.

The Future: Ambient Clinical Intelligence

Talk to LLM, as described above, is an active voice interface — the clinician initiates interaction and speaks with intent to the system. The next frontier is ambient clinical intelligence — AI that passively listens to the clinician-patient encounter (with appropriate consent) and generates clinical documentation, identifies relevant decision support opportunities, and surfaces pertinent information without explicit clinician commands.

Ambient clinical intelligence represents a paradigm shift from “the clinician talks to the AI” to “the AI listens to the clinical encounter.” Products like Nuance DAX, Abridge, and Suki are pioneering this space, and IntelMedica’s roadmap includes ambient capabilities built on our existing Talk to LLM infrastructure.

The technical challenges for ambient mode are substantial:

Multi-speaker diarization — Reliably separating clinician speech from patient speech from family member speech in real time
Intent classification — Determining which parts of the conversation are clinically relevant documentation versus social conversation, administrative discussion, or teaching moments
Implicit consent management — Ensuring patients are aware of and consent to AI-assisted documentation in a way that does not disrupt the clinical encounter
Context persistence — Maintaining awareness of the clinical context across a complete encounter that may last 15-45 minutes with interruptions

These challenges are solvable with the technology available today, and our on-premise architecture positions us well — ambient processing generates substantially more audio data than active voice commands, making the privacy advantages of local processing even more compelling.

Conclusion

The healthcare industry’s user interface problem is not a matter of better screen design or more intuitive menus. It is a fundamental mismatch between how clinicians work and how software expects to be used. Clinicians work with their hands, their eyes, and their voices. Software expects keyboards, mice, and screens.

Voice-first interfaces do not eliminate screens — they add a natural, hands-free interaction modality that fits the clinical workflow rather than forcing the workflow to fit the technology. When a surgeon can check a drug interaction without breaking sterile field, when an ER physician can dictate a note while managing a resuscitation, when a nurse can verify a dosage without leaving the bedside — that is technology serving medicine rather than medicine serving technology.

Talk to LLM is IntelMedica’s answer to the clinical interface problem. Built on-premise for HIPAA compliance, powered by specialized small models for speed and accuracy, and designed to integrate with existing EHR workflows rather than replace them. Because the best clinical AI in the world is worthless if the clinician cannot use it with a patient in front of them.

About IntelMedica: We build AI-powered tools that help healthcare professionals deliver better patient care. Learn more