Why We Self-Host Everything: A Healthcare AI Startup’s Guide to Sovereign Infrastructure

Reading Time: 14 minutes

There is an assumption in the technology industry that cloud-first is the only serious infrastructure strategy. AWS, GCP, and Azure have spent billions convincing CTOs that managing your own hardware is a distraction from your core business. For most startups, they are right.

Healthcare AI is not most startups.

At IntelMedica, we run our entire AI infrastructure on self-hosted hardware. Our LLMs run on consumer GPUs. Our databases run on bare-metal servers. Our Kubernetes cluster spans physical machines connected by a mesh VPN with no public IP addresses. Our cost per token for LLM inference is zero dollars.

This is not a philosophical stance against the cloud. It is a pragmatic response to the specific constraints of building AI systems that process Protected Health Information. This article explains why we made this choice, how we built the infrastructure, what it costs, and what we have learned that other healthcare CTOs should know.

Table of Contents

Why Cloud-First Does Not Work for Healthcare AI

The cloud has genuine advantages: elastic scaling, managed services, global availability, and offloading operational complexity. For a SaaS product serving general consumers, these advantages are decisive. For healthcare AI, they are offset by four structural problems.

Data Residency and Sovereignty

When a healthcare organization sends patient data to a cloud provider, that data is stored on infrastructure they do not own, in data centers they have not audited, managed by personnel they have not vetted. For many healthcare organizations — particularly those in the EU under GDPR, or in states with strict data privacy laws — this is a non-starter.

The regulatory trend is toward more data sovereignty requirements, not less. The EU’s Health Data Space regulation, state-level health privacy laws in Washington, Connecticut, and others, and the increasing scrutiny of cross-border data transfers all point in the same direction: healthcare data should stay close to the organization that owns it.

Self-hosting eliminates the data residency question entirely. The data lives on hardware you own, in a facility you control, processed by systems you operate. There is no third-party data processing agreement to negotiate because there is no third party.

HIPAA Compliance Complexity

HIPAA compliance in the cloud is not impossible, but it is expensive and operationally complex. You need a Business Associate Agreement (BAA) with every cloud service you use. Not every cloud service offers BAAs, which limits your technology choices. You need to ensure that every data path — every API call, every log entry, every cached value — stays within HIPAA-compliant services.

The practical result is that healthcare organizations in the cloud spend significant engineering effort on compliance guardrails: ensuring data does not leak into non-BAA services, managing encryption keys, auditing access logs, and maintaining documentation for regulators. This compliance overhead is a recurring cost that scales with system complexity.

Self-hosting simplifies the compliance model. You are both the covered entity and the infrastructure operator. Your compliance boundary is your physical network perimeter. Every service you run is under your direct control and subject to your security policies. This does not eliminate HIPAA compliance work — you still need access controls, audit logs, encryption, and policies — but it eliminates the multi-party compliance coordination that makes cloud HIPAA expensive.

Cost at AI Scale

The economics of cloud AI infrastructure are unfavorable for healthcare workloads. Consider the cost of running a clinical LLM:

Cloud inference (AWS Bedrock, GCP Vertex AI, Azure OpenAI): Pricing varies, but a mid-tier model running at moderate volume (100,000 clinical notes per month, averaging 2,000 tokens per note) costs roughly $8,000 to $15,000 per month in inference charges alone. This scales linearly with volume. At 500,000 notes per month, you are looking at $40,000 to $75,000 monthly.

Self-hosted inference: Two NVIDIA RTX 4060 Ti GPUs (approximately $400 each at retail) running vLLM can serve a 7-8 billion parameter model at 30-50 tokens per second each. For clinical documentation workloads, which are bursty (high volume during clinic hours, low volume overnight), two GPUs provide adequate throughput for a mid-size clinic. The ongoing cost is electricity — roughly $30-50 per month for two GPUs under moderate load.

The capital expenditure for self-hosted inference is recovered in the first month compared to cloud pricing. Over a three-year infrastructure lifecycle, the difference is six figures.

This is not a hypothetical comparison. We have run both configurations. The self-hosted setup processes clinical documents at comparable quality and latency to cloud-hosted models, at a fraction of the cost.

Latency

Clinical workflows are latency-sensitive. When a clinician is dictating a note with AI assistance, a 2-second delay between speaking and seeing the transcription is acceptable. A 5-second delay is disruptive. A 10-second delay — which is not uncommon with cloud-hosted LLMs during peak hours — is workflow-breaking.

Self-hosted inference with local GPUs typically provides sub-second response times for text generation tasks. The data does not leave the local network. There is no API gateway, no load balancer, no cross-region routing — just a direct connection from the application server to the GPU.

For real-time clinical applications like ambient listening and live note generation, this latency advantage is not optional. It is a requirement.

The Stack

Our infrastructure is built on components that are individually well-understood and collectively designed for healthcare AI workloads.

K3s Kubernetes

We run K3s, a lightweight Kubernetes distribution created by Rancher (now SUSE). K3s provides the full Kubernetes API and ecosystem with a fraction of the resource overhead of upstream Kubernetes.

Our cluster runs across 8 worker nodes and a control plane node, spanning two physical Proxmox hypervisors. The nodes are a mix of configurations — some optimized for compute (CPU-heavy workloads like database operations), some equipped with GPUs (AI inference), and some optimized for storage (database persistence and model storage).

Why K3s instead of full Kubernetes? Resource efficiency. K3s uses approximately 512MB of RAM for the server process, compared to several gigabytes for a standard Kubernetes control plane. On hardware where every gigabyte matters (because we are also running LLMs on the same machines), this efficiency is meaningful.

Why Kubernetes at all? Because healthcare AI requires the orchestration primitives that Kubernetes provides:

Service discovery and load balancing for distributing inference requests across multiple GPU nodes.
Rolling updates for deploying new model versions without downtime.
Resource limits and requests for ensuring that AI inference workloads do not starve database or application processes.
Persistent volume claims for managing database storage and model artifacts.
Health checks and restart policies for maintaining high availability of clinical services.
Namespace isolation for separating production, staging, and development environments on the same cluster.

Container Orchestration

Our application services run as containers orchestrated by Kubernetes. Each service — the FastAPI backend, the SvelteKit frontend, the vLLM inference server, the PostgreSQL database, supporting services like Redis and vector databases — is independently deployable, scalable, and monitorable.

We use Helm charts for packaging and deploying complex applications, and a GitOps workflow where infrastructure changes are version-controlled and applied through automated pipelines. No one SSHs into a server to make changes. Every infrastructure modification is a git commit.

Tailscale Mesh Networking

This is the component that makes self-hosted infrastructure practical for a distributed team. Tailscale provides a WireGuard-based mesh VPN that connects all of our nodes — physical servers, developer laptops, CI/CD runners — into a single, flat network with zero public IP addresses.

The implications for security are significant:

No public attack surface. None of our servers have public IP addresses. There is no firewall to misconfigure because there are no public ports to protect.
Zero-trust identity. Every connection is authenticated using Tailscale’s identity system, which integrates with our SSO provider. You cannot connect to a service without a valid identity.
Encrypted by default. All traffic between nodes is encrypted using WireGuard. There is no unencrypted path, even on the local network.
Access control lists (ACLs). Tailscale’s ACL system allows us to define which identities can access which services. A developer can reach the staging database but not production. The CI/CD runner can deploy to the cluster but cannot access patient data.

For HIPAA compliance, Tailscale’s architecture is ideal. It provides the network-level access controls, encryption, and audit logging that HIPAA’s Security Rule requires, without the complexity of traditional VPNs or the exposure of public-facing services.

Our K3s cluster uses Tailscale as the CNI (Container Network Interface) overlay, meaning pod-to-pod communication across nodes is encrypted via WireGuard tunnels. The flannel CNI operates over the Tailscale interfaces, providing both cross-node encryption and the ability to span nodes across different physical locations without traditional VPN concentrators.

Running LLMs on Consumer GPUs

This is the section that most healthcare CTOs find surprising. You do not need A100s or H100s to run useful clinical AI models. Consumer-grade NVIDIA GPUs — the same cards gamers buy — are capable of running 7-8 billion parameter models at production-quality speeds.

Our current inference setup:

Hardware: NVIDIA RTX 3070 (8GB VRAM) and RTX 4060 Ti (16GB VRAM).
Inference server: vLLM, an open-source LLM serving engine that implements PagedAttention for efficient memory management, continuous batching for throughput optimization, and tensor parallelism for multi-GPU setups.
Models: We run quantized versions (GPTQ 4-bit and AWQ) of models in the 7-8B parameter range for clinical documentation tasks, and smaller specialized models (medical NER, ICD coding) on CPU.
Performance: The RTX 4060 Ti generates approximately 40 tokens per second for a 7B model with 4-bit quantization. For clinical note generation (average output: 300-500 tokens), this translates to 7-12 second generation times. For real-time transcription assistance, where outputs are streamed token-by-token, the perceived latency is under one second.

The key insight is that clinical AI workloads do not require frontier-scale models. A well-fine-tuned 7B model outperforms GPT-4 on domain-specific clinical tasks — medical note generation, ICD coding, clinical entity extraction — because it has been trained specifically for those tasks. We do not need 175 billion parameters to write a SOAP note. We need 7 billion parameters that deeply understand SOAP notes.

Cost: $0 per token. Once the hardware is purchased (roughly $800-1,200 for a capable consumer GPU), the ongoing cost is electricity. At $0.12/kWh and 200W average power draw per GPU, that is approximately $17 per month per GPU. For the inference volume of a mid-size clinical practice, this is effectively free.

Database Strategy

PostgreSQL is the foundation of our data layer, and we use it aggressively:

Relational data: Patient records, encounter data, schedules, user accounts — standard relational workloads that PostgreSQL handles at scale.
pgvector: The pgvector extension adds vector similarity search to PostgreSQL. We store clinical note embeddings, medication embeddings, and diagnosis embeddings as vectors and perform similarity searches directly in the database. This eliminates the need for a separate vector database for most use cases.
pg_trgm: Trigram-based fuzzy text matching for medication name search (“amoxicillin” matches “amoxycillin”), diagnostic term search, and patient name search. This is critical in clinical environments where data entry is inconsistent.
JSONB: For semi-structured clinical data that does not fit cleanly into a relational schema (FHIR resources, clinical form responses, AI model outputs), PostgreSQL’s JSONB type provides schema-flexible storage with indexing and query capabilities.
Row-Level Security (RLS): PostgreSQL’s RLS policies enforce access controls at the database level. A query from the billing module physically cannot return clinical notes, regardless of application-level bugs or misconfigurations. This is defense in depth for HIPAA compliance.

Qdrant is our specialized vector database for workloads that exceed pgvector’s capabilities — specifically, large-scale clinical guideline retrieval (RAG) where we are searching across millions of document chunks with complex metadata filtering. Qdrant runs as a Kubernetes deployment with persistent storage.

The decision of when to use pgvector versus Qdrant is straightforward: if the vectors are associated with data already in PostgreSQL (patient notes, clinical entities), use pgvector. If the vectors represent an independent corpus (clinical guidelines, drug databases, medical literature), use Qdrant. This avoids the complexity of synchronizing data between two systems when it is unnecessary.

Monitoring and Observability

You cannot operate infrastructure you cannot see. Our monitoring stack:

Prometheus scrapes metrics from all services — Kubernetes node metrics, application metrics, GPU utilization, database performance, inference latency. Alerting rules notify on-call engineers of anomalies.
Grafana provides dashboards for real-time visibility into system health. We maintain dashboards for infrastructure (node health, disk usage, network traffic), application performance (request latency, error rates, throughput), and AI-specific metrics (inference latency p50/p95/p99, GPU utilization, model cache hit rates).
Loki aggregates logs from all containers. Structured logging in JSON format enables efficient search and correlation across services. For HIPAA compliance, logs that might contain PHI are filtered and stored separately with enhanced access controls.

The monitoring stack itself runs on the same K3s cluster, isolated in a dedicated namespace with separate storage. This is a calculated trade-off — running monitoring on the same infrastructure it monitors creates a dependency loop — but for our current scale, the operational simplicity outweighs the risk. At larger scale, we would move monitoring to a dedicated cluster.

Cost Comparison

Here is a realistic cost comparison for a healthcare AI workload equivalent to ours, running on cloud versus self-hosted infrastructure:

Cloud (AWS as representative)

Component	Monthly Cost
EC2 instances (application servers, 3x m6i.xlarge)	$430
RDS PostgreSQL (db.r6g.xlarge, Multi-AZ)	$580
SageMaker inference (ml.g5.xlarge, on-demand)	$1,200
Bedrock API calls (clinical LLM inference)	$8,000-15,000
S3 storage (clinical documents, model artifacts)	$150
VPC, NAT Gateway, data transfer	$200
CloudWatch, logging	$100
WAF, Shield, compliance tooling	$150
Monthly total	$10,810-$17,810
Annual total	$129,720-$213,720

Self-Hosted (IntelMedica actual costs)

Component	One-Time Cost	Monthly Cost
Server hardware (2x refurbished workstations)	$3,000	—
GPUs (RTX 3070 + RTX 4060 Ti)	$1,100	—
NVMe SSDs (4x 2TB)	$600	—
RAM upgrades (128GB total)	$400	—
Network equipment	$200	—
Electricity (servers + GPUs)	—	$80
Tailscale (team plan)	—	$18/user
Internet (business-grade, static IP not needed)	—	$80
Domain and DNS	—	$5
One-time total	$5,300	—
Monthly total	—	$220
Annual total (Year 1, including hardware)	—	$7,940
Annual total (Year 2+)	—	$2,640

The Year 1 cost of self-hosting is 94% less than the low end of cloud hosting. Year 2 onwards, it is 98% less. Even if you double the hardware budget for redundancy and spare parts, self-hosting is an order of magnitude cheaper.

The obvious counterargument is operational overhead — you need someone to manage the infrastructure. This is true, but the operational burden of a K3s cluster with 8 nodes is measured in hours per month, not full-time headcount. Kubernetes handles service restarts, rolling updates, and resource scheduling automatically. Tailscale eliminates network configuration. Prometheus alerts on issues before they become outages. We estimate our infrastructure operational overhead at 10-15 hours per month, which at engineering rates adds $3,000-5,000 per month. Even with this included, self-hosting is dramatically cheaper.

Security Architecture

Self-hosting is not inherently more or less secure than cloud hosting. It is differently secure. Here is how we approach it.

No Public IP Addresses

This is the single most impactful security decision we have made. None of our servers — not the application servers, not the databases, not the GPU nodes, not the monitoring stack — have public IP addresses. They are only reachable through the Tailscale mesh network.

A server with no public IP cannot be port-scanned. It cannot be hit by automated vulnerability scanners. It cannot be DDoSed. It does not appear in Shodan or Censys. The entire attack surface of our infrastructure to external adversaries is zero, because there is nothing external to attack.

Public-facing services (the website, the API gateway for mobile clients) are served through Cloudflare Tunnel, which provides DDoS protection, WAF, and TLS termination without exposing our origin servers. Traffic flows: User -> Cloudflare Edge -> Cloudflare Tunnel -> Tailscale -> Application Server. At no point is our server’s IP address visible to the internet.

Zero-Trust Networking

Tailscale implements zero-trust principles natively:

Every device is authenticated before it can join the network.
Every connection is encrypted with WireGuard.
Access Control Lists define which devices can communicate with which services.
MagicDNS provides service discovery without DNS servers that could be poisoned.
Key rotation happens automatically.

We define ACLs that enforce the principle of least privilege. The CI/CD runner can push containers to the registry and apply Kubernetes manifests, but cannot access database ports. Developer machines can access staging environments but cannot reach production databases. The monitoring stack can scrape metrics from all services but cannot write to any of them.

Encryption

In transit: All inter-node traffic is encrypted via WireGuard (Tailscale). All client-facing traffic uses TLS 1.3 via Cloudflare.
At rest: Database volumes use LUKS full-disk encryption. Kubernetes secrets are encrypted with AES-256 in etcd. Backup volumes are encrypted before transfer.
Application level: PHI fields in the database are encrypted at the application level using envelope encryption, providing defense in depth even if disk-level encryption is compromised.

Access Control

Infrastructure access: SSH access to nodes is only available through the Tailscale network and requires key-based authentication. There are no passwords to brute-force.
Kubernetes access: kubectl access is restricted via RBAC policies. Developers get namespace-scoped access. Only the CI/CD pipeline and designated infrastructure engineers have cluster-admin permissions.
Database access: PostgreSQL roles are configured with minimum necessary privileges. Application services connect through PgBouncer with role-specific credentials. Row-level security provides per-query access filtering.
Audit logging: All access — SSH sessions, Kubernetes API calls, database queries, application requests — is logged, timestamped, and retained for the period required by HIPAA (six years minimum).

Scaling Strategy

The obvious question with self-hosted infrastructure is: how do you scale? The answer depends on the time horizon.

Near-term (Current to 10x Load)

Our current hardware has significant headroom. The K3s cluster is running at approximately 40% CPU utilization and 60% memory utilization during peak hours. GPU utilization peaks at 70% during busy clinic hours and drops below 10% overnight.

To handle 10x current load, we would:

Add 2-4 additional GPU nodes (consumer workstations with RTX 4060 Ti or RTX 4070 Ti cards). Cost: $2,000-4,000 per node.
Add NVMe storage to existing nodes for database scaling.
Implement horizontal pod autoscaling to distribute load dynamically.
Add a second PostgreSQL replica for read scaling.

Total cost for 10x scaling: approximately $10,000-15,000 in hardware, deployed over a weekend.

Medium-term (10x to 100x Load)

At this scale, consumer hardware reaches its limits. We would transition to:

Dedicated server-grade hardware in a colocation facility. A 42U rack with enterprise servers provides significantly more compute density and reliability than workstations.
Professional GPU hardware (NVIDIA L4 or L40S) for inference workloads that need more VRAM or higher throughput.
Multi-node PostgreSQL with Citus or native PostgreSQL 16 logical replication for database scaling.
A second K3s cluster in a separate physical location for disaster recovery.

The colocation model preserves data sovereignty (you own the hardware, the colo facility provides power, cooling, and network connectivity) while providing enterprise-grade physical infrastructure.

Long-term (Production Healthcare Deployments)

For production deployments at healthcare organizations, the infrastructure model shifts. Rather than running centralized infrastructure, we deploy on-premise at the customer site:

Pre-configured K3s clusters shipped as appliances (physical servers pre-loaded with our stack).
Tailscale provides secure connectivity between the customer deployment and our management plane.
Customer data never leaves the customer’s facility.
We provide remote monitoring and management through the Tailscale connection.
Updates are deployed via our GitOps pipeline, tested in our staging environment, and rolled out to customer clusters automatically.

This model gives healthcare organizations the data sovereignty they require while providing the managed-service experience they expect.

Lessons Learned

Two years of self-hosting healthcare AI infrastructure have taught us things that no architecture document could:

Start with fewer nodes than you think you need.

We initially planned for 16 nodes. We launched with 4 and gradually expanded to 9. The smaller cluster was easier to debug, faster to iterate on, and forced us to optimize resource utilization. Healthcare AI workloads are bursty — high during clinic hours, low overnight and weekends. Right-sizing for average load and handling bursts with queuing is more cost-effective than provisioning for peak.

Consumer GPUs are genuinely production-viable.

We expected consumer GPUs to be a stopgap until we could afford enterprise hardware. Two years later, they are still our primary inference platform. The RTX 4060 Ti, in particular, offers exceptional value: 16GB VRAM handles quantized 7-8B models comfortably, CUDA support is identical to enterprise cards, and the cards are readily available at consumer retail prices. The main limitation is VRAM — if you need to run 13B+ models at high quality, you need more than 16GB. But for the model sizes that dominate clinical AI workloads, consumer GPUs are sufficient.

PostgreSQL does more than you think.

The temptation to add specialized databases — Redis for caching, Elasticsearch for search, a dedicated vector database, a time-series database for metrics — is strong. Resist it. PostgreSQL with the right extensions handles an enormous range of workloads. Every additional database is another system to operate, monitor, back up, and secure. We added Qdrant only after demonstrating that pgvector could not meet our RAG performance requirements at scale, and we still use pgvector for most vector workloads.

Tailscale changes the operational model.

Before Tailscale, self-hosting meant managing firewalls, VPNs, port forwarding, dynamic DNS, and SSL certificates for internal services. Tailscale eliminated all of this. The reduction in operational complexity was dramatic — easily a 50% reduction in infrastructure management time. If you are considering self-hosting, Tailscale (or a similar mesh VPN) is not optional. It is the technology that makes self-hosting practical for a small team.

Backups are not optional, and you will forget them.

We automated backups in week one and have tested restores monthly since then. PostgreSQL WAL archiving to local NAS storage with encrypted offsite copies. Kubernetes etcd snapshots. Application configuration in git. Every stateful component has an automated backup and a documented, tested restore procedure. Healthcare data is irreplaceable. The backup strategy is not something you add later.

Monitor everything, alert selectively.

We collect metrics from every component. We alert on fewer than 20 conditions. The temptation to alert on every anomaly leads to alert fatigue, which leads to ignored alerts, which leads to missed outages. Our alerting philosophy: if it requires human action within an hour, it is an alert. If it requires human action within a day, it is a daily report. If it is informational, it is a dashboard.

Recommendations for Healthcare CTOs

If you are evaluating infrastructure strategies for healthcare AI, here is our advice:

Do the cost math honestly. Include all cloud costs — not just compute and storage, but data transfer, compliance tooling, BAA premiums, and the engineering time spent on cloud-specific compliance guardrails. Compare this to the total cost of self-hosting, including hardware, electricity, operational overhead, and the engineering time spent on infrastructure management. For AI-heavy workloads, self-hosting is almost always cheaper.

Start self-hosted and move to cloud if you must, not the reverse. It is easy to migrate from self-hosted to cloud — you containerize your services and deploy to EKS/GKE/AKS. It is painful to migrate from cloud to self-hosted because you will have accumulated dependencies on cloud-specific services (SageMaker, Bedrock, RDS features) that do not have self-hosted equivalents.

Use Kubernetes from day one. Even if you start with a single node, Kubernetes provides the deployment, scaling, and management primitives you will need as you grow. K3s makes this viable on minimal hardware.

Encrypt everything, expose nothing. The simplest security model is one where there is nothing to attack. Zero public IPs. Mesh VPN for all connectivity. Encryption at rest and in transit. Row-level security in the database.

Own your inference. Cloud LLM APIs are expensive, add latency, and require sending PHI to a third party. Self-hosted inference with vLLM on consumer GPUs provides better economics, lower latency, and complete data control. The model quality for clinical tasks is equivalent or better with fine-tuned open-weight models.

Plan for the hardware lifecycle. Consumer GPUs and workstations have a 3-5 year useful life. Budget for replacement. Keep spare parts (power supplies, SSDs, RAM) on hand. Document your hardware configuration so that a replacement node can be built quickly.

Automate relentlessly. Every manual infrastructure operation is a potential error and a scaling bottleneck. GitOps for deployments. Ansible for node provisioning. Prometheus for monitoring. Automated backups with tested restores. The less you have to touch, the more reliably it runs.

Healthcare AI infrastructure is not a problem you solve once. It is a capability you build and operate continuously. Self-hosting is not the easy path — it requires infrastructure engineering skills and operational discipline. But for healthcare organizations that need data sovereignty, cost control, and low-latency AI inference, it is the right path.

IntelMedica builds and operates self-hosted AI infrastructure for healthcare. If you are a healthcare CTO evaluating your AI infrastructure strategy, we are happy to share our architecture in more detail. Visit intelmedica.com to learn more.