Private AI for your business

Capable AI on your own hardware — keeping prompts, documents, and conversations inside your network, and replacing per-seat AI licensing with a single CAPEX server.

Two AI tools I’ve built — try them now

On-prem · private

LLM Chat

A private chat assistant running on local open-weight models via Ollama, reverse-proxied to a GPU server on my network. Prompts and files never leave the LAN — no third-party API.

Try the live chat demo →
Agentic ops

Tower

A browser workspace pairing a live SSH terminal with an AI coding assistant (powered by Anthropic’s Claude). Operate infrastructure from anywhere, with per-user access and every action logged.

Open Tower (beta) Learn more

Want a private build for your organization? Request a free audit →

Proof of work

Software I’ve built, running in production

I don’t just deploy models — I build the tooling around them. Tower is a browser-based agentic workspace I built and run: a live SSH terminal paired with an AI coding assistant (Claude Code / Codex), with voice input, file attachments, and per-user key-based access. It’s how I operate infrastructure day to day — and it’s the same kind of private, auditable AI tooling I build for clients.

Open Tower (beta) Build something like this →

Why private AI

Data sovereignty

Prompts and documents never leave your premises. Sensitive data — payroll, HR files, client engagements, source code — stays on hardware you own, on a network you control.

Bypass per-seat licensing

Commercial AI subscriptions are billed per user per month. At $200–$300+ per seat, a 30-person org spends $6k–$9k monthly. A capable on-prem GPU server is a one-time CAPEX cost that pays itself back in months.

Tailored to your work

Models grounded in your actual data — SharePoint libraries, file shares, ticket history, internal wikis. Retrieval-augmented assistants answer from real company context, not generic web text.

Audit & compliance

Logs, rate limits, and access controls aligned with the rest of your security stack. AI usage is observable, reviewable, and policy-controlled — not a side-channel that bypasses governance.

No vendor lock-in

Open-weight models (Llama, Qwen, DeepSeek, etc.) running via Ollama, vLLM, or llama.cpp behind a hardened reverse proxy. Swap models, upgrade GPUs, or migrate hardware without rewriting your stack.

Agentic workflows

MCP servers, Claude Code-style coding agents, and custom orchestrators wired into ops, support, and back-office tasks. AI as a force multiplier across the work your team already does.

Where private AI fits

Concrete scenarios where keeping the model and data on-prem pays off.

Legal & professional services

Contract review, redaction, and clause search across privileged documents. Client matter material never touches a third-party API, which keeps the work inside the engagement letter.

Healthcare

Clinical-note summarisation, intake triage, and policy lookup over PHI without sending it to a SaaS endpoint — relevant under HIPAA, PIPEDA, and equivalent regimes.

Finance & deal rooms

M&A document review, internal advisor copilots, and analyst drafting tools grounded in deal data that legal will not let leave the firewall.

Engineering & software

Code review, architecture Q&A, and runbook search grounded in private repos and internal docs. Coding agents (Claude Code-style) can run against on-prem checkouts of source.

Customer support

Ticket triage, KB-grounded reply drafting, and escalation summaries. Support context (PII, account data, prior tickets) stays inside the helpdesk.

IT operations

Runbook search, incident playbook lookup, log triage, and configuration Q&A across firewall, AD, VPN, and SaaS estates — the same surface ops already manages.

Cost: SaaS vs CAPEX

A practical example for a 30-user organization. Numbers are illustrative — exact rates depend on the SaaS plan and the workload.

Per-seat SaaS AI

  • ~$250 per user per month
  • 30 users → $7,500 / month
  • $90,000 / year — recurring forever
  • Prompts and uploaded files leave your network

On-prem private LLM

  • ~$8k–$15k for a capable GPU server (one time)
  • Roughly $150 / month in power & ops
  • Pays itself back inside ~3 months at this scale
  • All data and prompts stay inside your network

What you can run

Open-weight models that self-host cleanly. Models can be swapped or upgraded as a config change — no rebuild, no vendor migration.

Llama (Meta)

Llama 3.1 8B for fast, lightweight workloads; Llama 3.3 70B for higher-quality reasoning and longer context. Strong general-purpose baseline for most business uses.

Qwen (Alibaba)

Qwen 2.5 / Qwen 3 across sizes from 7B to 72B. Coding-strong variants (Qwen Coder) make this a frequent default for engineering and ops chat assistants.

DeepSeek

DeepSeek V3 and the R1 reasoning family. Distilled smaller variants (R1-Distill-Qwen, R1-Distill-Llama) bring step-by-step reasoning to consumer-class GPUs.

Mistral / Mixtral

Efficient dense models (Mistral 7B, Small, Large) and Mixture-of- Experts variants (Mixtral 8x7B / 8x22B) that punch above their parameter weight on throughput.

Gemma (Google)

Compact, well-licensed models (Gemma 2 / 3 in 2B / 9B / 27B). Good fit when latency and footprint matter more than absolute quality.

Phi (Microsoft)

Small models tuned for reasoning quality per parameter. Useful for edge deployments and embedded assistants where a 70B server is overkill.

Embeddings and reranking models (BGE, E5, Jina, Voyage open variants) drop in alongside for retrieval-augmented assistants.

Hardware sizing

Three reference tiers. Each assumes a single host with reverse-proxy, auth, and basic monitoring already wired in. Numbers are illustrative — final spec depends on the model, context length, and concurrency target.

Entry

~$4k–$8k · small team or pilot

  • Single consumer GPU, 16–24 GB VRAM (RTX 4060 Ti 16GB / 4080 / used 3090)
  • Runs 7B–13B models comfortably; 32B at lower context
  • 1–3 concurrent users, sub-second TTFB on small models
  • Workstation-class chassis, 64 GB RAM, 2 TB NVMe

Mid

~$8k–$15k · small org production

  • RTX 4090 / 5090 or used A6000 — 24–48 GB VRAM
  • 13B–32B models at production context lengths; 70B at quantized weights
  • 5–15 concurrent light users; 30–50 with bursty usage
  • Server-class chassis, 128 GB RAM, redundant storage, IPMI

High-end

$20k+ · org-wide or RAG-heavy

  • A100 / H100 80 GB or multi-GPU pooling (vLLM tensor-parallel)
  • 70B+ models at full precision and long context
  • 20+ concurrent users; multi-tenant departmental sharing
  • Rack server, 256+ GB RAM, NVMe RAID, optional InfiniBand for clustering

How a build comes together

  1. Sizing. Pick a GPU and host platform sized to the model and concurrency the team actually needs. Open-weight models scale from 7B (laptop-class) to 70B+ (workstation/server-class).
  2. Hosting. Server lives on-prem (or in a private rack you control). Models run via Ollama / vLLM / llama.cpp behind a reverse proxy with auth tied into your existing identity provider.
  3. Grounding. Retrieval-augmented assistants point at the company data that matters: SharePoint libraries, file shares, ticket history, code repos, internal wikis.
  4. Operations. Audit logs, rate limits, and access controls align with the rest of the security stack. AI usage is observable and reviewable.
  5. Iteration. Swap models, upgrade GPUs, and add agentic workflows without rewriting the stack or breaking compliance.

What you get

Deliverables of a typical engagement — sized to fit a small org build or scaled up for a multi-department deployment.

Hardware & hosting

  • GPU server specced, procured, and built (or cloud equivalent)
  • Reverse proxy, TLS, and auth integrated with your IdP
  • Hardened OS baseline and config under version control

Models & retrieval

  • One or more open-weight models deployed (Ollama / vLLM / llama.cpp)
  • Embedding pipeline over your file shares, SharePoint, wikis, code repos
  • RAG configuration and evaluation harness for answer quality

Operations & handover

  • Monitoring + audit logging dashboards
  • Runbooks: model swaps, scaling, troubleshooting, incident response
  • IT-staff training and an optional retainer for ongoing tuning

FAQ

How many users can one server support?

A mid-tier build (single 24–48 GB GPU) reliably serves 30–50 light users on 13B–32B models, or 10–15 heavier RAG users. Scaling past that is a config change (vLLM tensor-parallel) or a second host.

What if the model goes out of date?

Open-weight models release on the order of weeks. Swapping is a config change, not a rebuild — the runbook covers it. Old models stay available so you can A/B before cutting over.

Does it integrate with M365 / Google Workspace / Slack?

Yes. Auth ties into Entra ID / Google / OIDC, and the assistant surfaces wire into M365 (SharePoint, Teams), Slack, Jira, GitHub, and most line-of-business apps via API.

Does it work without internet?

Yes. The server lives on your LAN. Internet is only needed to pull new model weights, and that can be done once or via an internal mirror. Day-to-day inference is fully offline.

How private is "private", really?

Prompts and uploaded documents never leave the box. There is no telemetry to a vendor. Audit logs are local and inspectable. You control retention, redaction, and who can see what.

What about ongoing maintenance?

Optional retainers cover model upgrades, capacity planning, prompt and RAG tuning, and feature builds (new agents, new integrations). One-shot builds with a handover are also fine for in-house teams.

Try it live

A live demo of the on-prem chat UI runs right here on this server — a private LLM answering in real time, with nothing leaving the network.

Try the live chat demo →

Want a private LLM build for your organization? Book a free audit → — we’ll scope it together.

Talk about a private AI build

Tell me what your team does and what data you'd want it grounded in. I'll reply the same day with whether on-prem AI is a fit — plus a rough shape and cost.

  • Real CAPEX-vs-subscription math for your situation
  • What's safe in-house vs. cloud, given your data
  • One quick win you can act on right away

Prefer email? info@sd-techsolutions.com · Same-day reply.

No spam, no sales funnel — it comes straight to me.

Book a free audit →